[Bug]: "rootcoord send tt to all channels too slowly #26279

ronghuaihai · 2023-08-11T05:28:04Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.2.8 
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.2.1
- OS(Ubuntu or CentOS): Centos
- CPU/Memory: 	Gold 5218*2/ 384G
- GPU:  ASPEED Graphics Family (rev 41)
- Others:

Current Behavior

I have install milvus cluster(2.2.8) on k8s . After running normally two month, I modify the memory limit of queryNode,like this

  queryNode:
  enabled: true
  replicas: 12
  resources:
    limits:
      memory: 48Gi
      cpu: 8
    requests:
      cpu: 4
      memory: 32Gi

use commands as follows to upgrade the milvus cluster：
helm upgrade my-release . --set metrics.serviceMonitor.enabled=true -f values.yaml

than we find all collections load again ，warning logs begin appear in the milvus rootcoord logs , the insert rate and search rate all become very slowly.

Expected Behavior

when i increase the memory limits of querynode or increate the numbers of querynodes.the performance should become better .but the performance become worser .

Steps To Reproduce

1.use command as follows to deploy milvus cluster 2.2.8
helm install my-release . --set metrics.serviceMonitor.enabled=true

2.then modify the querynode memory limits in values.xml 

queryNode:
  enabled: true
  replicas: 12
  resources:
    limits:
      memory: 48Gi
      cpu: 8
    requests:
      cpu: 4
      memory: 32Gi

3.use follow command to upgrade milvus cluster :
helm upgrade my-release . --set metrics.serviceMonitor.enabled=true -f values.yaml

4.all collecions reload again ,and run hellow_milvus.py ,insert rate takes too long time until timeout



### Milvus Log

[logs.tar.gz](https://github.com/milvus-io/milvus/files/12318791/logs.tar.gz)


### Anything else?

_No response_

The text was updated successfully, but these errors were encountered:

yanliang567 · 2023-08-11T06:06:02Z

@ronghuaihai could you please attach the 2 kinds of logs so that we could compare the different time cost between two search to address the issue? please refer this doc to export the whole Milvus logs.

/assign @ronghuaihai
/unassign

ronghuaihai · 2023-08-11T06:45:50Z

the querycoord and rootcoord logs as follows:
querycoord1.log
rootcoord.log

PowderLi · 2023-08-28T02:39:29Z

we can find such logs

rootcoord.log

time="2023-08-10T14:33:06Z" level=info msg="Broker notification of Closed producer: 8" local_addr="100.86.24.35:45964" remote_addr="pulsar://my-release-pulsar-proxy:6650"
time="2023-08-10T14:33:06Z" level=warning msg="[Connection was closed]" cnx="100.86.24.35:45964 -> 10.96.2.251:6650" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnecting to broker in  111.237162ms]" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
time="2023-08-10T14:33:06Z" level=info msg="Broker notification of Closed producer: 13" local_addr="100.86.24.35:45964" remote_addr="pulsar://my-release-pulsar-proxy:6650"
time="2023-08-10T14:33:06Z" level=warning msg="[Connection was closed]" cnx="100.86.24.35:45964 -> 10.96.2.251:6650" producerID=13 producer_name=my-release-pulsar-1-6 topic="persistent://public/default/by-dev-rootcoord-dml_12"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnecting to broker in  104.016198ms]" producerID=13 producer_name=my-release-pulsar-1-6 topic="persistent://public/default/by-dev-rootcoord-dml_12"
time="2023-08-10T14:33:06Z" level=info msg="Broker notification of Closed producer: 4" local_addr="100.86.24.35:45964" remote_addr="pulsar://my-release-pulsar-proxy:6650"
time="2023-08-10T14:33:06Z" level=warning msg="[Connection was closed]" cnx="100.86.24.35:45964 -> 10.96.2.251:6650" producerID=4 producer_name=my-release-pulsar-1-1 topic="persistent://public/default/by-dev-rootcoord-dml_3"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnecting to broker in  113.397834ms]" producerID=4 producer_name=my-release-pulsar-1-1 topic="persistent://public/default/by-dev-rootcoord-dml_3"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnected producer to broker]" cnx="100.86.24.35:60744 -> 10.96.2.251:6650" producerID=4 producer_name=my-release-pulsar-1-1 topic="persistent://public/default/by-dev-rootcoord-dml_3"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnected producer to broker]" cnx="100.86.24.35:60744 -> 10.96.2.251:6650" producerID=13 producer_name=my-release-pulsar-1-6 topic="persistent://public/default/by-dev-rootcoord-dml_12"
time="2023-08-10T14:33:06Z" level=info msg="Resending 1 pending batches" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnected producer to broker]" cnx="100.86.24.35:60744 -> 10.96.2.251:6650" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"

querycoord

[2023/08/10 14:32:52.028 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853324] [collectionID=441511672460751103] [segmentID=441511672490203790] [node=29] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m7.529309177s]
[2023/08/10 14:32:52.055 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853327] [collectionID=441511672460751103] [segmentID=441511672490203791] [node=33] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m7.556251684s]
[2023/08/10 14:32:54.018 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853326] [collectionID=441511672460751103] [segmentID=441511672494241580] [node=36] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.519515189s]
[2023/08/10 14:32:54.031 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853325] [collectionID=441511672460751103] [segmentID=441511672494241579] [node=31] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.532674343s]
[2023/08/10 14:32:54.032 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853324] [collectionID=441511672460751103] [segmentID=441511672490203790] [node=29] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.533836439s]
[2023/08/10 14:32:54.035 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853327] [collectionID=441511672460751103] [segmentID=441511672490203791] [node=33] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.536755333s]

There is reasonable suspicion that the pulsar didn't work well at that time. Also query nodes fetch lots of log from pulsar after restarted, may lead to high load on pulsar.

We recently release v2.2.14, it is recommended that you use the latest version.Refer to v2.2.x release notes, we made lots of efforts about mq and pulsar after 2.2.8, like

[2.2.12] Made pulsar request timeout configurable (#26526).
[2.2.13] Prevented exclusive consumer exception in Pulsar (#25376) (#25378)

stale · 2023-09-27T05:28:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

ronghuaihai added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 11, 2023

ronghuaihai assigned yanliang567 Aug 11, 2023

sre-ci-robot assigned ronghuaihai and unassigned yanliang567 Aug 11, 2023

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 11, 2023

czs007 assigned longjiquan and PowderLi and unassigned longjiquan Aug 23, 2023

stale bot added the stale indicates no udpates for 30 days label Sep 27, 2023

stale bot closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: "rootcoord send tt to all channels too slowly #26279

[Bug]: "rootcoord send tt to all channels too slowly #26279

ronghuaihai commented Aug 11, 2023

yanliang567 commented Aug 11, 2023

ronghuaihai commented Aug 11, 2023 •

edited

PowderLi commented Aug 28, 2023

stale bot commented Sep 27, 2023

[Bug]: "rootcoord send tt to all channels too slowly #26279

[Bug]: "rootcoord send tt to all channels too slowly #26279

Comments

ronghuaihai commented Aug 11, 2023

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

yanliang567 commented Aug 11, 2023

ronghuaihai commented Aug 11, 2023 • edited

PowderLi commented Aug 28, 2023

stale bot commented Sep 27, 2023

ronghuaihai commented Aug 11, 2023 •

edited