Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: "rootcoord send tt to all channels too slowly #26279

Closed
1 task done
ronghuaihai opened this issue Aug 11, 2023 · 4 comments
Closed
1 task done

[Bug]: "rootcoord send tt to all channels too slowly #26279

ronghuaihai opened this issue Aug 11, 2023 · 4 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ronghuaihai
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.2.8 
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.2.1
- OS(Ubuntu or CentOS): Centos
- CPU/Memory: 	Gold 5218*2/ 384G
- GPU:  ASPEED Graphics Family (rev 41)
- Others:

Current Behavior

I have install milvus cluster(2.2.8) on k8s . After running normally two month, I modify the memory limit of queryNode,like this

  queryNode:
  enabled: true
  replicas: 12
  resources:
    limits:
      memory: 48Gi
      cpu: 8
    requests:
      cpu: 4
      memory: 32Gi

use commands as follows to upgrade the milvus cluster:
helm upgrade my-release . --set metrics.serviceMonitor.enabled=true -f values.yaml


than we find all collections load again ,warning logs begin appear in the milvus rootcoord logs , the insert rate and search rate all become very slowly.
image

image

Expected Behavior

when i increase the memory limits of querynode or increate the numbers of querynodes.the performance should become better .but the performance become worser .

Steps To Reproduce

1.use command as follows to deploy milvus cluster 2.2.8
helm install my-release . --set metrics.serviceMonitor.enabled=true

2.then modify the querynode memory limits in values.xml 

queryNode:
  enabled: true
  replicas: 12
  resources:
    limits:
      memory: 48Gi
      cpu: 8
    requests:
      cpu: 4
      memory: 32Gi

3.use follow command to upgrade milvus cluster :
helm upgrade my-release . --set metrics.serviceMonitor.enabled=true -f values.yaml

4.all collecions reload again ,and run hellow_milvus.py ,insert rate takes too long time until timeout



### Milvus Log

[logs.tar.gz](https://github.com/milvus-io/milvus/files/12318791/logs.tar.gz)


### Anything else?

_No response_
@ronghuaihai ronghuaihai added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 11, 2023
@yanliang567
Copy link
Contributor

@ronghuaihai could you please attach the 2 kinds of logs so that we could compare the different time cost between two search to address the issue? please refer this doc to export the whole Milvus logs.

/assign @ronghuaihai
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 11, 2023
@ronghuaihai
Copy link
Author

ronghuaihai commented Aug 11, 2023

the querycoord and rootcoord logs as follows:
querycoord1.log
rootcoord.log

@czs007 czs007 assigned longjiquan and PowderLi and unassigned longjiquan Aug 23, 2023
@PowderLi
Copy link
Contributor

we can find such logs

  • rootcoord.log
time="2023-08-10T14:33:06Z" level=info msg="Broker notification of Closed producer: 8" local_addr="100.86.24.35:45964" remote_addr="pulsar://my-release-pulsar-proxy:6650"
time="2023-08-10T14:33:06Z" level=warning msg="[Connection was closed]" cnx="100.86.24.35:45964 -> 10.96.2.251:6650" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnecting to broker in  111.237162ms]" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
time="2023-08-10T14:33:06Z" level=info msg="Broker notification of Closed producer: 13" local_addr="100.86.24.35:45964" remote_addr="pulsar://my-release-pulsar-proxy:6650"
time="2023-08-10T14:33:06Z" level=warning msg="[Connection was closed]" cnx="100.86.24.35:45964 -> 10.96.2.251:6650" producerID=13 producer_name=my-release-pulsar-1-6 topic="persistent://public/default/by-dev-rootcoord-dml_12"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnecting to broker in  104.016198ms]" producerID=13 producer_name=my-release-pulsar-1-6 topic="persistent://public/default/by-dev-rootcoord-dml_12"
time="2023-08-10T14:33:06Z" level=info msg="Broker notification of Closed producer: 4" local_addr="100.86.24.35:45964" remote_addr="pulsar://my-release-pulsar-proxy:6650"
time="2023-08-10T14:33:06Z" level=warning msg="[Connection was closed]" cnx="100.86.24.35:45964 -> 10.96.2.251:6650" producerID=4 producer_name=my-release-pulsar-1-1 topic="persistent://public/default/by-dev-rootcoord-dml_3"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnecting to broker in  113.397834ms]" producerID=4 producer_name=my-release-pulsar-1-1 topic="persistent://public/default/by-dev-rootcoord-dml_3"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnected producer to broker]" cnx="100.86.24.35:60744 -> 10.96.2.251:6650" producerID=4 producer_name=my-release-pulsar-1-1 topic="persistent://public/default/by-dev-rootcoord-dml_3"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnected producer to broker]" cnx="100.86.24.35:60744 -> 10.96.2.251:6650" producerID=13 producer_name=my-release-pulsar-1-6 topic="persistent://public/default/by-dev-rootcoord-dml_12"
time="2023-08-10T14:33:06Z" level=info msg="Resending 1 pending batches" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
time="2023-08-10T14:33:06Z" level=info msg="[Reconnected producer to broker]" cnx="100.86.24.35:60744 -> 10.96.2.251:6650" producerID=8 producer_name=my-release-pulsar-1-3 topic="persistent://public/default/by-dev-rootcoord-dml_7"
  • querycoord
[2023/08/10 14:32:52.028 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853324] [collectionID=441511672460751103] [segmentID=441511672490203790] [node=29] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m7.529309177s]
[2023/08/10 14:32:52.055 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853327] [collectionID=441511672460751103] [segmentID=441511672490203791] [node=33] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m7.556251684s]
[2023/08/10 14:32:54.018 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853326] [collectionID=441511672460751103] [segmentID=441511672494241580] [node=36] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.519515189s]
[2023/08/10 14:32:54.031 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853325] [collectionID=441511672460751103] [segmentID=441511672494241579] [node=31] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.532674343s]
[2023/08/10 14:32:54.032 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853324] [collectionID=441511672460751103] [segmentID=441511672490203790] [node=29] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.533836439s]
[2023/08/10 14:32:54.035 +00:00] [WARN] [task/utils.go:110] ["deltaPosition is quite stale when packLoadSegmentRequest"] [taskID=853327] [collectionID=441511672460751103] [segmentID=441511672490203791] [node=33] [source=3] [channel=by-dev-rootcoord-dml_2_441511672460751103v0] [posSource=channelCheckpoint] [posTs=443462992396025859] [posTime=2023/08/10 14:17:44.499 +00:00] [tsLag=15m9.536755333s]

There is reasonable suspicion that the pulsar didn't work well at that time. Also query nodes fetch lots of log from pulsar after restarted, may lead to high load on pulsar.

We recently release v2.2.14, it is recommended that you use the latest version.Refer to v2.2.x release notes, we made lots of efforts about mq and pulsar after 2.2.8, like

  • [2.2.12] Made pulsar request timeout configurable (#26526).
  • [2.2.13] Prevented exclusive consumer exception in Pulsar (#25376) (#25378)

@stale
Copy link

stale bot commented Sep 27, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Sep 27, 2023
@stale stale bot closed this as completed Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants