Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [cluster] Nightly test case hangs for panic in querynode and datacoord #15283

Closed
1 task done
binbinlv opened this issue Jan 18, 2022 · 17 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 3eb5cc4
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): 2.0.0rc10.dev11
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

test case hangs for cluster mode

Expected Behavior

All the test cases executed successfully

Steps To Reproduce

Running nightly: [cluster mode]
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/305/pipeline

Anything else?

Tests hangs and there is no logs generated at this point, but the environment is still existed, so could login the machine to have some check, thanks.

@binbinlv binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 18, 2022
@yanliang567
Copy link
Contributor

@sunby could you please take a look at this issue?
/assign @sunby
/unassign

@sre-ci-robot sre-ci-robot assigned sunby and unassigned yanliang567 Jan 18, 2022
@yanliang567 yanliang567 added this to the 2.0.0-GA milestone Jan 18, 2022
@yanliang567
Copy link
Contributor

querynode was restarted and not back to service

@yanliang567 yanliang567 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 19, 2022
@binbinlv
Copy link
Contributor Author

binbinlv commented Jan 19, 2022

For nightly 305:

there is panic in datacoord and querynode:

(1) md-305-n-milvus-datacoord-f455cddc-56x5m_milvus-ci_datacoord-5a3376049d42d661cb64cc4484b97d98aaa684b96997104babf32713a1e527c4.log:

time="2022-01-18T12:22:15Z" level=error msg="[Failed to unsubscribe consumer]" consumerID=1 error="connection closed" name=gktcl subscription=by-dev-dataCoord topic="persistent://public/default/by-dev-datacoord-timetick-channel"

time="2022-01-18T12:22:15Z" level=info msg="Closing consumer=1" consumerID=1 name=gktcl subscription=by-dev-dataCoord topic="persistent://public/default/by-dev-datacoord-timetick-channel"

time="2022-01-18T12:22:15Z" level=warning msg="[Failed to close consumer]" consumerID=1 error="connection closed" name=gktcl subscription=by-dev-dataCoord topic="persistent://public/default/by-dev-datacoord-timetick-channel"

[2022/01/18 12:22:15.328 +00:00] [FATAL] [logutil.go:103] [panic] [recover={}] [stack="github.com/milvus-io/milvus/internal/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/logutil/logutil.go:103\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:969\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:113\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:66\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:57\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:108\ngithub.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).Close\n\t/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:226\ngithub.com/milvus-io/milvus/internal/datacoord.(*Server).startDataNodeTtLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:478"]

(2) md-305-n-milvus-querynode-75c88f95cc-58qrm_milvus-ci_querynode-56feae97f0a74b6188854829f418a01551c57f8157014e43d4bcc690ce6ad0e9.log:

time="2022-01-18T05:53:30Z" level=error msg="[Failed to unsubscribe consumer]" consumerID=733 error="server error: MetadataError: Consumer not found" name=pqhmj subscription=by-dev-queryNode-430567640935170049-10 topic="persistent://public/default/by-dev-search-0"

time="2022-01-18T05:53:30Z" level=info msg="Closing consumer=733" consumerID=733 name=pqhmj subscription=by-dev-queryNode-430567640935170049-10 topic="persistent://public/default/by-dev-search-0"

time="2022-01-18T05:53:30Z" level=warning msg="[Failed to close consumer]" consumerID=733 error="server error: MetadataError: Consumer not found" name=pqhmj subscription=by-dev-queryNode-430567640935170049-10 topic="persistent://public/default/by-dev-search-0"

panic: topic persistent://public/default/by-dev-search-0, subscription by-dev-queryNode-430567640935170049-10: server error: MetadataError: Consumer not found



goroutine 219 [running]:

github.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close.func1()

        /go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:113 +0xbb

sync.(*Once).doSlow(0xc004e472a8, 0xc003371b20)

        /usr/local/go/src/sync/once.go:66 +0xec

sync.(*Once).Do(...)

        /usr/local/go/src/sync/once.go:57

github.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close(0xc004e47260)

        /go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:108 +0x65

github.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).Close(0xc00104f2b0)

        /go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:226 +0x173

github.com/milvus-io/milvus/internal/querynode.(*queryCollection).close(0xc001362800)

        /go/src/github.com/milvus-io/milvus/internal/querynode/query_collection.go:163 +0x59

github.com/milvus-io/milvus/internal/querynode.(*queryService).stopQueryCollection(0xc00073e780, 0x5f9aefcf9b80001)

        /go/src/github.com/milvus-io/milvus/internal/querynode/query_service.go:184 +0x1e7

github.com/milvus-io/milvus/internal/querynode.(*releaseCollectionTask).Execute(0xc000b5b0b0, 0x35a0e00, 0xc0002ed9c0, 0x0, 0x0)

        /go/src/github.com/milvus-io/milvus/internal/querynode/task.go:771 +0x365

github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask(0xc0002eda00, 0x35b6300, 0xc000b5b0b0, 0x35b4560, 0xc0003300a0)

        /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:61 +0x173

github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop(0xc0002eda00)

        /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:78 +0x191

created by github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).Start

        /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:86 +0x5c

@binbinlv
Copy link
Contributor Author

binbinlv commented Jan 19, 2022

Panic occurs again in nightly 306: (cluster mode)
https://ci.milvus.io:18080/jenkins/job/milvus-nightly-ci/job/master/306/:
(1) md-306-n-milvus-datacoord-66f99754f4-hmpqp_milvus-ci_datacoord-2773cbd4e7cc154f865398f4dd4f383940d9a2fd216e07ec0c367dae35766973.log:

time="2022-01-19T01:00:38Z" level=error msg="[Failed to unsubscribe consumer]" consumerID=1 error="connection closed" name=qlkju subscription=by-dev-dataCoord topic="persistent://public/default/by-dev-datacoord-timetick-channel"

time="2022-01-19T01:00:38Z" level=info msg="Closing consumer=1" consumerID=1 name=qlkju subscription=by-dev-dataCoord topic="persistent://public/default/by-dev-datacoord-timetick-channel"

time="2022-01-19T01:00:38Z" level=warning msg="[Failed to close consumer]" consumerID=1 error="connection closed" name=qlkju subscription=by-dev-dataCoord topic="persistent://public/default/by-dev-datacoord-timetick-channel"

[2022/01/19 01:00:38.237 +00:00] [FATAL] [logutil.go:103] [panic] [recover={}] [stack="github.com/milvus-io/milvus/internal/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/logutil/logutil.go:103\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:969\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:113\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:66\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:57\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:108\ngithub.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).Close\n\t/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:226\ngithub.com/milvus-io/milvus/internal/datacoord.(*Server).startDataNodeTtLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:478"]

(2) md-306-n-milvus-querynode-7886c46b96-t6nf8_milvus-ci_querynode-56caf7d1f2e6c2b8e85235916c15e9ebceff9cf1095b7548f8258e73e570aea0.log:

/panic
panic: Failed to create consumer by-dev-rootcoord-dml_145, error = All attempts results:

attempt #1:server error: PersistenceError: java.util.concurrent.CompletionException: java.lang.NullPointerException

attempt #2:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #3:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #4:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #5:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #6:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #7:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #8:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #9:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #10:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #11:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #12:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #13:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #14:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #15:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #16:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #17:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #18:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #19:server error: ConsumerBusy: Exclusive consumer is already connected

attempt #20:server error: ConsumerBusy: Exclusive consumer is already connected





goroutine 185 [running]:

github.com/milvus-io/milvus/internal/msgstream.(*MqTtMsgStream).AsConsumerWithPosition(0xc0002bd680, 0xc0001c86b0, 0x1, 0x1, 0xc00040c1e0, 0x26, 0x1)

        /go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:742 +0x2a7

github.com/milvus-io/milvus/internal/msgstream.(*MqTtMsgStream).AsConsumer(0xc0002bd680, 0xc0001c86b0, 0x1, 0x1, 0xc00040c1e0, 0x26)

        /go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:710 +0x66

github.com/milvus-io/milvus/internal/querynode.(*queryNodeFlowGraph).seekQueryNodeFlowGraph(0xc00036cc80, 0xc000995f20, 0xc00040c030, 0x2d)

        /go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_query_node.go:222 +0xac

github.com/milvus-io/milvus/internal/querynode.(*watchDmChannelsTask).Execute(0xc0003449f0, 0x35a1220, 0xc000322e00, 0x0, 0x0)

        /go/src/github.com/milvus-io/milvus/internal/querynode/task.go:431 +0x26e5

github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask(0xc000322e40, 0x35b6860, 0xc0003449f0, 0x35b49a0, 0xc0000a6140)

        /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:61 +0x173

github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).taskLoop(0xc000322e40)

        /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:78 +0x191

created by github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).Start

        /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:86 +0x5c

@binbinlv binbinlv changed the title [Bug]: Nightly test case hangs for cluster mode [Bug]: [cluster] Nightly test case hangs for panic in querynode and datacoord Jan 19, 2022
@czs007 czs007 assigned longjiquan and unassigned sunby Jan 20, 2022
@binbinlv
Copy link
Contributor Author

binbinlv commented Jan 21, 2022

Test hang again for cluster mode in latest nightly:
https://ci.milvus.io:18080/jenkins/job/milvus-nightly-ci/job/master/310/

milvus: 21f999f
mode: cluster
pymilvus: 2.0.0rc10.dev12
replica number:
--set proxy.replicas=2
--set queryNode.replicas=2
--set indexNode.replicas=2
--set dataNode.replicas=2 \

  1. logs: the following name
    artifacts-milvus-distributed-master-310-pymilvus-e2e-logs.tar.gz

  2. hangs time:
    [2022-01-20T18:12:07.859Z]

  3. panic info in milvus log:

 md-310-n-milvus-datacoord-6b4b4bbdc9-bkw7f_milvus-ci_datacoord-572749c5666c636809bf77d50839227b68ca7edc881cc9776a8efde4e8327fcc.log:[2022/01/21 02:35:58.417 +00:00] [FATAL] [logutil.go:103] [panic] [recover="[{},{},{},{},{},{}]"] [stack="github.com/milvus-io/milvus/internal/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/logutil/logutil.go:103\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:969\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:121\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:66\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:57\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:112\ngithub.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).Close\n\t/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:226\ngithub.com/milvus-io/milvus/internal/datacoord.(*Server).startDataNodeTtLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:478"]
md-310-n-milvus-datanode-77cbb5848d-6sxvp_milvus-ci_datanode-6e52609ea060be6b976932e1f229e1406aa13f7abb8f7d17008cc1f9d57f4524.log:panic: All attempts results:
md-310-n-milvus-querynode-567476f669-j9c9v_milvus-ci_querynode-5a5bc1da2c7b404b7357ff03cd0f5de3794855d64885466a7dd63913d94610a1.log:panic: All attempts results:
md-310-n-milvus-querynode-567476f669-sjtnd_milvus-ci_querynode-d2b06698d3ee91034eaca3445612ae011dd01bc2edc7ac8cd87da76384395f65.log:panic: All attempts results:

@DragonDriver
The panic info is different from this issue, but there are some changes in the code these days, so I am not sure if the root cause is actually the same with this issue or not.

could you please check whether it is the same root cause with this issue?

If not, I will open another issue to track this, thanks.

@yanliang567
Copy link
Contributor

@DragonDriver rootcoord crashes for many times. please help to check
https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/311/pipeline

[2022/01/21 15:23:38.364 +00:00] [ERROR] [task.go:1701] ["loadBalanceTask: show collection's partitionIDs failed"] [collectionID=430644467046023169] [error="ShowPartitions failed: can't find collection id : 430644467046023169"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*loadBalanceTask).execute\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:1701\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:550\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:627"]

[2022/01/21 15:23:38.365 +00:00] [DEBUG] [task.go:1977] ["loadBalanceTask postExecute done"] ["trigger type"=4] [sourceNodeIDs="[7]"] [balanceReason=NodeDown] [taskID=430644161585087469]

panic: ShowPartitions failed: can't find collection id : 430644467046023169



goroutine 200 [running]:

github.com/milvus-io/milvus/internal/querycoord.(*loadBalanceTask).execute(0xc0008765a0, 0x3595b60, 0xc001170960, 0x0, 0x0)

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:1703 +0x5cbe

github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).processTask(0xc00026cdc0, 0x35bcc80, 0xc0008765a0, 0x0, 0x0)

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:550 +0x6d4

github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop(0xc00026cdc0)

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:627 +0x12bf

created by github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).Start

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:868 +0x65

@xiaofan-luan
Copy link
Contributor

@DragonDriver any progress?

@longjiquan
Copy link
Contributor

I fired an issue in pulsar, apache/pulsar#13920.

@xige-16
Copy link
Contributor

xige-16 commented Jan 24, 2022

@DragonDriver rootcoord crashes for many times. please help to check https://ci.milvus.io:18080/jenkins/blue/organizations/jenkins/milvus-nightly-ci/detail/master/311/pipeline

[2022/01/21 15:23:38.364 +00:00] [ERROR] [task.go:1701] ["loadBalanceTask: show collection's partitionIDs failed"] [collectionID=430644467046023169] [error="ShowPartitions failed: can't find collection id : 430644467046023169"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*loadBalanceTask).execute\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:1701\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).processTask\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:550\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:627"]

[2022/01/21 15:23:38.365 +00:00] [DEBUG] [task.go:1977] ["loadBalanceTask postExecute done"] ["trigger type"=4] [sourceNodeIDs="[7]"] [balanceReason=NodeDown] [taskID=430644161585087469]

panic: ShowPartitions failed: can't find collection id : 430644467046023169



goroutine 200 [running]:

github.com/milvus-io/milvus/internal/querycoord.(*loadBalanceTask).execute(0xc0008765a0, 0x3595b60, 0xc001170960, 0x0, 0x0)

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task.go:1703 +0x5cbe

github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).processTask(0xc00026cdc0, 0x35bcc80, 0xc0008765a0, 0x0, 0x0)

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:550 +0x6d4

github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).scheduleLoop(0xc00026cdc0)

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:627 +0x12bf

created by github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).Start

	/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:868 +0x65

querycoord panicd when rootcoord drop collection, and rootcoord release collection failed when drop collection. So after querycoord resarted, it called rootcoord.showpartition() and get error message, then querycoord will panic after every reboot

[2022/01/21 15:24:31.278 +00:00] [ERROR] [client.go:254] ["querycoord ClientBase ReCall grpc second call get error "] [error="err: number of querycoord is incorrect, 0\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:253 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:180 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).ReleaseCollection\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:823 github.com/milvus-io/milvus/internal/rootcoord.(*Core).SetQueryCoord.func2\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:364 github.com/milvus-io/milvus/internal/rootcoord.(*DropCollectionReqTask).Execute\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:60 github.com/milvus-io/milvus/internal/rootcoord.executeTask.func1\n/usr/local/go/src/runtime/asm_amd64.s:1374 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:254\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).ReleaseCollection\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:180\ngithub.com/milvus-io/milvus/internal/rootcoord.(*Core).SetQueryCoord.func2\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:823\ngithub.com/milvus-io/milvus/internal/rootcoord.(*DropCollectionReqTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:364\ngithub.com/milvus-io/milvus/internal/rootcoord.executeTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:60"]

[2022/01/21 15:24:31.278 +00:00] [ERROR] [task.go:365] ["Failed to CallReleaseCollectionService"] [error="err: number of querycoord is incorrect, 0\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:253 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:180 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).ReleaseCollection\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:823 github.com/milvus-io/milvus/internal/rootcoord.(*Core).SetQueryCoord.func2\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:364 github.com/milvus-io/milvus/internal/rootcoord.(*DropCollectionReqTask).Execute\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:60 github.com/milvus-io/milvus/internal/rootcoord.executeTask.func1\n/usr/local/go/src/runtime/asm_amd64.s:1374 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/rootcoord.(*DropCollectionReqTask).Execute\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:365\ngithub.com/milvus-io/milvus/internal/rootcoord.executeTask.func1\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:60"]

[2022/01/21 15:24:31.278 +00:00] [ERROR] [root_coord.go:1318] ["DropCollection failed"] [role=rootcoord] ["collection name"=search_collection_hR8hZgTI] [msgID=430644455249584592] [error="err: number of querycoord is incorrect, 0\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:253 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:180 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).ReleaseCollection\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:823 github.com/milvus-io/milvus/internal/rootcoord.(*Core).SetQueryCoord.func2\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:364 github.com/milvus-io/milvus/internal/rootcoord.(*DropCollectionReqTask).Execute\n/go/src/github.com/milvus-io/milvus/internal/rootcoord/task.go:60 github.com/milvus-io/milvus/internal/rootcoord.executeTask.func1\n/usr/local/go/src/runtime/asm_amd64.s:1374 runtime.goexit\n"] [stack="github.com/milvus-io/milvus/internal/rootcoord.(*Core).DropCollection\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:1318\ngithub.com/milvus-io/milvus/internal/distributed/rootcoord.(*Server).DropCollection\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/service.go:344\ngithub.com/milvus-io/milvus/internal/proto/rootcoordpb._RootCoord_DropCollection_Handler.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proto/rootcoordpb/root_coord.pb.go:879\ngithub.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38\ngithub.com/milvus-io/milvus/internal/proto/rootcoordpb._RootCoord_DropCollection_Handler\n\t/go/src/github.com/milvus-io/milvus/internal/proto/rootcoordpb/root_coord.pb.go:881\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"]

@binbinlv
Copy link
Contributor Author

binbinlv commented Jan 28, 2022

Panic occurs again:
https://ci.milvus.io:18080/jenkins/job/milvus-nightly-ci/job/master/318/ [cluster mode]

milvus: a62e2ef
mode: cluster
pymilvus: 2.0.0rc10.dev12
replica number:
--set proxy.replicas=2
--set queryNode.replicas=2
--set indexNode.replicas=2
--set dataNode.replicas=2

  1. log name:
    artifacts-milvus-distributed-master-318-pymilvus-e2e-logs.tar.gz

  2. panic info:

md-318-n-milvus-datacoord-779867dfcd-kd4fb_milvus-ci_datacoord-09ab05dc1c3c353a4b43d3ea235e19d9dd50ac3a687c34e9e61c18d9f9a5ff56.log:[2022/01/28 01:01:59.786 +00:00] [FATAL] [logutil.go:103] [panic] [recover="[{},{},{},{},{},{}]"] [stack="github.com/milvus-io/milvus/internal/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/logutil/logutil.go:103\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:969\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:121\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:66\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:57\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:112\ngithub.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).Close\n\t/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:226\ngithub.com/milvus-io/milvus/internal/datacoord.(*Server).startDataNodeTtLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:478"]
md-318-n-milvus-datanode-fc6dc59dc-q9kvx_milvus-ci_datanode-17f3752f6139ad4dcb5dfca5f7d2ce6682528d371407ece4d36fdc6b7584a5ee.log:panic: All attempts results:
md-318-n-milvus-querynode-586ffcbd64-b72ht_milvus-ci_querynode-b5c21dce78bea5d6ce274dbcc8e0a06b3955c89b9562d6c9de2196029696a61b.log:panic: All attempts results:
md-318-n-milvus-querynode-586ffcbd64-l4t6p_milvus-ci_querynode-9e058620f8bd16c497000fbcacc64824a1538a70385e09d8cf04833d1e364cae.log:panic: All attempts results:
  1. Case hangs time: 2022-01-27T16:40:09.497Z

@binbinlv
Copy link
Contributor Author

binbinlv commented Feb 7, 2022

Same issue occurs again:
https://ci.milvus.io:18080/jenkins/job/milvus-nightly-ci/job/master/323/ [cluster mode]

milvus: b4bfe58
mode: cluster
pymilvus: pymilvus==2.0.1.dev1
replica number:
--set proxy.replicas=2
--set queryNode.replicas=2
--set indexNode.replicas=2
--set dataNode.replicas=2

  1. log name:
    artifacts-milvus-distributed-master-323-pymilvus-e2e-logs.tar.gz

  2. panic info

md-323-n-milvus-datacoord-5fd66dff74-q429j_milvus-ci_datacoord-a94c515bdf7adc4f492a3fd1d8bf6c9a5b8a744476e2bddd26ea4f347515d40a.log:[2022/01/30 01:02:22.628 +00:00] [FATAL] [logutil.go:103] [panic] [recover="[{},{},{},{},{},{}]"] [stack="github.com/milvus-io/milvus/internal/logutil.LogPanic\n\t/go/src/github.com/milvus-io/milvus/internal/logutil/logutil.go:103\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:965\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:121\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/milvus-io/milvus/internal/util/mqclient.(*PulsarConsumer).Close\n\t/go/src/github.com/milvus-io/milvus/internal/util/mqclient/pulsar_consumer.go:112\ngithub.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).Close\n\t/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:226\ngithub.com/milvus-io/milvus/internal/datacoord.(*Server).startDataNodeTtLoop.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:478"]
md-323-n-milvus-datanode-7d7cb8bd97-nm2ml_milvus-ci_datanode-dd0e02dbfb2e0f0a1a1e9ecb5f1241753543dcbd2f88d601211fd57e7d4581f4.log:panic: All attempts results:
md-323-n-milvus-datanode-7d7cb8bd97-vvrcc_milvus-ci_datanode-c8a564641bd0b0db2f6ecbe284b0513b116dbd2b0c3332fba68b4619befae80f.log:panic: All attempts results:
md-323-n-milvus-querynode-5f6dbc4d64-bj76m_milvus-ci_querynode-03412a14133c03c233035a5b55f1558e5b93ec4735ccbb8e8acdae13f6435fc5.log:panic: All attempts results:
md-323-n-milvus-querynode-5f6dbc4d64-q66q2_milvus-ci_querynode-29167b2d233a782f0a905248951c0827a94a9650b7b959827ec7e75c1ecdcc02.log:panic: All attempts results:
  1. Case hangs time: 2022-01-29T16:42:57.544Z
[2022-01-29T16:42:57.544Z] testcases/test_search_20.py::TestCollectionSearch::test_search_with_expression_bool[128-False-True-False] PASSEDPASSEDPASSED
[2022-01-30T01:02:07.382Z] testcases/test_utility.py::TestUtilityBase::test_calc_distance_default_sqrt[metric-L2] PASSED
script returned exit code 124

@yanliang567 yanliang567 removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 15, 2022
@binbinlv
Copy link
Contributor Author

binbinlv commented Mar 9, 2022

assign @xiaofan-luan

@binbinlv
Copy link
Contributor Author

binbinlv commented Mar 9, 2022

Similar issue occurs again:
https://jenkins.milvus.io:18080/blue/organizations/jenkins/milvus-ha-ci/detail/PR-15926/3/pipeline/123

1 Log:
artifacts-milvus-distributed-PR-15926-3-pymilvus-e2e-logs.tar.gz

  1. panic info:
md-15926-3-pr-milvus-datanode-777dff59b-77sgs_milvus-ci_datanode-71e594e0429fb5474c293ed9bceaf3f6d720e395d7bbaef33227a4ae3f185833.log:panic: All attempts results:
md-15926-3-pr-milvus-querynode-b949f67c6-7d5rr_milvus-ci_querynode-5dfe21800b41d0c5ed836917b0116ec7469147dc84998f5a3cf743d74d219b59.log:panic: All attempts results:
time="2022-03-08T08:42:46Z" level=info msg="Closing consumer=2560" consumerID=2560 name=iyise subscription=by-dev-queryNode-431680100766580737-3 topic="persistent://public/default/by-dev-rootcoord-dml_132"

time="2022-03-08T08:42:46Z" level=warning msg="[Failed to close consumer]" consumerID=2560 error="connection closed" name=iyise subscription=by-dev-queryNode-431680100766580737-3 topic="persistent://public/default/by-dev-rootcoord-dml_132"

[2022/03/08 08:42:46.501 +00:00] [INFO] [grpclog.go:37] ["[core]Channel Connectivity change to SHUTDOWN"]

[2022/03/08 08:42:46.501 +00:00] [INFO] [grpclog.go:37] ["[core]Subchannel Connectivity change to SHUTDOWN"]

panic: All attempts results:

attempt #1:topic persistent://public/default/by-dev-rootcoord-dml_132, subscription by-dev-queryNode-431680100766580737-3: server error: UnknownError: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /managed-ledgers/public/default/persistent/by-dev-rootcoord-dml_132-by-dev-queryNode-431680100766580737-3__transaction_pending_ack

attempt #2:topic persistent://public/default/by-dev-rootcoord-dml_132, subscription by-dev-queryNode-431680100766580737-3: server error: MetadataError: Consumer not found

attempt #3:topic persistent://public/default/by-dev-rootcoord-dml_132, subscription by-dev-queryNode-431680100766580737-3: server error: MetadataError: Consumer not found

attempt #4:topic persistent://public/default/by-dev-rootcoord-dml_132, subscription by-dev-queryNode-431680100766580737-3: server error: MetadataError: Consumer not found

attempt #5:topic persistent://public/default/by-dev-rootcoord-dml_132, subscription by-dev-queryNode-431680100766580737-3: server error: MetadataError: Consumer not found

attempt #6:topic persistent://public/default/by-dev-rootcoord-dml_132, subscription by-dev-queryNode-431680100766580737-3: server error: MetadataError: Consumer not found

@binbinlv
Copy link
Contributor Author

binbinlv commented Mar 9, 2022

/assign @xiaofan-luan

@lijonnas
Copy link

Hey guys, I meet the similar issue again, after I manually delete the querycoord node.

Here is the log, I'v met this issue several times before

Welcome to use Milvus!
Version: v2.0.1
Built: Wed Feb 23 06:23:51 UTC 2022
GitCommit: ce9662c
GoVersion: go version go1.16.9 linux/amd64

open pid file: /run/milvus/querycoord.pid
lock pid file: /run/milvus/querycoord.pid
[2022/03/26 15:58:42.158 +00:00] [DEBUG] [base_table.go:65] ["config directory"] [configDir=/milvus/configs/]
[2022/03/26 15:58:42.163 +00:00] [DEBUG] [component_param.go:592] ["init cacheSize"] ["cacheSize (GB)"=0]
[2022/03/26 15:58:42.163 +00:00] [DEBUG] [component_param.go:624] ["query node init gracefulTime"] [gracefulTime=0]
[2022/03/26 15:58:42.167 +00:00] [DEBUG] [query_coord.go:302] ["query coordinator"] [queryChannels="[{}]"]
[2022/03/26 15:58:42.170 +00:00] [DEBUG] [base_table.go:65] ["config directory"] [configDir=/milvus/configs/]
[2022/03/26 15:58:42.173 +00:00] [DEBUG] [grpc_param.go:118] [initServerMaxSendSize] [role=querycoord] [grpc.serverMaxSendSize=2147483647]
[2022/03/26 15:58:42.173 +00:00] [DEBUG] [grpc_param.go:141] [initServerMaxRecvSize] [role=querycoord] [grpc.serverMaxRecvSize=2147483647]
[2022/03/26 15:58:42.177 +00:00] [DEBUG] [service.go:244] [network] [port=19531]
[2022/03/26 15:58:42.279 +00:00] [DEBUG] [session_util.go:86] ["Session try to connect to etcd"]
[2022/03/26 15:58:42.286 +00:00] [DEBUG] [session_util.go:101] ["Session connect to etcd success"]
[2022/03/26 15:58:42.286 +00:00] [DEBUG] [base_table.go:65] ["config directory"] [configDir=/milvus/configs/]
[2022/03/26 15:58:42.291 +00:00] [DEBUG] [grpc_param.go:186] [initClientMaxSendSize] [role=rootcoord] [grpc.clientMaxSendSize=104857600]
[2022/03/26 15:58:42.291 +00:00] [DEBUG] [grpc_param.go:209] [initClientMaxRecvSize] [role=rootcoord] [grpc.clientMaxRecvSize=104857600]
[2022/03/26 15:58:42.291 +00:00] [DEBUG] [service.go:153] ["QueryCoord try to wait for RootCoord ready"]
[2022/03/26 15:58:42.294 +00:00] [DEBUG] [session_util.go:294] ["SessionUtil GetSessions "] [prefix=rootcoord] [key=rootcoord] [> address=172.22.102.33:53100]
[2022/03/26 15:58:42.294 +00:00] [DEBUG] [client.go:96] ["RootCoordClient GetSessions success"] [address=172.22.102.33:53100]
[2022/03/26 15:58:42.296 +00:00] [DEBUG] [service.go:163] ["QueryCoord report RootCoord ready"]
[2022/03/26 15:58:42.297 +00:00] [DEBUG] [session_util.go:86] ["Session try to connect to etcd"]
[2022/03/26 15:58:42.300 +00:00] [DEBUG] [session_util.go:101] ["Session connect to etcd success"]
[2022/03/26 15:58:42.300 +00:00] [DEBUG] [base_table.go:65] ["config directory"] [configDir=/milvus/configs/]
[2022/03/26 15:58:42.302 +00:00] [DEBUG] [grpc_param.go:186] [initClientMaxSendSize] [role=datacoord] [grpc.clientMaxSendSize=104857600]
[2022/03/26 15:58:42.302 +00:00] [DEBUG] [grpc_param.go:209] [initClientMaxRecvSize] [role=datacoord] [grpc.clientMaxRecvSize=104857600]
[2022/03/26 15:58:42.302 +00:00] [DEBUG] [service.go:182] ["QueryCoord try to wait for DataCoord ready"]
[2022/03/26 15:58:42.305 +00:00] [DEBUG] [session_util.go:294] ["SessionUtil GetSessions "] [prefix=datacoord] [key=datacoord] [> address=172.22.102.34:13333]
[2022/03/26 15:58:42.306 +00:00] [DEBUG] [service.go:191] ["QueryCoord report DataCoord ready"]
[2022/03/26 15:58:42.306 +00:00] [DEBUG] [session_util.go:86] ["Session try to connect to etcd"]
[2022/03/26 15:58:42.309 +00:00] [DEBUG] [session_util.go:101] ["Session connect to etcd success"]
[2022/03/26 15:58:42.309 +00:00] [DEBUG] [base_table.go:65] ["config directory"] [configDir=/milvus/configs/]
[2022/03/26 15:58:42.314 +00:00] [DEBUG] [grpc_param.go:186] [initClientMaxSendSize] [role=indexcoord] [grpc.clientMaxSendSize=104857600]
[2022/03/26 15:58:42.314 +00:00] [DEBUG] [grpc_param.go:209] [initClientMaxRecvSize] [role=indexcoord] [grpc.clientMaxRecvSize=104857600]
[2022/03/26 15:58:42.314 +00:00] [DEBUG] [service.go:212] ["QueryCoord try to wait for IndexCoord ready"]
[2022/03/26 15:58:42.317 +00:00] [DEBUG] [session_util.go:294] ["SessionUtil GetSessions "] [prefix=indexcoord] [key=indexcoord] [> address=172.22.102.27:31000]
[2022/03/26 15:58:42.317 +00:00] [DEBUG] [client.go:97] ["IndexCoordClient GetSessions success"] [key=indexcoord] [> msess="{"indexcoord":{"ServerID":899,"ServerName":"indexcoord","Address":"172.22.102.27:31000","Exclusive":true,"TriggerKill":true}}"]
[2022/03/26 15:58:42.319 +00:00] [DEBUG] [service.go:218] ["QueryCoord report IndexCoord is ready"]
[2022/03/26 15:58:42.319 +00:00] [DEBUG] [service.go:225] [QueryCoord] [State=Initializing]
[2022/03/26 15:58:42.319 +00:00] [DEBUG] [query_coord.go:132] ["query coordinator start init, session info"] [metaPath=milvus-dev-1/meta] [> address=172.22.102.52:19531]
[2022/03/26 15:58:42.319 +00:00] [DEBUG] [session_util.go:86] ["Session try to connect to etcd"]
[2022/03/26 15:58:42.321 +00:00] [DEBUG] [session_util.go:101] ["Session connect to etcd success"]
[2022/03/26 15:58:42.334 +00:00] [DEBUG] [session_util.go:175] ["Session get serverID success"] [key=id] [ServerId=953]
[2022/03/26 15:58:42.335 +00:00] [INFO] [base_table.go:440] ["Set log file to "] [path=]
2022/03/26 15:58:42 sync and save timestamp{last 16 1648310183932713070 UTC} {save 16 1648310325338783937 UTC} {next 16 1648310322338783937 UTC}
[2022/03/26 15:58:42.761 +00:00] [ERROR] [task_scheduler.go:228] ["reloadFromKV: taskStateInfo and triggerTaskInfo are inconsistent"] [stack="github.com/> milvus-io/milvus/internal/querycoord.(*TaskScheduler).reloadFromKV\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:228\ngithub.com/milvus-io/milvus/internal/querycoord.newTaskScheduler\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:166\ngithub.com/milvus-io/milvus/internal/querycoord.(*QueryCoord).Init.func1\n\t/go/src/github.com/milvus-io/milvus/internal/> querycoord/query_coord.go:187\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com> /milvus-io/milvus/internal/querycoord.(*QueryCoord).Init\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/query_coord.go:134\ngithub.com/> milvus-io/milvus/internal/distributed/querycoord.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/> service.go:226\ngithub.com/milvus-io/milvus/internal/distributed/querycoord.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/> querycoord/service.go:95\ngithub.com/milvus-io/milvus/cmd/components.(*QueryCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/> query_coord.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runQueryCoord.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/> roles.go:178"]
[2022/03/26 15:58:42.763 +00:00] [ERROR] [task_scheduler.go:228] ["reloadFromKV: taskStateInfo and triggerTaskInfo are inconsistent"] [stack="github.com/> milvus-io/milvus/internal/querycoord.(*TaskScheduler).reloadFromKV\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:228\ngithub.com/milvus-io/milvus/internal/querycoord.newTaskScheduler\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:166\ngithub.com/milvus-io/milvus/internal/querycoord.(*QueryCoord).Init.func1\n\t/go/src/github.com/milvus-io/milvus/internal/> querycoord/query_coord.go:187\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com> /milvus-io/milvus/internal/querycoord.(*QueryCoord).Init\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/query_coord.go:134\ngithub.com/> milvus-io/milvus/internal/distributed/querycoord.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/> service.go:226\ngithub.com/milvus-io/milvus/internal/distributed/querycoord.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/> querycoord/service.go:95\ngithub.com/milvus-io/milvus/cmd/components.(*QueryCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/> query_coord.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runQueryCoord.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/> roles.go:178"]
[2022/03/26 15:58:42.763 +00:00] [ERROR] [task_scheduler.go:228] ["reloadFromKV: taskStateInfo and triggerTaskInfo are inconsistent"] [stack="github.com/> milvus-io/milvus/internal/querycoord.(*TaskScheduler).reloadFromKV\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:228\ngithub.com/milvus-io/milvus/internal/querycoord.newTaskScheduler\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:166\ngithub.com/milvus-io/milvus/internal/querycoord.(*QueryCoord).Init.func1\n\t/go/src/github.com/milvus-io/milvus/internal/> querycoord/query_coord.go:187\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com> /milvus-io/milvus/internal/querycoord.(*QueryCoord).Init\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/query_coord.go:134\ngithub.com/> milvus-io/milvus/internal/distributed/querycoord.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/> service.go:226\ngithub.com/milvus-io/milvus/internal/distributed/querycoord.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/> querycoord/service.go:95\ngithub.com/milvus-io/milvus/cmd/components.(*QueryCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/> query_coord.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runQueryCoord.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/> roles.go:178"]
[2022/03/26 15:58:42.764 +00:00] [ERROR] [task_scheduler.go:228] ["reloadFromKV: taskStateInfo and triggerTaskInfo are inconsistent"] [stack="github.com/> milvus-io/milvus/internal/querycoord.(*TaskScheduler).reloadFromKV\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:228\ngithub.com/milvus-io/milvus/internal/querycoord.newTaskScheduler\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/> task_scheduler.go:166\ngithub.com/milvus-io/milvus/internal/querycoord.(*QueryCoord).Init.func1\n\t/go/src/github.com/milvus-io/milvus/internal/> querycoord/query_coord.go:187\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com> /milvus-io/milvus/internal/querycoord.(*QueryCoord).Init\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/query_coord.go:134\ngithub.com/> milvus-io/milvus/internal/distributed/querycoord.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/> service.go:226\ngithub.com/milvus-io/milvus/internal/distributed/querycoord.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/> querycoord/service.go:95\ngithub.com/milvus-io/milvus/cmd/components.(*QueryCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/> query_coord.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runQueryCoord.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/> roles.go:178"]
time="2022-03-26T15:58:42Z" level=info msg="[Connecting to broker]" remote_addr="pulsar://pulsar-dev-broker.arch.svc.cluster.local:6650"
[2022/03/26 15:58:42.777 +00:00] [WARN] [task.go:1020] ["loadSegmentTask: loadSegment occur error"] [taskID=432094624070238211]
[2022/03/26 15:58:42.778 +00:00] [WARN] [task.go:1192] ["watchDmChannelTask: watchDmChannel occur error"] [taskID=432094624070238213]
[2022/03/26 15:58:42.778 +00:00] [WARN] [task.go:1192] ["watchDmChannelTask: watchDmChannel occur error"] [taskID=432094624070238212]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]parsed scheme: """]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]scheme "" not registered, fallback to default scheme"]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]ccResolverWrapper: sending update to cc: {[{172.22.103.25:21123 0 }] <> nil>}"]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]ClientConn switching balancer to "pick_first""]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]Channel switches to new LB policy "pick_first""]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]Subchannel Connectivity change to CONNECTING"]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]Subchannel picks a new address "172.22.103.25:21123" to connect"]
[2022/03/26 15:58:42.779 +00:00] [INFO] [grpclog.go:37] ["[core]pickfirstBalancer: UpdateSubConnState: 0xc0004fa8e0, {CONNECTING }"]
[2022/03/26 15:58:42.780 +00:00] [INFO] [grpclog.go:37] ["[core]Channel Connectivity change to CONNECTING"]
[2022/03/26 15:58:42.782 +00:00] [INFO] [grpclog.go:37] ["[core]Subchannel Connectivity change to READY"]
[2022/03/26 15:58:42.782 +00:00] [INFO] [grpclog.go:37] ["[core]pickfirstBalancer: UpdateSubConnState: 0xc0004fa8e0, {READY }"]
[2022/03/26 15:58:42.782 +00:00] [INFO] [grpclog.go:37] ["[core]Channel Connectivity change to READY"]
time="2022-03-26T15:58:42Z" level=info msg="[TCP connection established]" local_addr="172.22.102.52:56500" > remote_addr="pulsar://pulsar-dev-broker.arch.svc.cluster.local:6650"
time="2022-03-26T15:58:42Z" level=info msg="[Connection is ready]" local_addr="172.22.102.52:56500" > remote_addr="pulsar://pulsar-dev-broker.arch.svc.cluster.local:6650"
time="2022-03-26T15:58:42Z" level=info msg="[Connecting to broker]" > remote_addr="pulsar://pulsar-dev-broker-2.pulsar-dev-broker.arch.svc.cluster.local:6650"
time="2022-03-26T15:58:42Z" level=info msg="[TCP connection established]" local_addr="172.22.102.52:35020" > remote_addr="pulsar://pulsar-dev-broker-2.pulsar-dev-broker.arch.svc.cluster.local:6650"
time="2022-03-26T15:58:42Z" level=info msg="[Connection is ready]" local_addr="172.22.102.52:35020" > remote_addr="pulsar://pulsar-dev-broker-2.pulsar-dev-broker.arch.svc.cluster.local:6650"
time="2022-03-26T15:58:42Z" level=error msg="[Failed to create consumer]" consumerID=1 error="server error: ConsumerBusy: Exclusive consumer is already > connected" name=havir subscription=by-dev-queryNode-432070465564966913-900 topic="persistent://public/default/by-dev-rootcoord-dml_0"
time="2022-03-26T15:58:42Z" level=error msg="[Failed to create consumer]" consumerID=1 error="server error: ConsumerBusy: Exclusive consumer is already > connected" name=havir subscription=by-dev-queryNode-432070465564966913-900 topic="persistent://public/default/by-dev-rootcoord-dml_0"
time="2022-03-26T15:58:43Z" level=error msg="[Failed to create consumer]" consumerID=2 error="server error: ConsumerBusy: Exclusive consumer is already > connected" name=hclif subscription=by-dev-queryNode-432070465564966913-900 topic="persistent://public/default/by-dev-rootcoord-dml_0"
time="2022-03-26T15:58:43Z" level=error msg="[Failed to create consumer]" consumerID=2 error="server error: ConsumerBusy: Exclusive consumer is already > connected" name=hclif subscription=by-dev-queryNode-432070465564966913-900 topic="persistent://public/default/by-dev-rootcoord-dml_0"
...
...
...
time="2022-03-26T15:59:30Z" level=error msg="[Failed to create consumer]" consumerID=20 error="server error: ConsumerBusy: Exclusive consumer is already > connected" name=mqynj subscription=by-dev-queryNode-432070465564966913-900 topic="persistent://public/default/by-dev-rootcoord-dml_0"
time="2022-03-26T15:59:30Z" level=error msg="[Failed to create consumer]" consumerID=20 error="server error: ConsumerBusy: Exclusive consumer is already > connected" name=mqynj subscription=by-dev-queryNode-432070465564966913-900 topic="persistent://public/default/by-dev-rootcoord-dml_0"
panic: Failed to create consumer by-dev-rootcoord-dml_0, error = All attempts results:
attempt #1:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #2:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #3:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #4:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #5:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #6:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #7:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #8:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #9:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #10:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #11:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #12:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #13:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #14:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #15:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #16:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #17:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #18:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #19:server error: ConsumerBusy: Exclusive consumer is already connected
attempt #20:server error: ConsumerBusy: Exclusive consumer is already connected

goroutine 271 [running]:
github.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).AsConsumerWithPosition(0xc000280410, 0xc000122d00, 0x5, 0x8, 0xc0006ba0c0, 0x27, 0x1)
/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:165 +0x2a7
github.com/milvus-io/milvus/internal/msgstream.(*mqMsgStream).AsConsumer(0xc000280410, 0xc000122d00, 0x5, 0x8, 0xc0006ba0c0, 0x27)
/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:131 +0x66
github.com/milvus-io/milvus/internal/querycoord.unsubscribeChannels(0x34c06d0, 0xc0003f7480, 0x34c0ee8, 0xc0002d7060, 0xc0006ba0c0, 0x27, 0xc000122d00, > 0x5, 0x8, 0x0, ...)
/go/src/github.com/milvus-io/milvus/internal/querycoord/channel_unsubscribe.go:171 +0xd1
github.com/milvus-io/milvus/internal/querycoord.(*channelUnsubscribeHandler).handleChannelUnsubscribeLoop(0xc0006ae9b0)
/go/src/github.com/milvus-io/milvus/internal/querycoord/channel_unsubscribe.go:136 +0x20b
created by github.com/milvus-io/milvus/internal/querycoord.(*channelUnsubscribeHandler).start
/go/src/github.com/milvus-io/milvus/internal/querycoord/channel_unsubscribe.go:156 +0x5c

@stale
Copy link

stale bot commented Apr 25, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Apr 25, 2022
@binbinlv
Copy link
Contributor Author

Has been reproduced for a long time, so close it and will reopen it if it occurs again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants