Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] 1 billion dataset, query、load、search error #12323

Closed
1 task done
wangting0128 opened this issue Nov 29, 2021 · 11 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

wangting0128 commented Nov 29, 2021

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20211126-48bdeb5
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.0.0rc9.dev7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

client.log

client pod:benchmark-tag-79d6d-2834669771

client log:

[2021-11-27 08:54:14,661] [    INFO] -  (locust.stats_logger:733)
[2021-11-27 08:59:14,661] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] query
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1638003554.658878510","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
	{'API start': '2021-11-27 08:54:14.658085', 'RPC start': '2021-11-27 08:54:14.658092', 'RPC error': '2021-11-27 08:59:14.661492'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,674] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] load_collection
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1638003284.673792981","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
	{'API start': '2021-11-27 08:54:14.673329', 'RPC start': '2021-11-27 08:54:14.673338', 'RPC error': '2021-11-27 08:59:14.673983'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,692] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] describe_collection
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689533269","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:59:14.660490', 'RPC start': '2021-11-27 08:59:14.660499', 'RPC error': '2021-11-27 08:59:14.692242'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,693] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] search
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689533269","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:54:14.655184', 'RPC start': '2021-11-27 08:54:14.655191', 'RPC error': '2021-11-27 08:59:14.692960'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,693] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] describe_collection
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689555776","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:59:14.660828', 'RPC start': '2021-11-27 08:59:14.660835', 'RPC error': '2021-11-27 08:59:14.693464'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,694] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] search
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689555776","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:54:14.656158', 'RPC start': '2021-11-27 08:54:14.656165', 'RPC error': '2021-11-27 08:59:14.694061'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,694] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] describe_collection
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689562713","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:59:14.661138', 'RPC start': '2021-11-27 08:59:14.661148', 'RPC error': '2021-11-27 08:59:14.694872'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,695] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] search
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689562713","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:54:14.657094', 'RPC start': '2021-11-27 08:54:14.657100', 'RPC error': '2021-11-27 08:59:14.695435'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,696] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] describe_collection
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689568845","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:59:14.674683', 'RPC start': '2021-11-27 08:59:14.674691', 'RPC error': '2021-11-27 08:59:14.696006'} (pymilvus.client.grpc_handler:84)
[2021-11-27 08:59:14,696] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] search
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1638003554.689568845","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Socket closed","grpc_status":14}"
>
	{'API start': '2021-11-27 08:54:14.674914', 'RPC start': '2021-11-27 08:54:14.674921', 'RPC error': '2021-11-27 08:59:14.696496'} (pymilvus.client.grpc_handler:84)
[2021-11-27 09:00:14,692] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] _execute_search_requests
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1638003614.691806291","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
	{'API start': '2021-11-27 08:59:14.691454', 'RPC start': '2021-11-27 08:59:14.691463', 'RPC error': '2021-11-27 09:00:14.692293'} (pymilvus.client.grpc_handler:84)
[2021-11-27 09:00:14,706] [   ERROR] - 
Addr [benchmark-tag-79d6d-1-milvus.qa-milvus.svc.cluster.local:19530] search
RPC error: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1638003614.691806291","description":"Error received from peer ipv4:10.96.206.90:19530","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
	{'API start': '2021-11-27 08:54:14.654137', 'RPC start': '2021-11-27 08:54:14.654143', 'RPC error': '2021-11-27 09:00:14.706808'} (pymilvus.client.grpc_handler:84)

Expected Behavior

argo task: benchmark-tag-79d6d

test yaml:
client-configmap: client-random-locust-search-84h-1b
server-configmap: server-cluster-8c64m-datanode2-indexnode4-querynode6-nocompaction

server:

NAME                                                         READY   STATUS      RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
benchmark-tag-79d6d-1-etcd-0                                      1/1     Running     0          2d10h   10.97.17.216   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-etcd-1                                      1/1     Running     0          2d10h   10.97.17.215   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-etcd-2                                      1/1     Running     0          2d10h   10.97.17.217   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-datacoord-5774c6dd4d-csw2s           1/1     Running     0          2d10h   10.97.10.36    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-datanode-78bc74874f-9nmj5            1/1     Running     0          2d10h   10.97.10.39    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-datanode-78bc74874f-dg2c2            1/1     Running     0          2d10h   10.97.13.83    qa-node010.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-indexcoord-6896f9fc4f-lkb54          1/1     Running     0          2d10h   10.97.10.35    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-indexnode-75cf4d6884-5rm7s           1/1     Running     0          2d10h   10.97.17.211   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-indexnode-75cf4d6884-8knn5           1/1     Running     0          2d10h   10.97.17.209   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-indexnode-75cf4d6884-sgntx           1/1     Running     0          2d10h   10.97.14.88    qa-node011.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-indexnode-75cf4d6884-xmwk6           1/1     Running     0          2d10h   10.97.19.44    qa-node016.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-proxy-568b885b58-8drrm               1/1     Running     0          2d10h   10.97.8.91     qa-node006.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querycoord-84565ff598-zphw5          1/1     Running     0          2d10h   10.97.10.37    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-29lnh           1/1     Running     0          2d10h   10.97.15.112   qa-node012.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-69x27           1/1     Running     0          2d10h   10.97.11.177   qa-node009.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-8dj7s           1/1     Running     0          2d10h   10.97.15.113   qa-node012.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-kssr2           1/1     Running     1          2d10h   10.97.17.210   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-tzskn           1/1     Running     1          2d10h   10.97.14.85    qa-node011.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-zgt2j           1/1     Running     0          2d10h   10.97.10.38    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-milvus-rootcoord-846d4bc9b4-xxrv9           1/1     Running     0          2d10h   10.97.8.89     qa-node006.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-minio-0                                     1/1     Running     0          2d10h   10.97.7.234    qa-node005.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-minio-1                                     1/1     Running     0          2d10h   10.97.7.236    qa-node005.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-minio-2                                     1/1     Running     0          2d10h   10.97.8.94     qa-node006.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-minio-3                                     1/1     Running     0          2d10h   10.97.8.93     qa-node006.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-autorecovery-857c7cd59b-blbbg        1/1     Running     0          2d10h   10.97.8.92     qa-node006.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-bastion-6bc898bb7b-gjvjv             1/1     Running     0          2d10h   10.97.8.90     qa-node006.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-bookkeeper-0                         1/1     Running     4          2d10h   10.97.7.235    qa-node005.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-bookkeeper-1                         1/1     Running     3          2d10h   10.97.9.29     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-broker-8f9cff75-wkjn6                1/1     Running     1          2d10h   10.97.9.26     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-proxy-54584b74f9-57wrs               2/2     Running     0          2d10h   10.97.10.34    qa-node008.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-zookeeper-0                          1/1     Running     0          2d10h   10.97.14.86    qa-node011.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-zookeeper-1                          1/1     Running     0          2d10h   10.97.9.27     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-zookeeper-2                          1/1     Running     0          2d10h   10.97.9.28     qa-node007.zilliz.local   <none>           <none>
benchmark-tag-79d6d-1-pulsar-zookeeper-metadata-n78cn             0/1     Completed   0          2d10h   10.97.14.84    qa-node011.zilliz.local   <none>           <none>

Steps To Reproduce

No response

Anything else?

No response

@wangting0128 wangting0128 added kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Nov 29, 2021
@yanliang567 yanliang567 added this to the 2.0.0-GA milestone Nov 29, 2021
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 29, 2021
@wxyucs
Copy link
Member

wxyucs commented Nov 29, 2021

may related to #12285

@xige-16
Copy link
Contributor

xige-16 commented Nov 29, 2021

please check why querynode down, @bigsheeper
[2021/11/27 08:41:45.673 +00:00] [DEBUG] [session_util.go:347] ["watch services"] ["delete kv"="key:"by-dev/meta/session/QueryNode-10" create_revision:43 mod_revision:43 version:1 value:"{\"ServerID\":10,\"ServerName\":\"QueryNode\",\"Address\":\"10.97.17.210:21123\"}" lease:5853972929078545970 "]
[2021/11/27 08:41:45.674 +00:00] [INFO] [grpclog.go:37] ["[core]Subchannel Connectivity change to SHUTDOWN"]

@bigsheeper
Copy link
Contributor

/assign

@bigsheeper
Copy link
Contributor

[2021/11/27 08:41:00.591 +00:00] [DEBUG] [impl.go:279] ["watchDmChannelsTask WaitToFinish done"] [collectionID=429376988591685633]
panic: Failed to create consumer by-dev-rootcoord-dml_1, error = All attempts results:
attempt #8:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #15:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #17:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
github.com/milvus-io/milvus/internal/msgstream.(*MqTtMsgStream).AsConsumerWithPosition(0xc0042005a0, 0xc003efab10, 0x1, 0x1, 0xc002e9e500, 0x3a, 0x1)
github.com/milvus-io/milvus/internal/querynode.(*watchDmChannelsTask).Execute(0xc0004f1620, 0x2913440, 0xc000515580, 0x0, 0x0)
attempt #12:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #18:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:81 +0x5c
attempt #1:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #11:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
/go/src/github.com/milvus-io/milvus/internal/querynode/task.go:335 +0x268e
github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).loadAndReleaseLoop(0xc0005155c0)
attempt #20:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #4:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
goroutine 462 [running]:
github.com/milvus-io/milvus/internal/querynode.(*queryNodeFlowGraph).seekQueryNodeFlowGraph(0xc003eeed70, 0xc0003d60c0, 0xc000506000, 0x2b)
attempt #5:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #7:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #16:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #19:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:689 +0x66
attempt #2:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #3:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #6:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #9:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #10:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #13:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
attempt #14:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available

/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:720 +0x2a7

github.com/milvus-io/milvus/internal/msgstream.(*MqTtMsgStream).AsConsumer(0xc0042005a0, 0xc003efab10, 0x1, 0x1, 0xc002e9e500, 0x3a)
/go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_query_node.go:218 +0xac
github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask(0xc0005155c0, 0x29257c0, 0xc0004f1620, 0x2923de0, 0xc0004e6500)
/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:56 +0x173
/go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:73 +0x191
created by github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).Start

@bigsheeper
Copy link
Contributor

bigsheeper commented Nov 29, 2021

[2021/11/27 08:41:00.591 +00:00] [DEBUG] [impl.go:279] ["watchDmChannelsTask WaitToFinish done"] [collectionID=429376988591685633] panic: Failed to create consumer by-dev-rootcoord-dml_1, error = All attempts results: attempt #8:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #15:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #17:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available github.com/milvus-io/milvus/internal/msgstream.(*MqTtMsgStream).AsConsumerWithPosition(0xc0042005a0, 0xc003efab10, 0x1, 0x1, 0xc002e9e500, 0x3a, 0x1) github.com/milvus-io/milvus/internal/querynode.(*watchDmChannelsTask).Execute(0xc0004f1620, 0x2913440, 0xc000515580, 0x0, 0x0) attempt #12:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #18:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:81 +0x5c attempt #1:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #11:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available /go/src/github.com/milvus-io/milvus/internal/querynode/task.go:335 +0x268e github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).loadAndReleaseLoop(0xc0005155c0) attempt #20:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #4:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available goroutine 462 [running]: github.com/milvus-io/milvus/internal/querynode.(*queryNodeFlowGraph).seekQueryNodeFlowGraph(0xc003eeed70, 0xc0003d60c0, 0xc000506000, 0x2b) attempt #5:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #7:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #16:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #19:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available /go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:689 +0x66 attempt #2:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #3:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #6:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #9:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #10:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #13:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available attempt #14:server error: PersistenceError: java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available

/go/src/github.com/milvus-io/milvus/internal/msgstream/mq_msgstream.go:720 +0x2a7

github.com/milvus-io/milvus/internal/msgstream.(*MqTtMsgStream).AsConsumer(0xc0042005a0, 0xc003efab10, 0x1, 0x1, 0xc002e9e500, 0x3a) /go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_query_node.go:218 +0xac github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).processTask(0xc0005155c0, 0x29257c0, 0xc0004f1620, 0x2923de0, 0xc0004e6500) /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:56 +0x173 /go/src/github.com/milvus-io/milvus/internal/querynode/task_scheduler.go:73 +0x191 created by github.com/milvus-io/milvus/internal/querynode.(*taskScheduler).Start

Pod benchmark-tag-79d6d-1-milvus-querynode-77c56d79d6-tzskn panic caused by no enough non-faulty bookies available when consuming message stream.

@xige-16
Copy link
Contributor

xige-16 commented Nov 29, 2021

a querynode down, querycoord add a loadbalance task, when the intrenal task of the balance task reschedule, occure the error: @godchen0212 please check it
[2021/11/27 10:15:32.857 +00:00] [ERROR] [task_scheduler.go:705] ["waitActivateTaskDone: reschedule task error"] [taskID=429376955130052769] [triggerTaskID=429376955130052715] [error="delta channel not exist in meta"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).waitActivateTaskDone.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:705\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).waitActivateTaskDone\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:804"]

@xige-16
Copy link
Contributor

xige-16 commented Nov 29, 2021

a querynode down, querycoord add a loadbalance task, when the intrenal task of the balance task reschedule, occure the error: @godchen0212 please check it [2021/11/27 10:15:32.857 +00:00] [ERROR] [task_scheduler.go:705] ["waitActivateTaskDone: reschedule task error"] [taskID=429376955130052769] [triggerTaskID=429376955130052715] [error="delta channel not exist in meta"] [stack="github.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).waitActivateTaskDone.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:705\ngithub.com/milvus-io/milvus/internal/querycoord.(*TaskScheduler).waitActivateTaskDone\n\t/go/src/github.com/milvus-io/milvus/internal/querycoord/task_scheduler.go:804"]

@godchen0212
/assign @godchen0212

@bigsheeper
Copy link
Contributor

59359.661: Total time for which application threads were stopped: 0.0019294 seconds, Stopping threads took: 0.0008331 seconds
59359.669: Total time for which application threads were stopped: 0.0005431 seconds, Stopping threads took: 0.0000839 seconds
08:40:00.179 [BookieReadThreadPool-OrderedExecutor-2-0] ERROR org.apache.bookkeeper.proto.BookieServer - Unable to allocate memory, exiting bookie
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:776) ~[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:188) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:378) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1445) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.proto.ReadEntryProcessor.processPacket(ReadEntryProcessor.java:80) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:128) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:164) [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage.getEntry(SingleDirectoryDbLedgerStorage.java:430) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.LedgerDescriptorImpl.readEntry(LedgerDescriptorImpl.java:160) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) [io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.getEntry(DbLedgerStorage.java:199) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731) ~[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:158) [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]

memory limit of bookkeeper pod is not enough @wangting0128

@bigsheeper
Copy link
Contributor

[2021/11/27 08:41:05.270 +00:00] [DEBUG] [session_util.go:98] ["Session connect to etcd success"]
[2021/11/27 08:41:05.270 +00:00] [DEBUG] [session_util.go:128] ["Session checkIDExist Begin"]
[2021/11/27 08:41:05.271 +00:00] [DEBUG] [session_util.go:135] ["Session checkIDExist End"]
[2021/11/27 08:41:05.271 +00:00] [DEBUG] [session_util.go:140] ["Session try to get serverID"]
[2021/11/27 08:41:05.273 +00:00] [DEBUG] [session_util.go:171] ["Session get serverID success"]
[2021/11/27 08:41:05.273 +00:00] [DEBUG] [session_util.go:191] ["Session Register Begin QueryNode"]
[2021/11/27 08:41:05.275 +00:00] [DEBUG] [session_util.go:232] ["Session Register End"] [ServerID=19]
[2021/11/27 08:41:05.276 +00:00] [DEBUG] [query_node.go:141] ["query nodeID"] [nodeID=19]
[2021/11/27 08:41:05.276 +00:00] [DEBUG] [query_node.go:142] ["query node address"] [address=10.97.14.85:21123]
[2021/11/27 08:41:05.276 +00:00] [DEBUG] [service.go:232] ["QueryNode init done ..."]
[2021/11/27 08:41:05.275 +00:00] [INFO] [roles.go:482] ["Set log file to "] [path=]
[2021/11/27 08:41:05.284 +00:00] [DEBUG] [minio_kv.go:92] ["MinioKV new MinioKV success."]
[2021/11/27 08:41:05.284 +00:00] [DEBUG] [query_node.go:275] ["query node start successfully"] [queryNodeID=19] [IP=10.97.14.85] [Port=21123]
[2021/11/27 08:41:05.284 +00:00] [DEBUG] [service.go:237] ["QueryNode start done ..."]
[2021/11/27 08:41:05.284 +00:00] [DEBUG] [query_node.go:332] ["query node watchChangeInfo start"]
[2021/11/27 08:41:05.284 +00:00] [DEBUG] [query_node.go:54] ["QueryNode successfully started"]

query node restart successfully.

@xige-16
Copy link
Contributor

xige-16 commented Nov 29, 2021

There is some error when recover segment of the down query node, releated with #12340

@xige-16
Copy link
Contributor

xige-16 commented Dec 6, 2021

/unassig @bigsheeper
/unassign @godchen0212

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants