[Chaos] Create collection and search get stuck during datacoord is killed #5985

yanliang567 · 2021-06-22T09:44:28Z

Flush hang if datacoord deleted and restarted

Describe the bug
Flush hang until timeout if datacoord deleted

Steps/Code to reproduce behavior

deploy Milvus with cluster mode
insert and flush in a loop
delete datacoord pod
flush hang without timeout

Expected behavior

flush need a timeout if fail

Actual

flush hangs without timeout

Method of installation

Docker/cpu
Docker/gpu
Build from source
Milvus version (master or released version)
Milvus 2.0 with cluster deployment
Built: Tue Jun 22 03:36:05 UTC 2021
GitCommit: 438e7fb

Additional context

The text was updated successfully, but these errors were encountered:

yanliang567 · 2021-06-22T09:50:33Z

Actually, one more side effect is that the other operations such as create, insert, search are also blocked or performance degraded by flush hang

XuanYang-cn · 2021-06-23T11:03:31Z

I'm looking into this right now.

sunby · 2021-06-24T06:10:13Z

Client retried too many times to connect with datacoord so the flush request blocked. Proxy handle request sequentially and other requests will wait util the previous request returns. IMHO, requests should be handled concurrently unless they actually have internal relations. This will be fixed after 2.0-RC1. @yanliang567 @czs007

yanliang567 · 2021-07-03T08:08:23Z

logs from build master-20210703-111a24a:
DEBUG ci_test:test_chaos.py:131 chaos injected
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR ci_test:api_request.py:24 Traceback (most recent call last):
File "/Users/yanliang/fork/milvus/tests20/python_client/utils/api_request.py", line 18, in inner_wrapper
res = func(*args, **kwargs)
File "/Users/yanliang/fork/milvus/tests20/python_client/utils/api_request.py", line 42, in api_request
return func(*arg, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus_orm/collection.py", line 893, in create_index
return conn.create_index(self._name, field_name, index_params,
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 63, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 51, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 40, in inner
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 749, in create_index
return handler.create_index(collection_name, field_name, params, timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 705, in create_index
self.flush([collection_name], timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 984, in flush
self._wait_for_flushed(collection_name, lambda: future.result().coll_segIDs[collection_name].data)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 955, in _wait_for_flushed
if flushed(first_segment_ids):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 940, in flushed
infos = self.get_persistent_segment_infos(collection_name, timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 935, in get_persistent_segment_infos
raise BaseException(status.error_code, status.reason)
pymilvus.client.exceptions.BaseException: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>

ERROR ci_test:api_request.py:25 (api_response) [Milvus API Exception]<function api_request at 0x7fd091091280>: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=Failed to call flush to data coordinator: Connect to datacoord failed with error:
All attempts results:
attempt #1:context deadline exceeded
attempt #2:context deadline exceeded
attempt #3:context deadline exceeded
attempt #4:context deadline exceeded
attempt #5:context deadline exceeded
attempt #6:context deadline exceeded
attempt #7:context deadline exceeded
attempt #8:context deadline exceeded
attempt #9:context deadline exceeded
attempt #10:context deadline exceeded

ThreadDao · 2021-08-16T06:52:58Z

/assign @ThreadDao

ThreadDao · 2021-08-19T03:22:36Z

The version

Built:     Thu Aug 19 02:31:20 UTC 2021
GitCommit: 648d22e

The result
Expect:
During the dataCoord pod being killed, expected insert and flush failed, and, create, index, query search success
Actual:
create, index, search failed, and only query success
The logs
milvus_logs.tar.gz

Fault injection time period

[2021-08-19 11:01:25 - DEBUG - ci_test]: chaos injected (test_chaos.py:132)

[2021-08-19 11:02:06 - DEBUG - ci_test]: chaos deleted (test_chaos.py:157)

The datacoord pod name before killed
milvus-chaos-datacoord-756588d59c-kqfmx

All pods name after chaos deleted

milvus-chaos-datacoord-756588d59c-dcsqn      1/1     Running   0          18m
milvus-chaos-datanode-7cf694bc94-4w8j8       1/1     Running   0          23m
milvus-chaos-etcd-0                          1/1     Running   0          23m
milvus-chaos-indexcoord-fd9c6f54d-4qpww      1/1     Running   0          23m
milvus-chaos-indexnode-7cd6c6c7f5-cvlfr      1/1     Running   0          23m
milvus-chaos-minio-6cd5bb4c6f-gxhq9          1/1     Running   0          23m
milvus-chaos-proxy-79d8b948f8-9h5wc          1/1     Running   0          23m
milvus-chaos-pulsar-97cfccf4f-kljcn          1/1     Running   0          23m
milvus-chaos-querycoord-66b76476f9-xrb68     1/1     Running   0          23m
milvus-chaos-querynode-6b45f7bb5c-mfzwh      1/1     Running   0          23m
milvus-chaos-rootcoord-89d94b5fc-p95xp       1/1     Running   0          23m

ThreadDao · 2021-08-19T03:22:50Z

/unassign

ThreadDao · 2021-08-27T07:46:27Z

I didn't reproduce on commit 8701c47-20210826. Close it first

yanliang567 added kind/bug Issues or changes related a bug sdk/python labels Jun 22, 2021

yanliang567 added this to the 2.0 milestone Jun 22, 2021

yanliang567 assigned xiaocai2333 Jun 22, 2021

xiaocai2333 assigned XuanYang-cn and unassigned xiaocai2333 Jun 22, 2021

sunby assigned sunby and unassigned XuanYang-cn Jun 24, 2021

wxyucs added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 13, 2021

xiaofan-luan removed the sdk/python label Aug 14, 2021

sre-ci-robot assigned ThreadDao Aug 16, 2021

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2021

sre-ci-robot unassigned ThreadDao Aug 19, 2021

ThreadDao modified the milestones: 2.0-Backlog, 2.0.0-RC5 Aug 19, 2021

ThreadDao added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 19, 2021

xiaofan-luan modified the milestones: 2.0.0-RC5, 2.0.0-RC6 Aug 20, 2021

ThreadDao changed the title ~~[Chaos] Flush hang without timeout if datacoord deleted~~ [Chaos] Create collection and search get stuck during datacoord is killed Aug 25, 2021

ThreadDao closed this as completed Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chaos] Create collection and search get stuck during datacoord is killed #5985

[Chaos] Create collection and search get stuck during datacoord is killed #5985

yanliang567 commented Jun 22, 2021 •

edited

Loading

yanliang567 commented Jun 22, 2021

XuanYang-cn commented Jun 23, 2021

sunby commented Jun 24, 2021

yanliang567 commented Jul 3, 2021

ThreadDao commented Aug 16, 2021

ThreadDao commented Aug 19, 2021 •

edited

Loading

ThreadDao commented Aug 19, 2021

ThreadDao commented Aug 27, 2021 •

edited

Loading

[Chaos] Create collection and search get stuck during datacoord is killed #5985

[Chaos] Create collection and search get stuck during datacoord is killed #5985

Comments

yanliang567 commented Jun 22, 2021 • edited Loading

yanliang567 commented Jun 22, 2021

XuanYang-cn commented Jun 23, 2021

sunby commented Jun 24, 2021

yanliang567 commented Jul 3, 2021

ThreadDao commented Aug 16, 2021

ThreadDao commented Aug 19, 2021 • edited Loading

ThreadDao commented Aug 19, 2021

ThreadDao commented Aug 27, 2021 • edited Loading

yanliang567 commented Jun 22, 2021 •

edited

Loading

ThreadDao commented Aug 19, 2021 •

edited

Loading

ThreadDao commented Aug 27, 2021 •

edited

Loading