Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chaos] Create collection and search get stuck during datacoord is killed #5985

Closed
3 tasks
yanliang567 opened this issue Jun 22, 2021 · 8 comments
Closed
3 tasks
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@yanliang567
Copy link
Contributor

yanliang567 commented Jun 22, 2021

Flush hang if datacoord deleted and restarted

Describe the bug
Flush hang until timeout if datacoord deleted

Steps/Code to reproduce behavior

  1. deploy Milvus with cluster mode
  2. insert and flush in a loop
  3. delete datacoord pod
  4. flush hang without timeout

Expected behavior

  1. flush need a timeout if fail

Actual

  1. flush hangs without timeout

Method of installation

  • Docker/cpu

  • Docker/gpu

  • Build from source

  • Milvus version (master or released version)
    Milvus 2.0 with cluster deployment
    Built: Tue Jun 22 03:36:05 UTC 2021
    GitCommit: 438e7fb

Additional context

@yanliang567 yanliang567 added kind/bug Issues or changes related a bug sdk/python labels Jun 22, 2021
@yanliang567 yanliang567 added this to the 2.0 milestone Jun 22, 2021
@yanliang567
Copy link
Contributor Author

Actually, one more side effect is that the other operations such as create, insert, search are also blocked or performance degraded by flush hang

@XuanYang-cn
Copy link
Contributor

I'm looking into this right now.

@sunby sunby assigned sunby and unassigned XuanYang-cn Jun 24, 2021
@sunby
Copy link
Contributor

sunby commented Jun 24, 2021

Client retried too many times to connect with datacoord so the flush request blocked. Proxy handle request sequentially and other requests will wait util the previous request returns. IMHO, requests should be handled concurrently unless they actually have internal relations. This will be fixed after 2.0-RC1. @yanliang567 @czs007

@yanliang567
Copy link
Contributor Author

logs from build master-20210703-111a24a:
DEBUG ci_test:test_chaos.py:131 chaos injected
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR ci_test:api_request.py:24 Traceback (most recent call last):
File "/Users/yanliang/fork/milvus/tests20/python_client/utils/api_request.py", line 18, in inner_wrapper
res = func(*args, **kwargs)
File "/Users/yanliang/fork/milvus/tests20/python_client/utils/api_request.py", line 42, in api_request
return func(*arg, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus_orm/collection.py", line 893, in create_index
return conn.create_index(self._name, field_name, index_params,
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 63, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 51, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 40, in inner
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/stub.py", line 749, in create_index
return handler.create_index(collection_name, field_name, params, timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 705, in create_index
self.flush([collection_name], timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 984, in flush
self._wait_for_flushed(collection_name, lambda: future.result().coll_segIDs[collection_name].data)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 955, in _wait_for_flushed
if flushed(first_segment_ids):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 940, in flushed
infos = self.get_persistent_segment_infos(collection_name, timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 64, in handler
raise e
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 56, in handler
return func(self, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 935, in get_persistent_segment_infos
raise BaseException(status.error_code, status.reason)
pymilvus.client.exceptions.BaseException: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>

ERROR ci_test:api_request.py:25 (api_response) [Milvus API Exception]<function api_request at 0x7fd091091280>: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=getSegmentsOfCollection, err:rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
ERROR pymilvus.client.grpc_handler:grpc_handler.py:58 Error: <BaseException: (code=1, message=Failed to call flush to data coordinator: Connect to datacoord failed with error:
All attempts results:
attempt #1:context deadline exceeded
attempt #2:context deadline exceeded
attempt #3:context deadline exceeded
attempt #4:context deadline exceeded
attempt #5:context deadline exceeded
attempt #6:context deadline exceeded
attempt #7:context deadline exceeded
attempt #8:context deadline exceeded
attempt #9:context deadline exceeded
attempt #10:context deadline exceeded

@wxyucs wxyucs added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 13, 2021
@ThreadDao
Copy link
Contributor

/assign @ThreadDao

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 16, 2021
@ThreadDao
Copy link
Contributor

ThreadDao commented Aug 19, 2021

  1. The version
Built:     Thu Aug 19 02:31:20 UTC 2021
GitCommit: 648d22e
  1. The result
    Expect:
    During the dataCoord pod being killed, expected insert and flush failed, and, create, index, query search success
    Actual:
    create, index, search failed, and only query success

  2. The logs
    milvus_logs.tar.gz

Fault injection time period

[2021-08-19 11:01:25 - DEBUG - ci_test]: chaos injected (test_chaos.py:132)

[2021-08-19 11:02:06 - DEBUG - ci_test]: chaos deleted (test_chaos.py:157)

The datacoord pod name before killed
milvus-chaos-datacoord-756588d59c-kqfmx

All pods name after chaos deleted

milvus-chaos-datacoord-756588d59c-dcsqn      1/1     Running   0          18m
milvus-chaos-datanode-7cf694bc94-4w8j8       1/1     Running   0          23m
milvus-chaos-etcd-0                          1/1     Running   0          23m
milvus-chaos-indexcoord-fd9c6f54d-4qpww      1/1     Running   0          23m
milvus-chaos-indexnode-7cd6c6c7f5-cvlfr      1/1     Running   0          23m
milvus-chaos-minio-6cd5bb4c6f-gxhq9          1/1     Running   0          23m
milvus-chaos-proxy-79d8b948f8-9h5wc          1/1     Running   0          23m
milvus-chaos-pulsar-97cfccf4f-kljcn          1/1     Running   0          23m
milvus-chaos-querycoord-66b76476f9-xrb68     1/1     Running   0          23m
milvus-chaos-querynode-6b45f7bb5c-mfzwh      1/1     Running   0          23m
milvus-chaos-rootcoord-89d94b5fc-p95xp       1/1     Running   0          23m

@ThreadDao
Copy link
Contributor

/unassign

@ThreadDao ThreadDao modified the milestones: 2.0-Backlog, 2.0.0-RC5 Aug 19, 2021
@ThreadDao ThreadDao added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 19, 2021
@xiaofan-luan xiaofan-luan modified the milestones: 2.0.0-RC5, 2.0.0-RC6 Aug 20, 2021
@ThreadDao ThreadDao changed the title [Chaos] Flush hang without timeout if datacoord deleted [Chaos] Create collection and search get stuck during datacoord is killed Aug 25, 2021
@ThreadDao
Copy link
Contributor

ThreadDao commented Aug 27, 2021

I didn't reproduce on commit 8701c47-20210826. Close it first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants