Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Search failed with error fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica without any chaos #21026

Closed
1 task done
zhuwenxing opened this issue Dec 7, 2022 · 9 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20221206-f8cff798
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.3.0.dev15
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2022-12-07T02:23:24.289Z] [2022-12-07 02:23:13 - INFO - ci_test]: [test][2022-12-07T02:23:09Z] [4.03896960s] Hello_Milvus load -> None (wrapper.py:30)

[2022-12-07T02:23:24.289Z] [2022-12-07 02:23:13 - INFO - ci_test]: assert load: 4.039217472076416 (test_data_persistence.py:89)

[2022-12-07T02:23:24.289Z] [2022-12-07 02:23:13 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.1086702406615037, 0.03836368826610158, 0.08882218314766184, 0.10044394949779682, 0.13572320745375752, 0.07011054020093992, 0.09889132419668602, 0.10155191962304858, 0.11281254282514301, 0.009236124677987909, 0.15303027050056867, 0.05284775407586183, 0.1029843927374862, 0.026665505247487422, 0.0......, kwargs: {} (api_request.py:56)

[2022-12-07T02:23:24.289Z] [2022-12-07 02:23:23 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-07T02:23:24.289Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_437880003361443220v1 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #8:context deadline exceeded

[2022-12-07T02:23:24.290Z] )>, <Time:{'RPC start': '2022-12-07 02:23:13.990435', 'RPC error': '2022-12-07 02:23:23.993269'}> (decorators.py:108)

[2022-12-07T02:23:24.290Z] [2022-12-07 02:23:23 - ERROR - ci_test]: Traceback (most recent call last):

[2022-12-07T02:23:24.290Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-12-07T02:23:24.290Z]     res = func(*args, **_kwargs)

[2022-12-07T02:23:24.290Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-12-07T02:23:24.290Z]     return func(*arg, **kwargs)

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 610, in search

[2022-12-07T02:23:24.290Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-12-07T02:23:24.290Z]     raise e

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-12-07T02:23:24.290Z]     return func(*args, **kwargs)

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-12-07T02:23:24.290Z]     ret = func(self, *args, **kwargs)

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-12-07T02:23:24.290Z]     raise e

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-12-07T02:23:24.290Z]     return func(self, *args, **kwargs)

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 469, in search

[2022-12-07T02:23:24.290Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 438, in _execute_search_requests

[2022-12-07T02:23:24.290Z]     raise pre_err

[2022-12-07T02:23:24.290Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 429, in _execute_search_requests

[2022-12-07T02:23:24.290Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-12-07T02:23:24.290Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-07T02:23:24.290Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_437880003361443220v1 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #8:context deadline exceeded

[2022-12-07T02:23:24.290Z] )>

[2022-12-07T02:23:24.290Z]  (api_request.py:39)

[2022-12-07T02:23:24.290Z] [2022-12-07 02:23:23 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-07T02:23:24.290Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437880003361443220v0 is not available in any replica

[2022-12-07T02:23:24.290Z] attempt #2:fail to get shard leaders from QueryCoord: channel by...... (api_request.py:40)

Expected Behavior

all test cases passed

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/378/pipeline
log:

artifacts-indexcoord-pod-failure-378-server-logs.tar.gz

artifacts-indexcoord-pod-failure-378-pytest-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2022
@yanliang567
Copy link
Contributor

/assign @jiaoew1991
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2022
@yanliang567 yanliang567 added this to the 2.3 milestone Dec 7, 2022
@jiaoew1991
Copy link
Contributor

/assign @smellthemoon
/unassign

@zhuwenxing
Copy link
Contributor Author

zhuwenxing commented Dec 9, 2022

deploy task: reinstall milvus mode:cluster old image tag: v2.1.4 new image tag: master-20221208-eb7ef01b
It reproduced again!
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/66/pipeline
log:
artifacts-pulsar-cluster-reinstall-66-server-logs.tar.gz

artifacts-pulsar-cluster-reinstall-66-pytest-logs.tar.gz

@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Dec 9, 2022
@moliqingwa
Copy link

We have the same issue, how to avoid this ?

@yah01
Copy link
Member

yah01 commented Dec 9, 2022

The target observer has to guarantee the next target is always not empty, or checkers may release channels/segments

@yah01
Copy link
Member

yah01 commented Dec 9, 2022

/assign

@yah01
Copy link
Member

yah01 commented Dec 12, 2022

/assign @zhuwenxing
fixed with #21069 #20914

@zhuwenxing
Copy link
Contributor Author

Not reproduced in master-20221212-e977e014

@yah01
Copy link
Member

yah01 commented Dec 12, 2022

Target observer and collection observer both update the current target, which may lead to assign a unfinished target to current target

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants