Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: flush all failed after milvus already gets back from many chaos test #33540

Closed
1 task done
zhuwenxing opened this issue Jun 3, 2024 · 13 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20240531-2c7bb0b8-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): all   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.255.102.75', 'port': 19530} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_response) : True  (api_request.py:37)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['Checker__bEn31jmW', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-06-02T23:26:00.847Z] -------------

[2024-06-02T23:26:00.847Z] <name>: Checker__bEn31jmW

[2024-06-02T23:26:00.847Z] <description>: 

[2024-06-02T23:26:00.847Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:16:32 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:19:33 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:105)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:19:33 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Checker__bEn31jmW, flusht_ts: 450198291857539073)>, <Time:{'RPC start': '2024-06-02 23:16:32.904152', 'RPC error': '2024-06-02 23:19:33.021434'}> (decorators.py:139)

[2024-06-02T23:26:00.847Z] [2024-06-02 23:19:33 - ERROR - ci_test]: Traceback (most recent call last):

[2024-06-02T23:26:00.847Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-06-02T23:26:00.847Z]     res = func(*args, **_kwargs)

[2024-06-02T23:26:00.847Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-06-02T23:26:00.847Z]     return func(*arg, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush

[2024-06-02T23:26:00.848Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 140, in handler

[2024-06-02T23:26:00.848Z]     raise e from e

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2024-06-02T23:26:00.848Z]     return func(*args, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 175, in handler

[2024-06-02T23:26:00.848Z]     return func(self, *args, **kwargs)

[2024-06-02T23:26:00.848Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 106, in handler

[2024-06-02T23:26:00.848Z]     raise MilvusException(

[2024-06-02T23:26:00.848Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Checker__bEn31jmW, flusht_ts: 450198291857539073)>

[2024-06-02T23:26:00.848Z]  (api_request.py:45)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/14576/pipeline
log:
artifacts-etcd-followers-pod-failure-14576-server-logs.tar.gz


[2024-06-02T23:14:37.219Z] + kubectl get pods -o wide

[2024-06-02T23:14:37.221Z] + grep etcd-followers-pod-failure-14576

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-0                                1/1     Running       2 (17m ago)        44m     10.104.17.97    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-1                                1/1     Running       0                  44m     10.104.27.34    4am-node31   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-2                                1/1     Running       0                  44m     10.104.25.223   4am-node30   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-0                          2/2     Running       2 (43m ago)        44m     10.104.17.88    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-1                          2/2     Running       2 (43m ago)        44m     10.104.30.91    4am-node38   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-2                          2/2     Running       2 (43m ago)        44m     10.104.19.101   4am-node28   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-kafka-exporter-86cb4f8b9bkb5sz   1/1     Running       5 (42m ago)        44m     10.104.4.33     4am-node11   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-datacoord-58cd8d5fwjxcz   1/1     Running       4 (42m ago)        44m     10.104.14.77    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-datanode-c57d5b4c-chk52   1/1     Running       3 (43m ago)        44m     10.104.1.46     4am-node10   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-datanode-c57d5b4c-xct5m   1/1     Running       4 (42m ago)        44m     10.104.14.76    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexcoord-6d86598mltjj   1/1     Running       0                  44m     10.104.14.75    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexnode-847c4b9bj5sw2   1/1     Running       1 (42m ago)        44m     10.104.13.104   4am-node16   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexnode-847c4b9bn69qw   1/1     Running       4 (42m ago)        44m     10.104.20.80    4am-node22   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-indexnode-847c4b9bpz7lj   1/1     Running       4 (43m ago)        44m     10.104.18.248   4am-node25   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-proxy-74fdd457f6-8dpff    1/1     Running       4 (43m ago)        44m     10.104.17.66    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querycoord-68b8975vskfx   1/1     Running       4 (42m ago)        44m     10.104.14.74    4am-node18   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querynode-57c449456f9b4   1/1     Running       4 (42m ago)        44m     10.104.23.7     4am-node27   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querynode-57c44945f72hc   1/1     Running       4 (42m ago)        44m     10.104.34.118   4am-node37   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-querynode-57c44945s28nh   1/1     Running       4 (42m ago)        44m     10.104.15.167   4am-node20   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-milvus-rootcoord-7466f8c4rj92v   1/1     Running       4 (42m ago)        44m     10.104.26.125   4am-node32   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-0                          1/1     Running       0                  44m     10.104.27.30    4am-node31   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-1                          1/1     Running       0                  44m     10.104.17.98    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-2                          1/1     Running       0                  44m     10.104.26.131   4am-node32   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-minio-3                          1/1     Running       0                  44m     10.104.19.102   4am-node28   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-zookeeper-0                      1/1     Running       0                  44m     10.104.27.32    4am-node31   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-zookeeper-1                      1/1     Running       0                  44m     10.104.17.99    4am-node23   <none>           <none>

[2024-06-02T23:14:37.476Z] etcd-followers-pod-failure-14576-zookeeper-2                      1/1     Running       0                  44m     10.104.26.132   4am-node32   <none>           <none>

Anything else?

almost all chaos. test for master image all failed due to flush timeout
image

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 3, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 3, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Jun 3, 2024
@zhuwenxing
Copy link
Contributor Author

/assign @XuanYang-cn
PTAL

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 3, 2024
@yanliang567 yanliang567 removed their assignment Jun 3, 2024
@XuanYang-cn
Copy link
Contributor

/assign @xiaocai2333

@XuanYang-cn
Copy link
Contributor

image

@XuanYang-cn
Copy link
Contributor

Flush failed for L0 segments added back into metacache, those L0 segment will never get flushed and block flush forever

@xiaocai2333
Copy link
Contributor

fixed, please verify.
/assign @zhuwenxing

@XuanYang-cn
Copy link
Contributor

/unassign

@zhuwenxing
Copy link
Contributor Author

zhuwenxing commented Jun 13, 2024

it is still reproduced in master-20240612-9d4535ce-amd64

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/14838/pipeline
log:
artifacts-querynode-pod-kill-14838-server-logs.tar.gz

This is one of many failed tests.

@zhuwenxing
Copy link
Contributor Author

/unassign

sre-ci-robot pushed a commit that referenced this issue Jun 17, 2024
issue: #33540 
1. gorwing L0 segments is invisible to datacoord.
2. flushed L0 segments need to clean by datacoord.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@xiaocai2333
Copy link
Contributor

/assign @zhuwenxing
please verify again.

@zhuwenxing
Copy link
Contributor Author

image tag: master-20240619-7b9462c0-amd64

The current situation has seen some improvement, but the issue has not been fully resolved. Previously, almost all collections would fail to flush, and now some collections still fail to flush.

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/15373/pipeline
log:
artifacts-proxy-pod-failure-15373-server-logs.tar.gz

[2024-06-19T19:52:25.998Z] [2024-06-19 19:49:21 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-06-19T19:52:25.998Z] -------------

[2024-06-19T19:52:25.998Z] <name>: InsertChecker__ia1cSkY2

[2024-06-19T19:52:25.999Z] <description>: 

[2024-06-19T19:52:25.999Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT......  (api_request.py:37)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:49:21 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:105)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__ia1cSkY2, flusht_ts: 450580070212829196)>, <Time:{'RPC start': '2024-06-19 19:49:21.823172', 'RPC error': '2024-06-19 19:52:21.943708'}> (decorators.py:139)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - ERROR - ci_test]: Traceback (most recent call last):

[2024-06-19T19:52:25.999Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-06-19T19:52:25.999Z]     res = func(*args, **_kwargs)

[2024-06-19T19:52:25.999Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-06-19T19:52:25.999Z]     return func(*arg, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush

[2024-06-19T19:52:25.999Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 140, in handler

[2024-06-19T19:52:25.999Z]     raise e from e

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2024-06-19T19:52:25.999Z]     return func(*args, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 175, in handler

[2024-06-19T19:52:25.999Z]     return func(self, *args, **kwargs)

[2024-06-19T19:52:25.999Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 106, in handler

[2024-06-19T19:52:25.999Z]     raise MilvusException(

[2024-06-19T19:52:25.999Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__ia1cSkY2, flusht_ts: 450580070212829196)>

[2024-06-19T19:52:25.999Z]  (api_request.py:45)

[2024-06-19T19:52:25.999Z] [2024-06-19 19:52:21 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__ia1cSkY2, flusht_ts: 450580070212829196)> (api_request.py:46)

[2024-06-19T19:52:25.999Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2024-06-19T19:52:25.999Z] =========================== short test summary info ============================

[2024-06-19T19:52:25.999Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__CVwOkPk7] - AssertionError: Response of API flush expect True, but got False

[2024-06-19T19:52:25.999Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[InsertChecker__ia1cSkY2] - AssertionError: Response of API flush expect True, but got False

@zhuwenxing
Copy link
Contributor Author

/unassign

@xiaocai2333
PTAL

@xiaocai2333
Copy link
Contributor

If the segment has just been created and doesn't have a stats log yet, it won't be include in the SyncSegments view. Fix this immediately.

sre-ci-robot pushed a commit that referenced this issue Jun 25, 2024
issue: #33540

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
@zhuwenxing
Copy link
Contributor Author

Not reproduced in master-20240625-506a9152-amd64

xiaocai2333 added a commit to xiaocai2333/milvus that referenced this issue Jun 26, 2024
issue: milvus-io#33540

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
xiaocai2333 added a commit to xiaocai2333/milvus that referenced this issue Jun 26, 2024
…#33829)

issue: milvus-io#33540
1. gorwing L0 segments is invisible to datacoord.
2. flushed L0 segments need to clean by datacoord.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
xiaocai2333 added a commit to xiaocai2333/milvus that referenced this issue Jun 26, 2024
issue: milvus-io#33540

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
yellow-shine pushed a commit to yellow-shine/milvus that referenced this issue Jul 2, 2024
issue: milvus-io#33540

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
yellow-shine pushed a commit to yellow-shine/milvus that referenced this issue Jul 2, 2024
…#33829)

issue: milvus-io#33540 
1. gorwing L0 segments is invisible to datacoord.
2. flushed L0 segments need to clean by datacoord.

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
yellow-shine pushed a commit to yellow-shine/milvus that referenced this issue Jul 2, 2024
issue: milvus-io#33540

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants