Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: flush timeout after milvus recovered from pulsar pod kill chaos test #34255

Closed
1 task done
zhuwenxing opened this issue Jun 28, 2024 · 13 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20240626-e43c5580-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:44)

[2024-06-27T17:59:07.747Z] ------------------------------ Captured log call -------------------------------

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.255.5.29', 'port': 19530} (api_request.py:62)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_response) : True  (api_request.py:37)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['HybridSearchChecker__4dXdCfE3', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-06-27T17:59:07.747Z] -------------

[2024-06-27T17:59:07.747Z] <name>: HybridSearchChecker__4dXdCfE3

[2024-06-27T17:59:07.747Z] <description>: 

[2024-06-27T17:59:07.747Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType......  (api_request.py:37)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:55:42 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:58:43 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:105)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:58:43 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: HybridSearchChecker__4dXdCfE3, flusht_ts: 450759476555546630)>, <Time:{'RPC start': '2024-06-27 17:55:42.678740', 'RPC error': '2024-06-27 17:58:43.140795'}> (decorators.py:139)

[2024-06-27T17:59:07.747Z] [2024-06-27 17:58:43 - ERROR - ci_test]: Traceback (most recent call last):

[2024-06-27T17:59:07.747Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-06-27T17:59:07.747Z]     res = func(*args, **_kwargs)

[2024-06-27T17:59:07.747Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-06-27T17:59:07.747Z]     return func(*arg, **kwargs)

[2024-06-27T17:59:07.747Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush

[2024-06-27T17:59:07.748Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-06-27T17:59:07.748Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 140, in handler

[2024-06-27T17:59:07.748Z]     raise e from e

[2024-06-27T17:59:07.748Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2024-06-27T17:59:07.748Z]     return func(*args, **kwargs)

[2024-06-27T17:59:07.748Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 175, in handler

[2024-06-27T17:59:07.748Z]     return func(self, *args, **kwargs)

[2024-06-27T17:59:07.748Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 106, in handler

[2024-06-27T17:59:07.748Z]     raise MilvusException(

[2024-06-27T17:59:07.748Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: HybridSearchChecker__4dXdCfE3, flusht_ts: 450759476555546630)>

[2024-06-27T17:59:07.748Z]  (api_request.py:45)

[2024-06-27T17:59:07.748Z] [2024-06-27 17:58:43 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: HybridSearchChecker__4dXdCfE3, flusht_ts: 450759476555546630)> (api_request.py:46)

[2024-06-27T17:59:07.748Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2024-06-27T17:59:07.748Z] =========================== short test summary info ============================

[2024-06-27T17:59:07.748Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[HybridSearchChecker__4dXdCfE3] - AssertionError: Response of API flush expect True, but got False

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/15599/pipeline

log:
artifacts-pulsar-pod-kill-15599-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 28, 2024
@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@yanliang567 yanliang567 added this to the 2.5.0 milestone Jul 1, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 1, 2024
@xiaofan-luan
Copy link
Contributor

/assign @bigsheeper
please help on it

@XuanYang-cn
Copy link
Contributor

segment 450758977781515919 's data is lost from pulsar.

after DN recovering and seek to 17:31:42, the msg in between 17:31:42 are lost cpTime=2024/06/27 17:46:30.521 are lost.

so the flush is timeout.

[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:53] ["datanode seek successfully when register to msgDispatcher"] [nodeID=13] [collectionID=450758977781515635] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [msgID="\u0008\n\u0010\ufffd\u0006\u0018\u0000 \u0000"] [tsTime=2024/06/27 17:31:42.121 +00:00] [tsLag=14m39.081598088s]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/data_sync_service.go:93] ["dataSyncService starting flow graph"] [collectionID=450758977781515635] [vChanName=by-dev-rootcoord-dml_10_450758977781515635v1]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:379] ["Stop timer for ToWatch operation succeeded"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [opID=450758977782726625] [timeout=5m0s]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:194] ["Success to watch"] [opID=450758977782726625] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [State=WatchSuccess]
[2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/manager.go:198] ["start merging..."] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"]
[2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/dispatcher.go:149] ["add new target"] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [isMain=true]
[2024/06/27 17:46:21.373 +00:00] [INFO] [msgdispatcher/manager.go:218] ["merge done"] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"]
[2024/06/27 17:46:21.784 +00:00] [INFO] [datanode/services.go:381] ["DataNode receives CheckChannelOperationProgress"] [traceID=c4cc98513456be9c1824b08d196f446a] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [operation=ToWatch]
[2024/06/27 17:47:00.720 +00:00] [DEBUG] [writebuffer/write_buffer.go:291] ["checkpoint from latest consumed msg"] [collectionID=450758977781515635] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTimestamp=450759331812737025]
[2024/06/27 17:47:02.392 +00:00] [DEBUG] [datanode/flow_graph_time_tick_node.go:122] ["UpdateChannelCheckpoint success"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTs=450759331812737025] [cpTime=2024/06/27 17:46:30.521 +00:00]

 5977 [2024/06/27 17:55:51.734 +00:00] [INFO] [datacoord/services.go:1241] ["DataCoord receive GetFlushState request, Flushed is false"] [traceID=9cc685a95c1cfb1ffd54f70ba951b85b] [collection=450758977781515635] [flushTs=2024/06/27 17:55:42.671 +00:00] [unflushed="[450758977781515919]"] [len=1]

@bigsheeper
Copy link
Contributor

segment 450758977781515919 's data is lost from pulsar.

after DN recovering and seek to 17:31:42, the msg in between 17:31:42 are lost cpTime=2024/06/27 17:46:30.521 are lost.

so the flush is timeout.

[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:53] ["datanode seek successfully when register to msgDispatcher"] [nodeID=13] [collectionID=450758977781515635] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [msgID="\u0008\n\u0010\ufffd\u0006\u0018\u0000 \u0000"] [tsTime=2024/06/27 17:31:42.121 +00:00] [tsLag=14m39.081598088s]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/data_sync_service.go:93] ["dataSyncService starting flow graph"] [collectionID=450758977781515635] [vChanName=by-dev-rootcoord-dml_10_450758977781515635v1]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:379] ["Stop timer for ToWatch operation succeeded"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [opID=450758977782726625] [timeout=5m0s]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:194] ["Success to watch"] [opID=450758977782726625] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [State=WatchSuccess]
[2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/manager.go:198] ["start merging..."] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"]
[2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/dispatcher.go:149] ["add new target"] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [isMain=true]
[2024/06/27 17:46:21.373 +00:00] [INFO] [msgdispatcher/manager.go:218] ["merge done"] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"]
[2024/06/27 17:46:21.784 +00:00] [INFO] [datanode/services.go:381] ["DataNode receives CheckChannelOperationProgress"] [traceID=c4cc98513456be9c1824b08d196f446a] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [operation=ToWatch]
[2024/06/27 17:47:00.720 +00:00] [DEBUG] [writebuffer/write_buffer.go:291] ["checkpoint from latest consumed msg"] [collectionID=450758977781515635] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTimestamp=450759331812737025]
[2024/06/27 17:47:02.392 +00:00] [DEBUG] [datanode/flow_graph_time_tick_node.go:122] ["UpdateChannelCheckpoint success"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTs=450759331812737025] [cpTime=2024/06/27 17:46:30.521 +00:00]
 5977 [2024/06/27 17:55:51.734 +00:00] [INFO] [datacoord/services.go:1241] ["DataCoord receive GetFlushState request, Flushed is false"] [traceID=9cc685a95c1cfb1ffd54f70ba951b85b] [collection=450758977781515635] [flushTs=2024/06/27 17:55:42.671 +00:00] [unflushed="[450758977781515919]"] [len=1]

Yes, after the datanode restart, the flowgraph did not receive insert data of segment 450758977781515919. There are several possible reasons for this:

  1. The channel checkpoint was incorrectly advanced past segment 450758977781515919 (without flush/sync).
  2. The insert data consumed was incorrectly skipped.
  3. Data was lost in Pulsar (although data was present in Pulsar before the datanode restart).

@zhuwenxing
Copy link
Contributor Author

it is not a stable reproduced issue. so it is a bit hard to verify and reproduce

@bigsheeper
Copy link
Contributor

it is not a stable reproduced issue. so it is a bit hard to verify and reproduce

Try to add some key logs; we'll look into it if it happens again.

sre-ci-robot pushed a commit that referenced this issue Jul 6, 2024
…logs (#34438)

Check if the segment exists during FlushSegments and add some key logs
in write path.

issue: #34255

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jul 8, 2024
…logs (#34438) (#34472)

Check if the segment exists during FlushSegments and add some key logs
in write path.

issue: #34255

pr: #34438

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@zhuwenxing
Copy link
Contributor Author

There is a reproduced case in 2.4-20240708-f6cd8416-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/14352/pipeline

log:
artifacts-pulsar-pod-kill-14352-server-logs.tar.gz


[2024-07-08T21:33:29.882Z] <name>: SearchChecker__GGsK0j2n

[2024-07-08T21:33:29.882Z] <description>: 

[2024-07-08T21:33:29.882Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT......  (api_request.py:37)

[2024-07-08T21:33:29.882Z] [2024-07-08 21:30:21 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:106)

[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: SearchChecker__GGsK0j2n, flusht_ts: 451011994230456324)>, <Time:{'RPC start': '2024-07-08 21:30:21.415697', 'RPC error': '2024-07-08 21:33:21.580120'}> (decorators.py:146)

[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - ERROR - ci_test]: Traceback (most recent call last):

[2024-07-08T21:33:29.882Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-07-08T21:33:29.882Z]     res = func(*args, **_kwargs)

[2024-07-08T21:33:29.882Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-07-08T21:33:29.882Z]     return func(*arg, **kwargs)

[2024-07-08T21:33:29.882Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush

[2024-07-08T21:33:29.882Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-07-08T21:33:29.882Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 147, in handler

[2024-07-08T21:33:29.882Z]     raise e from e

[2024-07-08T21:33:29.882Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 143, in handler

[2024-07-08T21:33:29.882Z]     return func(*args, **kwargs)

[2024-07-08T21:33:29.882Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 182, in handler

[2024-07-08T21:33:29.882Z]     return func(self, *args, **kwargs)

[2024-07-08T21:33:29.882Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 107, in handler

[2024-07-08T21:33:29.882Z]     raise MilvusException(

[2024-07-08T21:33:29.882Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: SearchChecker__GGsK0j2n, flusht_ts: 451011994230456324)>

[2024-07-08T21:33:29.882Z]  (api_request.py:45)

[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: SearchChecker__GGsK0j2n, flusht_ts: 451011994230456324)> (api_request.py:46)

@zhuwenxing
Copy link
Contributor Author

zhuwenxing commented Jul 10, 2024

@bigsheeper
Copy link
Contributor

The cause of this issue is different; it's due to tt delay:
image
image

@bigsheeper
Copy link
Contributor

reproduced in master-20240709-eeb03a0d-amd64 datanode chaos test

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/15588/pipeline

log:artifacts-datanode-pod-kill-15588-server-logs.tar.gz

dispatcher merged by mistake, the insert data consumed was incorrectly skipped:
image

sre-ci-robot pushed a commit that referenced this issue Jul 10, 2024
When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: #34255

pr: #34562

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jul 10, 2024
When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: #34255

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper
Copy link
Contributor

/assign @zhuwenxing
should be fixed

@bigsheeper
Copy link
Contributor

/unassign

bigsheeper added a commit to bigsheeper/milvus that referenced this issue Jul 12, 2024
When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: milvus-io#34255

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
bigsheeper added a commit to bigsheeper/milvus that referenced this issue Jul 12, 2024
When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: milvus-io#34255

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
bigsheeper added a commit to bigsheeper/milvus that referenced this issue Jul 12, 2024
…us-io#34563)

When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: milvus-io#34255

pr: milvus-io#34562

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jul 12, 2024
When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: #34255

pr: #34562

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
bigsheeper added a commit to bigsheeper/milvus that referenced this issue Jul 12, 2024
…us-io#34563)

When the main dispatcher has not yet consumed data, curTs is 0. During
this time, merging dispatchers should not be allowed; otherwise, the
data of the solo dispatcher will be skipped.

issue: milvus-io#34255

pr: milvus-io#34562

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jul 13, 2024
This PR cherry-picks the following commits that fix bugs:

- #34563
- #34230
- #34071
- #34302
- #34566

issue: #34255,
#34221,
#34068,
#34247,

pr: #34563,
#34230,
#34071,
#34302,
#34566

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: SimFG <bang.fu@zilliz.com>
@zhuwenxing
Copy link
Contributor Author

verified and not reproduced in master-20240713-2c462d38-amd64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants