Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Standalone restarted once during test #27145

Closed
1 task done
zhuwenxing opened this issue Sep 15, 2023 · 17 comments · Fixed by #27167
Closed
1 task done

[Bug]: Standalone restarted once during test #27145

zhuwenxing opened this issue Sep 15, 2023 · 17 comments · Fixed by #27167
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.3.0-20230915-a42bc90c
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

some tests failed due to the standalone pod restarted

Expected Behavior

image

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/909/pipeline

log:

artifacts-rocksmq-standalone-reinstall-909-pytest-logs.tar.gz
artifacts-rocksmq-standalone-reinstall-909-server-logs (1).tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 15, 2023
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Sep 15, 2023
@yanliang567
Copy link
Contributor

not see any errors in the logs, except the following one:

[2023/09/15 13:31:39.712 +00:00] [INFO] [datacoord/index_builder.go:251] ["index builder peek client error, there is no available"]
[2023/09/15 13:31:39.712 +00:00] [INFO] [datacoord/index_builder.go:188] ["there is no IndexNode available or etcd is not serviceable, wait a minute..."]

@yanliang567
Copy link
Contributor

/assign @czs007
/unassign

@sre-ci-robot sre-ci-robot assigned czs007 and unassigned yanliang567 Sep 15, 2023
@yanliang567 yanliang567 added this to the 2.3.1 milestone Sep 15, 2023
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 15, 2023
@xiaofan-luan
Copy link
Contributor

I don't see a complete log.

what is the test case?

@xiaofan-luan
Copy link
Contributor

/assign @yiwangdr

@zhuwenxing
Copy link
Contributor Author

I don't see a complete log.
What is the test case?

  1. From the failed job, we can get the pod name. For the full log, can get it from loki.

  2. This is a complex test scenario that will have 5 test cases running.
    image

@zhuwenxing
Copy link
Contributor Author

[2023/09/15 16:50:01.286 +00:00] [INFO] [datanode/channel_meta.go:626] ["updated segment memorySize"] [segmentID=444280681155636934] [memorySize=0]
[2023/09/15 16:50:01.289 +00:00] [INFO] [datacoord/meta.go:615] ["meta update: update flush segments info - update flush segments info successfully"] [segmentID=444280681155837309]
[2023/09/15 16:50:01.289 +00:00] [INFO] [datacoord/services.go:497] ["flush segment with meta"] [traceID=6ce0561e94b6fcd07e6ea42be4808f49] [nodeID=1] [collectionID=444280681154030140] [segmentID=444280681155837309] [meta="[{\"fieldID\":101,\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/101/444280681153636023\",\"log_size\":6004}]},{\"fieldID\":102,\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/102/444280681153636023\",\"log_size\":31521}]},{\"fieldID\":103,\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/103/444280681153636023\",\"log_size\":67962}]},{\"fieldID\":104,\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/104/444280681153636023\",\"log_size\":768516}]},{\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/0/444280681153636023\",\"log_size\":12008}]},{\"fieldID\":1,\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/1/444280681153636023\",\"log_size\":12008}]},{\"fieldID\":100,\"binlogs\":[{\"entries_num\":1501,\"timestamp_from\":444280759985111042,\"timestamp_to\":444280759985111042,\"log_path\":\"file/insert_log/444280681154030140/444280681154030141/444280681155837309/100/444280681153636023\",\"log_size\":12008}]}]"]
[2023/09/15 16:50:01.289 +00:00] [INFO] [datanode/segment.go:227] ["evictHistoryInsertBuffer done"] [segmentID=444280681155837309] [ts=2023/09/15 16:50:00.734 +00:00] [channel=by-dev-rootcoord-dml_6_444280681154030140v1]
[2023/09/15 16:50:01.289 +00:00] [INFO] [datanode/segment.go:250] ["evictHistoryDeleteBuffer done"] [segmentID=444280681155837309] [ts=2023/09/15 16:50:00.734 +00:00] [channel=by-dev-rootcoord-dml_6_444280681154030140v1]
[2023/09/15 16:50:01.289 +00:00] [INFO] [datanode/channel_meta.go:626] ["updated segment memorySize"] [segmentID=444280681155837309] [memorySize=0]
[2023/09/15 16:50:01.289 +00:00] [FATAL] [datanode/flow_graph_insert_buffer_node.go:498] ["insertBufferNode failed to flushBufferData"] [segmentID=444280681153630609] [flushed=false] [dropped=false] [auto=false] [position="channel_name:\"by-dev-rootcoord-dml_14_444280681153830019v0\" msgID:\"\\2439X\\236\\350f*\\006\" msgGroup:\"datanode-1-by-dev-rootcoord-dml_14_444280681153830019v0-true\" timestamp:444280744898461701 "] [channel=by-dev-rootcoord-dml_14_444280681153830019v0] [segmentID=444280681153630609] [error="attempt #0: no such segment 444280681153630609 in the channel: attempt #1: no such segment 444280681153630609 in the channel: attempt #2: no such segment 444280681153630609 in the channel: attempt #3: no such segment 444280681153630609 in the channel: attempt #4: no such segment 444280681153630609 in the channel: attempt #5: no such segment 444280681153630609 in the channel: attempt #6: no such segment 444280681153630609 in the channel: attempt #7: no such segment 444280681153630609 in the channel: attempt #8: no such segment 444280681153630609 in the channel: attempt #9: no such segment 444280681153630609 in the channel: attempt #10: no such segment 444280681153630609 in the channel: attempt #11: no such segment 444280681153630609 in the channel: attempt #12: no such segment 444280681153630609 in the channel: attempt #13: no such segment 444280681153630609 in the channel: attempt #14: no such segment 444280681153630609 in the channel: attempt #15: no such segment 444280681153630609 in the channel: attempt #16: no such segment 444280681153630609 in the channel: attempt #17: no such segment 444280681153630609 in the channel: attempt #18: no such segment 444280681153630609 in the channel: attempt #19: no such segment 444280681153630609 in the channel"] [stack="github.com/milvus-io/milvus/internal/datanode.(*insertBufferNode).Sync\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_insert_buffer_node.go:498\ngithub.com/milvus-io/milvus/internal/datanode.(*insertBufferNode).Operate\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_insert_buffer_node.go:205\ngithub.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work\n\t/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:127"]

@yiwangdr
Copy link
Contributor

yiwangdr commented Sep 15, 2023

Timeline:

[2023/09/15 16:50:00.246 +00:00] [DEBUG] [querynodev2/services.go:886] ["start do query segments"] [traceID=561d735b354061e11b957c3ea06d7352] [msgID=444280759919575232] [collectionID=444280681153830019] [channel=by-dev-rootcoord-dml_14_444280681153830019v0] [scope=Historical] [fromShardLeader=true] [segmentIDs="[444280681155232677,444280681155031305,444280681155231605,444280681155232083,444280681155433120,444280681155433764,444280681155031012,444280681155031143,444280681155231896,444280681153630609,444280681155433497,444280681155031272,444280681155031425,444280681155232495,444280681155232426,444280681155433036,444280681155030896,444280681155231818,444280681155232309,444280681155432790,444280681155433362]"]

[2023/09/15 16:50:00.246 +00:00] [DEBUG] [querynodev2/services.go:694] ["start to search segments on worker"] [traceID=4ab4e35c58702b42126b926b27d33c9b] [msgID=444280759919575256] [collectionID=444280681153830019] [channel=by-dev-rootcoord-dml_14_444280681153830019v0] [scope=Historical] [segmentIDs="[444280681155232677,444280681155031305,444280681155231605,444280681155232083,444280681155433120,444280681155433764,444280681155031012,444280681155031143,444280681155231896,444280681153630609,444280681155433497,444280681155031272,444280681155031425,444280681155232495,444280681155232426,444280681155433036,444280681155030896,444280681155231818,444280681155232309,444280681155432790,444280681155433362]"]

[2023/09/15 16:50:00.246 +00:00] [DEBUG] [segments/segment.go:420] ["retrieve segment done"] [collectionID=444280681153830019] [partitionID=444280681153830020] [segmentID=444280681153630609] [msgID=444280759919575232] [segmentType=Sealed] [resultNum=12]

[2023/09/15 16:50:00.247 +00:00] [DEBUG] [segments/segment.go:362] ["search segment done"] [traceID=4ab4e35c58702b42126b926b27d33c9b] [collectionID=444280681153830019] [segmentID=444280681153630609] [segmentType=Sealed] [withIndex=true]

[2023/09/15 16:50:00.247 +00:00] [DEBUG] [querynodev2/services.go:725] [tr/searchSegments] [traceID=4ab4e35c58702b42126b926b27d33c9b] [msg="search segments done, channel = by-dev-rootcoord-dml_14_444280681153830019v0, segmentIDs = [444280681155232677 444280681155031305 444280681155231605 444280681155232083 444280681155433120 444280681155433764 444280681155031012 444280681155031143 444280681155231896 444280681153630609 444280681155433497 444280681155031272 444280681155031425 444280681155232495 444280681155232426 444280681155433036 444280681155030896 444280681155231818 444280681155232309 444280681155432790 444280681155433362]"] [duration=1.316608ms]

[2023/09/15 16:50:01.289 +00:00] [FATAL] [datanode/flow_graph_insert_buffer_node.go:498] ["insertBufferNode failed to flushBufferData"] [segmentID=444280681153630609] [flushed=false] [dropped=false] [auto=false] [position="channel_name:\"by-dev-rootcoord-dml_14_444280681153830019v0\" msgID:\"\\2439X\\236\\350f*\\006\" msgGroup:\"datanode-1-by-dev-rootcoord-dml_14_444280681153830019v0-true\" timestamp:444280744898461701 "] [channel=by-dev-rootcoord-dml_14_444280681153830019v0] [segmentID=444280681153630609] [error="attempt #0: no such segment 444280681153630609 in the channel: attempt #1: no such segment 444280681153630609 in the channel: attempt #2: no such segment 444280681153630609 in the channel: attempt #3: no such segment 444280681153630609 in the channel: attempt #4: no such segment 444280681153630609 in the channel: attempt #5: no such segment 444280681153630609 in the channel: attempt #6: no such segment 444280681153630609 in the channel: attempt #7: no such segment 444280681153630609 in the channel: attempt #8: no such segment 444280681153630609 in the channel: attempt #9: no such segment 444280681153630609 in the channel: attempt #10: no such segment 444280681153630609 in the channel: attempt #11: no such segment 444280681153630609 in the channel: attempt #12: no such segment 444280681153630609 in the channel: attempt #13: no such segment 444280681153630609 in the channel: attempt #14: no such segment 444280681153630609 in the channel: attempt #15: no such segment 444280681153630609 in the channel: attempt #16: no such segment 444280681153630609 in the channel: attempt #17: no such segment 444280681153630609 in the channel: attempt #18: no such segment 444280681153630609 in the channel: attempt #19: no such segment 444280681153630609 in the channel"] [stack="github.com/milvus-io/milvus/internal/datanode.(*insertBufferNode).Sync\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_insert_buffer_node.go:498\ngithub.com/milvus-io/milvus/internal/datanode.(*insertBufferNode).Operate\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_insert_buffer_node.go:205\ngithub.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work\n\t/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:127"]

This segment 444280681153630609 was loaded and the search was successful on querynode. The segment itself is not corrupted.
One second later, datanode crashed due to it.

@yiwangdr
Copy link
Contributor

yiwangdr commented Sep 15, 2023

Note that

[2023/09/15 16:50:00.247 +00:00] ... ["search segment done"] ... [segmentID=444280681153630609] [segmentType=Sealed]...

^ this log is 1 sec before crash.

So the datanode was trying to flushBufferData for a sealed segment.
iiuc it's for deletion.

@yiwangdr
Copy link
Contributor

complete log from standalone fyi

milvus.log

@yiwangdr
Copy link
Contributor

The datanode failed to fetch meta data for a sealed segment. The error no such segment 444280681153630609 in the channel is from channel meta check.

This issue is probably caused by #27063, which introduced changes to both flush and channel meta.
@bigsheeper fyi

More investigation is needed.

@bigsheeper
Copy link
Contributor

Before #27063, sealed segment only be sync every 10min, after #27063, sealed segment will be sync when Flush is invoked, I believe #27063 simply amplifies the likelihood of encountering the scenario: concurrent compaction and synchronization.

A possible quick fix:
Ignore no such segment error for compacted segment. @xiaofan-luan @yiwangdr @czs007

@yiwangdr
Copy link
Contributor

yiwangdr commented Sep 16, 2023

I tried to revert most of #27063 and the tests passed.
Like @bigsheeper pointed out, it doesn't mean #27063 is the root cause. This pr just amplifies the existing issue.

A possible quick fix:
Ignore no such segment error for compacted segment. @xiaofan-luan @yiwangdr @czs007

It prevents false positives but the downside is that we may introduce true negatives.
@xiaofan-luan @bigsheeper iiuc, we need to guard meta with locks, which may lead to performance drop. Yet I'm not sure about how big the impact is. It may need some work to verify & test.

@yanliang567 @bigsheeper @czs007 Is this PR critical to 2.3.1 release? If not, I'd prefer pr revert to unblock release.

@xiaofan-luan
Copy link
Contributor

[2023/09/15 16:49:02.277 +00:00] [DEBUG] [datacoord/meta.go:1110] ["meta update: alter meta store for compaction updates"] ["compact from segments (segments to be updated as dropped)"="[444280681153630609,444280681155031143,444280681155031012,444280681155030896]"] ["new segmentID"=444280681153633878] [binlog=7] ["stats log"=1] ["delta logs"=4] ["compact to segment"=444280681153633878]

@xiaofan-luan
Copy link
Contributor

on 16:49, the segment is compacted.

I agreed thttps://github.com//pull/27152 might not be rootcasue, but flush is triggered more often than before

@bigsheeper
Copy link
Contributor

/assign @zhuwenxing
pr merged, please help to make verify

sre-ci-robot pushed a commit that referenced this issue Nov 29, 2023
We have been check ErrSegmentNotFound in insert_buffer_node in datanode,
we should also check it in delete_node.

issue: #27145

pr: #28371

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants