Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: standalone panic with error panic: segment not found[segment=450570869009695178] during test after reinstallation or upgrade #34018

Closed
1 task done
zhuwenxing opened this issue Jun 20, 2024 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20240619-7b9462c0-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024/06/19 10:14:23.995 +00:00] [INFO] [datacoord/task_scheduler.go:218] ["task is processing"] [taskID=450570345243517627] [state=JobStateInProgress]
[2024/06/19 10:14:23.996 +00:00] [DEBUG] [indexnode/indexnode_service.go:436] ["query index jobs result success"] [traceID=5365eaef05470135fb896116f7840443] [clusterID=by-dev] [taskIDs="[450570345243517627]"] [results="[{\"buildID\":450570345243517627,\"state\":2}]"]
[2024/06/19 10:14:23.996 +00:00] [INFO] [datacoord/task_index.go:299] ["query task index info successfully"] [taskID=450570345243517627] ["result state"=InProgress] [failReason=]
[2024/06/19 10:14:23.996 +00:00] [INFO] [datacoord/task_scheduler.go:218] ["task is processing"] [taskID=450570345243517628] [state=JobStateInit]
[2024/06/19 10:14:23.996 +00:00] [INFO] [indexnode/indexnode_service.go:213] ["Get Index Job Stats"] [traceID=46de8eb37c2ebf48f2329b6aa2af7dc2] [unissued=0] [active=1] [slot=0]
[2024/06/19 10:14:23.996 +00:00] [DEBUG] [datacoord/task_scheduler.go:229] ["pick client failed"]
[2024/06/19 10:14:23.996 +00:00] [INFO] [datacoord/task_scheduler.go:198] ["there is no idle indexing node, wait a minute..."]
[2024/06/19 10:14:24.009 +00:00] [DEBUG] [gc/gc_tuner.go:90] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=1527] ["total memory"=17389] ["next GC"=4050] ["new GOGC"=200] [gc-pause=9.588969ms] [gc-pause-end=1718792064004832930]
panic: segment not found[segment=450570869009695178]

goroutine 23545 [running]:
panic({0x5690c60?, 0xc00f4e3bf0?})
	/usr/local/go/src/runtime/panic.go:1017 +0x3ac fp=0xc1152a3540 sp=0xc1152a3490 pc=0x1e27b2c
github.com/milvus-io/milvus/internal/datanode/syncmgr.(*storageV1Serializer).setTaskMeta.func1({0x5f69ea0?, 0xc00f4e3bf0?})
	/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/storage_serializer.go:158 +0x26 fp=0xc1152a3560 sp=0xc1152a3540 pc=0x4620146
github.com/milvus-io/milvus/internal/datanode/syncmgr.(*SyncTask).HandleError(0xc11901c9a0, {0x5f69ea0, 0xc00f4e3bf0})
	/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/task.go:117 +0x7c fp=0xc1152a35f0 sp=0xc1152a3560 pc=0x461b99c
github.com/milvus-io/milvus/internal/datanode/syncmgr.(*syncManager).safeSubmitTask(0xc004c008d0, {0x5fb4630, 0xc11901c9a0}, {0xc003b1c720, 0x1, 0x1})
	/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/sync_manager.go:121 +0x147 fp=0xc1152a3678 sp=0xc1152a35f0 pc=0x461a9a7
github.com/milvus-io/milvus/internal/datanode/syncmgr.(*syncManager).SyncData(0xc0059be380?, {0x5f9bc60?, 0x86b63e0?}, {0x5fb4630?, 0xc11901c9a0?}, {0xc003b1c720?, 0x0?, 0x0?})
	/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/sync_manager.go:109 +0xeb fp=0xc1152a36b8 sp=0xc1152a3678 pc=0x461a7eb
github.com/milvus-io/milvus/internal/datanode/writebuffer.(*writeBufferBase).syncSegments(0xc0059be380, {0x5f9bc60, 0x86b63e0}, {0xc119028188, 0x1, 0x51?})
	/go/src/github.com/milvus-io/milvus/internal/datanode/writebuffer/write_buffer.go:362 +0x3f8 fp=0xc1152a3880 sp=0xc1152a36b8 pc=0x462f9f8
github.com/milvus-io/milvus/internal/datanode/writebuffer.(*writeBufferBase).triggerSync(0xc0059be380)
	/go/src/github.com/milvus-io/milvus/internal/datanode/writebuffer/write_buffer.go:307 +0x19f fp=0xc1152a3960 sp=0xc1152a3880 pc=0x462ec9f
github.com/milvus-io/milvus/internal/datanode/writebuffer.(*l0WriteBuffer).BufferData(0xc005e97b00, {0x86b63e0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0x1facc89?, 0xc00369a3c0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/writebuffer/l0_write_buffer.go:182 +0x385 fp=0xc1152a3a88 sp=0xc1152a3960 pc=0x46288a5
github.com/milvus-io/milvus/internal/datanode/writebuffer.(*bufferManager).BufferData(0xc004631f40, {0xc001744510, 0x2b}, {0x86b63e0, 0x0, 0x0}, {0x0, 0x0, 0x0}, 0xc00369a360, ...)
	/go/src/github.com/milvus-io/milvus/internal/datanode/writebuffer/manager.go:200 +0x274 fp=0xc1152a3b48 sp=0xc1152a3a88 pc=0x462abd4
github.com/milvus-io/milvus/internal/datanode.(*writeNode).Operate(0xc0036a16d0, {0xc1145fb4a0?, 0x1?, 0x1?})
	/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_write_node.go:74 +0x703 fp=0xc1152a3ea0 sp=0xc1152a3b48 pc=0x4690563
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtxManager).workNodeStart(0xc007ca0e10)
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:125 +0x362 fp=0xc1152a3fc8 sp=0xc1152a3ea0 pc=0x4672d02
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtxManager).Start.func1()
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:78 +0x25 fp=0xc1152a3fe0 sp=0xc1152a3fc8 pc=0x4672965
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc1152a3fe8 sp=0xc1152a3fe0 pc=0x1e61381
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtxManager).Start in goroutine 7127
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:78 +0x65

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2431/pipeline
log:
artifacts-rocksmq-standalone-reinstall-2431-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2024
@zhuwenxing zhuwenxing modified the milestones: 2.4.lru, 2.5.0 Jun 20, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 20, 2024
@zhuwenxing zhuwenxing changed the title [Bug]: standalone panic with error panic: segment not found[segment=450570869009695178] during test after reinstallation [Bug]: standalone panic with error panic: segment not found[segment=450570869009695178] during test after reinstallation or upgrade Jun 20, 2024
@zhuwenxing
Copy link
Contributor Author

It also reproduced after v2.3.15 --> master-20240619-7b9462c0-amd64

error

[2024/06/19 10:09:59.079 +00:00] [INFO] [metacache/meta_cache.go:299] ["remove dropped segment"] [segmentID=450570799631185937]
[2024/06/19 10:09:59.079 +00:00] [INFO] [datacoord/session_manager.go:256] ["success to sync segments"] [nodeID=3] [planID=0]
[2024/06/19 10:09:59.079 +00:00] [INFO] [datacoord/sync_segments_scheduler.go:145] ["sync segments success"] [collectionID=450570409833736812] [partitionID=450570409833736813] [channelName=by-dev-rootcoord-dml_2_450570409833736812v1] [nodeID=3] [segments="[450570799631185937,450570799631185738,450570799630779282]"]
[2024/06/19 10:09:59.080 +00:00] [INFO] [datanode/services.go:284] ["DataNode receives SyncSegments"] [traceID=a004a4287697927e184d13448af58e44] [planID=0] [nodeID=3] [collectionID=450570409833736906] [partitionID=450570409833736907] [channel=by-dev-rootcoord-dml_8_450570409833736906v0]
[2024/06/19 10:09:59.080 +00:00] [INFO] [util/load_stats.go:38] ["begin to init pk bloom filter"] [segmentID=450570799630779283] [statsBinLogsLen=1]
[2024/06/19 10:09:59.085 +00:00] [INFO] [util/load_stats.go:113] ["Successfully load pk stats"] [segmentID=450570799630779283] [time=5.178694ms] [size=408064]
panic: segment not found[segment=450570799632187110]

goroutine 18885 [running]:
panic({0x5690c60?, 0xc0084c5680?})
	/usr/local/go/src/runtime/panic.go:1017 +0x3ac fp=0xc005261540 sp=0xc005261490 pc=0x1e27b2c
[2024/06/19 10:09:59.085 +00:00] [INFO] [metacache/meta_cache.go:289] ["metacache does not have segment, add it"] [segmentID=450570799630779283]
github.com/milvus-io/milvus/internal/datanode/syncmgr.(*storageV1Serializer).setTaskMeta.func1({0x5f69ea0?, 0xc0084c5680?})
	/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/storage_serializer.go:158 +0x26 fp=0xc005261560 sp=0xc005261540 pc=0x4620146
github.com/milvus-io/milvus/internal/datanode/syncmgr.(*SyncTask).HandleError(0xc0058c8c60, {0x5f69ea0, 0xc0084c5680})
	/go/src/github.com/milvus-io/milvus/internal/datanode/syncmgr/task.go:117 +0x7c fp=0xc0052615f0 sp=0xc005261560 pc=0x461b99c
[2024/06/19 10:09:59.085 +00:00] [INFO] [metacache/meta_cache.go:299] ["remove dropped segment"] [segmentID=450570409833336946]
[2024/06/19 10:09:59.085 +00:00] [INFO] [metacache/meta_cache.go:299] ["remove dropped segment"] [segmentID=450570409833337817]
[2024/06/19 10:09:59.085 +00:00] [INFO] [metacache/meta_cache.go:299] ["remove dropped segment"] [segmentID=450570409834537791]
[2024/06/19 10:09:59.085 +00:00] [INFO] [datacoord/session_manager.go:256] ["success to sync segments"] [nodeID=3] [planID=0]
[2024/06/19 10:09:59.085 +00:00] [INFO] [datacoord/sync_segments_scheduler.go:145] ["sync segments success"] [collectionID=450570409833736906] [partitionID=450570409833736907] [channelName=by-dev-rootcoord-dml_8_450570409833736906v0] [nodeID=3] [segments="[450570799631185466,450570799631185548,450570799630779283]"]

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2433/pipeline
log:
artifacts-rocksmq-standalone-upgrade-2433-server-logs.tar.gz

@xiaofan-luan
Copy link
Contributor

/assign @congqixia

please help on it

congqixia added a commit to congqixia/milvus that referenced this issue Jun 21, 2024
Related to milvus-io#34018

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@congqixia
Copy link
Contributor

The root cause was the SyncSegments call removed a new growing segments from meta, which caused panicking when datanode needs to flush it
image

congqixia added a commit to congqixia/milvus that referenced this issue Jun 22, 2024
Related to milvus-io#34018

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Jun 24, 2024
Related to #34018

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@congqixia
Copy link
Contributor

patch merged, could you please helping verify?
/assign @zhuwenxing

@congqixia congqixia added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2024
@zhuwenxing
Copy link
Contributor Author

Not reproduced

xiaocai2333 pushed a commit to xiaocai2333/milvus that referenced this issue Jun 26, 2024
yellow-shine pushed a commit to yellow-shine/milvus that referenced this issue Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants