Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Querynode terminated with log: failed to Deserialize index, cardinal inner error #30857

Open
1 task done
ThreadDao opened this issue Feb 27, 2024 · 17 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.3-ef086dc-20240222
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. collection laion_stable_4 has 58m-768d+ data, and the schema is:
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
  1. reload collection (64 segments) -> concurrent requests: insert + delete + search + query
    image

  2. One querynode of total 4 terminated 134 error with error logs:
    (Since the cardinal is private, please get in touch with me for more detailed querynode terminated logs)

E20240226 16:13:52.003870   607 FileIo.cpp:25] [CARDINAL][FileReader][milvus] Failed to open file : /var/lib/milvus/data/querynode/index_files/447990444064979058/1/_mem.index.bin
E20240226 16:13:52.005385   607 cardinal.cc:368] [KNOWHERE][Deserialize][milvus] Cardinal Inner Exception: std::exception
I20240226 16:13:52.005625   607 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Load index: done (2.135270 ms)
 => failed to Deserialize index, cardinal inner error
non-Go function
    pc=0x7f58fbc2003b
non-Go function
    pc=0x7f58fbbff858
non-Go function
    pc=0x7f58fba998d0
non-Go function
    pc=0x7f58fbaa537b
non-Go function
    pc=0x7f58fbaa4358
non-Go function
    pc=0x7f58fbaa4d10
non-Go function
    pc=0x7f58fbde1bfe
runtime.cgocall(0x4749090, 0xc001774cd0)
    /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc001774ca8 sp=0xc001774c70 pc=0x1a627bc
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteSegment(0x7f58f68d1700)
    _cgo_gotypes.go:475 +0x45 fp=0xc001774cd0 sp=0xc001774ca8 pc=0x4522a45
github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).Release.func1(0xc00175b2d8?)
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment.go:1037 +0x3a fp=0xc001774d08 sp=0xc001774cd0 pc=0x453d07a
github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).Release(0xc00175b290)
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/segment.go:1037 +0xa6 fp=0xc001774f48 sp=0xc001774d08 pc=0x453c826
github.com/milvus-io/milvus/internal/querynodev2/segments.remove({0x5744620, 0xc00175b290})
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/manager.go:542 +0x42 fp=0xc001775010 sp=0xc001774f48 pc=0x452e5c2
github.com/milvus-io/milvus/internal/querynodev2/segments.(*segmentManager).Remove(0xc001620a80, 0x5158d57?, 0x3)
    /go/src/github.com/milvus-io/milvus/internal/querynodev2/segments/manager.go:447 +0x2d5 fp=0xc0017750b0 sp=0xc001775010 pc=0x452d5b5

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

laion1b-test-2-etcd-0                                             1/1     Running             1 (46d ago)     74d     10.104.25.31    4am-node30   <none>           <none>
laion1b-test-2-etcd-1                                             1/1     Running             0               74d     10.104.30.94    4am-node38   <none>           <none>
laion1b-test-2-etcd-2                                             1/1     Running             0               74d     10.104.34.225   4am-node37   <none>           <none>
laion1b-test-2-milvus-datanode-7b7f99b8d4-g8v8q                   1/1     Running             0               20h     10.104.16.187   4am-node21   <none>           <none>
laion1b-test-2-milvus-datanode-7b7f99b8d4-t7lfp                   1/1     Running             0               20h     10.104.30.131   4am-node38   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-2kbqd                   1/1     Running             0               15h     10.104.14.112   4am-node18   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-d6q6m                   1/1     Running             0               15h     10.104.9.46     4am-node14   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-lg4k7                   1/1     Running             0               15h     10.104.34.47    4am-node37   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-q9hcx                   1/1     Running             0               15h     10.104.17.50    4am-node23   <none>           <none>
laion1b-test-2-milvus-indexnode-c8c8f4584-vtvx5                   1/1     Running             0               15h     10.104.29.240   4am-node35   <none>           <none>
laion1b-test-2-milvus-mixcoord-74b896d49d-ljz4l                   1/1     Running             0               20h     10.104.18.222   4am-node25   <none>           <none>
laion1b-test-2-milvus-proxy-5cdb5b7d6b-w5h29                      1/1     Running             0               20h     10.104.19.6     4am-node28   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-8pfz2                1/1     Running             0               15h     10.104.17.49    4am-node23   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-cb9mn                1/1     Running             0               15h     10.104.28.101   4am-node33   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-dd9hn                1/1     Running             0               15h     10.104.33.74    4am-node36   <none>           <none>
laion1b-test-2-milvus-querynode-0-7977c8fdbf-zfhhz                1/1     Running             1 (15h ago)     15h     10.104.32.10    4am-node39   <none>           <none>
laion1b-test-2-pulsar-bookie-0                                    1/1     Running             0               74d     10.104.33.107   4am-node36   <none>           <none>
laion1b-test-2-pulsar-bookie-1                                    1/1     Running             0               50d     10.104.18.240   4am-node25   <none>           <none>
laion1b-test-2-pulsar-bookie-2                                    1/1     Running             0               74d     10.104.25.32    4am-node30   <none>           <none>
laion1b-test-2-pulsar-broker-0                                    1/1     Running             0               68d     10.104.1.69     4am-node10   <none>           <none>
laion1b-test-2-pulsar-proxy-0                                     1/1     Running             0               74d     10.104.4.218    4am-node11   <none>           <none>
laion1b-test-2-pulsar-recovery-0                                  1/1     Running             0               74d     10.104.14.151   4am-node18   <none>           <none>
laion1b-test-2-pulsar-zookeeper-0                                 1/1     Running             0               74d     10.104.29.87    4am-node35   <none>           <none>
laion1b-test-2-pulsar-zookeeper-1                                 1/1     Running             0               74d     10.104.21.124   4am-node24   <none>           <none>
laion1b-test-2-pulsar-zookeeper-2                                 1/1     Running             0               74d     10.104.34.229   4am-node37   <none>           <none>
  • core dump file:
    /tmp/cores/core-laion1b-test-2-milvus-querynode-0-7977c8fdbf-zfhhz-milvus-8-1708964037 of 4am-node39

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2024
@ThreadDao ThreadDao added this to the 2.3.10 milestone Feb 27, 2024
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Feb 27, 2024
@ThreadDao
Copy link
Contributor Author

Perhaps it is because the dataCoord.channel.watchTimeoutInterval configuration is modified and the milvus is restarted. I mean when the qn restarts it looks like the tests haven't started yet

@yanliang567
Copy link
Contributor

/assign @liliu-z
/unassign

@sre-ci-robot sre-ci-robot assigned liliu-z and unassigned yanliang567 Feb 28, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2024
@foxspy
Copy link
Contributor

foxspy commented Feb 28, 2024

/assign @foxspy
/unassign @liliu-z

@sre-ci-robot sre-ci-robot assigned foxspy and unassigned liliu-z Feb 28, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.10, 2.3.11 Feb 28, 2024
@foxspy
Copy link
Contributor

foxspy commented Feb 28, 2024

image
The root cause seems to be a concurrency bug between release and load. The qn release a segments while the index engine loading the index from file concurrently. And the index engine throwing an exception is as expected for the file not exist, which will not cause the qn coredump, and the actual cause of the coredump is the release operation.

@foxspy
Copy link
Contributor

foxspy commented Feb 28, 2024

/assign @yanliang567
/unassign

@sre-ci-robot sre-ci-robot assigned yanliang567 and unassigned foxspy Feb 28, 2024
@chyezh
Copy link
Contributor

chyezh commented Feb 28, 2024

/assign @chyezh

@chyezh
Copy link
Contributor

chyezh commented Feb 28, 2024

Index 447990444064979058 belongs to Segment 447990444064723266.

Node `` start to release segment while new load request is incoming.


[2024/02/26 16:13:34.948 +00:00] [INFO] [querynodev2/services.go:595] ["start to release segments"] [traceID=06cd8e504609e669a530c57535e33631] [collectionID=447902879639453431] [shard=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [segmentIDs="[447990444064723266]"] [currentNodeID=1681]

[2024/02/26 16:13:38.996 +00:00] [INFO] [querynodev2/services.go:433] ["received load segments request"] [traceID=7c9ecfdda5b9a070dc760feb81f2bf64] [collectionID=447902879639453431] [partitionID=447902879639453437] [shard=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [segmentID=447990444064723266] [currentNodeID=1681] [version=1708964018906870595] [needTransfer=false] [loadScope=Full]

Load repeat segment is checked by SegmentManager.

...
		if len(loader.manager.Segment.GetBy(WithType(segmentType), WithID(segment.GetSegmentID()))) == 0 &&
			!loader.loadingSegments.Contain(segment.GetSegmentID()) {
...

Release segment is remove the segment from SegmentManager then release the memory.

	case querypb.DataScope_Historical:
		sealed = mgr.removeSegmentWithType(SegmentTypeSealed, segmentID)
		if sealed != nil {
			removeSealed = 1
		}

	mgr.updateMetric()
	mgr.mu.Unlock()

	if sealed != nil {
		remove(sealed)
	}

Concurrent load and release happens.

@chyezh
Copy link
Contributor

chyezh commented Feb 28, 2024

Short-term fix: Implement mutual exclusivity between Release and Load on QN;
Long-term, it is necessary to implement lifecycle controls such as Loading, Loaded, Release state of Segment on QueryCoord.

@yanliang567
Copy link
Contributor

/unassign

@congqixia
Copy link
Contributor

@chyezh
Loading segment will not be released in segment manager. In my opinion, concurrent load&release shall not happen for same segment. Could you please explain the detail how it went?

@chyezh
Copy link
Contributor

chyezh commented Feb 29, 2024

@chyezh Loading segment will not be released in segment manager. In my opinion, concurrent load&release shall not happen for same segment. Could you please explain the detail how it went?

Load is triggered when segment is releasing, but not release is triggered when segment is loading.

The release segment operation is divided into two steps on query node.

  1. Removing the segment from the segmentManager (after this, the SegmentLoader is allowed to reload this segment),
  2. Releasing the actual segment.

@congqixia
Copy link
Contributor

@chyezh Loading segment will not be released in segment manager. In my opinion, concurrent load&release shall not happen for same segment. Could you please explain the detail how it went?

Load is triggered when segment is releasing, but not release is triggered when segment is loading.

The release segment operation is divided into two steps on query node.

1. Removing the segment from the segmentManager (after this, the SegmentLoader is allowed to reload this segment),

2. Releasing the actual segment.

@chyezh got it, thanks!

@congqixia
Copy link
Contributor

After some offline discussion, the final solution shall be separating the disk resource for different segment life-cycle.

One more thing, it's looks weird that a segment is released than loaded back. Maybe the segment was bouncing between querynode?

@chyezh
Copy link
Contributor

chyezh commented Feb 29, 2024

After some offline discussion, the final solution shall be separating the disk resource for different segment life-cycle.

One more thing, it's looks weird that a segment is released than loaded back. Maybe the segment was bouncing between querynode?

  • segment is released on QN for collection released.
  • segment is reloaded for segment checker(lack of segment), updated by Distribution?
[2024/02/26 16:13:34.498 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067586] [type=Reduce] [source=segment_checker] [reason=collection released] [collectionID=447902879639453431] [replicaID=-1] [priority=Normal] [actionsCount=1] [actions={[type=Reduce][node=1681][streaming=false]}] [segmentID=447990444064723266]"]

[2024/02/26 16:13:38.501 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067608] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=447902879639453431] [replicaID=447990457955778562] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1681][streaming=false]}] [segmentID=447990444064723266]"]

@chyezh
Copy link
Contributor

chyezh commented Mar 1, 2024

Release then load collection.
Or concurrent release and load collection can reproduce it.

2024-02-27 00:13:34.487	[2024/02/26 16:13:34.487 +00:00] [INFO] [querycoordv2/services.go:254] ["release collection request received"] [traceID=458948ca161f98b29f6d8118b6001ae5] [collectionID=447902879639453431]
2024-02-27 00:13:34.498	[2024/02/26 16:13:34.498 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067586] [type=Reduce] [source=segment_checker] [reason=collection released] [collectionID=447902879639453431] [replicaID=-1] [priority=Normal] [actionsCount=1] [actions={[type=Reduce][node=1681][streaming=false]}] [segmentID=447990444064723266]"]
2024-02-27 00:13:34.976	[2024/02/26 16:13:34.976 +00:00] [INFO] [task/executor.go:104] ["execute the action of task"] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [step=0] [source=segment_checker]
2024-02-27 00:13:34.977	[2024/02/26 16:13:34.976 +00:00] [INFO] [task/executor.go:298] ["release segment..."] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [segmentID=447990444064723266] [node=1681] [source=segment_checker]
2024-02-27 00:13:35.469	[2024/02/26 16:13:35.469 +00:00] [INFO] [task/scheduler.go:768] ["task removed"] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [status=succeeded] [segmentID=447990444064723266]
2024-02-27 00:13:35.470	[2024/02/26 16:13:35.470 +00:00] [WARN] [task/executor.go:301] ["failed to release segment, it may be a false failure"] [taskID=1708948067586] [collectionID=447902879639453431] [replicaID=-1] [segmentID=447990444064723266] [node=1681] [source=segment_checker] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:192 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).ReleaseSegments\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:164 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments.func1\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:271 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).send\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:161 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:299 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:135 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction: attempt #0: rpc error: code = Canceled desc = context canceled: context canceled"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: attempt #0: rpc error: code = Canceled desc = context canceled: context canceled\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).ReleaseSegments\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:192\n  | github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:164\n  | github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).send\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:271\n  | github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:161\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:299\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:135\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:107\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:192 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).ReleaseSegments\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:164 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments.func1\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:271 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).send\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/session/cluster.go:161 github.com/milvus-io/milvus/internal/querycoordv2/session.(*QueryCluster).ReleaseSegments\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:299 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n  | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:135 github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction\nWraps: (3) attempt #0: rpc error: code = Canceled desc = context canceled\nWraps: (4) context canceled\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.multiErrors (4) *errors.errorString"]
2024-02-27 00:13:35.790	[2024/02/26 16:13:35.790 +00:00] [INFO] [querycoordv2/services.go:197] ["load collection request received"] [traceID=6019913291c1f2be024e5909f5edd21f] [collectionID=447902879639453431] [replicaNumber=1] [resourceGroups="[]"] [refreshMode=false] [schema="name:\"laion_stable_4\" fields:<fieldID:100 name:\"id\" is_primary_key:true data_type:Int64 > fields:<fieldID:101 name:\"float_vector\" data_type:FloatVector type_params:<key:\"dim\" value:\"768\" > > fields:<fieldID:102 name:\"int64_pk_5b\" data_type:Int64 is_partition_key:true > fields:<fieldID:103 name:\"varchar_caption\" data_type:VarChar type_params:<key:\"max_length\" value:\"8192\" > > fields:<fieldID:104 name:\"varchar_NSFW\" data_type:VarChar type_params:<key:\"max_length\" value:\"8192\" > > fields:<fieldID:105 name:\"float64_similarity\" data_type:Float > fields:<fieldID:106 name:\"int64_width\" data_type:Int64 > fields:<fieldID:107 name:\"int64_height\" data_type:Int64 > fields:<fieldID:108 name:\"int64_original_width\" data_type:Int64 > fields:<fieldID:109 name:\"int64_original_height\" data_type:Int64 > fields:<fieldID:110 name:\"varchar_md5\" data_type:VarChar type_params:<key:\"max_length\" value:\"8192\" > > fields:<fieldID:111 name:\"$meta\" description:\"dynamic schema\" data_type:JSON is_dynamic:true > enable_dynamic_field:true "] [fieldIndexes="[447902879639453513,447902879639453519,447902879639453502,447902879639453508]"]
2024-02-27 00:13:38.501	[2024/02/26 16:13:38.501 +00:00] [INFO] [task/scheduler.go:269] ["task added"] [task="[id=1708948067608] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=447902879639453431] [replicaID=447990457955778562] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1681][streaming=false]}] [segmentID=447990444064723266]"]
2024-02-27 00:13:38.608	[2024/02/26 16:13:38.608 +00:00] [INFO] [task/executor.go:104] ["execute the action of task"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [step=0] [source=segment_checker]
2024-02-27 00:13:38.906	[2024/02/26 16:13:38.906 +00:00] [INFO] [task/executor.go:230] ["load segments..."] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [segmentID=447990444064723266] [node=1681] [source=segment_checker] [shardLeader=1679]
2024-02-27 00:14:02.610	[2024/02/26 16:14:02.609 +00:00] [WARN] [task/executor.go:238] ["failed to load segment"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [segmentID=447990444064723266] [node=1681] [source=segment_checker] [shardLeader=1679] [error="unrecoverable error"]
2024-02-27 00:14:02.610	[2024/02/26 16:14:02.609 +00:00] [INFO] [task/executor.go:119] ["execute action done, remove it"] [taskID=1708948067608] [step=0] [error="unrecoverable error"]
2024-02-27 00:14:02.623	[2024/02/26 16:14:02.623 +00:00] [WARN] [task/scheduler.go:727] ["task scheduler recordSegmentTaskError"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [segmentID=447990444064723266] [status=failed] [error="unrecoverable error"]
2024-02-27 00:14:02.623	[2024/02/26 16:14:02.623 +00:00] [INFO] [task/scheduler.go:768] ["task removed"] [taskID=1708948067608] [collectionID=447902879639453431] [replicaID=447990457955778562] [status=failed] [segmentID=447990444064723266]

@yanliang567 yanliang567 modified the milestones: 2.3.11, 2.3.12 Mar 11, 2024
@ThreadDao
Copy link
Contributor Author

ThreadDao commented Mar 14, 2024

@chyezh

  • image: cardinal-milvus-io-2.3-3c90475-20240311
  • queryNode laion1b-test-2-milvus-querynode-1-86cfff6f5d-7b2lv terminated with 134 error at 2024-03-12 16:02:40.814(UTC)
    image

@ThreadDao
Copy link
Contributor Author

Short-term fix: Implement mutual exclusivity between Release and Load on QN;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants