Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: querynode load segment failed when load data into memory [error=std::exception] #34088

Closed
1 task done
sunwsh opened this issue Jun 24, 2024 · 12 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@sunwsh
Copy link

sunwsh commented Jun 24, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.3.10
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  kafka
- SDK version(e.g. pymilvus v2.0.0rc2):   java sdk 2.3.3
- OS(Ubuntu or CentOS):  Ubuntu 20.4
- CPU/Memory:   12/64G
- GPU:
- Others:

Current Behavior

看到线上 querynode load segment 失败。

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

[2024/06/24 09:46:44.441 +08:00] [INFO] [segments/segment.go:691] ["start loading field data for field"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=1] [rowCount=61516]
[2024/06/24 09:46:44.442 +08:00] [INFO] [segments/segment.go:691] ["start loading field data for field"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=101] [rowCount=61516]
[2024/06/24 09:46:44.442 +08:00] [INFO] [segments/segment.go:691] ["start loading field data for field"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=0] [rowCount=61516]
[2024/06/24 09:46:44.441 +08:00] [INFO] [segments/segment.go:691] ["start loading field data for field"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=100] [rowCount=61516]
	
[2024/06/24 09:46:44.442 +08:00] [INFO] [segments/segment.go:714] ["submitted loadFieldData task to dy pool"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=1] [rowCount=61516]
[2024/06/24 09:46:44.442 +08:00] [INFO] [segments/segment.go:714] ["submitted loadFieldData task to dy pool"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=0] [rowCount=61516]	
[2024/06/24 09:46:44.442 +08:00] [INFO] [segments/segment.go:714] ["submitted loadFieldData task to dy pool"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=101] [rowCount=61516]
[2024/06/24 09:46:44.442 +08:00] [INFO] [segments/segment.go:714] ["submitted loadFieldData task to dy pool"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=100] [rowCount=61516]

[2024/06/24 09:46:44.469 +08:00] [INFO] [segments/segment.go:727] ["load field done"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=0] [rowCount=61516]
[2024/06/24 09:46:44.478 +08:00] [INFO] [segments/segment.go:727] ["load field done"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=100] [rowCount=61516]
[2024/06/24 09:46:44.487 +08:00] [INFO] [segments/segment.go:727] ["load field done"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [fieldID=1] [rowCount=61516]



[2024/06/24 09:46:44.539 +08:00] [WARN] [segments/cgo_util.go:54] ["CStatus returns err"] [error=std::exception]
[2024/06/24 09:46:44.546 +08:00] [INFO] [gc/gc_tuner.go:90] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=33] ["total memory"=15847] ["next GC"=69] ["new GOGC"=200] [gc-pause=131.711µs] [gc-pause-end=1719193604545050717]	
[2024/06/24 09:46:44.548 +08:00] [WARN] [segments/segment_loader.go:263] ["load segment failed when load data into memory"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [segmentType=Sealed] [partitionID=449953581368041552] [segmentID=449953581514048119] [error=std::exception]
	
[2024/06/24 09:46:44.548 +08:00] [ERROR] [funcutil/parallel.go:88] [loadSegmentFunc] [error=std::exception] [idx=0] [stack="github.com/milvus-io/milvus/pkg/util/funcutil.ProcessFuncParallel.func3\n\t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/funcutil/parallel.go:88"]

2024/06/24 09:46:44.548 +08:00] [WARN] [segments/segment_loader.go:288] ["failed to load some segments"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [segmentType=Sealed] [error=std::exception]
[2024/06/24 09:46:44.548 +08:00] [WARN] [segments/segment_loader.go:207] ["release new segment created due to load failure"] [traceID=71bd6f63d2bd45e6645282510ad1fa9e] [collectionID=449953581317609203] [segmentType=Sealed] [segmentID=449953581514048119] [error=std::exception]
[2024/06/24 09:46:44.574 +08:00] [INFO] [segments/segment.go:1053] ["delete segment from memory"] [collectionID=449953581317609203] [partitionID=449953581368041552] [segmentID=449953581514048119] [segmentType=Sealed] [insertCount=61516]

Anything else?

No response

@sunwsh sunwsh added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2024
@yanliang567
Copy link
Contributor

@sunwsh are you running a new created milvus cluster or upgraded from a earlier version? please retry on latest milvus v2.4.5 and offer the full milvus logs for investigation.please refer this doc to export the whole Milvus logs for investigation

/assign @sunwsh
/unassign

@sre-ci-robot sre-ci-robot assigned sunwsh and unassigned yanliang567 Jun 24, 2024
@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2024
@xiaofan-luan
Copy link
Contributor

@sunwsh could you upload server side log so we can get more details?
there is some server side error but not showing this detail in the std::exception

@xiaofan-luan
Copy link
Contributor

I guess this might be a exception for S3 load but we need more clue

@xiaofan-luan
Copy link
Contributor

xiaofan-luan commented Jun 24, 2024

you can also upgrade to 2.3.18 to see the detailed info

@sunwsh
Copy link
Author

sunwsh commented Jun 24, 2024

@sunwsh are you running a new created milvus cluster or upgraded from a earlier version? please retry on latest milvus v2.4.5 and offer the full milvus logs for investigation.please refer this doc to export the whole Milvus logs for investigation

/assign @sunwsh /unassign

不是升级的,原始版本就是 V2.3.10, 发生这个失败后,看 querynode 一直在 重试,每分钟都重试

@xiaofan-luan
Copy link
Contributor

upgrade 2.3.18 and you should see the error details

@sunwsh
Copy link
Author

sunwsh commented Jun 24, 2024

upgrade 2.3.18 and you should see the error details

哦,

看代码是在这里, 用2.3.18版本 可以打印出 c++ 库的日志?

func (s *LocalSegment) LoadFieldData(ctx context.Context, fieldID int64, rowCount int64, field *datapb.FieldBinlog) error {
	var status C.CStatus
	GetLoadPool().Submit(func() (any, error) {
		log.Info("submitted loadFieldData task to dy pool")
		status = C.LoadFieldData(s.ptr, loadFieldDataInfo.cLoadFieldDataInfo)
		return nil, nil
	}).Await()
	if err := HandleCStatus(ctx, &status, "LoadFieldData failed",
		zap.Int64("collectionID", s.Collection()),
		zap.Int64("partitionID", s.Partition()),
		zap.Int64("segmentID", s.ID()),
		zap.Int64("fieldID", fieldID)); err != nil {
		return err
	}

@xiaofan-luan
Copy link
Contributor

there is a cpp error. without upgrading you only see std::exception

@xiaofan-luan
Copy link
Contributor

xiaofan-luan commented Jun 24, 2024

or you can search for the log to see detailed errors.

@sunwsh
Copy link
Author

sunwsh commented Jun 24, 2024

there is a cpp error. without upgrading you only see std::exception

刚刚查到原因了,是我们这边的 S3接口的对象存储有bug, 写入时数据提示成功了,但是真实的数据文件错了,

因为我们的 milvus 有 60秒没有应答时 重试的逻辑, 第一次没应答,后重试写成功了,但是第一次超时的在后台 有写了,
还好对象存储中有保留上一次的文件, 用旧的文件恢复就好了。

// 看 indexnode 的日志有提示解析文件错误。

[2024/06/24 16:41:52.382 +08:00] [INFO] [indexnode/indexnode_service.go:56] ["IndexNode building index ..."] [traceID=e613ffeeedbfc91d5aab5cd79d9d4dd5] [clusterID=milvus-pricing] [indexBuildID=449953581514253890] [indexID=0] [indexName=] [indexFilePrefix=index_files] [indexVersion=21394] [dataPaths="[insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255319,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255424,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255470,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255488,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255518,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255524,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255530,insert_log/449953581317609203/449953581368041552/449953581514048119/101/449953581502255548]"] [typeParams="[{\"key\":\"dim\",\"value\":\"2048\"}]"] [indexParams="[{\"key\":\"metric_type\",\"value\":\"L2\"},{\"key\":\"nlist\",\"value\":\"1000\"},{\"key\":\"nbits\",\"value\":\"8\"},{\"key\":\"m\",\"value\":\"128\"},{\"key\":\"index_type\",\"value\":\"IVF_PQ\"}]"] [numRows=61516] [current_index_version=1]
[2024/06/24 16:41:52.456 +08:00] [INFO] [indexnode/indexnode_service.go:116] ["IndexNode successfully scheduled"] [traceID=e613ffeeedbfc91d5aab5cd79d9d4dd5] [clusterID=milvus-pricing] [indexBuildID=449953581514253890] [indexName=]

[2024/06/24 16:41:52.625 +08:00] [WARN] [indexnode/task.go:252] ["parse field meta from binlog failed"] [error="parquet: file is smaller than indicated metadata size"] [errorVerbose="parquet: file is smaller than indicated metadata size:\n    github.com/apache/arrow/go/v12/parquet/file.init\n       /go/pkg/mod/github.com/apache/arrow/go/v12@v12.0.1/parquet/file/file_reader.go:42"]

@sunwsh
Copy link
Author

sunwsh commented Jun 24, 2024

这个 issue 可以关闭了。

@yanliang567
Copy link
Contributor

great to hear that. thanks for updating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants