Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Reduce lock granularity in Datacoord meta #30837

Closed
3 of 4 tasks
jaime0815 opened this issue Feb 26, 2024 · 3 comments
Closed
3 of 4 tasks

[Enhancement]: Reduce lock granularity in Datacoord meta #30837

jaime0815 opened this issue Feb 26, 2024 · 3 comments
Labels
kind/enhancement Issues or changes related to enhancement stale indicates no udpates for 30 days

Comments

@jaime0815
Copy link
Contributor

jaime0815 commented Feb 26, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

Why is this needed?

lock contention easily causes load segments, release segments and update channel checkpoint requests to fail in Datacoord meta which holds a coarse lock and executes some IO operations after acquiring the lock.

We attempt to test stability with flushing after inserting one row, the following issues are exposed when the number of segments has reached 20-30K :

  1. We saw the write lock will be held for more than 2s during updating compaction results.
    image

  2. 87K+ goroutine acquiring a write lock to update channel checkpoint

goroutine profile: total 89243
87385 @ 0x1a9c256 0x1aae02f 0x1aae006 0x1ace1c6 0x1af3885 0x1af4f16 0x1af4ef5 0x3de68a6 0x3e13318 0x3e2abaf 0x294819b 0x3ca29e2 0x264701a 0x3ca3753 0x264701a 0x39014c9 0x264701a 0x2345048 0x264701a 0x2646ebe 0x2948058 0x21beb33 0x21c3c56 0x21bc458 0x1ad2b21
#	0x1ace1c5	sync.runtime_SemacquireMutex+0x25										/usr/local/go/src/runtime/sema.go:77
#	0x1af3884	sync.(*Mutex).lockSlow+0x164											/usr/local/go/src/sync/mutex.go:171
#	0x1af4f15	sync.(*Mutex).Lock+0x35												/usr/local/go/src/sync/mutex.go:90
#	0x1af4ef4	sync.(*RWMutex).Lock+0x14											/usr/local/go/src/sync/rwmutex.go:147
#	0x3de68a5	github.com/milvus-io/milvus/internal/datacoord.(*meta).UpdateChannelCheckpoint+0x85				/go/src/github.com/milvus-io/milvus/internal/datacoord/meta.go:1330
#	0x3e13317	github.com/milvus-io/milvus/internal/datacoord.(*Server).UpdateChannelCheckpoint+0x117				/go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:1357
#	0x3e2abae	github.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).UpdateChannelCheckpoint+0x2e		/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:407
#	0x294819a	github.com/milvus-io/milvus/internal/proto/datapb._DataCoord_UpdateChannelCheckpoint_Handler.func1+0x7a		/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:6986
#	0x3ca29e1	github.com/milvus-io/milvus/pkg/util/interceptor.ServerIDValidationUnaryServerInterceptor.func1+0x101		/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/server_id_interceptor.go:54
#	0x2647019	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#	0x3ca3752	github.com/milvus-io/milvus/pkg/util/interceptor.ClusterValidationUnaryServerInterceptor.func1+0xf2		/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/cluster_interceptor.go:48
#	0x2647019	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#	0x39014c8	github.com/milvus-io/milvus/pkg/util/logutil.UnaryTraceLoggerInterceptor+0x48					/go/src/github.com/milvus-io/milvus/pkg/util/logutil/grpc_interceptor.go:23
#	0x2647019	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#	0x2345047	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1+0x527	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.38.0/interceptor.go:342
#	0x2647019	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39					/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#	0x2646ebd	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbd					/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34
#	0x2948057	github.com/milvus-io/milvus/internal/proto/datapb._DataCoord_UpdateChannelCheckpoint_Handler+0x137		/go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:6988
#	0x21beb32	google.golang.org/grpc.(*Server).processUnaryRPC+0xdf2								/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1345
#	0x21c3c55	google.golang.org/grpc.(*Server).handleStream+0xa35								/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1722
#	0x21bc457	google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97							/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:966
  1. Acquiring a read lock takes 600ms while get the recovery info.
[2024/02/22 02:33:41.590 +00:00] [INFO] [datacoord/services.go:757] ["get recovery info request received"] [traceID=e6f616c4cde658799fb1d6b2fca452e1] [collectionID=447845239012389309] [partitionIDs="[]"]
2024-02-22 02:33:42.172	
[2024/02/22 02:33:42.172 +00:00] [INFO] [datacoord/handler.go:116] [GetQueryVChanPositions] [collectionID=447845239012389309] [channel=in01-f52be80fa2fba4a-rootcoord-dml_0_447845239012389309v0] [numOfSegments=2] ["indexed segment"=1]

Anything else?

No response

@xiaofan-luan
Copy link
Contributor

This seems to be a challenging task to work on.
But this is also becomes a blocking issue since we hold a global lock to do ios.

@sunby could you please help on this

Copy link

stale bot commented Apr 14, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Apr 14, 2024
@xiaofan-luan
Copy link
Contributor

keep it

@stale stale bot closed this as completed Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Issues or changes related to enhancement stale indicates no udpates for 30 days
Projects
None yet
Development

No branches or pull requests

2 participants