enhance: declarative resource group api #31930

chyezh · 2024-04-06T08:52:58Z

issue: #30647

Add declarative resource group api
Add config for resource group management
Resource group recovery enhancement

mergify · 2024-04-06T10:36:25Z

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

mergify · 2024-04-07T13:57:22Z

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

mergify · 2024-04-07T16:17:23Z

@chyezh ut workflow job failed, comment rerun ut can trigger the job again.

mergify · 2024-04-08T02:40:40Z

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

mergify · 2024-04-08T11:08:57Z

@chyezh E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

codecov · 2024-04-08T16:15:13Z

Codecov Report

Attention: Patch coverage is 82.30769% with 138 lines in your changes are missing coverage. Please review.

Project coverage is 81.75%. Comparing base (3d5fe7b) to head (b061234).
Report is 21 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #31930      +/-   ##
==========================================
+ Coverage   81.69%   81.75%   +0.06%     
==========================================
  Files         991      993       +2     
  Lines      122136   122555     +419     
==========================================
+ Hits        99777   100197     +420     
+ Misses      18530    18519      -11     
- Partials     3829     3839      +10

Files	Coverage Δ
internal/distributed/proxy/service.go	`83.92% <100.00%> (+0.04%)`	⬆️
internal/distributed/querycoord/client/client.go	`97.47% <100.00%> (+0.06%)`	⬆️
internal/distributed/querycoord/service.go	`77.92% <100.00%> (+0.19%)`	⬆️
internal/querycoordv2/meta/resource_group.go	`100.00% <100.00%> (ø)`
...nternal/querycoordv2/observers/replica_observer.go	`95.23% <100.00%> (+0.15%)`	⬆️
internal/querycoordv2/server.go	`83.36% <100.00%> (+0.97%)`	⬆️
pkg/util/constant.go	`93.33% <ø> (ø)`
pkg/util/merr/errors.go	`86.95% <ø> (ø)`
pkg/util/merr/utils.go	`89.39% <100.00%> (+0.46%)`	⬆️
pkg/util/paramtable/quota_param.go	`83.69% <100.00%> (+0.15%)`	⬆️
... and 8 more

... and 30 files with indirect coverage changes

weiliu1031 · 2024-04-10T03:38:59Z

internal/distributed/querycoord/client/client.go

@@ -315,6 +315,17 @@ func (c *Client) CreateResourceGroup(ctx context.Context, req *milvuspb.CreateRe
 	})
 }

+func (c *Client) UpdateResourceGroups(ctx context.Context, req *milvuspb.UpdateResourceGroupsRequest, opts ...grpc.CallOption) (*commonpb.Status, error) {


it's not recommend to use milvus.pbMsg between milvus internal rpc. milvus.pbMsg should only be used between sdk and proxy.

weiliu1031 · 2024-04-10T03:39:58Z

internal/distributed/querycoord/client/client.go

+	req = typeutil.Clone(req)
+	commonpbutil.UpdateMsgBase(
+		req.GetBase(),
+		commonpbutil.FillMsgBaseFromClient(paramtable.GetNodeID(), commonpbutil.WithTargetID(c.grpcClient.GetNodeID())),


No need to set target_id here, server id check already be done in server_id_intercepter

All api in internal grpc client add it.
I will keep the coding convention until some PR fix it all.

weiliu1031 · 2024-04-10T03:44:51Z

internal/proxy/impl.go

+			log.Warn("UpdateResourceGroups failed",
+				zap.Error(err),
+			)
+			return getErrResponse(err, method, "", ""), nil


it's weird to pass two empty string value here

extra parameter of getErrReponse is produced by database and collection level metric.
It's a bad implementation, all related code should be removed to somewhere like grpc interceptor.
But I still keep it in this PR. code refactor of getErrResponse should be produced by other pr.

weiliu1031 · 2024-04-10T03:51:14Z

internal/proxy/impl.go

+
+	log.Info("UpdateResourceGroups received")
+
+	if err := node.sched.ddQueue.Enqueue(t); err != nil {


task in ddQueue will be processed one by one, so UpdateResourceGroup may be blocked by CreateCollection/LoadCollection

CreateResourceGroup and DropResourceGroup is still use the queue to execute task.
I will keep using the queue in this PR.
Remove the related API from ddQueue in other PR if necessary.

weiliu1031 · 2024-04-10T04:13:05Z