Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong watchedDmchannel and SealedSegment after querynode down #12624

Closed
wants to merge 1 commit into from

Conversation

xige-16
Copy link
Contributor

@xige-16 xige-16 commented Dec 2, 2021

fix issue: #12340 #12190
/kind bug
Signed-off-by: xige-16 xi.ge@zilliz.com

@sre-ci-robot sre-ci-robot added the kind/bug Issues or changes related a bug label Dec 2, 2021
@sre-ci-robot sre-ci-robot added area/internal-api size/XXL Denotes a PR that changes 1000+ lines. labels Dec 2, 2021
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xige-16
To complete the pull request process, please assign congqixia after the PR has been reviewed.
You can assign the PR to them by writing /assign @congqixia in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added the dco-passed DCO check passed. label Dec 2, 2021
@xige-16
Copy link
Contributor Author

xige-16 commented Dec 2, 2021

fix #12323

@xige-16 xige-16 force-pushed the fix-nodedown branch 5 times, most recently from 783508c to 0564904 Compare December 3, 2021 13:36
@xige-16
Copy link
Contributor Author

xige-16 commented Dec 3, 2021

/re-run unit

Signed-off-by: xige-16 <xi.ge@zilliz.com>
@codecov
Copy link

codecov bot commented Dec 6, 2021

Codecov Report

Merging #12624 (836c2f0) into master (a2a1f9d) will increase coverage by 0.07%.
The diff coverage is 83.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12624      +/-   ##
==========================================
+ Coverage   79.08%   79.15%   +0.07%     
==========================================
  Files         457      458       +1     
  Lines       60712    60960     +248     
==========================================
+ Hits        48016    48255     +239     
- Misses      10231    10238       +7     
- Partials     2465     2467       +2     
Impacted Files Coverage Δ
internal/querycoord/cluster.go 77.95% <66.66%> (+0.57%) ⬆️
internal/querycoord/util.go 79.79% <79.79%> (ø)
internal/querycoord/task.go 84.49% <80.08%> (-0.47%) ⬇️
internal/querynode/task.go 66.66% <86.66%> (-0.71%) ⬇️
internal/querycoord/querynode.go 73.17% <89.18%> (+3.05%) ⬆️
internal/querycoord/meta.go 78.47% <96.46%> (+1.15%) ⬆️
internal/querycoord/channel_allocator.go 87.50% <100.00%> (+1.01%) ⬆️
internal/querycoord/query_coord.go 79.03% <100.00%> (ø)
internal/querycoord/task_scheduler.go 75.74% <100.00%> (+0.09%) ⬆️
... and 11 more

@mergify mergify bot added the ci-passed label Dec 6, 2021
case queryPb.WatchType_WatchPartition:
lt = loadTypePartition
default:
return errors.New("unKnow watch type, collectionID = " + fmt.Sprintln(collectionID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the watchType so it's easier for debugging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should has unified log format, for instance
Failed to execute watchDmChannelsTask becasase of unknown watch type, collectionId xxx, watchType xxx

@@ -59,6 +59,7 @@ type Cluster interface {
watchDeltaChannels(ctx context.Context, nodeID int64, in *querypb.WatchDeltaChannelsRequest) error
//TODO:: removeDmChannel
getNumDmChannels(nodeID int64) (int, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not return err here? since this is only a in memory action

defer c.RUnlock()

if node, ok := c.nodes[nodeID]; ok {
watchInfos := node.getDmChannelWatchInfo(collectionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy before return?

removeDmChannelByPartitionID(info, partitionID)
}
// delete delta channels
delete(qn.watchedDeltaChannels, collectionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about delta channels? thought we also need to handle the partition case

}
err = qn.addDmChannel(in.CollectionID, channelInfo)
if err != nil {
log.Debug("watchDmChannels: add dm channel to node meta failed", zap.String("error", err.Error()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Warn log in error handling path

@@ -643,8 +668,12 @@ func (c *queryNodeCluster) removeNodeInfo(nodeID int64) error {
return err
}

if _, ok := c.nodes[nodeID]; ok {
err = c.nodes[nodeID].clearNodeInfo()
if node, ok := c.nodes[nodeID]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far the logic is ok, but it seems that the logic is not achieved for idemptence.
For example , if removeDmChannelWatchInfosByNodeID success but clearNodeInfo failed, will the task retry to delete nodeId in c.nodes? can the retry success?

There is usually two way to guarantee idempt:

  1. do the io in a batch or trasanction
  2. do the critical io in the last step, and make sure all the previous step can be retried with no error.

// different queryNode may watch same dmChannel and create same growing segment
// deduplicate result when reduce, the correctness of search has no effect
// and growing segment will be removed after handoff
if loadReq.WatchType == querypb.WatchType_WatchTypeNone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, do we has some meta about this collection whether it is loaded per collection or per segment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I guess delta channel need to be handled at the same time

@xige-16
Copy link
Contributor Author

xige-16 commented Dec 7, 2021

/hold

@mergify mergify bot removed the ci-passed label Dec 17, 2021
@xige-16 xige-16 closed this Dec 24, 2021
@xige-16 xige-16 deleted the fix-nodedown branch December 24, 2021 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/internal-api dco-passed DCO check passed. do-not-merge/hold kind/bug Issues or changes related a bug size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants