New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning: continue this region round and retry on next round when TiKV is busy #40278
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: lance6716 <lance6716@gmail.com>
/cc @gozssky @lichunzhu |
/run-integration-br-tests |
/run-integration-br-tests |
confVer: r.Region.GetRegionEpoch().GetConfVer(), | ||
version: r.Region.GetRegionEpoch().GetVersion(), | ||
} | ||
if _, ok := writeCheckpoint[checkpointKey]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to ensure the range bounds of nextRoundRegions are not changed after rescan?
Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>
/run-integration-br-tests |
/retest |
strange error if I make |
/retest |
Signed-off-by: lance6716 <lance6716@gmail.com>
/run-integration-br-tests |
/run-integration-br-tests |
Signed-off-by: lance6716 <lance6716@gmail.com>
/run-integration-br-tests |
/run-integration-br-tests |
ptal @gozssky @lichunzhu CI should be OK now |
ptal @gozssky |
Signed-off-by: lance6716 <lance6716@gmail.com>
/run-integration-br-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
case retryBusyIngest: | ||
log.FromContext(ctx).Warn("meet tikv busy when ingest", log.ShortError(err), logutil.SSTMetas(ingestMetas), | ||
logutil.Region(region.Region)) | ||
// ImportEngine will continue on this unfinished range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tidb/br/pkg/lightning/backend/local/local.go
Lines 1651 to 1682 in 3cb091b
for { | |
unfinishedRanges := lf.unfinishedRanges(ranges) | |
if len(unfinishedRanges) == 0 { | |
break | |
} | |
log.FromContext(ctx).Info("import engine unfinished ranges", zap.Int("count", len(unfinishedRanges))) | |
// if all the kv can fit in one region, skip split regions. TiDB will split one region for | |
// the table when table is created. | |
needSplit := len(unfinishedRanges) > 1 || lfTotalSize > regionSplitSize || lfLength > regionSplitKeys | |
// split region by given ranges | |
for i := 0; i < maxRetryTimes; i++ { | |
err = local.SplitAndScatterRegionInBatches(ctx, unfinishedRanges, lf.tableInfo, needSplit, regionSplitSize, maxBatchSplitRanges) | |
if err == nil || common.IsContextCanceledError(err) { | |
break | |
} | |
log.FromContext(ctx).Warn("split and scatter failed in retry", zap.Stringer("uuid", engineUUID), | |
log.ShortError(err), zap.Int("retry", i)) | |
} | |
if err != nil { | |
log.FromContext(ctx).Error("split & scatter ranges failed", zap.Stringer("uuid", engineUUID), log.ShortError(err)) | |
return err | |
} | |
// start to write to kv and ingest | |
err = local.writeAndIngestByRanges(ctx, lf, unfinishedRanges, regionSplitSize, regionSplitKeys) | |
if err != nil { | |
log.FromContext(ctx).Error("write and ingest engine failed", log.ShortError(err)) | |
return err | |
} | |
} |
Seems there is no retry limit for ImportEngine. Are there any risks that will make lightning hang/run too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the old logic, PTAL @gozssky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is no retry limit. Maybe we can check if there is progress on each retry. If there is no progress, then set a retry limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to add another retry counter and sleep interval for now. I'm thinking of providing a retry framework 😂 since the problem does not become worse than before, I prefer we do it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ptal @buchuitoudegou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/run-integration-br-tests |
/run-integration-br-tests |
/run-integration-br-tests |
/merge BR CI is very unstable, will check the effects of this PR later |
This pull request has been accepted and is ready to merge. Commit hash: d5c259d
|
Signed-off-by: lance6716 lance6716@gmail.com
What problem does this PR solve?
Issue Number: close #40205
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.