Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning: continue this region round and retry on next round when TiKV is busy #40278

Merged
merged 12 commits into from Jan 12, 2023

Conversation

lance6716
Copy link
Contributor

Signed-off-by: lance6716 lance6716@gmail.com

What problem does this PR solve?

Issue Number: close #40205

Problem Summary:

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: lance6716 <lance6716@gmail.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jan 3, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • buchuitoudegou
  • gozssky

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 3, 2023
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/cc @gozssky @lichunzhu

@lance6716
Copy link
Contributor Author

/run-integration-br-tests

Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

br/pkg/lightning/backend/local/local.go Outdated Show resolved Hide resolved
confVer: r.Region.GetRegionEpoch().GetConfVer(),
version: r.Region.GetRegionEpoch().GetVersion(),
}
if _, ok := writeCheckpoint[checkpointKey]; !ok {
Copy link
Contributor

@sleepymole sleepymole Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to ensure the range bounds of nextRoundRegions are not changed after rescan?

Signed-off-by: lance6716 <lance6716@gmail.com>
@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 4, 2023
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

/retest

@lance6716
Copy link
Contributor Author

strange error if I make write to tikv no leader retryable 🤔

@lance6716
Copy link
Contributor Author

/retest

Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

/run-integration-br-tests

Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

ptal @gozssky @lichunzhu

CI should be OK now

@lance6716
Copy link
Contributor Author

ptal @gozssky

br/pkg/lightning/backend/local/local.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 10, 2023
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

/cc @D3Hunter @buchuitoudegou

Copy link
Contributor

@buchuitoudegou buchuitoudegou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

case retryBusyIngest:
log.FromContext(ctx).Warn("meet tikv busy when ingest", log.ShortError(err), logutil.SSTMetas(ingestMetas),
logutil.Region(region.Region))
// ImportEngine will continue on this unfinished range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for {
unfinishedRanges := lf.unfinishedRanges(ranges)
if len(unfinishedRanges) == 0 {
break
}
log.FromContext(ctx).Info("import engine unfinished ranges", zap.Int("count", len(unfinishedRanges)))
// if all the kv can fit in one region, skip split regions. TiDB will split one region for
// the table when table is created.
needSplit := len(unfinishedRanges) > 1 || lfTotalSize > regionSplitSize || lfLength > regionSplitKeys
// split region by given ranges
for i := 0; i < maxRetryTimes; i++ {
err = local.SplitAndScatterRegionInBatches(ctx, unfinishedRanges, lf.tableInfo, needSplit, regionSplitSize, maxBatchSplitRanges)
if err == nil || common.IsContextCanceledError(err) {
break
}
log.FromContext(ctx).Warn("split and scatter failed in retry", zap.Stringer("uuid", engineUUID),
log.ShortError(err), zap.Int("retry", i))
}
if err != nil {
log.FromContext(ctx).Error("split & scatter ranges failed", zap.Stringer("uuid", engineUUID), log.ShortError(err))
return err
}
// start to write to kv and ingest
err = local.writeAndIngestByRanges(ctx, lf, unfinishedRanges, regionSplitSize, regionSplitKeys)
if err != nil {
log.FromContext(ctx).Error("write and ingest engine failed", log.ShortError(err))
return err
}
}

Seems there is no retry limit for ImportEngine. Are there any risks that will make lightning hang/run too long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the old logic, PTAL @gozssky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no retry limit. Maybe we can check if there is progress on each retry. If there is no progress, then set a retry limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to add another retry counter and sleep interval for now. I'm thinking of providing a retry framework 😂 since the problem does not become worse than before, I prefer we do it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jan 12, 2023
@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

/run-integration-br-tests

@lance6716
Copy link
Contributor Author

/merge

BR CI is very unstable, will check the effects of this PR later

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: d5c259d

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 12, 2023
@ti-chi-bot ti-chi-bot merged commit aef752a into pingcap:master Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
5 participants