ddl: support checkpoint for ingest mode #42769

tangenta · 2023-04-03T08:44:30Z

What problem does this PR solve?

Issue Number: close #42164

Background:

For creating index, most of the time are spent on reading table, writing index and importing. After code refactoring #42472 and #42668, the procedure is as follows:

The DDL worker first divides the table data into tasks, each task representing a region, and sends it to the reader, who reads the row data of the region in batches from the TiKV storage. A batch of data read by the reader is called a chunk.
The reader reads a chunk and sends it to the writer. The writer extracts the index column for each row in the chunk, encodes it into an index KV and writes it to the local engine of TiDB-lightning. Moreover,
- when the writer's memory occupancy reaches the threshold, a flush will be triggered to write the index KVs from the memory buffer to the disk.
- when the occupancy of the disk reaches the threshold, an unsafe import will be triggered to import the index KV of the disk to the TiKV storage.
After the writer completes the writing of the chunk, it will return the result to the DDL worker from time to time to update the current progress.
The DDL worker waits for all the results to return, and finally triggers an import to write all the remaining index KVs on the disk to the TiKV storage.

If we treat the overall process as a progress bar, the start point is the start key of the table, and the end point is the end key of the table. There are two keys that can represent the current progress:

Global Checkpoint: all keys smaller than this have been imported into the TiKV store. Even if all TiDB crash, these keys do not need to be re-imported. It is updated by unsafe import.
Local Checkpoint: all keys smaller than this have been written to at least the local disk where the TiDB local engine is located. If TiDB is restarted and the data on the local disk can be accessed, these keys do not need to be read and encoded again. It is updated by flush.

As long as the Global/Local Checkpoint is persisted, before reader starts to read, we can compare the end key and checkpoint of the task to determine whether the task can be skipped. Therefore, we need a component to manage checkpoints, including the addition, deletion and modification of checkpoints, called Checkpoint Manager.

What is changed and how it works?

According to the above reading and writing process, we can abstract the interface for Checkpoint Manager:

type CheckpointManager interface {
   IsComplete(taskID int, start, end kv.Key) bool
   UpdateTotal(taskID int, added int, last bool)
   UpdateCurrent(taskID int, added int) error
}

IsComplete() is called before the reader reads the data and decides whether to skip the current task.
UpdateTotal() is called by the reader after reading the data to update the number of rows contained in the current chunk.
UpdateCurrent() is called by the writer after writing the local engine to update the current number of rows written.

The checkpoint manager spawns a background goroutine, which is used to update the checkpoint info to the system table mysql.tidb_ddl_reorg periodically.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2023-04-03T08:44:32Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

Benjamin2037
wjhuang2016

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot · 2023-04-03T08:44:32Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

…int-v2

hawkingrei · 2023-04-07T03:15:04Z

/test all

tangenta · 2023-04-10T05:46:41Z

/retest

ddl/backfilling.go

ddl/ingest/checkpoint.go

ddl/backfilling_scheduler.go

wjhuang2016 · 2023-04-11T06:15:47Z

ddl/ingest/checkpoint.go

+)
+
+// CheckpointManager is an interface to manage checkpoints.
+type CheckpointManager interface {


We can remove this interface since it's not used in the distribution.

We can add a no-op implementation for distributed reorg.

…int-v2

Benjamin2037

LTGM

wjhuang2016 · 2023-04-11T12:25:27Z

ddl/backfilling_scheduler.go

@@ -110,6 +111,8 @@ func (b *txnBackfillScheduler) setupWorkers() error {
 }

 func (b *txnBackfillScheduler) sendTask(task *reorgBackfillTask) {
+	b.taskMaxID++


Why reallocate the task ID?

Because it needs to be unique during the lifetime of the DDL job, instead of a task batch.

wjhuang2016 · 2023-04-11T12:27:01Z

ddl/backfilling_scheduler.go

@@ -288,6 +299,12 @@ func (b *ingestBackfillScheduler) setupWorkers() error {
 		return errors.Trace(errors.New("cannot get lightning backend"))
 	}
 	b.backendCtx = bc
+	mgr, err := ingest.NewCheckpointManager(b.ctx, bc, b.sessPool, job.ID,


It shouldn't set the manager in distribute case.

Let me resolve the conflict after #42753 is merged.

…int-v2

tangenta · 2023-04-12T09:46:08Z

/merge

ti-chi-bot · 2023-04-12T09:46:12Z

This pull request has been accepted and is ready to merge.

Commit hash: e535400

tangenta · 2023-04-12T11:42:57Z

/retest

…int-v2

tangenta · 2023-04-12T12:01:54Z

/merge

ti-chi-bot · 2023-04-12T12:01:58Z

This pull request has been accepted and is ready to merge.

Commit hash: 9ef599f

tangenta · 2023-04-12T12:52:37Z

/retest

ti-chi-bot · 2023-04-14T18:17:06Z

@tangenta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
idc-jenkins-ci-tidb/unit-test	`92dc8b5`	link	true	`/test unit-test`
idc-jenkins-ci-tidb/mysql-test	`e535400`	link	unknown	`/test mysql-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ddl: support checkpoint for ingest mode

2b185f6

ti-chi-bot added do-not-merge/needs-linked-issue release-note-none do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 3, 2023

tangenta added 2 commits April 4, 2023 15:42

ddl: fix data inconsistency issue

73ddb5d

fix test for build and refine code

31a774a

ti-chi-bot removed the do-not-merge/needs-linked-issue label Apr 4, 2023

make job.RowCount more accurate

e0f7dbb

ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2023

Merge remote-tracking branch 'upstream/master' into add-index-checkpo…

6a9b52e

…int-v2

ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2023

tangenta added 2 commits April 4, 2023 23:07

Merge remote-tracking branch 'upstream/master' into add-index-checkpo…

d85eb4b

…int-v2

move checkpoint to ingest package

d3241b7

tangenta mentioned this pull request Apr 6, 2023

ddl: support checkpoint for ingest mode of adding index #42350

Closed

12 tasks

add part of test for checkpoint manager

0b5b1ab

tangenta marked this pull request as ready for review April 6, 2023 13:01

ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2023

tangenta added 3 commits April 6, 2023 22:08

add more test for checkpoint manager

a8d490f

fix linter

c14674e

Merge remote-tracking branch 'upstream/master' into add-index-checkpo…

a378b11

…int-v2

update bazel

1abad96

tangenta requested review from Benjamin2037 and wjhuang2016 April 7, 2023 07:27

Benjamin2037 reviewed Apr 10, 2023

View reviewed changes

ddl/backfilling.go Show resolved Hide resolved

Benjamin2037 reviewed Apr 10, 2023

View reviewed changes

ddl/ingest/checkpoint.go Outdated Show resolved Hide resolved

Benjamin2037 reviewed Apr 11, 2023

View reviewed changes

ddl/backfilling_scheduler.go Outdated Show resolved Hide resolved

wjhuang2016 reviewed Apr 11, 2023

View reviewed changes

tangenta added 3 commits April 11, 2023 15:46

remove redundant file

9a3ca75

refine code

87987c6

Merge remote-tracking branch 'upstream/master' into add-index-checkpo…

92dc8b5

…int-v2

Benjamin2037 approved these changes Apr 11, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 11, 2023

wjhuang2016 reviewed Apr 11, 2023

View reviewed changes

tangenta added 4 commits April 12, 2023 10:10

Merge remote-tracking branch 'upstream/master' into add-index-checkpo…

b08d9b9

…int-v2

use checkpoint manager only if distributed exec is disabled

a562d83

update the checkpoint flush interval to 10 min

f3f2b55

add task ID allocator

e535400

wjhuang2016 approved these changes Apr 12, 2023

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 12, 2023

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 12, 2023

Merge remote-tracking branch 'upstream/master' into add-index-checkpo…

9ef599f

…int-v2

ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Apr 12, 2023

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 12, 2023

ti-chi-bot merged commit 7aac6ab into pingcap:master Apr 12, 2023
11 checks passed

This was referenced Apr 17, 2023

The row count is not updated in time during adding index #43102

Closed

system-variables: update tidb_ddl_enable_fast_reorg pingcap/docs#13309

Merged

system-variables: update tidb_ddl_enable_fast_reorg pingcap/docs-cn#13756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddl: support checkpoint for ingest mode #42769

ddl: support checkpoint for ingest mode #42769

tangenta commented Apr 3, 2023 •

edited

ti-chi-bot commented Apr 3, 2023 •

edited

ti-chi-bot commented Apr 3, 2023

hawkingrei commented Apr 7, 2023

tangenta commented Apr 10, 2023

wjhuang2016 Apr 11, 2023

tangenta Apr 11, 2023

tangenta Apr 11, 2023

Benjamin2037 left a comment

wjhuang2016 Apr 11, 2023

tangenta Apr 11, 2023

wjhuang2016 Apr 11, 2023

tangenta Apr 11, 2023

tangenta Apr 12, 2023

tangenta commented Apr 12, 2023

ti-chi-bot commented Apr 12, 2023

tangenta commented Apr 12, 2023

tangenta commented Apr 12, 2023

ti-chi-bot commented Apr 12, 2023

tangenta commented Apr 12, 2023

ti-chi-bot commented Apr 14, 2023

ddl: support checkpoint for ingest mode #42769

ddl: support checkpoint for ingest mode #42769

Conversation

tangenta commented Apr 3, 2023 • edited

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Apr 3, 2023 • edited

ti-chi-bot commented Apr 3, 2023

hawkingrei commented Apr 7, 2023

tangenta commented Apr 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Benjamin2037 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tangenta commented Apr 12, 2023

ti-chi-bot commented Apr 12, 2023

tangenta commented Apr 12, 2023

tangenta commented Apr 12, 2023

ti-chi-bot commented Apr 12, 2023

tangenta commented Apr 12, 2023

ti-chi-bot commented Apr 14, 2023

tangenta commented Apr 3, 2023 •

edited

ti-chi-bot commented Apr 3, 2023 •

edited