Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning: generate and send region job more smoothly #42780

Merged
merged 24 commits into from Apr 13, 2023

Conversation

lance6716
Copy link
Contributor

@lance6716 lance6716 commented Apr 3, 2023

What problem does this PR solve?

Issue Number: close #42456

Problem Summary:

What is changed and how it works?

  1. in the old PR we try to generate all region jobs at once, then process jobs, and generate jobs for unfinished range again, then process again... In order to reduce the lag caused by generating, we now start processing jobs once a key range has generated job, and start generating job once the job need to be generated again, like the region is changed. Also, these work are done concurrently.
  2. unfinishedRange is removed because we don't have a barrier to stop all processing before use unfinishedRange to generate job. Now the after the job is initially generated, the key range of job will be totally finished by worker, or shrank by partitially ingestion. The job may also be splitted into multiple job whose total key range equals to the original job.
  3. In order to let the workers schedule the backoffing job for later processing, a regionJobRetryer is introduced. Now the workflow becomes
	/*
	   [prepareAndSendJob]-----jobToWorkerCh--->[workers]
	                        ^                       |
	                        |                jobFromWorkerCh
	                        |                       |
	                        |                       v
	               [regionJobRetryer]<--[dispatchJobGoroutine]-->done
	*/

Reviews please help check if there's deadlock and if there's a better way to implement it.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 3, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • D3Hunter
  • gozssky

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-cherry-pick-release-7.0 and removed do-not-merge/needs-linked-issue labels Apr 3, 2023
Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>

func (local *local) doImport(ctx context.Context, engine *Engine, regionRanges []Range, regionSplitSize, regionSplitKeys int64) error {
/*
[prepareAndSendJob]-----jobToWorkerCh--->[workers]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gozssky Can you give some suggestions on the pattern? Currently regionJob has different stages, multiple workers hold the resouces (pd client, ...) that can advance the stage, and producer/consumer are communicated by channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of two design patterns

one is using goroutines to acquire "worker resource" and then start job, when the job need to sleep some period to retry, release the worker resource. like

for _, j := range jobs {
  w := workerPool.Acquire()
  go func() {j.run(w); w.Release()}
}

and

// when meet error
j.w.Release()
sleep()
j.w := workerPool.Acquire()

In this method, I want to rely on golang's goroutine scheduling to keep N system thread where N is worker count. But I'm afraid we will spawn too many sleeping goroutines and affect the runtime.

In the refactoring PR I want to do manually scheduling, where try to keep the N worker goroutines always busy and put the job back when the job needs retry. To solve #42456, I should put every time consuming work in worker to make use of concurrency, and remove the barrier in the single-thread producer goroutine. Do you have any suggestions? @gozssky @D3Hunter

@lance6716 lance6716 changed the title [WIP]lightning: generate region job more smoothly [WIP]lightning: generate and send region job more smoothly Apr 4, 2023
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 5, 2023
…on-job

Signed-off-by: lance6716 <lance6716@gmail.com>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2023
Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716 lance6716 changed the title [WIP]lightning: generate and send region job more smoothly lightning: generate and send region job more smoothly Apr 10, 2023
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2023
@lance6716
Copy link
Contributor Author

/run-integration-br-test

Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-test

@lance6716
Copy link
Contributor Author

/retest

@lance6716
Copy link
Contributor Author

/run-integration-br-test

@lance6716
Copy link
Contributor Author

/run-integration-br-test

@lance6716
Copy link
Contributor Author

/hold

need to fix #42780 (comment)

@ti-chi-bot ti-chi-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2023
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2023
@lance6716
Copy link
Contributor Author

/run-integration-br-test

Signed-off-by: lance6716 <lance6716@gmail.com>
Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-test

@lance6716
Copy link
Contributor Author

/lgtm cancel

fix a problem that sending to putBackCh will block retryer.push 45b208d

because retryer.push must access the inner queue, sending to putBackCh must not access the inner queue to avoid blocking the former. So we should use a temporary toPutBack to reduce the locking scope. @D3Hunter @gozssky

@D3Hunter @gozssky PTAL

Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-test

@lance6716
Copy link
Contributor Author

@lance6716
Copy link
Contributor Author

/retest

Signed-off-by: lance6716 <lance6716@gmail.com>
@lance6716
Copy link
Contributor Author

/run-integration-br-test

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 13, 2023
@lance6716
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 3f90366

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 13, 2023
@ti-chi-bot ti-chi-bot merged commit 44aa4cf into pingcap:master Apr 13, 2023
15 checks passed
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.0: #43035.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 13, 2023
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-7.0 release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ddl: compared with v6.6.0, v7.0.0 has 10% - 13% performance regression
4 participants