Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding index in parallel #19386

Closed
djshow832 opened this issue Aug 24, 2020 · 15 comments
Closed

Adding index in parallel #19386

djshow832 opened this issue Aug 24, 2020 · 15 comments

Comments

@djshow832
Copy link
Contributor

djshow832 commented Aug 24, 2020

Description

Now adding index is processed only on the DDL owner. When the table is huge, it takes too much time. We can leverage the computing capability of the whole cluster to accomplish it.

Here's a rough plan:

  1. When the DDL owner receives an adding index job, it looks up the statistics to figure out whether the table is huge. It not, it just executes it serially as before.
  2. If the table is huge, the owner splits the job into many subjobs, each of which takes care of a relatively small range of data. The subjobs are put into a queue.
  3. The owner sends the subjobs to all the other TiDB instances. Each TiDB instance receives only one subjob at one time. Note that as the number of subjobs is greater than the number of TiDB instances, subjobs are not sent all at once.
  4. Each subjob is further split into multiple ranges, each of which corresponds to a transaction. Once a transaction is done, TiDB persists the progress onto TiKV.
  5. Once a TiDB instance has processed a subjob, it notifies the owner to fetch the next subjob. In this way, we can make sure that the more efficient a TiDB instance is, the more subjobs it processes.
  6. If a subjob fails in one TiDB instance, it notifies the owner to rollback. The owner notifies all the TiDB instances to stop, and the rollback job is executed in background on the owner.
  7. If a TiDB instance hasn't responded for some time, it may be down. The owner sends the subjob to another TiDB instance. The new TiDB instance reads the progress and continues the subjob.
  8. Once the owner is down, the new owner reads the subjob queue and continues the whole job.
  9. Once all the subjobs are done, the owner returns success.

Score

  • 6600

SIG Slack Channel

You can join #sig-ddl on slack in your spare time to discuss and get help with mentors or others.

Mentor

Contact the mentors: #tidb-challenge-program channel in TiDB Community Slack Workspace

Recommended Skills

  • DDL
  • Transaction
  • Golang

Learning Materials

@ghost
Copy link

ghost commented Aug 25, 2020

+1 I like this design over any static weighting because it responds well to environments where virtual machines could be migrated around or performance could change due to noisy neighbors/steal.

@TszKitLo40
Copy link
Contributor

/pick-up

@ghost ghost mentioned this issue Sep 12, 2020
3 tasks
@TszKitLo40
Copy link
Contributor

@djshow832 I want to try this issue, can this issue be picked up?

@ti-challenge-bot ti-challenge-bot bot added picked and removed picked labels Sep 14, 2020
@djshow832
Copy link
Contributor Author

djshow832 commented Sep 14, 2020

@djshow832 I want to try this issue, can this issue be picked up?

Of course, please try picking up again. @TszKitLo40

@pingcap pingcap deleted a comment from Rustin170506 Sep 14, 2020
@pingcap pingcap deleted a comment from ti-challenge-bot bot Sep 14, 2020
@pingcap pingcap deleted a comment from Rustin170506 Sep 14, 2020
@pingcap pingcap deleted a comment from ti-challenge-bot bot Sep 14, 2020
@pingcap pingcap deleted a comment from Rustin170506 Sep 14, 2020
@pingcap pingcap deleted a comment from ti-challenge-bot bot Sep 14, 2020
@ti-challenge-bot
Copy link

The description of issue updated, but still has some problems.

More

Tip :
You need to ensure that the description of the issue follows the following template:

```
## Score

- ${score}

## Mentor

- ${mentor}
```

Warning: The description format for this issue is wrong.

@TszKitLo40
Copy link
Contributor

/pick-up

@Rustin170506
Copy link
Member

Rustin170506 commented Sep 18, 2020

/pick-up

@TszKitLo40 Sorry you cannot pick up this issue. Because you do not have a team.

This is a known issue. I will fix it later.

hptc challenge program only for team, so you can try to join a team.

@pingcap pingcap deleted a comment from ti-challenge-bot bot Sep 18, 2020
@ben2077
Copy link

ben2077 commented Sep 18, 2020

/pick-up

@ti-challenge-bot
Copy link

Pick up success.

@aierui
Copy link
Contributor

aierui commented Sep 19, 2020

/pick-up

@ti-challenge-bot
Copy link

This issue already picked by CodingBen.

@ben2077
Copy link

ben2077 commented Sep 25, 2020

For owner:
First check the job is the parent, if it is, it will only do following three things:
(only the owner can execute the parent job)

  1. It will get all the subTasks from the subTaskQueue, then check the ‘runner’ status
    which represents the TIDB instance that is executing the subTask,(if it’s down, update the runner to nil and set status to unclaimed)
  2. Check the subTask’s status one by one, reset the failed subTask’ runner and status to nil and unclaimed.
  3. If owner find all the subTasks in queue witch have same jobId were done, it will finish the job.

For the TiDB Instance, which is not the owner:
All the TIDB instance in the cluster will read the subTaskQueue on time, get at most one unclaimed task and execute it each time.

@ben2077
Copy link

ben2077 commented Sep 25, 2020

/pick-up

@ti-challenge-bot
Copy link

The challenge program issue is already in the assign flow and development has started. So you cannot pick up this issue. You can try other issues.

@ben2077
Copy link

ben2077 commented Sep 26, 2020

The status quo I recognize:
The earliest index addition logic is based on a single physical table, so the progress information of add index exists in such a triplet:
1.JobId-StartHandle (where the current job has been executed)
2.JobId-EndHandle
3.JobId-Physical table ID
This design actually limits the processing relationship between add index ddl job and physical table at the same time in the system to only one pair, so the code for the add index task of the partition table is also operated one by one. . But it is difficult and inefficient to perform such logic in a cluster. So we may redesign such a recorg information. As follows:
1.JobId-SubTaskId-StartHandle
2.JobId-SubTaskId-EndHandle
3.JobId-SubTaskId-AddedCount (for progress statistics)
In this way, the backfill index job can be restored from the previous Job-Partition level to the SubTask-Range level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants