Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ddl operations for placement rule atomic #28159

Closed
Tracked by #18030
lcwangchao opened this issue Sep 17, 2021 · 12 comments · Fixed by #30590
Closed
Tracked by #18030

Make ddl operations for placement rule atomic #28159

lcwangchao opened this issue Sep 17, 2021 · 12 comments · Fixed by #30590
Assignees

Comments

@lcwangchao
Copy link
Collaborator

lcwangchao commented Sep 17, 2021

Enhancement

For now, the ddl operations for placement rule is not atomic. For the current implement, if it fails to send request to PD, meta operations will rollback:

tidb/ddl/partition.go

Lines 1131 to 1135 in 680de92

err = infosync.PutRuleBundles(context.TODO(), bundles)
if err != nil {
job.State = model.JobStateCancelled
return ver, errors.Wrapf(err, "failed to notify PD the placement rules")
}

It works well in most times. However, it is possible that PD have received and processed the request successfully but the response packet is lost in the network. In this case, the meta operations will rollback and left an inconsistent state with PD.

Some ways to resolve the problem:

  • Rollback PD states when the ddl job fails even if we got a failed response. In this way, rollback for PD should be idempotent and should retry infinitely if it still fails.
  • Left it to users. In this way, we should allow user to retry his operation to make the state consistent manually. It's better to provide some commands to check the consistency too.
@xhebox
Copy link
Contributor

xhebox commented Sep 17, 2021

You should add another option, retrying sending requests to PD until successful. API is not only atomic, but also idempotent. That means we can retry DDL job forever until it succeeds, i.e. blocking.

It can be archived with a middle state. And that is what we wanted to do at the very beginning of the previous work plan. But we decided not to do. Because in the previous version, there is no persistence on TiKV. Infoschema is merely a cache. It can not be inconsistent.

But, since now we have persistence, this plan becomes useful again.

The reason that I did not choose to rollback: our DDL can not be canceled in many cases anyway. We are not good at rollback. It is hard.

BTW, please add it in the tracking issue, in the rule engine section, maybe.

@morgo
Copy link
Contributor

morgo commented Sep 20, 2021

Yes I agree. We should be retrying requests for n times before giving up and rolling back.

@morgo
Copy link
Contributor

morgo commented Sep 27, 2021

We discussed this today on our weekly meeting. For direct placement options, it is more straight forward because we can retry the placement option.. except we need to be careful that the retrying is not in the DDL queue (which would block other DDL jobs).

For altering a PLACEMENT POLICY, we need to consider that it could cascade to many options, which may fail or require retrying.

@AilinKid can you please share your suggestion?

@AilinKid
Copy link
Contributor

AilinKid commented Sep 27, 2021

how about moving the sync pd logic out of ddl txn? I mean do the sync after the ddl txn is committed for sure.

This comes from a similar logic for what gc does now, for example, drop a table, when the ddl is committed already, the following action is to write the table id into the gc_delete_range physical table, which in done in finishDDLJob.

For the case here, sync PD is a kind of afterward action, which needs the ddl txn is done for sure. When ddl is done, the bundle can be found in the meta, which is truth already, syncing pd is just time problem. (which avoided the retry ddl job which is time-costy)

@xhebox
Copy link
Contributor

xhebox commented Sep 27, 2021

how about moving the sync pd logic out of ddl txn? I mean do the sync after the ddl txn is committed for sure.

This comes from a similar logic for what gc does now, for example, drop a table, when the ddl is committed already, the following action is to write the table id into the gc_delete_range physical table, which in done in finishDDLJob.

For the case here, sync PD is a kind of afterward action, which needs the ddl txn is done for sure. When ddl is done, the bundle can be found in the meta, which is truth already, syncing pd is just time problem. (which avoided the retry ddl job which is time-costy)

I like the idea, but the order of DDL jobs still needs to be kept. Bundle does not come with monotonic id like key-ranes, which means the job either can not be executed parallel or must take care of the order. Otherwise, drop policy for table -> create policy for table may be synchronized as create policy for table -> drop policy for table.

Serial execution does not make a difference. And taking care of the order may introduce new inconsistency. BTW, now tiflash works by pulling meta from tidb regularly(one persistence).

@lcwangchao
Copy link
Collaborator Author

Summary for some candidates:

  • Retry infinitely until success. It will make a final consistency. However it will block other DDL operations if PD request continually fails. Moreover the PD request may fail for non-network problems. For example, we left some parameter checks to PD instead of checking it in tidb layer.

  • Retry some times and rollback. @xhebox has some comments about this way that: "The reason that I did not choose to rollback: our DDL can not be canceled in many cases anyway. We are not good at rollback. It is hard." . But I think it is not necessary to add a rollback stage for every ddl operation. The first stage of ddl operation can always be rollback because not any changes have been committed yet. So we can choose the "last stage the ddl can rollback" and do some PD requests. This way still has a small probability that makes the state not consistency when we have a severe network jitter(request succeed but response package lost).

  • Syncing with PD out of ddl operations. Retry infinitely in a standalone process. In this way the DDL operation will never be blocked, but we should consider more about orders and this makes our system more complex.

@AilinKid
Copy link
Contributor

AilinKid commented Nov 13, 2021

Summary for some candidates:

  • Retry infinitely until success. It will make a final consistency. However it will block other DDL operations if PD request continually fails. Moreover the PD request may fail for non-network problems. For example, we left some parameter checks to PD instead of checking it in tidb layer.
  • Retry some times and rollback. @xhebox has some comments about this way that: "The reason that I did not choose to rollback: our DDL can not be canceled in many cases anyway. We are not good at rollback. It is hard." . But I think it is not necessary to add a rollback stage for every ddl operation. The first stage of ddl operation can always be rollback because not any changes have been committed yet. So we can choose the "last stage the ddl can rollback" and do some PD requests. This way still has a small probability that makes the state not consistency when we have a severe network jitter(request succeed but response package lost).
  • Syncing with PD out of ddl operations. Retry infinitely in a standalone process. In this way the DDL operation will never be blocked, but we should consider more about orders and this makes our system more complex.

The third way of mine is that: doing it in finishDDLJob is not in a standalone goroutine, but the mainstream. The benefit is that it can be doing after the KV meta txn is committed for sure. (strictly the same sequence of what the ddl job done is like)

Or we can make it async (in a standalone goroutine), strictly control the syncing sequence of of what the ddl job done is like (another subtask queue?)

@xhebox
Copy link
Contributor

xhebox commented Nov 14, 2021

The third way of mine is that: doing it in finishDDLJob is not in a standalone goroutine, but the mainstream. The benefit is that it can be doing after the KV meta txn is committed for sure. (strictly the same sequence of what the ddl job done is like)

Or we can make it async (in a standalone goroutine), strictly control the syncing sequence of of what the ddl job done is like (another subtask queue?)

It is trivial about where the third method is done. But it will be complex to ensure order and consistency.

Moreover the PD request may fail for non-network problems. For example, we left some parameter checks to PD instead of checking it in tidb layer.
This way still has a small probability that makes the state not consistency when we have a severe network jitter(request succeed but response package lost).

We can tell if it is timeout or not. In case you lost the response, you could send it again to gain the correct response.

But I think it is not necessary to add a rollback stage for every ddl operation. The first stage of ddl operation can always be rollback because not any changes have been committed yet.

The hard part is to determine the state that needs to rollback to.

@lcwangchao
Copy link
Collaborator Author

We can tell if it is timeout or not. In case you lost the response, you could send it again to gain the correct response.

Yes, we can tell it wether it is a network or request error. But what I mean is that if it is a request error, it is meaningless to retry it because it will always be failed. So we still need to rollback.

The hard part is to determine the state that needs to rollback to.

Actually, we need not to add extra rollback stage for placement. If a ddl operation does not support rollback before, we just need to request PD in first stage.

@xhebox
Copy link
Contributor

xhebox commented Nov 15, 2021

Yes, we can tell it wether it is a network or request error. But what I mean is that if it is a request error, it is meaningless to retry it because it will always be failed. So we still need to rollback.

So you can just retry if it is network, but fail if it is HTTP 200 for example.

Actually, we need not to add extra rollback stage for placement. If a ddl operation does not support rollback before, we just need to request PD in first stage.

What if it supports? What is the original state?

@lcwangchao
Copy link
Collaborator Author

What if it supports? What is the original state?
It's easier to handle a job which supports rollback. Put the PD request operation in the last stage.

@xhebox
Copy link
Contributor

xhebox commented Nov 15, 2021

What if it supports? What is the original state?
It's easier to handle a job which supports rollback. Put the PD request operation in the last stage.

Yeah, you could do so, if you can construct the old placement rule bundles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants