Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential data inconsistency that is caused by multiple writer #9440

Closed
overvenus opened this issue Jul 27, 2023 · 6 comments
Closed

Potential data inconsistency that is caused by multiple writer #9440

overvenus opened this issue Jul 27, 2023 · 6 comments

Comments

@overvenus
Copy link
Member

What did you do?

It's possible that TiCDC may have multiple nodes replicating the same table to the same downstream when something goes wrong. E.g., #9344.

We need a mechanism to protect access to downstream. AFAIK, there are two ways

  1. Fencing token [1]. Although it has a negative impact on performance, some downstream may not work out.
  2. Lease. Every node maintains a TTL, it renews on successful heartbeat. A processor must stop replicating ASAP if TTL expires. If the owner finds a processor doesn't heartbeat within one TTL, it considers that processor is down, and needs to wait for one more TTL before scheduling tables on that processor.

[1]: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

What did you expect to see?

No data inconsistency.

What did you see instead?

Data inconsistency.

Versions of the cluster

All release

@overvenus overvenus added type/bug This is a bug. area/ticdc Issues or PRs related to TiCDC. labels Jul 27, 2023
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Jul 27, 2023
@lance6716
Copy link
Contributor

lance6716 commented Jul 27, 2023

#3737 (comment)

DM has met a similar problem before. And even though dm-worker has TTL and one worker knows its lease is expired, the exiting logic is not finished before the new worker started.

@overvenus
Copy link
Member Author

#3737 (comment)

DM has met a similar problem before. And even though dm-worker has TTL and one worker knows its lease it expired, the exiting logic is not finished before the new worker started.

Good point, that is the issue of the TTL approach. It only does best the effort to prevent data inconsistency. "ASAP" may never be fast enough in the case of "processor pause".

@fubinzh
Copy link

fubinzh commented Aug 1, 2023

/severity major

@nongfushanquan
Copy link
Contributor

Timeout handling has been implemented. It will be activated when the actual event occurs.

@nongfushanquan
Copy link
Contributor

/close

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 21, 2023

@nongfushanquan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot closed this as completed Aug 21, 2023
Question and Bug Reports automation moved this from Need Triage to Done Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants