Potential data inconsistency that is caused by multiple writer #9440

overvenus · 2023-07-27T08:27:22Z

What did you do?

It's possible that TiCDC may have multiple nodes replicating the same table to the same downstream when something goes wrong. E.g., #9344.

We need a mechanism to protect access to downstream. AFAIK, there are two ways

Fencing token [1]. Although it has a negative impact on performance, some downstream may not work out.
Lease. Every node maintains a TTL, it renews on successful heartbeat. A processor must stop replicating ASAP if TTL expires. If the owner finds a processor doesn't heartbeat within one TTL, it considers that processor is down, and needs to wait for one more TTL before scheduling tables on that processor.

[1]: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

What did you expect to see?

No data inconsistency.

What did you see instead?

Data inconsistency.

Versions of the cluster

All release

lance6716 · 2023-07-27T08:38:57Z

#3737 (comment)

DM has met a similar problem before. And even though dm-worker has TTL and one worker knows its lease is expired, the exiting logic is not finished before the new worker started.

overvenus · 2023-07-28T03:25:23Z

#3737 (comment)

DM has met a similar problem before. And even though dm-worker has TTL and one worker knows its lease it expired, the exiting logic is not finished before the new worker started.

Good point, that is the issue of the TTL approach. It only does best the effort to prevent data inconsistency. "ASAP" may never be fast enough in the case of "processor pause".

fubinzh · 2023-08-01T06:03:27Z

/severity major

nongfushanquan · 2023-08-21T01:59:09Z

Timeout handling has been implemented. It will be activated when the actual event occurs.

nongfushanquan · 2023-08-21T01:59:15Z

/close

ti-chi-bot · 2023-08-21T01:59:17Z

@nongfushanquan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

overvenus added type/bug This is a bug. area/ticdc Issues or PRs related to TiCDC. labels Jul 27, 2023

github-actions bot added this to Need Triage in Question and Bug Reports Jul 27, 2023

ti-chi-bot bot added severity/major This is a major bug. may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Aug 1, 2023

ti-chi-bot bot closed this as completed Aug 21, 2023

Question and Bug Reports automation moved this from Need Triage to Done Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential data inconsistency that is caused by multiple writer #9440

Potential data inconsistency that is caused by multiple writer #9440

overvenus commented Jul 27, 2023

lance6716 commented Jul 27, 2023 •

edited

overvenus commented Jul 28, 2023

fubinzh commented Aug 1, 2023

nongfushanquan commented Aug 21, 2023

nongfushanquan commented Aug 21, 2023

ti-chi-bot bot commented Aug 21, 2023

Potential data inconsistency that is caused by multiple writer #9440

Potential data inconsistency that is caused by multiple writer #9440

Comments

overvenus commented Jul 27, 2023

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

lance6716 commented Jul 27, 2023 • edited

overvenus commented Jul 28, 2023

fubinzh commented Aug 1, 2023

nongfushanquan commented Aug 21, 2023

nongfushanquan commented Aug 21, 2023

ti-chi-bot bot commented Aug 21, 2023

lance6716 commented Jul 27, 2023 •

edited