Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchResolveLock doesn't handle stale pessimistic locks with invalidated primary and may break data consistency #43243

Closed
MyonKeminta opened this issue Apr 20, 2023 · 4 comments
Assignees
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 affects-7.1 severity/critical sig/transaction SIG:Transaction type/bug This issue is a bug.

Comments

@MyonKeminta
Copy link
Contributor

Bug Report

* This is found by reviewing code and not yet confirmed by tests.

As way know, pessimistic transactions may switch primary during execution, and a stale pessimistic lock's primary may point to a key that is no longer the primary of the transaction.

It's quite tricky to handle this case when resolving locks. Historically, we've done several fixes trying to make it correct, such as #14787 , #21689 and many more. However, we still find some incorrectly handled corner cases recently. One is #42937 , and here's another one.

Since a stale pessimistic lock may points to a key that's not the actual primary of that transaction, the status of the key pointed by the pessimistic lock may not indicate the real state (committed or rolled back) of that transaction. We used to optimize resolving lock by caching the result of check_txn_status, and then it's found to be problematic and fixed in #21689 .

However, we found that BatchResolveLocks (which is used in GC) have a different way to misuse the wrong transaction status. It's passed an array of locks (which is usually collected by ScanLock RPC), then checks their primary's status, collects the results in a list, and finally send them to TiKV in one ResolveLock RPC. There's no special handling for pessimistic locks, and it may send incorrect transaction state to TiKV. If the transaction is committed and some of their keys are prewritten but not committed (secondaries of 2PC transactions, or any keys of async commit transactions), these key may be incorrectly rolled back, causing the transaction incompletely committed and breaks the data consistency.

@MyonKeminta MyonKeminta added type/bug This issue is a bug. severity/critical labels Apr 20, 2023
@ti-chi-bot ti-chi-bot added may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Apr 20, 2023
@MyonKeminta MyonKeminta added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 and removed may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Apr 20, 2023
@cfzjywxk
Copy link
Contributor

cfzjywxk commented May 4, 2023

I remember BatchResolveLocks is also used by tools like BR, but I'm not quite sure about it. @YuJuncen @3pointer Could you help verify the usages of lock-resolving related interfaces in current BR or other implementations?

/cc
@tonyxuqqi @nongfushanquan
PTAL if there are other components that would call BatchResolveLock directly using TiDB Dependency.

ti-chi-bot bot pushed a commit that referenced this issue May 18, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this issue May 18, 2023
ti-chi-bot bot pushed a commit that referenced this issue May 18, 2023
ti-chi-bot bot pushed a commit that referenced this issue May 30, 2023
@zyguan zyguan mentioned this issue Jun 30, 2023
12 tasks
ti-chi-bot bot pushed a commit that referenced this issue Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 affects-7.1 severity/critical sig/transaction SIG:Transaction type/bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

4 participants