Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResolveLocks for pessimistic transaction that has switched primary may break transaction atomicity #42937

Closed
MyonKeminta opened this issue Apr 11, 2023 · 4 comments · Fixed by #42990
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 affects-7.1 severity/critical sig/transaction SIG:Transaction type/bug This issue is a bug.

Comments

@MyonKeminta
Copy link
Contributor

MyonKeminta commented Apr 11, 2023

Bug Report

* I haven't strictly confirm that the problem does exist yet, but it worth discussion, I think.

Giventhe fact that:

  • A pessimistic transaction that has selected the primary may switch to another key if its first transaction
  • When statement retrying happens, pessimistic locks that are just locked should be pessimistic-rollback-ed
  • PessimisticRollback request may fail without affecting the data consistency

Consider this case:

Initially, transaction T1 produces such state of data:

  • key1: primary, prewrite finished
  • key2: secondary, prewrite finished, primary = k1
  • key3: residual pessimistic lock, primary = k2
Steps to construct this state...
  1. Transaction T1's first statement locks key2 and key3, and key2 is selected as primary
  2. Transaction T1's first statement encounters write conflict and rolls back the pessimistic locks (key2 and key3) it has locked, but the pessimisticRollback request to key3 failed
  3. Transaction T1's subsequent operations first locks key1 (as the primary), and then locked key2 (again) as a secondary

Then:

  1. T1 (committing in 2PC mode) completely finished its prewrite operation.
  2. Another transaction T2 encounters the lock on key3, and then it trys to resolve lock and tries to call check_txn_status on key2 (though it's not the real primary)
  3. TiKV doens't check if check_txn_status is called on a primary lock. It finds that the lock (written during T1's prewrite) is expired, so it considers T1 to be failed and rolls back the lock.
  4. T1 successfully commit key1 (primary), but failed to commit key2. So, T1's atomicity is broken.
@MyonKeminta
Copy link
Contributor Author

MyonKeminta commented Apr 13, 2023

The problem is confirmed using the following steps.

/* init */ create table t(id int primary key, v int unique);
/* init */ insert into t values (1, 10), (2, 20), (3, 30), (4, 40);
/* init */ create table t2 (id int primary key, v int);
/* init */ insert into t2 values (1, 1), (2, 2);

/* t1 */ set @@tidb_enable_async_commit = 0;
/* t1 */ set @@tidb_enable_1pc = 0;
/* t2 */ set @@tidb_enable_async_commit = 0;
/* t2 */ set @@tidb_enable_1pc = 0;

-- Enable failpoints:
--  * tikvclient/beforeAsyncPessimisticRollback = return("skip")
--  * tikvclient/twoPCRequestBatchSizeLimit = return
-- This simulates asyncPessimisticRollback failure and multi pessimistic lock requests.

/* t1 */ begin pessimistic;
/* t2 */ begin pessimistic;
/* t2 */ update t set v = v + 1 where id = 2;

-- Enable failpoints:
--  * tikvclient/twoPCShortLockTTL = return
--  * tikvclient/shortPessimisticLockTTL = return
-- This makes the locks acquired by transaction t1 have a very short TTL for quickly
-- simulating the case that locks' TTL has expired.

/* t1 */ with c as (select /*+ MERGE() */ v from t2 where id = 1 or id = 2) update c join t on c.v = t.id set t.v = t.v + 1;
-- t1 blocked by t2
-- In my test environment, the key of index(v):10 is selected as the primary.

/* t3 */ update t2 set v = v + 2; -- Change the rows that will be updated by t1.
/* t2 */ commit;
-- t1 resumed, row 3 and row 4 are updated

/* t1 */ update t set v = 0 where id = 1; -- This requires locking the key index(v):10 again.

-- Enable failpoint:
--  * tikvclient/beforeCommit = 1*return("delay")
-- with delays randomly in 0 to 5 seconds
/* t1 */ commit;
-- t1 blocked by failpoint. Now, the initial state described in the issue is constructed.

-- sleep 1s

/* t2 */ insert into t values (5, 11);
-- t2 encounters lock on key index(v):11 that's not cleaned up whose primary
-- points to index(v):10, and it succeeds after resolving locks.

-- Wait for t1.
-- t1 is supposed to fail since the transaction in t2 should roll it back. However,
-- it returns success to the client.

/* t1 */ admin check table t; -- Fails

Note that this procedure doesn't stably produce the failure since the delay time in failpoint tikvclient/beforeCommit is randomized.

Also note that the data-index inconsistency is not the only phenomena it may cause - I'm afraid in most cases it causes data inconsistency that's hardly noticeable.

The admin-check failure:

[executor:8134]data inconsistency in table: t, index: v, col: v, handle: "1", index-values:"KindInt64 10" != record-values:"KindInt64 0", compare err:<nil>

The log of committing secondaries:

[2023/04/13 14:04:57.182 +08:00] [ERROR] [commit.go:198] ["2PC failed commit key after primary key committed"] [error="Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(440759980553142284), commit_ts: TimeStamp(440759980579618825), key: [116, 128, 0, 0, 0, 0, 0, 0, 97, 95, 105, 128, 0, 0, 0, 0, 0, 0, 1, 3, 128, 0, 0, 0, 0, 0, 0, 10] })))))"] [errorVerbose="Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(440759980553142284), commit_ts: TimeStamp(440759980579618825), key: [116, 128, 0, 0, 0, 0, 0, 0, 97, 95, 105, 128, 0, 0, 0, 0, 0, 0, 1, 3, 128, 0, 0, 0, 0, 0, 0, 10] })))))\ngithub.com/tikv/client-go/v2/error.ExtractKeyErr\n\t/home/tidb/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230406064257-1ec0ff5bf443/error/error.go:293\ngithub.com/tikv/client-go/v2/txnkv/transaction.actionCommit.handleSingleBatch\n\t/home/tidb/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230406064257-1ec0ff5bf443/txnkv/transaction/commit.go:187\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1\n\t/home/tidb/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230406064257-1ec0ff5bf443/txnkv/transaction/2pc.go:1934\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"] [txnStartTS=440759980553142284] [commitTS=440759980579618825] [keys="[7480000000000000615f69800000000000000103800000000000000a]"] [stack="github.com/tikv/client-go/v2/txnkv/transaction.actionCommit.handleSingleBatch\n\t/home/tidb/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230406064257-1ec0ff5bf443/txnkv/transaction/commit.go:198\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1\n\t/home/tidb/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230406064257-1ec0ff5bf443/txnkv/transaction/2pc.go:1934"]

The test code:

func TestIssue42937(t *testing.T) {
pprof.SetGoroutineLabels(pprof.WithLabels(context.Background(), pprof.Labels("name", "TestIssue42937")))
store := realtikvtest.CreateMockStoreAndSetup(t)
tk := testkit.NewTestKit(t, store)
tk2 := testkit.NewTestKit(t, store)
tk3 := testkit.NewTestKit(t, store)
tk.MustExec("use test")
tk.MustExec("set @@tidb_enable_async_commit = 0")
tk.MustExec("set @@tidb_enable_1pc = 0")
tk2.MustExec("use test")
tk2.MustExec("set @@tidb_enable_async_commit = 0")
tk2.MustExec("set @@tidb_enable_1pc = 0")
tk3.MustExec("use test")
tk.MustExec("create table t(id int primary key, v int unique)")
tk.MustExec("insert into t values (1, 10), (2, 20), (3, 30), (4, 40)")
tk.MustExec("create table t2 (id int primary key, v int)")
tk.MustExec("insert into t2 values (1, 1), (2, 2)")
require.NoError(t, failpoint.Enable("tikvclient/beforeAsyncPessimisticRollback", `return("skip")`))
require.NoError(t, failpoint.Enable("tikvclient/twoPCRequestBatchSizeLimit", "return"))
defer func() {
require.NoError(t, failpoint.Disable("tikvclient/beforeAsyncPessimisticRollback"))
require.NoError(t, failpoint.Disable("tikvclient/twoPCRequestBatchSizeLimit"))
}()
tk.MustExec("begin pessimistic")
tk2.MustExec("begin pessimistic")
tk2.MustExec("update t set v = v + 1 where id = 2")
require.NoError(t, failpoint.Enable("tikvclient/twoPCShortLockTTL", "return"))
require.NoError(t, failpoint.Enable("tikvclient/shortPessimisticLockTTL", "return"))
ch := mustExecAsync(tk, `
with
c as (select /*+ MERGE() */ v from t2 where id = 1 or id = 2)
update c join t on c.v = t.id set t.v = t.v + 1`)
mustTimeout(t, ch, time.Millisecond*100)
tk3.MustExec("update t2 set v = v + 2")
tk2.MustExec("commit")
<-ch
tk.MustQuery("select id, v from t order by id").Check(testkit.Rows("1 10", "2 20", "3 31", "4 41"))
tk.MustExec("update t set v = 0 where id = 1")
require.NoError(t, failpoint.Enable("tikvclient/beforeCommit", `1*return("delay")`))
defer func() {
require.NoError(t, failpoint.Disable("tikvclient/beforeCommit"))
}()
ch = mustExecAsync(tk, "commit")
time.Sleep(time.Second * 1)
mustTimeout(t, ch, time.Millisecond*100)
require.NoError(t, failpoint.Disable("tikvclient/twoPCShortLockTTL"))
require.NoError(t, failpoint.Disable("tikvclient/shortPessimisticLockTTL"))
tk2.MustExec("insert into t values (5, 11)")
time.Sleep(time.Second * 5)
// tk is supposed to fail to keep the consistency
mustRecv(t, ch)
tk.MustExec("admin check table t")
tk.MustQuery("select * from t order by id").Check(testkit.Rows(
"1 0",
"2 21",
"3 31",
"4 41",
"5 11",
))
}

@MyonKeminta
Copy link
Contributor Author

cc @cfzjywxk

@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 labels Apr 13, 2023
@MyonKeminta MyonKeminta added affects-6.5 affects-7.0 affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 and removed may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-6.1 may-affects-6.5 labels Apr 13, 2023
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Apr 26, 2023
close tikv#14636, ref pingcap/tidb#42937

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Apr 26, 2023
close tikv#14636, ref pingcap/tidb#42937

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this issue Apr 26, 2023
close tikv#14636, ref pingcap/tidb#42937

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
lidezhu pushed a commit to lidezhu/tikv that referenced this issue Apr 27, 2023
… requests (tikv#14637)

close tikv#14636, ref pingcap/tidb#42937

Makes TiKV support checking whether the lock is primary when handling check_txn_status.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: lidezhu <lidezhu@pingcap.com>
@MyonKeminta
Copy link
Contributor Author

MyonKeminta commented May 4, 2023

@shiyuhang0 I think you or your team also need to notice this since TiSpark uses the java client to resolve locks

ti-chi-bot bot pushed a commit to tikv/tikv that referenced this issue May 10, 2023
… requests (#14637) (#14659)

close #14636, ref #14637, ref pingcap/tidb#42937

Makes TiKV support checking whether the lock is primary when handling check_txn_status.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: zyguan <zhongyangguan@gmail.com>

Co-authored-by: MyonKeminta <9948422+MyonKeminta@users.noreply.github.com>
Co-authored-by: zyguan <zhongyangguan@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue May 23, 2023
…le pessimistic lock written before switching primary (#42990)

close #42937
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this issue May 23, 2023
ti-chi-bot bot pushed a commit that referenced this issue May 24, 2023
…le pessimistic lock written before switching primary (#42990) (#44119)

close #42937
ti-chi-bot bot added a commit to tikv/tikv that referenced this issue May 24, 2023
… requests (#14637) (#14661)

close #14636, ref #14637, ref pingcap/tidb#42937

Makes TiKV support checking whether the lock is primary when handling check_txn_status.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Co-authored-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue May 29, 2023
…le pessimistic lock written before switching primary (#42990) (#44118)

close #42937
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. affects-6.0 affects-6.1 affects-6.2 affects-6.3 affects-6.4 affects-6.5 affects-6.6 affects-7.0 affects-7.1 severity/critical sig/transaction SIG:Transaction type/bug This issue is a bug.
Projects
None yet
4 participants