Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-retry on PD crash can *also* cause lost updates, read skew #10076

Closed
aphyr opened this issue Apr 8, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@aphyr
Copy link

commented Apr 8, 2019

As noted in #10075, TiDB loses updates by default thanks to an auto-retry mechanism. The documentation says you can disable this mechanism by setting @@global.tidb_disable_txn_auto_retry = 1;. This is not entirely correct. There is a second automatic transaction retry mechanism, which comes into play when the connection to PD is lost--for example, if PD crashes. This retry mechanism can also cause lost updates.

I suggest disabling this retry mechanism by default as well, and, going forward, modify it to only retry transactions which can be safely re-applied.

In addition, setting tidb_disable_txn_auto_retry should disable all automatic retry mechanisms.

  1. What did you do?

With Jepsen 8c4f1101510b382266a3b8d3fdaa7adfc60d9d9a, run

lein run test --time-limit 120 --concurrency 2n -w bank --nemesis kill-pd
  1. What did you expect to see?

The total of all accounts should be constant.

  1. What did you see instead?

Shortly after the recovery from a PD crash affecting a majority of PD nodes, account balances can briefly fluctuate. This anomaly is not a transient read phenomenon; logical state is corrupted. We believe this represents another case of lost updates. http://jepsen.io.s3.amazonaws.com/analyses/tidb-2.1.7/20190408T132351-pd-auto-retry.zip

bank (8)

  1. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?

2.1.7-linux-amd64

@morgo morgo added the type/bug label Apr 8, 2019

@aphyr aphyr changed the title Auto-retry on PD crash can *also* cause lost updates Auto-retry on PD crash can *also* cause lost updates, read skew May 3, 2019

@aphyr

This comment has been minimized.

Copy link
Author

commented May 3, 2019

It looks like this can happen even when there are no process crashes, network problems, or other issues. In this test, we have a single case of read skew with tidb_disable_txn_auto_retry set to 1, but auto-retry-limit set to the default value of 10.

@jackysp

This comment has been minimized.

Copy link
Member

commented May 4, 2019

Hi @aphyr ,
I looked at your log and there was a transaction retry. The reason for this transaction retry is that the transaction commit time is too long. The tidb_disable_txn_auto_retry variable in the beta version of TiDB 3.0 only prevents automatic retry of transaction write conflicts, so transactions like network failures, program crashes, and commit times are too long, the transaction will automatically retry, when this retry and the write-conflict occurs at the same time, it will lead to the problem you are talking about.
Below is the log I found.

[2019/05/01 13:01:12.633 -07:00] [WARN] [session.go:386] [sql] [conn=2] [label=general] [error="[try again later]: conn2 2PC commit failed: retryable:\"Txn(Mvcc(TxnLockNotFound { start_ts: 408090278408224772, commit_ts: 408090279311835140, key: [116, 128, 0, 0, 0, 0, 0, 0, 255, 114, 95, 114, 128, 0, 0, 0, 0, 255, 0, 1, 21, 0, 0, 0, 0, 0, 250] }))\" "] [errorVerbose="conn2 2PC commit failed: retryable:\"Txn(Mvcc(TxnLockNotFound { start_ts: 408090278408224772, commit_ts: 408090279311835140, key: [116, 128, 0, 0, 0, 0, 0, 0, 255, 114, 95, 114, 128, 0, 0, 0, 0, 255, 0, 1, 21, 0, 0, 0, 0, 0, 250] }))\" \ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).commitSingleBatch\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:585\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:330\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnKeys\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:287\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).commitKeys\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:652\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).execute\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:751\ngithub.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).executeAndWriteFinishBinlog\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:660\ngithub.com/pingcap/tidb/store/tikv.(*tikvTxn).Commit\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/txn.go:275\ngithub.com/pingcap/tidb/session.(*TxnState).Commit\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/txn.go:192\ngithub.com/pingcap/tidb/session.(*session).doCommit\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:359\ngithub.com/pingcap/tidb/session.(*session).doCommitWithRetry\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:370\ngithub.com/pingcap/tidb/session.(*session).CommitTxn\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:442\ngithub.com/pingcap/tidb/session.finishStmt\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/tidb.go:171\ngithub.com/pingcap/tidb/session.runStmt\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/tidb.go:234\ngithub.com/pingcap/tidb/session.(*session).executeStatement\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:907\ngithub.com/pingcap/tidb/session.(*session).execute\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:981\ngithub.com/pingcap/tidb/session.(*session).Execute\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:930\ngithub.com/pingcap/tidb/server.(*TiDBContext).Execute\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/server/driver_tidb.go:246\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/server/conn.go:1045\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/server/conn.go:772\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/server/conn.go:608\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/server/server.go:437\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337\n[try again later]"] [txn="Txn{state=invalid}"]
[2019/05/01 13:01:12.633 -07:00] [WARN] [session.go:579] [retrying] [conn=2] [schemaVersion=20] [retryCnt=0] [queryNum=0] [sql="insert into lf (id, sk, val) values (277, 277, '1') on duplicate key update val = CONCAT(val, ',', '1')"]
[2019/05/01 13:01:12.633 -07:00] [INFO] [2pc.go:689] ["2PC clean up done"] [conn=2] [txnStartTS=408090278408224772]
[2019/05/01 13:01:12.634 -07:00] [WARN] [session.go:579] [retrying] [conn=2] [schemaVersion=20] [retryCnt=0] [queryNum=1] [sql="insert into lf (id, sk, val) values (278, 278, '1') on duplicate key update val = CONCAT(val, ',', '1')"]
[2019/05/01 13:01:12.634 -07:00] [WARN] [session.go:579] [retrying] [conn=2] [schemaVersion=20] [retryCnt=0] [queryNum=2] [sql=COMMIT]

The way to completely disable transaction automatic retry in TiDB is to set tidb_retry_limit to 0.

@jackysp

This comment has been minimized.

Copy link
Member

commented May 4, 2019

set tidb_disable_txn_auto_retry = 1 only disable retry for write conflicts is not a good idea, so I filed a pr to fix it #10339, hopefully this will appear in the 3.0 GA release.

@aphyr

This comment has been minimized.

Copy link
Author

commented May 6, 2019

I think that's a good idea, thanks! :)

@jackysp

This comment has been minimized.

Copy link
Member

commented May 17, 2019

After #10339 was merged, set tidb_disable_txn_auto_retry = 1 could avoid lost update and read skew.

@aphyr

This comment has been minimized.

Copy link
Author

commented May 29, 2019

Fixed in 3.0.0-rc.2!

@aphyr aphyr closed this May 29, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.