Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB crashes on startup when KV regions are unavailable #10470

Open
aphyr opened this issue May 14, 2019 · 3 comments

Comments

Projects
None yet
3 participants
@aphyr
Copy link

commented May 14, 2019

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?

Using Jepsen b4587653b9d5a33145bec901f9f56326975e6e31, we start PD, wait for all PD health pages to return, start KV, wait for all their health pages to return, then start DB. However, DB sometimes crashes, complaining that a region was unavailable. Full logs are available here

  1. What did you expect to see?

We expected to see DB block until the region was available, then run normally.

  1. What did you see instead?

DB crashes with a fatal warning:

2019/05/14 16:52:34.082 terror.go:329: [fatal] [tikv:9005]Region is unavailable[try again later]
github.com/pingcap/errors.AddStack
	/home/jenkins/workspace/release_tidb_master_tag/go/pkg/mod/github.com/pingcap/errors@v0.11.1/errors.go:174
github.com/pingcap/errors.Trace
	/home/jenkins/workspace/release_tidb_master_tag/go/pkg/mod/github.com/pingcap/errors@v0.11.1/juju_adaptor.go:15
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleCopResponse
	/home/jenkins/workspace/release_tidb_master_tag/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:730
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTaskOnce
	/home/jenkins/workspace/release_tidb_master_tag/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:639
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).handleTask
	/home/jenkins/workspace/release_tidb_master_tag/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:582
github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).run
	/home/jenkins/workspace/release_tidb_master_tag/go/src/github.com/pingcap/tidb/store/tikv/coprocessor.go:430
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1333
  1. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?

Release Version: v3.0.0-beta.1
Git Commit Hash: d6490c1
Git Branch: HEAD
UTC Build Time: 2019-03-26 02:46:54
GoVersion: go version go1.11.4 linux/amd64
Race Enabled: false
TiKV Min Version: 2.1.0-alpha.1-ff3dd160846b7d1aed9079c389fc188f7f5ea13e
Check Table Before Drop: false

@Hoverbear

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

It seems like pd ended up in a fairly rocky election!

TiDB is reporting it hits a 20s max sleep on a region miss backoff:

# ec2-3-81-52-36.compute-1.amazonaws.com TiDB
[2019/05/14 16:52:34.935 +00:00] [WARN] [backoff.go:259] ["regionMiss backoffer.maxSleep 20000ms is exceeded, errors:\nmessage:\"region epoch is not match\" epoch_not_match:<current_regions:<id:3 start_key:\"t\\200\\000\\000\\000\\000\\000\\000S\" region_epoch:<conf_ver:1 version:17 > peers:<id:4 store_id:1 > > current_regions:<id:40 start_key:\"t\\200\\000\\000\\000\\000\\000\\000O\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000S\" region_epoch:<conf_ver:1 version:17 > peers:<id:41 store_id:1 > > >  at 2019-05-14T16:52:33.931495811Z\nmessage:\"region epoch is not match\" epoch_not_match:<current_regions:<id:3 start_key:\"t\\200\\000\\000\\000\\000\\000\\000S\" region_epoch:<conf_ver:1 version:17 > peers:<id:4 store_id:1 > > current_regions:<id:40 start_key:\"t\\200\\000\\000\\000\\000\\000\\000O\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000S\" region_epoch:<conf_ver:1 version:17 > peers:<id:41 store_id:1 > > >  at 2019-05-14T16:52:34.433260974Z\nmessage:\"region epoch is not match\" epoch_not_match:<current_regions:<id:3 start_key:\"t\\200\\000\\000\\000\\000\\000\\000S\" region_epoch:<conf_ver:1 version:17 > peers:<id:4 store_id:1 > > current_regions:<id:40 start_key:\"t\\200\\000\\000\\000\\000\\000\\000O\" end_key:\"t\\200\\000\\000\\000\\000\\000\\000S\" region_epoch:<conf_ver:1 version:17 > peers:<id:41 store_id:1 > > >  at 2019-05-14T16:52:34.935179175Z"]

We can see TiKV is reporting that the regions epoch doesn't match for the request:

# ec2-3-81-52-36.compute-1.amazonaws.com TiKV
[2019/05/14 16:52:19.325 +00:00] [ERROR] [endpoint.rs:470] [error-response] [err="region Error { message: \"region epoch is not match\", not_leader: None, region_not_found: None, key_not_in_region: None, epoch_not_match: Some(EpochNotMatch { current_regions: [Region { id: 3, start_key: [116, 128, 0, 0, 0, 0, 0, 0, 255, 83, 0, 0, 0, 0, 0, 0, 0, 248], end_key: [], region_epoch: Some(RegionEpoch { conf_ver: 1, version: 17, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }), peers: [Peer { id: 4, store_id: 1, is_learner: false, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }], unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }, Region { id: 40, start_key: [116, 128, 0, 0, 0, 0, 0, 0, 255, 79, 0, 0, 0, 0, 0, 0, 0, 248], end_key: [116, 128, 0, 0, 0, 0, 0, 0, 255, 83, 0, 0, 0, 0, 0, 0, 0, 248], region_epoch: Some(RegionEpoch { conf_ver: 1, version: 17, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }), peers: [Peer { id: 41, store_id: 1, is_learner: false, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }], unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }], unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }), server_is_busy: None, stale_command: None, store_not_match: None, raft_entry_too_large: None, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }"]

Here we can see evidence that PD is tangled in an election during this time frame:

# ec2-3-81-52-36.compute-1.amazonaws.com PD
2019/05/14 16:52:13.143 node.go:312: [info] raft.node: 31a860370a7744b9 lost leader 1f6ee85649ca29fe at term 5
2019/05/14 16:52:13.144 node.go:306: [info] raft.node: 31a860370a7744b9 elected leader 579e9d5eb41f7baf at term 5
2019/05/14 16:52:37.119 raft.go:857: [info] 31a860370a7744b9 is starting a new election at term 5
2019/05/14 16:52:37.119 raft.go:684: [info] 31a860370a7744b9 became pre-candidate at term 5
2019/05/14 16:52:37.119 raft.go:755: [info] 31a860370a7744b9 received MsgPreVoteResp from 31a860370a7744b9 at term 5
2019/05/14 16:52:37.119 raft.go:742: [info] 31a860370a7744b9 [logterm: 5, index: 290] sent MsgPreVote request to 579e9d5eb41f7baf at term 5
2019/05/14 16:52:37.119 raft.go:742: [info] 31a860370a7744b9 [logterm: 5, index: 290] sent MsgPreVote request to 9f1e909c86ec2f0d at term 5
2019/05/14 16:52:37.119 raft.go:742: [info] 31a860370a7744b9 [logterm: 5, index: 290] sent MsgPreVote request to e3d0977a65591fe6 at term 5
2019/05/14 16:52:37.119 raft.go:742: [info] 31a860370a7744b9 [logterm: 5, index: 290] sent MsgPreVote request to 1f6ee85649ca29fe at term 5
2019/05/14 16:52:37.129 raft.go:782: [info] 31a860370a7744b9 [logterm: 5, index: 290, vote: 579e9d5eb41f7baf] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 5, index: 290] at term 5: lease is not expired (remaining ticks: 6)
2019/05/14 16:52:37.272 raft.go:782: [info] 31a860370a7744b9 [logterm: 5, index: 290, vote: 579e9d5eb41f7baf] ignored MsgPreVote from 1f6ee85649ca29fe [logterm: 5, index: 290] at term 5: lease is not expired (remaining ticks: 3)
2019/05/14 16:52:37.390 raft.go:782: [info] 31a860370a7744b9 [logterm: 5, index: 290, vote: 579e9d5eb41f7baf] ignored MsgPreVote from 9f1e909c86ec2f0d [logterm: 5, index: 290] at term 5: lease is not expired (remaining ticks: 1)
2019/05/14 16:52:37.399 raft.go:656: [info] 31a860370a7744b9 became follower at term 5
2019/05/14 16:53:22.140 raft.go:882: [info] 31a860370a7744b9 [logterm: 5, index: 344, vote: 579e9d5eb41f7baf] cast MsgPreVote for 9f1e909c86ec2f0d [logterm: 5, index: 344] at term 5
2019/05/14 16:53:22.142 raft.go:797: [info] 31a860370a7744b9 [term: 5] received a MsgVote message with higher term from 9f1e909c86ec2f0d [term: 6]
2019/05/14 16:53:22.142 raft.go:656: [info] 31a860370a7744b9 became follower at term 6
2019/05/14 16:53:22.142 raft.go:882: [info] 31a860370a7744b9 [logterm: 5, index: 344, vote: 0] cast MsgVote for 9f1e909c86ec2f0d [logterm: 5, index: 344] at term 6
2019/05/14 16:53:22.142 node.go:312: [info] raft.node: 31a860370a7744b9 lost leader 579e9d5eb41f7baf at term 6
2019/05/14 16:53:22.144 node.go:306: [info] raft.node: 31a860370a7744b9 elected leader 9f1e909c86ec2f0d at term 6
2019/05/14 16:56:03.215 raft.go:782: [info] 31a860370a7744b9 [logterm: 6, index: 520, vote: 9f1e909c86ec2f0d] ignored MsgPreVote from 1f6ee85649ca29fe [logterm: 6, index: 518] at term 6: lease is not expired (remaining ticks: 6)
2019/05/14 16:56:03.765 raft.go:782: [info] 31a860370a7744b9 [logterm: 6, index: 520, vote: 9f1e909c86ec2f0d] ignored MsgPreVote from 1f6ee85649ca29fe [logterm: 6, index: 518] at term 6: lease is not expired (remaining ticks: 6)
2019/05/14 16:56:03.765 raft.go:782: [info] 31a860370a7744b9 [logterm: 6, index: 520, vote: 9f1e909c86ec2f0d] ignored MsgPreVote from 1f6ee85649ca29fe [logterm: 6, index: 519] at term 6: lease is not expired (remaining ticks: 6)
2019/05/14 16:56:03.765 raft.go:782: [info] 31a860370a7744b9 [logterm: 6, index: 520, vote: 9f1e909c86ec2f0d] ignored MsgPreVote from 1f6ee85649ca29fe [logterm: 6, index: 519] at term 6: lease is not expired (remaining ticks: 6)
2019/05/14 16:56:03.770 raft.go:782: [info] 31a860370a7744b9 [logterm: 6, index: 520, vote: 9f1e909c86ec2f0d] ignored MsgPreVote from 1f6ee85649ca29fe [logterm: 6, index: 520] at term 6: lease is not expired (remaining ticks: 5)
2019/05/14 17:00:04.674 log.go:86: [warning] etcdserver: [request "header:<ID:3390483429799868527 > txn:<compare:<target:VALUE key:\"/pd/6690926463083149525/leader\" value:\"\\n\\003pd5\\020\\215\\336\\260\\267\\310\\223\\244\\217\\237\\001\\0323http://ec2-3-90-63-186.compute-1.amazonaws.com:2380\\\"3http://ec2-3-90-63-186.compute-1.amazonaws.com:2379\" > success:<request_put:<key:\"/pd/6690926463083149525/timestamp\" value:\"\\025\\236\\233<~\\211\\t\\206\" > > > " took too long (306.917341ms) to execute]
2019/05/14 17:00:04.674 log.go:86: [warning] etcdserver: [read-only range request "key:\"/pd/6690926463083149525/leader\" " took too long (263.854966ms) to execute]
2019/05/14 17:00:08.529 raft.go:882: [info] 31a860370a7744b9 [logterm: 6, index: 768, vote: 9f1e909c86ec2f0d] cast MsgPreVote for e3d0977a65591fe6 [logterm: 6, index: 768] at term 6
2019/05/14 17:00:08.531 raft.go:797: [info] 31a860370a7744b9 [term: 6] received a MsgVote message with higher term from e3d0977a65591fe6 [term: 7]
2019/05/14 17:00:08.531 raft.go:656: [info] 31a860370a7744b9 became follower at term 7
2019/05/14 17:00:08.531 raft.go:882: [info] 31a860370a7744b9 [logterm: 6, index: 768, vote: 0] cast MsgVote for e3d0977a65591fe6 [logterm: 6, index: 768] at term 7
2019/05/14 17:00:08.531 node.go:312: [info] raft.node: 31a860370a7744b9 lost leader 9f1e909c86ec2f0d at term 7
2019/05/14 17:00:08.533 node.go:306: [info] raft.node: 31a860370a7744b9 elected leader e3d0977a65591fe6 at term 7
2019/05/14 17:00:52.222 raft.go:882: [info] 31a860370a7744b9 [logterm: 7, index: 816, vote: e3d0977a65591fe6] cast MsgPreVote for 1f6ee85649ca29fe [logterm: 7, index: 816] at term 7
2019/05/14 17:00:52.224 raft.go:797: [info] 31a860370a7744b9 [term: 7] received a MsgVote message with higher term from 1f6ee85649ca29fe [term: 8]
2019/05/14 17:00:52.224 raft.go:656: [info] 31a860370a7744b9 became follower at term 8
2019/05/14 17:00:52.224 raft.go:882: [info] 31a860370a7744b9 [logterm: 7, index: 816, vote: 0] cast MsgVote for 1f6ee85649ca29fe [logterm: 7, index: 816] at term 8
2019/05/14 17:00:52.224 node.go:312: [info] raft.node: 31a860370a7744b9 lost leader e3d0977a65591fe6 at term 8
2019/05/14 17:00:52.226 node.go:306: [info] raft.node: 31a860370a7744b9 elected leader 1f6ee85649ca29fe at term 8
2019/05/14 17:00:52.877 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 817, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 817] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:00:53.984 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 818, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 818] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:00:58.974 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 827, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 827] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:00:59.840 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 828, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 828] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:00:59.845 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 828, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 828] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:03:22.846 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 971, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 971] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:03:24.224 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 973, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 971] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:03:24.618 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 973, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 971] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:03:24.619 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 973, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 971] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:03:24.619 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 973, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 971] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:03:24.620 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 973, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 972] at term 8: lease is not expired (remaining ticks: 5)
2019/05/14 17:20:13.499 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 1993, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 1992] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:20:13.508 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 1993, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 1992] at term 8: lease is not expired (remaining ticks: 6)
2019/05/14 17:20:13.508 raft.go:782: [info] 31a860370a7744b9 [logterm: 8, index: 1993, vote: 1f6ee85649ca29fe] ignored MsgPreVote from e3d0977a65591fe6 [logterm: 8, index: 1992] at term 8: lease is not expired (remaining ticks: 6)
@shenli

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

@aphyr During the startup process of tidb-server, if the tikv/pd is unavailable, it will abort.
This fail-fast behavior is by design. tidb-ansible will complain about the error. So the DBA could know tidb-server meet some problem and check its log.

@shenli shenli added the type/wontfix label Jun 11, 2019

@aphyr

This comment has been minimized.

Copy link
Author

commented Jun 11, 2019

OK. Since you've fixed two of these issues (tikv/tikv#4500, #10495), and talked about fixing a third (#10240), it might be worth sitting down and writing a project policy on whether you'd like to crash during startup so operators can observe the failure, or whether you'd like to be resilient to transient failures.

I'd like to suggest that since you need operational monitoring and diagnostic tools for transient failures anyway, the fail-fast feedback may not be a significant advantage here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.