Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB can take 500+ seconds to recover from a single-node partition isolating a PD leader #10643

Open
aphyr opened this issue May 29, 2019 · 4 comments

Comments

Projects
None yet
4 participants
@aphyr
Copy link

commented May 29, 2019

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?

On a five node cluster, with Jepsen be3c0b730fc8392487428b12833932137af63183, run:

lein run test -w register --time-limit 450 --concurrency 6n --version 3.0.0-rc.2 --nemesis partition-pd-leader --nemesis-interval 200 --nemesis-schedule fixed; and-bell

This nemesis identifies a current PD leader, and isolates that node from all other nodes in the cluster with an iptables network partition.

  1. What did you expect to see?

Since 4/5 nodes are still available, PD and every KV region should be able to elect a new leader in the majority component, and service should continue, perhaps after a brief interruption for leader election.

  1. What did you see instead?

TiKV stops working entirely for the duration of the network partition. I'm running longer tests now to see if it'll recover given more than 200 seconds.

latency-raw (3)

20190529T172428.000-0400.zip

PD nodes in the majority partition, when the network partition isolating the current leader occurs, log messages indicating that the leader has been deleted, it looks like n1 may be elected, and then... we seem to get stuck!

[2019/05/29 14:27:48.528 -07:00] [INFO] [raft.go:947] ["f82e563e5c75137e [logterm: 2, index: 477, vote: 9b116f88cab4dc9] cast MsgPreVote for 5440ff22fe632778 [logterm: 2, index: 477] at term 2"]
[2019/05/29 14:27:48.543 -07:00] [INFO] [raft.go:862] ["f82e563e5c75137e [term: 2] received a MsgVote message with higher term from 5440ff22fe632778 [term: 3]"]
[2019/05/29 14:27:48.544 -07:00] [INFO] [raft.go:712] ["f82e563e5c75137e became follower at term 3"]
[2019/05/29 14:27:48.544 -07:00] [INFO] [raft.go:947] ["f82e563e5c75137e [logterm: 2, index: 477, vote: 0] cast MsgVote for 5440ff22fe632778 [logterm: 2, index: 477] at term 3"]
[2019/05/29 14:27:48.544 -07:00] [INFO] [node.go:336] ["raft.node: f82e563e5c75137e lost leader 9b116f88cab4dc9 at term 3"]
[2019/05/29 14:27:48.554 -07:00] [INFO] [node.go:330] ["raft.node: f82e563e5c75137e elected leader 5440ff22fe632778 at term 3"]
[2019/05/29 14:27:49.922 -07:00] [INFO] [leader.go:353] ["leader is deleted"]
[2019/05/29 14:27:49.922 -07:00] [ERROR] [client.go:115] ["region sync with leader meet error"] [error="rpc error: code = Canceled desc = context canceled"] [stack="github.com/pingcap/log.Error\n\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190214045112-b37da76f67a7/global.go:42\ngithub.com/pingcap/pd/server/region_syncer.(*RegionSyncer).StartSyncWithLeader.func1\n\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/region_syncer/client.go:115"]
[2019/05/29 14:27:50.923 -07:00] [INFO] [leader.go:108] ["leader changed, try to campaign leader"]
[2019/05/29 14:27:50.923 -07:00] [INFO] [leader.go:114] ["skip campaign leader and check later"] [server-name=pd4] [etcd-leader-id=6071132823457310584]
[2019/05/29 14:27:51.124 -07:00] [INFO] [leader.go:106] ["start watch leader"] [leader="name:\"pd1\" member_id:6071132823457310584 peer_urls:\"http://n1:2380\" client_urls:\"http://n1:2379\" "]
[2019/05/29 14:27:51.126 -07:00] [INFO] [leader.go:415] ["server enable region storage"]
[2019/05/29 14:27:51.127 -07:00] [INFO] [client.go:111] ["server starts to synchronize with leader"] [server=pd4] [leader=pd1] [request-index=568]
[2019/05/29 14:27:53.166 -07:00] [WARN] [stream.go:435] ["lost TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=f82e563e5c75137e] [remote-peer-id=9b116f88cab4dc9] [error="read tcp 192.168.122.14:52814->192.168.122.13:2380: i/o timeout"]
[2019/05/29 14:27:53.166 -07:00] [WARN] [peer_status.go:68] ["peer became inactive (message send to peer failed)"] [peer-id=9b116f88cab4dc9] [error="failed to read 9b116f88cab4dc9 on stream MsgApp v2 (read tcp 192.168.122.14:52814->192.168.122.13:2380: i/o timeout)"]
[2019/05/29 14:27:53.181 -07:00] [WARN] [stream.go:435] ["lost TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=f82e563e5c75137e] [remote-peer-id=9b116f88cab4dc9] [error="read tcp 192.168.122.14:52816->192.168.122.13:2380: i/o timeout"]
[2019/05/29 14:27:58.203 -07:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 192.168.122.14:2379->192.168.122.11:33944: read: connection reset by peer"]
[2019/05/29 14:28:16.439 -07:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=9b116f88cab4dc9] [rtt=918.209µs] [error="dial tcp 192.168.122.13:2380: i/o timeout"]
[2019/05/29 14:28:21.440 -07:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=9b116f88cab4dc9] [rtt=918.209µs] [error="dial tcp 192.168.122.13:2380: i/o timeout"]
[2019/05/29 14:28:26.440 -07:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=9b116f88cab4dc9] [rtt=918.209µs] [error="dial tcp 192.168.122.13:2380: i/o timeout"]

... and from then on, prober detected unhealthy status continues logging messages about n3 being down until the network partition is resolved.

  1. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?

3.0.0-rc.2

@siddontang

This comment has been minimized.

Copy link
Member

commented May 30, 2019

@BusyJay

[2019/05/29 14:34:55.416 -07:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=RAFT_LOG_GC] [peer_id=1069] [region_id=1068]
[2019/05/29 14:34:55.416 -07:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=SPLIT_REGION_CHECK] [peer_id=1069] [region_id=1068]

what does the log mean?

@aphyr

This comment has been minimized.

Copy link
Author

commented May 30, 2019

Here's a case with full logs for an outage lasting 500 seconds: 20190530T082003.000-0400.zip

latency-raw (6)

Interestingly... it looks like these outages may not be total. For instance, in this test, we were able to perform exactly one transaction in a 500-second outage:

latency-raw (5)

@aphyr aphyr changed the title TiDB can take 200+ seconds to recover from a single-node partition isolating a PD leader TiDB can take 500+ seconds to recover from a single-node partition isolating a PD leader May 30, 2019

@nolouch

This comment has been minimized.

Copy link
Member

commented May 30, 2019

@aphyr I found there are some regions only have 2 replicas, like region 658, which cannot elect a new leader when partition isolating. we need to wait for a while before the test, as we need to make up the replicas to 3.

@BusyJay

This comment has been minimized.

Copy link
Member

commented May 31, 2019

The log means the peer is destroyed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.