Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
TiDB can take 500+ seconds to recover from a single-node partition isolating a PD leader #10643
Please answer these questions before submitting your issue. Thanks!
On a five node cluster, with Jepsen be3c0b730fc8392487428b12833932137af63183, run:
This nemesis identifies a current PD leader, and isolates that node from all other nodes in the cluster with an iptables network partition.
Since 4/5 nodes are still available, PD and every KV region should be able to elect a new leader in the majority component, and service should continue, perhaps after a brief interruption for leader election.
TiKV stops working entirely for the duration of the network partition. I'm running longer tests now to see if it'll recover given more than 200 seconds.
PD nodes in the majority partition, when the network partition isolating the current leader occurs, log messages indicating that the leader has been deleted, it looks like n1 may be elected, and then... we seem to get stuck!
... and from then on,
what does the log mean?