Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676
I was testing a 3-nodes deployment of a TiDB cluster with default settings and observed a situation when TiDB stoped serving requests for 2 minutes 40 seconds. It happened when a leader got separated from the peers. The unavailable window is too wide so I believe that it a bug rather than an expected behavior.
The version of the TiDB is
The client app opened a connection to each of the nodes and was continuously running the following loop per each of them:
Each connection used its own key to avoid collision. If there was an error during the loop then it closed the current connection, opened a new one and began the next iteration.
Once in a second it dumped aggregated number of successful iterations per cluster and per each node for a last second.
When I separated a leader (10.0.0.7) from the peers with the following command:
the cluster became unavailable for more than two minutes (the rate of successful iterations dropped to zero)
Please see this repo for client's code, more information about the incident and the repro steps https://github.com/rystsov/perseus/tree/master/tidb
I have used your case and reproduce the problem. This is caused that your test client connects the TiDB in 10.0.0.7, but this TiDB can't connect to other PDs and TiKVs in 10.0.0.5 and 10.0.0.6. After a long time retry, TiDB will return the test client a timeout error, then the test client will connect another TiDB and works. In my machine, the unavailable time is about 1m30s.
TiDB is stateless, only PD and TiKV have the leader concept because of Raft. When the partition occurs, the new leader will be elected at about 20s, which is within the expected range. So now we need to let TiDB return error immediately or not retry too many times when we know the TiDB can't provide services. /cc @coocood
I'll be able to provide the logs in a couple of days.
It isn't 100% correct. My data shows that all cluster becomes unavailable so there is no other TiDB to connect to.
The test application opened 3 connections to each TiDB of the custer and was testing them independently (measuring the rate of successful operations per connection). After I isolated the first node (I called it a leader because the success rate for that node was higher than for the rest nodes) the rate dropped to zero for the whole cluster and it took more than 2 minutes to recover:
86 445 195 126 124 87 474 214 132 128 88 19 8 5 6 89 0 0 0 0 ... 247 0 0 0 0 248 29 0 0 29 249 237 0 138 99 250 289 0 179 110 251 314 0 197 117
The first column is number of seconds since the beginning of the experiment, the second is the number of successful read-inc-write iterations per all cluster per last second. The last three columns are success rate per each of connection.
Hi folks I reproduced the experiment using a 3-nodes deployment (each node hosted PD, TiKV and TiDB) with the recent version of TiDB:
For each TiDB, a client opened a connection, used a different key to avoid collisions and was executing the read-increment routine in a loop.
The first number in a row is the ordinal second of the experiment, the second is the number of iterations done across all connections per that second, the last three are numbers of iterations per connection (10.0.0.8, 10.0.0.9, 10.0.0.10).
During the experiment I isolated a node (10.0.0.10) which had a higher RPS with iptables:
I expected RPS in the last column to be dropped to zero without effect on the other columns. But from the client perspective the whole cluster went down for almost 20 seconds:
After some time I removed the isolation (approximately at 126th second):
The expectation was to see an immediate recovery of the RPS in the third connection but the it took 64 seconds and had very strange pattern. First the cluster became unavailable for almost 6 seconds:
Nothing happened for more than 40 seconds (the RPS of third node was zero) but then the cluster became unavailable for almost 12 seconds:
Finally, the third node recovered.
As you can see a simulated network glitch affecting one node in a three nodes deployment caused TiDB to stop serving requests for more than 35 seconds.
Please find logs here. I hope they will help you to fix the issue :)
The first 0 QPS is because that all your requests need to get data from TiKV 5 which has been partitioned yet. In our previous test, if we use iptables to do the isolation, the gRPC connection can't detect the network broken and it will throw a timeout error after 20 seconds (We configure the gRPC timeout to 20 seconds here).
We can decrease the timeout here or use keepalive to let the connection know it is broken more quickly. But we must know that TiKV also still needs some time to elect a new leader if the old leader is isolated.
The following log in tdb1.db.log shows this error:
The second 0 QPS is caused by the PD leader election. I will explain the reason in detail here:
So from 06:46:37 to 06:46:38, the PD couldn't provide service, and the QPS became 0 because TiDB can't get TSO from PD.
Using Raft Pre-Vote feature can avoid the case that the Raft leader is disturbed accidentally, but we also need to know why the Raft ReadIndex callback is not called immediately after the peer re-becomes the leader but timeout. We will discuss with the etcd team about this then update later. /cc @xiang90
By the way, I dockerized the scripts so the test can be reproduced just with a few commands - https://github.com/rystsov/perseus/tree/master/tidb
I run another test and observed the following situation: after a node was isolated from a three-nodes cluster the whole system became irresponsible for more than 5 minutes, I tired to wait until it recovers, rejoined the node and it took another 1m30s until it revived. The logs -
I see that you isolated tidb1 for 5m and then rejoined it, and the 5 minutes no response makes sense, I will explain it in detail here:
You start the cluster and isolate the tidb1 in 1 minute, at this time, there is only 1 replica in the Raft group on TiKV 1 on tidb2 machine.
This is the known problem, see https://github.com/coreos/etcd/tree/master/raft#implementation-notes, the note only tells the problem when removing Raft member, but adding Raft member has the same problem.
Although the result is reasonable here, it is still a bug I think we need to fix or improve. maybe we can:
Maybe now you can wait some time ( 5 minutes) then test again.
@siddontang thanks for the explanation! As I understood when TiDB starts it works without replication (Raft group consists of one TiKV) and it takes about 2 minutes to increase the replication factor so there is a risk of loosing already acknowledged data. Is it correct? Is it possible to configure TiDB to reject requests unless there are at least 2 TiKV in a Raft's group?
Yes, we have this problem. The number of replicas may 1 or 2, so if any replica fails, the whole cluster can't work.
But as you can see, this is only for cluster bootstrap, it is not a serious problem now Maybe we can speed up the replication at first. @disksing seems we can connect to PD when the TiKV provides service, then the PD can add the replicas immediately, no need to wait 1 minute.