You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the past weeks, I played a bit with the idea of partial connectivity [1][2]. The test I executed consecutively here involves 5 nodes named A to E, where A is the leader. To create the loss of connectivity, I use iptables cutting everyone all at once but C. Going from the topology in left to the right:
I let it run in that scenario for several minutes and then recovered the network to full connectivity between the nodes. This was enough to record different behaviors.
During the partial connectivity issue, leader A stepped down, and I did not find any violation in safety. Nodes A, B, and C all seem to realize the cluster is in partial connectivity, although consistently, in all executions I did, nodes D and E do not update their view (not sure if this is a problem). Their logs show something similar:
165681 [WARN] GMS: D: not member of view [A|7]; discarding it
168589 [TRACE] FD_ALL3: D: sent heartbeat
168590 [DEBUG] FD_ALL3: D: haven't received a heartbeat from E in timeout period (40000 ms), adding it to suspect list
168591 [DEBUG] FD_ALL3: D: haven't received a heartbeat from B in timeout period (40000 ms), adding it to suspect list
168591 [DEBUG] FD_ALL3: D: haven't received a heartbeat from A in timeout period (40000 ms), adding it to suspect list
169202 [TRACE] GMS: D: I'm not the merge leader, waiting for merge leader (A) to start merge
In some executions, node C was elected the leader. I think the election from the dissertation [3] is unable to make progress in this case. Finding a way for this to happen deterministically would be the best. Since node C still connects to everyone, it should be able to reach quorum. But we should still respect the restrictions RAFT imposes, e.g., log prefix [3].
Now getting to the weird behavior.
In some executions, some nodes relentlessly try to become the leader, increasing the term in succession. In one execution, node A went from term 7 to 85 before stopping the voting thread.
And a liveness issue. After the network recovers, the nodes remain without electing a new leader. This requires manual intervention to restart nodes.
For the first scenario, I didn't dig too deep. I believe it happens when the node receives a merge view and mistakenly thinks it still connects to a majority, going on a spree sending vote requests until the view updates.
The second issue is where I spent more time. From what I could identify, nodes A, B, and C updated their views during the partial connectivity period, but nodes D and E did not. The issue happens after the network is restored and the view coordinator either nodes D or E. Since they still see the old view before the disconnection, they compute the update as "no_change" and do not start the voting thread, still believing node A is the leader. Meanwhile, nodes A, B, and C are idle since they are not the view coordinator. Some logs from node A after the network is restored:
I did a small reproducer here. This is not replaying the whole stack, only RAFT and ELECTION simulating the steps with the views. In the end, no leader is elected.
Entering now into possible solutions, which also happens to be a highly suggested optimization, too. In Diego's dissertation [3], Section 9.6 quickly describes the PreVote mechanism. In summary, a node would query the nodes it knows to inspect if it can start a new election, helping to avoid disrupting the cluster.
Our implementation is already stable concerning disruptions and already includes the CheckQuorum optimization [4]. I propose adding the PreVote before starting the voting thread so address case 1 described previously.
To solve issue 2, we could add PreVote if the node computes the new view with "no_change" and is the new view coordinator and is not the RAFT leader. This would cause the node to probe everyone, and if it receives a majority of replies agreeing, the node then starts the voting thread. To reinforce, we keep everything as we have today and only include the PreVote as an additional validation before running the voting thread. See, this could also add a slight delay in the election process.
Let me know what you think of this. If this is something that we workaround with configuration changes would be good, too. I can check other scenarios from [1][2] to stress the implementation.
Hi Jose
implementing the prevoting, as discussed in Diego's thesis in ch. 9.6, sounds like a good plan.
Modeling the faults in PartialConnectivityTestfirst is critical IMO, then implementing prevoting and making sure the previously failing test now passes.
Given that partial connectivity is the edge case, and not the norm, I suggest creation of an ELECTION2 protocol, perhaps extracting the common functionality into an Election class, and making ELECTION and ELECTION2 extend it.
Alternatively, allow users to configure whether or not they want a prevoting phase: some users may not want this as patial connectivity will not occur in their networks, or - if it does occur - they're willing to allow for manual intervention.
I plan on following the approach you suggested in creating the ELECTION2, asserting the PartialConnectivityTest is catching the issue (and working with the fix), and in addition:
Extend some tests to cover both election protocols;
Include an operation to start the voting thread. Focused for operators, an easier way for manual intervention.
In the past weeks, I played a bit with the idea of partial connectivity [1][2]. The test I executed consecutively here involves 5 nodes named
A
toE
, whereA
is the leader. To create the loss of connectivity, I useiptables
cutting everyone all at once butC
. Going from the topology in left to the right:I let it run in that scenario for several minutes and then recovered the network to full connectivity between the nodes. This was enough to record different behaviors.
During the partial connectivity issue, leader
A
stepped down, and I did not find any violation in safety. NodesA
,B
, andC
all seem to realize the cluster is in partial connectivity, although consistently, in all executions I did, nodesD
andE
do not update their view (not sure if this is a problem). Their logs show something similar:In some executions, node
C
was elected the leader. I think the election from the dissertation [3] is unable to make progress in this case. Finding a way for this to happen deterministically would be the best. Since nodeC
still connects to everyone, it should be able to reach quorum. But we should still respect the restrictions RAFT imposes, e.g., log prefix [3].Now getting to the weird behavior.
A
went from term 7 to 85 before stopping the voting thread.For the first scenario, I didn't dig too deep. I believe it happens when the node receives a merge view and mistakenly thinks it still connects to a majority, going on a spree sending vote requests until the view updates.
The second issue is where I spent more time. From what I could identify, nodes
A
,B
, andC
updated their views during the partial connectivity period, but nodesD
andE
did not. The issue happens after the network is restored and the view coordinator either nodesD
orE
. Since they still see the old view before the disconnection, they compute the update as "no_change
" and do not start the voting thread, still believing nodeA
is the leader. Meanwhile, nodesA
,B
, andC
are idle since they are not the view coordinator. Some logs from nodeA
after the network is restored:And node
E
:I did a small reproducer here. This is not replaying the whole stack, only
RAFT
andELECTION
simulating the steps with the views. In the end, no leader is elected.Entering now into possible solutions, which also happens to be a highly suggested optimization, too. In Diego's dissertation [3], Section 9.6 quickly describes the
PreVote
mechanism. In summary, a node would query the nodes it knows to inspect if it can start a new election, helping to avoid disrupting the cluster.Our implementation is already stable concerning disruptions and already includes the
CheckQuorum
optimization [4]. I propose adding thePreVote
before starting the voting thread so address case 1 described previously.To solve issue 2, we could add
PreVote
if the node computes the new view with "no_change
" and is the new view coordinator and is not theRAFT
leader. This would cause the node to probe everyone, and if it receives a majority of replies agreeing, the node then starts the voting thread. To reinforce, we keep everything as we have today and only include thePreVote
as an additional validation before running the voting thread. See, this could also add a slight delay in the election process.Let me know what you think of this. If this is something that we workaround with configuration changes would be good, too. I can check other scenarios from [1][2] to stress the implementation.
[1] https://dl.acm.org/doi/pdf/10.1145/3552326.3587441
[2] https://omnipaxos.com/blog/how-omnipaxos-handles-partial-connectivity-and-why-other-protocols-cant/
[3] https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf
[4] https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
Reproducer: jabolina@4231a04
The text was updated successfully, but these errors were encountered: