Liveness issue with ELECTION #211

jabolina · 2023-06-15T02:39:06Z

In the past weeks, I played a bit with the idea of partial connectivity [1][2]. The test I executed consecutively here involves 5 nodes named A to E, where A is the leader. To create the loss of connectivity, I use iptables cutting everyone all at once but C. Going from the topology in left to the right:

        +---+                           +---+
        |   |                           |   |
  +-----+ A +-----+                     | A |
  |     |   |     |                     |   |
  |     +-+-+     |                     +-+-+
  |       |       |                       |
+-+-+     |     +-+-+           +---+   +-+-+   +---+
|   |     |     |   |           |   |   |   |   |   |
| E +-----+-----+ B |           | E +---+ C +---+ B |
|   |     |     |   |           |   |   |   |   |   |
++--+     |     +--++           +---+   +-+-+   +---+
 |        |        |                      |
 | +---+  |  +---+ |                    +-+-+
 | |   |  |  |   | |                    |   |
 +-+ D +--+--+ C +-+                    | D |
   |   |     |   |                      |   |
   +---+     +---+                      +---+

I let it run in that scenario for several minutes and then recovered the network to full connectivity between the nodes. This was enough to record different behaviors.

During the partial connectivity issue, leader A stepped down, and I did not find any violation in safety. Nodes A, B, and C all seem to realize the cluster is in partial connectivity, although consistently, in all executions I did, nodes D and E do not update their view (not sure if this is a problem). Their logs show something similar:

165681 [WARN] GMS: D: not member of view [A|7]; discarding it
168589 [TRACE] FD_ALL3: D: sent heartbeat
168590 [DEBUG] FD_ALL3: D: haven't received a heartbeat from E in timeout period (40000 ms), adding it to suspect list
168591 [DEBUG] FD_ALL3: D: haven't received a heartbeat from B in timeout period (40000 ms), adding it to suspect list
168591 [DEBUG] FD_ALL3: D: haven't received a heartbeat from A in timeout period (40000 ms), adding it to suspect list
169202 [TRACE] GMS: D: I'm not the merge leader, waiting for merge leader (A) to start merge

In some executions, node C was elected the leader. I think the election from the dissertation [3] is unable to make progress in this case. Finding a way for this to happen deterministically would be the best. Since node C still connects to everyone, it should be able to reach quorum. But we should still respect the restrictions RAFT imposes, e.g., log prefix [3].

Now getting to the weird behavior.

In some executions, some nodes relentlessly try to become the leader, increasing the term in succession. In one execution, node A went from term 7 to 85 before stopping the voting thread.
And a liveness issue. After the network recovers, the nodes remain without electing a new leader. This requires manual intervention to restart nodes.

For the first scenario, I didn't dig too deep. I believe it happens when the node receives a merge view and mistakenly thinks it still connects to a majority, going on a spree sending vote requests until the view updates.

The second issue is where I spent more time. From what I could identify, nodes A, B, and C updated their views during the partial connectivity period, but nodes D and E did not. The issue happens after the network is restored and the view coordinator either nodes D or E. Since they still see the old view before the disconnection, they compute the update as "no_change" and do not start the voting thread, still believing node A is the leader. Meanwhile, nodes A, B, and C are idle since they are not the view coordinator. Some logs from node A after the network is restored:

330341 [DEBUG] GMS: A: I will be the merge leader. Starting the merge task. Views: {B=[A|8] (3) [A, C, B], E=[A|4] (5) [A, B, C, D, E], A=[A|9] (2) [A, C]}
330341 [DEBUG] GMS: A: merge task A::2 started with 4 participants
330342 [TRACE] GMS: A: sending MERGE_REQ to [B, D, A, E]
330342 [TRACE] GMS: A: got merge request from A, merge_id=A::2, mbrs=[C, B, D, E, A]
330344 [TRACE] GMS: A: fetched all digests for [C, A] in 2 ms
330344 [TRACE] GMS: A: got merge response from A, merge_id=A::2, merge data is sender=A, view=[A|9] (2) [C, A], digest=C: [3 (3)], A: [93 (93)]
330344 [TRACE] GMS: A: got merge response from B, merge_id=A::2, merge data is sender=B, view=[A|8] (1) [B], digest=B: [2 (2)]
330346 [TRACE] GMS: A: got merge response from D, merge_id=A::2, merge data is sender=D, view=[A|4] (1) [D], digest=D: [0 (0)]
330346 [TRACE] GMS: A: got merge response from E, merge_id=A::2, merge data is sender=E, view=[A|4] (1) [E], digest=E: [0 (0)]
330347 [TRACE] GMS: A: collected 4 merge response(s) in 5 ms
330347 [TRACE] GMS: A: consolidated view=MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]
consolidated digest=E: [0 (0)], B: [2 (2)], D: [0 (0)], C: [3 (3)], A: [93 (93)]
330347 [DEBUG] GMS: A: installing merge view [E|10] in [B, D, E, A]
330347 [DEBUG] GMS: A: merge A::2 took 6 ms
330347 [DEBUG] GMS: A: installing view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E] ([E, B, D] joined)
330348 [DEBUG] ELECTION: A: existing view: [A|9] (2) [A, C], new view: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E], result: reached
-- view change: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]
330348 [TRACE] GMS: A: mcasting view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]

And node E:

296449 [TRACE] GMS: E: got merge request from A, merge_id=A::2, mbrs=[E]
296453 [DEBUG] GMS: E: installing view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E] 
296454 [DEBUG] FD_SOCK: E: socket to A was closed gracefully
296455 [DEBUG] FD_SOCK: E: pingable_mbrs=[E, B, D, C, A], ping_dest=B
296455 [TRACE] FD_SOCK: E: ping_dest=B, ping_sock=Socket[addr=/10.172.199.158,port=35159,localport=38397], cache=C: 10.172.199.64:33045 (295 secs old)
A: 10.172.199.185:44261 (295 secs old)
B: 10.172.199.158:35159 (295 secs old)
D: 10.172.199.86:43301 (295 secs old)
E: 10.172.199.232:37185 (295 secs old)

296456 [DEBUG] ELECTION: E: existing view: [A|4] (5) [A, B, C, D, E], new view: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E], result: no_change
-- view change: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]
296457 [TRACE] GMS: E: mcasting view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]

I did a small reproducer here. This is not replaying the whole stack, only RAFT and ELECTION simulating the steps with the views. In the end, no leader is elected.

Entering now into possible solutions, which also happens to be a highly suggested optimization, too. In Diego's dissertation [3], Section 9.6 quickly describes the PreVote mechanism. In summary, a node would query the nodes it knows to inspect if it can start a new election, helping to avoid disrupting the cluster.

Our implementation is already stable concerning disruptions and already includes the CheckQuorum optimization [4]. I propose adding the PreVote before starting the voting thread so address case 1 described previously.

To solve issue 2, we could add PreVote if the node computes the new view with "no_change" and is the new view coordinator and is not the RAFT leader. This would cause the node to probe everyone, and if it receives a majority of replies agreeing, the node then starts the voting thread. To reinforce, we keep everything as we have today and only include the PreVote as an additional validation before running the voting thread. See, this could also add a slight delay in the election process.

Let me know what you think of this. If this is something that we workaround with configuration changes would be good, too. I can check other scenarios from [1][2] to stress the implementation.

[1] https://dl.acm.org/doi/pdf/10.1145/3552326.3587441
[2] https://omnipaxos.com/blog/how-omnipaxos-handles-partial-connectivity-and-why-other-protocols-cant/
[3] https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf
[4] https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
Reproducer: jabolina@4231a04

The text was updated successfully, but these errors were encountered:

belaban · 2023-06-17T14:35:29Z

Hi Jose
implementing the prevoting, as discussed in Diego's thesis in ch. 9.6, sounds like a good plan.
Modeling the faults in PartialConnectivityTest first is critical IMO, then implementing prevoting and making sure the previously failing test now passes.

belaban · 2023-06-17T15:32:09Z

Given that partial connectivity is the edge case, and not the norm, I suggest creation of an ELECTION2 protocol, perhaps extracting the common functionality into an Election class, and making ELECTION and ELECTION2 extend it.
Alternatively, allow users to configure whether or not they want a prevoting phase: some users may not want this as patial connectivity will not occur in their networks, or - if it does occur - they're willing to allow for manual intervention.

jabolina · 2023-06-19T15:27:35Z

Thanks, Bela.

I plan on following the approach you suggested in creating the ELECTION2, asserting the PartialConnectivityTest is catching the issue (and working with the fix), and in addition:

Extend some tests to cover both election protocols;
Include an operation to start the voting thread. Focused for operators, an easier way for manual intervention.

jabolina added the bug label Jun 15, 2023

jabolina added this to the 1.0.12 milestone Jun 21, 2023

jabolina mentioned this issue Sep 2, 2023

Ensure a single leader per term #218

Closed

jabolina mentioned this issue Sep 24, 2023

Adding ELECTION2 protocol #220

Merged

belaban closed this as completed in #220 Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness issue with ELECTION #211

Liveness issue with ELECTION #211

jabolina commented Jun 15, 2023

belaban commented Jun 17, 2023

belaban commented Jun 17, 2023

jabolina commented Jun 19, 2023

Liveness issue with ELECTION #211

Liveness issue with ELECTION #211

Comments

jabolina commented Jun 15, 2023

belaban commented Jun 17, 2023

belaban commented Jun 17, 2023

jabolina commented Jun 19, 2023