Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness issue with ELECTION #211

Closed
jabolina opened this issue Jun 15, 2023 · 3 comments · Fixed by #220
Closed

Liveness issue with ELECTION #211

jabolina opened this issue Jun 15, 2023 · 3 comments · Fixed by #220
Labels
Milestone

Comments

@jabolina
Copy link
Member

In the past weeks, I played a bit with the idea of partial connectivity [1][2]. The test I executed consecutively here involves 5 nodes named A to E, where A is the leader. To create the loss of connectivity, I use iptables cutting everyone all at once but C. Going from the topology in left to the right:

        +---+                           +---+
        |   |                           |   |
  +-----+ A +-----+                     | A |
  |     |   |     |                     |   |
  |     +-+-+     |                     +-+-+
  |       |       |                       |
+-+-+     |     +-+-+           +---+   +-+-+   +---+
|   |     |     |   |           |   |   |   |   |   |
| E +-----+-----+ B |           | E +---+ C +---+ B |
|   |     |     |   |           |   |   |   |   |   |
++--+     |     +--++           +---+   +-+-+   +---+
 |        |        |                      |
 | +---+  |  +---+ |                    +-+-+
 | |   |  |  |   | |                    |   |
 +-+ D +--+--+ C +-+                    | D |
   |   |     |   |                      |   |
   +---+     +---+                      +---+

I let it run in that scenario for several minutes and then recovered the network to full connectivity between the nodes. This was enough to record different behaviors.

During the partial connectivity issue, leader A stepped down, and I did not find any violation in safety. Nodes A, B, and C all seem to realize the cluster is in partial connectivity, although consistently, in all executions I did, nodes D and E do not update their view (not sure if this is a problem). Their logs show something similar:

165681 [WARN] GMS: D: not member of view [A|7]; discarding it
168589 [TRACE] FD_ALL3: D: sent heartbeat
168590 [DEBUG] FD_ALL3: D: haven't received a heartbeat from E in timeout period (40000 ms), adding it to suspect list
168591 [DEBUG] FD_ALL3: D: haven't received a heartbeat from B in timeout period (40000 ms), adding it to suspect list
168591 [DEBUG] FD_ALL3: D: haven't received a heartbeat from A in timeout period (40000 ms), adding it to suspect list
169202 [TRACE] GMS: D: I'm not the merge leader, waiting for merge leader (A) to start merge

In some executions, node C was elected the leader. I think the election from the dissertation [3] is unable to make progress in this case. Finding a way for this to happen deterministically would be the best. Since node C still connects to everyone, it should be able to reach quorum. But we should still respect the restrictions RAFT imposes, e.g., log prefix [3].

Now getting to the weird behavior.

  1. In some executions, some nodes relentlessly try to become the leader, increasing the term in succession. In one execution, node A went from term 7 to 85 before stopping the voting thread.
  2. And a liveness issue. After the network recovers, the nodes remain without electing a new leader. This requires manual intervention to restart nodes.

For the first scenario, I didn't dig too deep. I believe it happens when the node receives a merge view and mistakenly thinks it still connects to a majority, going on a spree sending vote requests until the view updates.

The second issue is where I spent more time. From what I could identify, nodes A, B, and C updated their views during the partial connectivity period, but nodes D and E did not. The issue happens after the network is restored and the view coordinator either nodes D or E. Since they still see the old view before the disconnection, they compute the update as "no_change" and do not start the voting thread, still believing node A is the leader. Meanwhile, nodes A, B, and C are idle since they are not the view coordinator. Some logs from node A after the network is restored:

330341 [DEBUG] GMS: A: I will be the merge leader. Starting the merge task. Views: {B=[A|8] (3) [A, C, B], E=[A|4] (5) [A, B, C, D, E], A=[A|9] (2) [A, C]}
330341 [DEBUG] GMS: A: merge task A::2 started with 4 participants
330342 [TRACE] GMS: A: sending MERGE_REQ to [B, D, A, E]
330342 [TRACE] GMS: A: got merge request from A, merge_id=A::2, mbrs=[C, B, D, E, A]
330344 [TRACE] GMS: A: fetched all digests for [C, A] in 2 ms
330344 [TRACE] GMS: A: got merge response from A, merge_id=A::2, merge data is sender=A, view=[A|9] (2) [C, A], digest=C: [3 (3)], A: [93 (93)]
330344 [TRACE] GMS: A: got merge response from B, merge_id=A::2, merge data is sender=B, view=[A|8] (1) [B], digest=B: [2 (2)]
330346 [TRACE] GMS: A: got merge response from D, merge_id=A::2, merge data is sender=D, view=[A|4] (1) [D], digest=D: [0 (0)]
330346 [TRACE] GMS: A: got merge response from E, merge_id=A::2, merge data is sender=E, view=[A|4] (1) [E], digest=E: [0 (0)]
330347 [TRACE] GMS: A: collected 4 merge response(s) in 5 ms
330347 [TRACE] GMS: A: consolidated view=MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]
consolidated digest=E: [0 (0)], B: [2 (2)], D: [0 (0)], C: [3 (3)], A: [93 (93)]
330347 [DEBUG] GMS: A: installing merge view [E|10] in [B, D, E, A]
330347 [DEBUG] GMS: A: merge A::2 took 6 ms
330347 [DEBUG] GMS: A: installing view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E] ([E, B, D] joined)
330348 [DEBUG] ELECTION: A: existing view: [A|9] (2) [A, C], new view: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E], result: reached
-- view change: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]
330348 [TRACE] GMS: A: mcasting view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]

And node E:

296449 [TRACE] GMS: E: got merge request from A, merge_id=A::2, mbrs=[E]
296453 [DEBUG] GMS: E: installing view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E] 
296454 [DEBUG] FD_SOCK: E: socket to A was closed gracefully
296455 [DEBUG] FD_SOCK: E: pingable_mbrs=[E, B, D, C, A], ping_dest=B
296455 [TRACE] FD_SOCK: E: ping_dest=B, ping_sock=Socket[addr=/10.172.199.158,port=35159,localport=38397], cache=C: 10.172.199.64:33045 (295 secs old)
A: 10.172.199.185:44261 (295 secs old)
B: 10.172.199.158:35159 (295 secs old)
D: 10.172.199.86:43301 (295 secs old)
E: 10.172.199.232:37185 (295 secs old)

296456 [DEBUG] ELECTION: E: existing view: [A|4] (5) [A, B, C, D, E], new view: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E], result: no_change
-- view change: MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]
296457 [TRACE] GMS: E: mcasting view MergeView::[E|10] (5) [E, B, D, C, A], 3 subgroups: [A|8] (3) [A, C, B], [A|9] (2) [A, C], [A|4] (5) [A, B, C, D, E]

I did a small reproducer here. This is not replaying the whole stack, only RAFT and ELECTION simulating the steps with the views. In the end, no leader is elected.

Entering now into possible solutions, which also happens to be a highly suggested optimization, too. In Diego's dissertation [3], Section 9.6 quickly describes the PreVote mechanism. In summary, a node would query the nodes it knows to inspect if it can start a new election, helping to avoid disrupting the cluster.

Our implementation is already stable concerning disruptions and already includes the CheckQuorum optimization [4]. I propose adding the PreVote before starting the voting thread so address case 1 described previously.

To solve issue 2, we could add PreVote if the node computes the new view with "no_change" and is the new view coordinator and is not the RAFT leader. This would cause the node to probe everyone, and if it receives a majority of replies agreeing, the node then starts the voting thread. To reinforce, we keep everything as we have today and only include the PreVote as an additional validation before running the voting thread. See, this could also add a slight delay in the election process.

Let me know what you think of this. If this is something that we workaround with configuration changes would be good, too. I can check other scenarios from [1][2] to stress the implementation.


[1] https://dl.acm.org/doi/pdf/10.1145/3552326.3587441
[2] https://omnipaxos.com/blog/how-omnipaxos-handles-partial-connectivity-and-why-other-protocols-cant/
[3] https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf
[4] https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
Reproducer: jabolina@4231a04

@jabolina jabolina added the bug label Jun 15, 2023
@belaban
Copy link
Member

belaban commented Jun 17, 2023

Hi Jose
implementing the prevoting, as discussed in Diego's thesis in ch. 9.6, sounds like a good plan.
Modeling the faults in PartialConnectivityTest first is critical IMO, then implementing prevoting and making sure the previously failing test now passes.

@belaban
Copy link
Member

belaban commented Jun 17, 2023

Given that partial connectivity is the edge case, and not the norm, I suggest creation of an ELECTION2 protocol, perhaps extracting the common functionality into an Election class, and making ELECTION and ELECTION2 extend it.
Alternatively, allow users to configure whether or not they want a prevoting phase: some users may not want this as patial connectivity will not occur in their networks, or - if it does occur - they're willing to allow for manual intervention.

@jabolina
Copy link
Member Author

Thanks, Bela.

I plan on following the approach you suggested in creating the ELECTION2, asserting the PartialConnectivityTest is catching the issue (and working with the fix), and in addition:

  • Extend some tests to cover both election protocols;
  • Include an operation to start the voting thread. Focused for operators, an easier way for manual intervention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants