Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness issue joining during election thread execution #253

Closed
jabolina opened this issue Feb 14, 2024 · 0 comments
Closed

Liveness issue joining during election thread execution #253

jabolina opened this issue Feb 14, 2024 · 0 comments
Labels

Comments

@jabolina
Copy link
Member

I found this locally because of a flaky test that waits for all nodes to install a leader. During concurrent joins, there is a small gap that a node might not receive the LeaderElected message. The scenario is:

  1. The coordinator already collected the VoteReponses;
  2. The LeaderElected message is already multicast;
  3. The leader did not set the leader locally.

If the joiner arrives after the LeaderElected message and before the leader set locally, it will not receive the LeaderElected message.

As I am not sure how to test this, the test log:

9686 [TRACE] ELECTION: A -> all: VoteRequest: current_term=1
9686 [TRACE] ELECTION: A <- A: VoteRequest(term=1)
9688 [TRACE] ELECTION: B <- A: VoteRequest(term=1)
9688 [TRACE] RAFT: B: changed term from 0 -> 1
9688 [TRACE] RAFT: B: change leader from null -> null
A: collected votes from {B=VoteResponse: current_term=1, last_log_term=0, last_log_index=0, A=VoteResponse: current_term=1, last_log_term=0, last_log_index=0} in 2 ms (majority=2) -> leader is A (new_term=1)
9688 [TRACE] ELECTION: A -> all (-self): LeaderElected: current_term=1, leader=A
9688 [TRACE] ELECTION: B <- A: LeaderElected: current_term=1, leader=A
9688 [TRACE] RAFT: B: change leader from A -> A

-------------------------------------------------------------------
GMS: address=C, cluster=AppendEntriesTest, physical address=127.0.0.1:1
-------------------------------------------------------------------
9688 [DEBUG] ELECTION: A: existing view: [A|1] (2) [A, B], new view: [A|2] (3) [A, B, C], result: no_change
9688 [TRACE] RAFT: A: change leader from A -> A
9688 [TRACE] RAFT: A: changed role from Follower -> Leader
9688 [DEBUG] ELECTION: A: stopping the voting thread
9690 [DEBUG] ELECTION: B: existing view: [A|1] (2) [A, B], new view: [A|2] (3) [A, B, C], result: no_change
9690 [DEBUG] ELECTION: C: existing view: null, new view: [A|2] (3) [A, B, C], result: reached

Changing to set the leader locally before sending the message should fix this. If the elected leader is not the coordinator, it will receive the LeaderElected before the view is installed and then notify the new node in the new view.

@jabolina jabolina added the bug label Feb 14, 2024
jabolina added a commit to jabolina/jgroups-raft that referenced this issue Mar 2, 2024
* Utilize the blocking interceptors to block while sending the
  LeaderElected message. This reproduces the issue with the test.
* The fix is just a change to add the leader locally before sending the
  message. If the view changes meantime, the node is already leader!

Closes jgroups-extras#253.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant