You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found this locally because of a flaky test that waits for all nodes to install a leader. During concurrent joins, there is a small gap that a node might not receive the LeaderElected message. The scenario is:
The coordinator already collected the VoteReponses;
The LeaderElected message is already multicast;
The leader did not set the leader locally.
If the joiner arrives after the LeaderElected message and before the leader set locally, it will not receive the LeaderElected message.
As I am not sure how to test this, the test log:
9686 [TRACE] ELECTION: A -> all: VoteRequest: current_term=1
9686 [TRACE] ELECTION: A <- A: VoteRequest(term=1)
9688 [TRACE] ELECTION: B <- A: VoteRequest(term=1)
9688 [TRACE] RAFT: B: changed term from 0 -> 1
9688 [TRACE] RAFT: B: change leader from null -> null
A: collected votes from {B=VoteResponse: current_term=1, last_log_term=0, last_log_index=0, A=VoteResponse: current_term=1, last_log_term=0, last_log_index=0} in 2 ms (majority=2) -> leader is A (new_term=1)
9688 [TRACE] ELECTION: A -> all (-self): LeaderElected: current_term=1, leader=A
9688 [TRACE] ELECTION: B <- A: LeaderElected: current_term=1, leader=A
9688 [TRACE] RAFT: B: change leader from A -> A
-------------------------------------------------------------------
GMS: address=C, cluster=AppendEntriesTest, physical address=127.0.0.1:1
-------------------------------------------------------------------
9688 [DEBUG] ELECTION: A: existing view: [A|1] (2) [A, B], new view: [A|2] (3) [A, B, C], result: no_change
9688 [TRACE] RAFT: A: change leader from A -> A
9688 [TRACE] RAFT: A: changed role from Follower -> Leader
9688 [DEBUG] ELECTION: A: stopping the voting thread
9690 [DEBUG] ELECTION: B: existing view: [A|1] (2) [A, B], new view: [A|2] (3) [A, B, C], result: no_change
9690 [DEBUG] ELECTION: C: existing view: null, new view: [A|2] (3) [A, B, C], result: reached
Changing to set the leader locally before sending the message should fix this. If the elected leader is not the coordinator, it will receive the LeaderElected before the view is installed and then notify the new node in the new view.
The text was updated successfully, but these errors were encountered:
* Utilize the blocking interceptors to block while sending the
LeaderElected message. This reproduces the issue with the test.
* The fix is just a change to add the leader locally before sending the
message. If the view changes meantime, the node is already leader!
Closesjgroups-extras#253.
I found this locally because of a flaky test that waits for all nodes to install a leader. During concurrent joins, there is a small gap that a node might not receive the
LeaderElected
message. The scenario is:VoteReponses
;LeaderElected
message is already multicast;If the joiner arrives after the
LeaderElected
message and before the leader set locally, it will not receive theLeaderElected
message.As I am not sure how to test this, the test log:
Changing to set the leader locally before sending the message should fix this. If the elected leader is not the coordinator, it will receive the
LeaderElected
before the view is installed and then notify the new node in the new view.The text was updated successfully, but these errors were encountered: