Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Clustering: leadership acquired actions could get stuck #1287

Merged
merged 2 commits into from Apr 3, 2023

Conversation

kozlovic
Copy link
Member

If a leadership changed occurred while leadership actions were
executed, before the raft.Barrier() call was made, the server
would be stuck in that call. This is because RAFT library
notifies the Streaming server code that a leadership changed
through a go channel that was just of size 1. Since the
streaming server read from the channel and then executes
the leadership acquired code, it could not read from the
notification channel that caused the RAFT library to block
on a go channel send, which then made the Barrier() call
block.

I believe the right approach is to have a bigger notification
go channel instead of making Barrier() time out. If it does
timeout, the server should then transfer leadership, which
I am afraid could cause a cascading effect if all servers
getting elected need longer that the chosen timeout to
apply all the preceding entries to the FSM.

Signed-off-by: Ivan Kozlovic ivan@synadia.com

If a leadership changed occurred while leadership actions were
executed, before the raft.Barrier() call was made, the server
would be stuck in that call. This is because RAFT library
notifies the Streaming server code that a leadership changed
through a go channel that was just of size 1. Since the
streaming server read from the channel and then executes
the leadership acquired code, it could not read from the
notification channel that caused the RAFT library to block
on a go channel send, which then made the Barrier() call
block.

I believe the right approach is to have a bigger notification
go channel instead of making Barrier() time out. If it does
timeout, the server should then transfer leadership, which
I am afraid could cause a cascading effect if all servers
getting elected need longer that the chosen timeout to
apply all the preceding entries to the FSM.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@coveralls
Copy link

Coverage Status

Coverage: 91.526% (+0.04%) from 91.49% when pulling ee84146 on fix_leadership_acquired into 2af2beb on main.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kozlovic kozlovic merged commit 4cfa4f1 into main Apr 3, 2023
3 checks passed
@kozlovic kozlovic deleted the fix_leadership_acquired branch April 3, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants