[FIXED] Clustering: leadership acquired actions could get stuck #1287

kozlovic · 2023-03-31T00:21:29Z

If a leadership changed occurred while leadership actions were
executed, before the raft.Barrier() call was made, the server
would be stuck in that call. This is because RAFT library
notifies the Streaming server code that a leadership changed
through a go channel that was just of size 1. Since the
streaming server read from the channel and then executes
the leadership acquired code, it could not read from the
notification channel that caused the RAFT library to block
on a go channel send, which then made the Barrier() call
block.

I believe the right approach is to have a bigger notification
go channel instead of making Barrier() time out. If it does
timeout, the server should then transfer leadership, which
I am afraid could cause a cascading effect if all servers
getting elected need longer that the chosen timeout to
apply all the preceding entries to the FSM.

Signed-off-by: Ivan Kozlovic ivan@synadia.com

If a leadership changed occurred while leadership actions were executed, before the raft.Barrier() call was made, the server would be stuck in that call. This is because RAFT library notifies the Streaming server code that a leadership changed through a go channel that was just of size 1. Since the streaming server read from the channel and then executes the leadership acquired code, it could not read from the notification channel that caused the RAFT library to block on a go channel send, which then made the Barrier() call block. I believe the right approach is to have a bigger notification go channel instead of making Barrier() time out. If it does timeout, the server should then transfer leadership, which I am afraid could cause a cascading effect if all servers getting elected need longer that the chosen timeout to apply all the preceding entries to the FSM. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

coveralls · 2023-03-31T00:28:23Z

Coverage: 91.526% (+0.04%) from 91.49% when pulling ee84146 on fix_leadership_acquired into 2af2beb on main.

derekcollison

LGTM

kozlovic added 2 commits March 30, 2023 17:53

Change travis to exclude staticcheck on Go 1.18

ee84146

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic requested a review from derekcollison March 31, 2023 00:21

derekcollison approved these changes Mar 31, 2023

View reviewed changes

kozlovic merged commit 4cfa4f1 into main Apr 3, 2023

kozlovic deleted the fix_leadership_acquired branch April 3, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIXED] Clustering: leadership acquired actions could get stuck #1287

[FIXED] Clustering: leadership acquired actions could get stuck #1287

kozlovic commented Mar 31, 2023

coveralls commented Mar 31, 2023

derekcollison left a comment

[FIXED] Clustering: leadership acquired actions could get stuck #1287

[FIXED] Clustering: leadership acquired actions could get stuck #1287

Conversation

kozlovic commented Mar 31, 2023

coveralls commented Mar 31, 2023

derekcollison left a comment

Choose a reason for hiding this comment