Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Fix possible deadlocks #1537

Merged
merged 3 commits into from
Sep 14, 2016
Merged

Commits on Sep 14, 2016

  1. raft: Fix race that leads to raft deadlock

    PR moby#1310 ("Fix infinite election loop") solved a problem with election
    loops on startup, by delaying new proposals until the leader has
    committed all its existing entries. This ensures that the state machine
    doesn't call ApplyStoreActions to commit a previous entry from the log
    while an new proposal is in process - since they both acquire a write
    lock over the memory store, which would deadlock.
    
    Unfortunately, there is still a race condition which can lead to a
    similar deadlock. processInternalRaftRequest makes sure that proposals
    arent't started after the manager loses its status as the leader by
    first registering a wait for the raft request, then checking the
    leadership status. If the leadership status is lost before calling
    register(), then the leadership check should fail, since it happens
    afterwards. Conversely, if the leadership status is lost after calling
    register(), then cancelAll() in Run() will make sure this wait gets
    cancelled.
    
    The problem with this is that the new code in PR moby#1310 calls cancelAll()
    *before* setting the leadership status. So it's possible that first we
    cancel all outstanding requests, then a new request is registered and
    successfully checks that we are still the leader, then we set leader to
    "false". This request never gets cancelled, so it causes a deadlock.
    Nothing can be committed to the store until this request goes through,
    but it can't go through if we're not the leader anymore.
    
    To fix this, swap the order of cancelAll so it happens after we change
    the leadership status variable. This means that no matter how the
    goroutines are interleaved, a new request will either cancel itself or
    be cancelled by Run when leadership is lost. I'm aware that this is ugly
    and I'm open to suggestions for refactoring or abstracting.
    
    Also, out of extra caution, call cancelAll in the situation which would
    lead to a deadlock if there were any outstanding raft requests.
    
    Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
    aaronlehmann committed Sep 14, 2016
    Configuration menu
    Copy the full SHA
    0324db7 View commit details
    Browse the repository at this point in the history
  2. raft: If no longer leader, cancel proposals before calling processCom…

    …mitted
    
    This was being done after processCommitted, which could cause a deadlock
    (if not for the cautionary cancelAll call added to processEntry in the
    previous commit).
    
    Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
    aaronlehmann committed Sep 14, 2016
    Configuration menu
    Copy the full SHA
    50b2dac View commit details
    Browse the repository at this point in the history
  3. raft: Fix possible deadlock in processInternalRaftRequest

    This function calls Propose, which will block if there's no leader.
    Suppose we lose leadership just before calling Propose, and now there's
    no leader. processInternalRaftRequest will block until a new leader is
    elected, but this may interfere with the leader election. If Run
    receives a Ready value that has new items to commit to the store, it
    will try to do that, and get stuck there because
    processInternalRaftRequest is called by a store function that has the
    write lock held. Then the Run goroutine will get stuck, and not be able
    to send outgoing messages.
    
    To solve this, create a context with a cancel function for
    processInternalRaftRequest, and make it so that cancelling the wait
    calls this cancel function and unblocks Propose.
    
    Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
    aaronlehmann committed Sep 14, 2016
    Configuration menu
    Copy the full SHA
    8b34600 View commit details
    Browse the repository at this point in the history