Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Fix possible deadlocks #1537

Merged
merged 3 commits into from
Sep 14, 2016
Merged

Conversation

aaronlehmann
Copy link
Collaborator

This includes three commits that fix possible deadlocks between ApplyStoreActions and processInternalRaftRequest:

raft: Fix race that leads to raft deadlock

PR #1310 ("Fix infinite election loop") solved a problem with election
loops on startup, by delaying new proposals until the leader has
committed all its existing entries. This ensures that the state machine
doesn't call ApplyStoreActions to commit a previous entry from the log
while an new proposal is in process - since they both acquire a write
lock over the memory store, which would deadlock.

Unfortunately, there is still a race condition which can lead to a
similar deadlock. processInternalRaftRequest makes sure that proposals
arent't started after the manager loses its status as the leader by
first registering a wait for the raft request, then checking the
leadership status. If the leadership status is lost before calling
register(), then the leadership check should fail, since it happens
afterwards. Conversely, if the leadership status is lost after calling
register(), then cancelAll() in Run() will make sure this wait gets
cancelled.

The problem with this is that the new code in PR #1310 calls cancelAll()
before setting the leadership status. So it's possible that first we
cancel all outstanding requests, then a new request is registered and
successfully checks that we are still the leader, then we set leader to
"false". This request never gets cancelled, so it causes a deadlock.
Nothing can be committed to the store until this request goes through,
but it can't go through if we're not the leader anymore.

To fix this, swap the order of cancelAll so it happens after we change
the leadership status variable. This means that no matter how the
goroutines are interleaved, a new request will either cancel itself or
be cancelled by Run when leadership is lost. I'm aware that this is ugly
and I'm open to suggestions for refactoring or abstracting.

Also, out of extra caution, call cancelAll in the situation which would
lead to a deadlock if there were any outstanding raft requests.

raft: If no longer leader, cancel proposals before calling processCommitted

This was being done after processCommitted, which could cause a deadlock
(if not for the cautionary cancelAll call added to processEntry in the
previous commit).

raft: Fix possible deadlock in processInternalRaftRequest

This function calls Propose, which will block if there's no leader.
Suppose we lose leadership just before calling Propose, and now there's
no leader. processInternalRaftRequest will block until a new leader is
elected, but this may interfere with the leader election. If Run
receives a Ready value that has new items to commit to the store, it
will try to do that, and get stuck there because
processInternalRaftRequest is called by a store function that has the
write lock held. Then the Run goroutine will get stuck, and not be able
to send outgoing messages.

To solve this, create a context with a cancel function for
processInternalRaftRequest, and make it so that cancelling the wait
calls this cancel function and unblocks Propose.

cc @LK4D4

PR moby#1310 ("Fix infinite election loop") solved a problem with election
loops on startup, by delaying new proposals until the leader has
committed all its existing entries. This ensures that the state machine
doesn't call ApplyStoreActions to commit a previous entry from the log
while an new proposal is in process - since they both acquire a write
lock over the memory store, which would deadlock.

Unfortunately, there is still a race condition which can lead to a
similar deadlock. processInternalRaftRequest makes sure that proposals
arent't started after the manager loses its status as the leader by
first registering a wait for the raft request, then checking the
leadership status. If the leadership status is lost before calling
register(), then the leadership check should fail, since it happens
afterwards. Conversely, if the leadership status is lost after calling
register(), then cancelAll() in Run() will make sure this wait gets
cancelled.

The problem with this is that the new code in PR moby#1310 calls cancelAll()
*before* setting the leadership status. So it's possible that first we
cancel all outstanding requests, then a new request is registered and
successfully checks that we are still the leader, then we set leader to
"false". This request never gets cancelled, so it causes a deadlock.
Nothing can be committed to the store until this request goes through,
but it can't go through if we're not the leader anymore.

To fix this, swap the order of cancelAll so it happens after we change
the leadership status variable. This means that no matter how the
goroutines are interleaved, a new request will either cancel itself or
be cancelled by Run when leadership is lost. I'm aware that this is ugly
and I'm open to suggestions for refactoring or abstracting.

Also, out of extra caution, call cancelAll in the situation which would
lead to a deadlock if there were any outstanding raft requests.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
…mitted

This was being done after processCommitted, which could cause a deadlock
(if not for the cautionary cancelAll call added to processEntry in the
previous commit).

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This function calls Propose, which will block if there's no leader.
Suppose we lose leadership just before calling Propose, and now there's
no leader. processInternalRaftRequest will block until a new leader is
elected, but this may interfere with the leader election. If Run
receives a Ready value that has new items to commit to the store, it
will try to do that, and get stuck there because
processInternalRaftRequest is called by a store function that has the
write lock held. Then the Run goroutine will get stuck, and not be able
to send outgoing messages.

To solve this, create a context with a cancel function for
processInternalRaftRequest, and make it so that cancelling the wait
calls this cancel function and unblocks Propose.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@LK4D4
Copy link
Contributor

LK4D4 commented Sep 14, 2016

@aaronlehmann Nice, thanks! will try with my tests.

@aaronlehmann aaronlehmann modified the milestones: 1.12., 1.12.2 Sep 14, 2016
@codecov-io
Copy link

codecov-io commented Sep 14, 2016

Current coverage is 53.84% (diff: 81.48%)

Merging #1537 into master will increase coverage by 0.13%

@@             master      #1537   diff @@
==========================================
  Files            82         82          
  Lines         13414      13415     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           7204       7223    +19   
+ Misses         5224       5209    -15   
+ Partials        986        983     -3   

Sunburst

Powered by Codecov. Last update ae7991c...8b34600

Copy link
Contributor

@LK4D4 LK4D4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code makes sense for me. I've run basic tests and they work perfectly.

Copy link
Contributor

@mrjana mrjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants