raft: Fix possible deadlocks #1537

PR moby#1310 ("Fix infinite election loop") solved a problem with election loops on startup, by delaying new proposals until the leader has committed all its existing entries. This ensures that the state machine doesn't call ApplyStoreActions to commit a previous entry from the log while an new proposal is in process - since they both acquire a write lock over the memory store, which would deadlock. Unfortunately, there is still a race condition which can lead to a similar deadlock. processInternalRaftRequest makes sure that proposals arent't started after the manager loses its status as the leader by first registering a wait for the raft request, then checking the leadership status. If the leadership status is lost before calling register(), then the leadership check should fail, since it happens afterwards. Conversely, if the leadership status is lost after calling register(), then cancelAll() in Run() will make sure this wait gets cancelled. The problem with this is that the new code in PR moby#1310 calls cancelAll() *before* setting the leadership status. So it's possible that first we cancel all outstanding requests, then a new request is registered and successfully checks that we are still the leader, then we set leader to "false". This request never gets cancelled, so it causes a deadlock. Nothing can be committed to the store until this request goes through, but it can't go through if we're not the leader anymore. To fix this, swap the order of cancelAll so it happens after we change the leadership status variable. This means that no matter how the goroutines are interleaved, a new request will either cancel itself or be cancelled by Run when leadership is lost. I'm aware that this is ugly and I'm open to suggestions for refactoring or abstracting. Also, out of extra caution, call cancelAll in the situation which would lead to a deadlock if there were any outstanding raft requests. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

…mitted This was being done after processCommitted, which could cause a deadlock (if not for the cautionary cancelAll call added to processEntry in the previous commit). Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

This function calls Propose, which will block if there's no leader. Suppose we lose leadership just before calling Propose, and now there's no leader. processInternalRaftRequest will block until a new leader is elected, but this may interfere with the leader election. If Run receives a Ready value that has new items to commit to the store, it will try to do that, and get stuck there because processInternalRaftRequest is called by a store function that has the write lock held. Then the Run goroutine will get stuck, and not be able to send outgoing messages. To solve this, create a context with a cancel function for processInternalRaftRequest, and make it so that cancelling the wait calls this cancel function and unblocks Propose. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: Fix possible deadlocks #1537

raft: Fix possible deadlocks #1537

Commits on Sep 14, 2016