New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce fundingmanager chan send deadlock scenarios #1301
Reduce fundingmanager chan send deadlock scenarios #1301
Conversation
f.failFundingFlow(fmsg.peerAddress.IdentityKey, | ||
pendingChanID, err) | ||
fmsg.msg.ChanID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// If this was the last active reservation for this peer, delete the | ||
// peer's entry altogether. | ||
if len(nodeReservations) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fndgLog.Warnf("unable to delete reservation: %v", err) | ||
return | ||
} | ||
resCtx.err <- err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem that this actually solves the issue this PR was meant to. It should be safe to call handleErrorMsg
in a completely async fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should solve it, as we now cancel the reservation before we send the error on the channel, while before we did it after.
Good point about the async calling though, I will attempt to write a test that exercises that and make sure all necessary mutexes are in place.
UPDATE: I just checked, and handleErrorMsg
is only called from the reservationCoordinator
, which means it should only be called from one thread at a time. However, this was also the case before this change. We could select
on the resCtx.err
channel to make sure we don't deadlock, but I'm reluctant to deploy such a fix that is only really mitigating a different bug. This change greatly simplifies the ways the error could be sent, so if it happens again it should at least be much easier to track down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a commit that makes sure cancelReservationCtx
holds the mutex all the way through. Now a call to cancelReservationCtx
should be always return the resCtx
or an error, and since we now only send on the error channel after it is cancelled (exception is in CancelPeerReservations
, but that's under the mutex) we should be sage.
The error channel should never be nil, and it should always be buffered. Because of this we can send directly on the channel.
This commit moves the responsibility of sending a funding error on the reservation error channel inside failFundingFlow, reducing the places we need to keep track of sending it.
784d435
to
3d96462
Compare
This commit changes cancelReservationCtx to gold the resMtx from start to finish. Earlier it would lock at different times only when accessing the maps, meaning that other goroutines (I'm looking at you PeerTerminationWatcher) could come in and grab the context in between locks, possibly leading to a race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐉
This PR consists of several commits meant to simplify and reduce the cases where we can end up deadlocking on sending errors on the error channels within
fundingmanager
.The biggest change is that we now determine whether to send on the
resCtx.err
channel insidefailFundingFlow
.After this change we will send on the
resCtx.err
channel only after a call tocancelReservationCtx
, making sure it won't be attempted again (which would cause a deadlock).We also increase the buffer of the
updateChan
to 2, so we can safely send on it during the funding workflow without blocking.Fixes #1254.