Reduce fundingmanager chan send deadlock scenarios #1301

halseth · 2018-05-30T10:52:13Z

This PR consists of several commits meant to simplify and reduce the cases where we can end up deadlocking on sending errors on the error channels within fundingmanager.

The biggest change is that we now determine whether to send on the resCtx.err channel inside failFundingFlow.

After this change we will send on the resCtx.err channel only after a call to cancelReservationCtx, making sure it won't be attempted again (which would cause a deadlock).

We also increase the buffer of the updateChan to 2, so we can safely send on it during the funding workflow without blocking.

Fixes #1254.

Roasbeef · 2018-06-01T03:25:14Z

fundingmanager.go

 		f.failFundingFlow(fmsg.peerAddress.IdentityKey,
-			pendingChanID, err)
+			fmsg.msg.ChanID, err)


Roasbeef · 2018-06-01T03:48:37Z

fundingmanager.go

+
+	// If this was the last active reservation for this peer, delete the
+	// peer's entry altogether.
+	if len(nodeReservations) == 0 {


Roasbeef · 2018-06-01T03:52:07Z

fundingmanager.go

-		fndgLog.Warnf("unable to delete reservation: %v", err)
-		return
-	}
+	resCtx.err <- err


It doesn't seem that this actually solves the issue this PR was meant to. It should be safe to call handleErrorMsg in a completely async fashion.

It should solve it, as we now cancel the reservation before we send the error on the channel, while before we did it after.

Good point about the async calling though, I will attempt to write a test that exercises that and make sure all necessary mutexes are in place.

UPDATE: I just checked, and handleErrorMsg is only called from the reservationCoordinator, which means it should only be called from one thread at a time. However, this was also the case before this change. We could select on the resCtx.err channel to make sure we don't deadlock, but I'm reluctant to deploy such a fix that is only really mitigating a different bug. This change greatly simplifies the ways the error could be sent, so if it happens again it should at least be much easier to track down.

Pushed a commit that makes sure cancelReservationCtx holds the mutex all the way through. Now a call to cancelReservationCtx should be always return the resCtx or an error, and since we now only send on the error channel after it is cancelled (exception is in CancelPeerReservations, but that's under the mutex) we should be sage.

The error channel should never be nil, and it should always be buffered. Because of this we can send directly on the channel.

This commit moves the responsibility of sending a funding error on the reservation error channel inside failFundingFlow, reducing the places we need to keep track of sending it.

This commit changes cancelReservationCtx to gold the resMtx from start to finish. Earlier it would lock at different times only when accessing the maps, meaning that other goroutines (I'm looking at you PeerTerminationWatcher) could come in and grab the context in between locks, possibly leading to a race.

Roasbeef

LGTM 🐉

Roasbeef requested changes Jun 1, 2018

View reviewed changes

halseth added 9 commits June 1, 2018 08:55

fundingmanager: use non-zero channel ID

5ddb154

server: increase updateChan buffer to 2

08d9edd

fundingmanager: don't block on pendingChannels chan send

6ee58ec

fundingmanager: delete empty peer in deleteReservationCtx

ccf9cd4

fundingmanager: fail funding flow on confirmation timeout

b63ee1a

fundingmanager: send error directly in CancelPeerReservations

7d38e35

The error channel should never be nil, and it should always be buffered. Because of this we can send directly on the channel.

fundingmanager: simplify handleErrorMsg

9090344

fundingmanager: send on resCtx.err in failFundingFlow

d20cb8e

This commit moves the responsibility of sending a funding error on the reservation error channel inside failFundingFlow, reducing the places we need to keep track of sending it.

fundingmanager test: check that error is sent on timeout

3d96462

halseth force-pushed the fundingmanager-double-error branch from 784d435 to 3d96462 Compare June 1, 2018 06:55

Roasbeef approved these changes Jun 2, 2018

View reviewed changes

Roasbeef merged commit 7e4abb4 into lightningnetwork:master Jun 2, 2018

halseth deleted the fundingmanager-double-error branch July 12, 2018 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce fundingmanager chan send deadlock scenarios #1301

Reduce fundingmanager chan send deadlock scenarios #1301

halseth commented May 30, 2018 •

edited by Roasbeef

Roasbeef Jun 1, 2018

Roasbeef Jun 1, 2018

Roasbeef Jun 1, 2018

halseth Jun 1, 2018 •

edited

halseth Jun 1, 2018

Roasbeef left a comment

Reduce fundingmanager chan send deadlock scenarios #1301

Reduce fundingmanager chan send deadlock scenarios #1301

Conversation

halseth commented May 30, 2018 • edited by Roasbeef

Roasbeef Jun 1, 2018

Choose a reason for hiding this comment

Roasbeef Jun 1, 2018

Choose a reason for hiding this comment

Roasbeef Jun 1, 2018

Choose a reason for hiding this comment

halseth Jun 1, 2018 • edited

Choose a reason for hiding this comment

halseth Jun 1, 2018

Choose a reason for hiding this comment

Roasbeef left a comment

Choose a reason for hiding this comment

halseth commented May 30, 2018 •

edited by Roasbeef

halseth Jun 1, 2018 •

edited