Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: eliminate circular waiting by calling maybeSendNodeAnn async #7938

Merged
merged 1 commit into from Aug 30, 2023

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Aug 29, 2023

In this commit, we attempt to fix circular waiting scenario introduced inadvertently when fixing a race condition
scenario
. With that PR, we added a new channel that would block Disconnect, and WaitForDisconnect to ensure that only until the Start method has finished would those calls be allowed to succeed.

The issue is that if the server is trying to disconnect a peer due to a concurrent connection, but Start is blocked on maybeSendNodeAnn, which then wants to grab the main server mutex, then Start can never exit, which causes startReady to never be closed, which then causes the server to be blocked.

This PR attempts to fix the issue by calling maybeSendNodeAnn in a goroutine, so it can grab the server mutex and not block the Start method.

Fixes #7924

Fixes #7928

Fixes #7866

Fixes #7917

In this commit, we attempt to fix circular waiting scenario introduced
inadvertently when [fixing a race condition
scenario](lightningnetwork#7856). With that
PR, we added a new channel that would block `Disconnect`, and
`WaitForDisconnect` to ensure that only until the `Start` method has
finished would those calls be allowed to succeed.

The issue is that if the server is trying to disconnect a peer due to a
concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`,
which then wants to grab the main server mutex, then `Start` can never
exit, which causes `startReady` to never be closed, which then causes
the server to be blocked.

This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a
goroutine, so it can grab the server mutex and not block the `Start`
method.

Fixes lightningnetwork#7924

Fixes lightningnetwork#7928

Fixes lightningnetwork#7866
@Roasbeef Roasbeef added this to the v0.17.0 milestone Aug 29, 2023
@Roasbeef
Copy link
Member Author

IMO this isn't an ideal solution, but pretty sure this is the interaction at hand.

Here're some supporting goroutine stack traces:

Server blocked on grabbing the mutex to execute the InboundPeerConnected call back:

285 @ 0x43df0e 0x44f3b8 0x44f38f 0x46cf65 0x48d81d 0x48ecf1 0x48ecd2 0x18da3df 0x4710e1
#    0x46cf64    sync.runtime_SemacquireMutex+0x24                    runtime/sema.go:77
#    0x48d81c    sync.(*Mutex).lockSlow+0x15c                        sync/mutex.go:171
#    0x48ecf0    sync.(*Mutex).Lock+0x30                            sync/mutex.go:90
#    0x48ecd1    sync.(*RWMutex).Lock+0x11                        sync/rwmutex.go:147
#    0x18da3de    github.com/lightningnetwork/lnd.(*server).InboundPeerConnected+0xfe    github.com/lightningnetwork/lnd/server.go:3520

Peer goroutines unable to be torn down (concurrent connection case), as waiting on startReady to be closed:

1 @ 0x43df0e 0x44e385 0x18196fd 0x181c27a 0x4710e1
#    0x18196fc    github.com/lightningnetwork/lnd/peer.(*Brontide).Disconnect+0x9c    github.com/lightningnetwork/lnd/peer/brontide.go:1115
#    0x181c279    github.com/lightningnetwork/lnd/peer.(*Brontide).readHandler+0x10b9    github.com/lightningnetwork/lnd/peer/brontide.go:1697

1 @ 0x43df0e 0x44e385 0x18196fd 0x18df825 0x18da9ba 0x4710e1
#    0x18196fc    github.com/lightningnetwork/lnd/peer.(*Brontide).Disconnect+0x9c    github.com/lightningnetwork/lnd/peer/brontide.go:1115
#    0x18df824    github.com/lightningnetwork/lnd.(*server).removePeer+0x104        github.com/lightningnetwork/lnd/server.go:4309
#    0x18da9b9    github.com/lightningnetwork/lnd.(*server).InboundPeerConnected+0x6d9    github.com/lightningnetwork/lnd/server.go:3586

Peer goroutine can't finish starting up as it needs to call maybeSendAnn, which wants the server mutex:

2 @ 0x43df0e 0x44f3b8 0x44f38f 0x46cf65 0x48d81d 0x48ecf1 0x48ecd2 0x18d6a8f 0x18dcc65 0x18192d1 0x1816ac5 0x18dd0b8 0x4710e1
#    0x46cf64    sync.runtime_SemacquireMutex+0x24                    runtime/sema.go:77
#    0x48d81c    sync.(*Mutex).lockSlow+0x15c                        sync/mutex.go:171
#    0x48ecf0    sync.(*Mutex).Lock+0x30                            sync/mutex.go:90
#    0x48ecd1    sync.(*RWMutex).Lock+0x11                        sync/rwmutex.go:147
#    0x18d6a8e    github.com/lightningnetwork/lnd.(*server).genNodeAnnouncement+0x8e    github.com/lightningnetwork/lnd/server.go:2972
#    0x18dcc64    github.com/lightningnetwork/lnd.(*server).peerConnected.func1+0x84    github.com/lightningnetwork/lnd/server.go:3857
#    0x18192d0    github.com/lightningnetwork/lnd/peer.(*Brontide).maybeSendNodeAnn+0xb0    github.com/lightningnetwork/lnd/peer/brontide.go:1069
#    0x1816ac4    github.com/lightningnetwork/lnd/peer.(*Brontide).Start+0xae4        github.com/lightningnetwork/lnd/peer/brontide.go:675
#    0x18dd0b7    github.com/lightningnetwork/lnd.(*server).peerInitializer+0x137        github.com/lightningnetwork/lnd/server.go:3977

So now Start can't finish, which means that Disconnect can't be called.

Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works. There's also an old PR #7525 which removes maybeSendNodeAnn as it seems it's not needed?

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

Either this approach or the one in #7525 seem fine to me.

Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch 🎣

@Roasbeef Roasbeef merged commit eb0d8af into lightningnetwork:master Aug 30, 2023
21 of 25 checks passed
Roasbeef added a commit to Roasbeef/lnd that referenced this pull request Sep 28, 2023
This PR is a follow up, to a [follow
up](lightningnetwork#7938) of an [initial
concurrency issue](lightningnetwork#7856)
fixed in the peer goroutine.

In lightningnetwork#7938, we noticed that the introduction of `p.startReady` can cause
`Disconnect` to block. This happens as `Disconnect` cannot be called
until `p.startReady` has been closed. `Disconnect` is also called from
`InboundPeerConnected` (the case of concurrent peers, so we need to
remove one of the connections) while the main server mutex is held. If
`p.Start` blocks for any reason, then this leads to the deadlock as: we
can't disconnect until we've finished starting, and we can't finish
starting as we need the disconnect caller to exit as it has the mutex.

In this commit, we now make the call to `prunePersistentPeerConnection`
async. The call to `prunePersistentPeerConnection` eventually wants to
grab the server mutex, which triggers the circular waiting scenario
above.

The main learning here is that no calls to the main server mutex path
can block from `p.Start`. This is more or less a stop gap to resolve the
issue initially introduced in v0.16.4. Assuming we want to move forward
with this fix, we should reexamine `p.startReady` all together, and also
revisit attempt to refactor this section of the code to eliminate the
mega mutex in the server in favor of a dedicated event loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants