Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: launch persistent peer pruning in background goroutine #8041

Merged
merged 1 commit into from Sep 28, 2023

Conversation

Roasbeef
Copy link
Member

This PR is a follow up, to a follow
up
of an initial concurrency issue fixed in the peer goroutine.

In #7938, we noticed that the introduction of p.startReady can cause Disconnect to block. This happens as Disconnect cannot be called until p.startReady has been closed. Disconnect is also called from InboundPeerConnected (the case of concurrent peers, so we need to remove one of the connections) while the main server mutex is held. If p.Start blocks for any reason, then this leads to the deadlock as: we can't disconnect until we've finished starting, and we can't finish starting as we need the disconnect caller to exit as it has the mutex.

In this commit, we now make the call to prunePersistentPeerConnection async. The call to prunePersistentPeerConnection eventually wants to grab the server mutex, which triggers the circular waiting scenario above.

The main learning here is that no calls to the main server mutex path can block from p.Start. This is more or less a stop gap to resolve the issue initially introduced in v0.16.4. Assuming we want to move forward with this fix, we should reexamine p.startReady all together, and also revisit attempt to refactor this section of the code to eliminate the mega mutex in the server in favor of a dedicated event loop.

Fixes #8039

This PR is a follow up, to a [follow
up](lightningnetwork#7938) of an [initial
concurrency issue](lightningnetwork#7856)
fixed in the peer goroutine.

In lightningnetwork#7938, we noticed that the introduction of `p.startReady` can cause
`Disconnect` to block. This happens as `Disconnect` cannot be called
until `p.startReady` has been closed. `Disconnect` is also called from
`InboundPeerConnected` (the case of concurrent peers, so we need to
remove one of the connections) while the main server mutex is held. If
`p.Start` blocks for any reason, then this leads to the deadlock as: we
can't disconnect until we've finished starting, and we can't finish
starting as we need the disconnect caller to exit as it has the mutex.

In this commit, we now make the call to `prunePersistentPeerConnection`
async. The call to `prunePersistentPeerConnection` eventually wants to
grab the server mutex, which triggers the circular waiting scenario
above.

The main learning here is that no calls to the main server mutex path
can block from `p.Start`. This is more or less a stop gap to resolve the
issue initially introduced in v0.16.4. Assuming we want to move forward
with this fix, we should reexamine `p.startReady` all together, and also
revisit attempt to refactor this section of the code to eliminate the
mega mutex in the server in favor of a dedicated event loop.
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix🙏

@Roasbeef
Copy link
Member Author

Thx for the review, confirmed fix by the user the reported it. Looking forward to reworking this area entirely and getting rid of the mega mutex.

@Roasbeef Roasbeef merged commit edeb0d7 into lightningnetwork:master Sep 28, 2023
22 of 26 checks passed
@LNBIG-COM
Copy link

Hello,
May be fix was not fixed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: 0.17.0-rc5 - lncli getinfo hangs, lnd has 20000 files opened
4 participants