-
Couldn't load subscription status.
- Fork 421
Fix race in PeerManager read pausing.
#4168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix race in PeerManager read pausing.
#4168
Conversation
TheBlueMatt
commented
Oct 22, 2025
|
👋 I see @tankyleo was un-assigned. |
aa5f64e to
ad1e948
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4168 +/- ##
==========================================
+ Coverage 88.78% 88.81% +0.03%
==========================================
Files 180 180
Lines 137004 137261 +257
Branches 137004 137261 +257
==========================================
+ Hits 121642 121914 +272
+ Misses 12538 12535 -3
+ Partials 2824 2812 -12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
What API change would be required to fix it completely? And is a reason not to do it to avoid breaking external usage of this code? |
|
The API change in this PR should fix it completely. My comment was about backporting to 0.1, where we aren't allowed to remove the returned-bool from |
e4a70b9 to
bd4356a
Compare
|
Dropped the first commit as it makes it more annoying to remove the spurious |
|
✅ Added second reviewer: @wpaulino |
|
✅ Added second reviewer: @jkczyz |
|
✅ Added second reviewer: @tankyleo |
|
✅ Added second reviewer: @valentinewallace |
Can you clarify this in the commit message? It isn't clear atm |
bd4356a to
254673b
Compare
|
Okay, I rewrote that paragraph |
^ from the commit message I found this unclear at first, I think adding a sentence saying "This race condition caused us to unnecessarily halt reads from a peer, even after we got out some writes that should have unpaused the reads. Halting reads in this way triggered a hang in testing." would go a long way (assuming what I wrote is correct) |
254673b to
028d64c
Compare
The proposed wording I found confusing as it seems to imply that the write events happen after the read event, which is only half-true. I tried again with the following: |
|
Much clearer, thanks. I'm good with a squash (and CI fix) |
06ad948 to
6dd0836
Compare
|
Squashed with an additional doc fix: |
|
|
We recently ran into a race condition on macOS where `read_event` would return `Ok(true)` (implying reads should be paused) due to many queued outbound messages but before the caller was able to set the read-pause flag, the `send_data` calls to flush the buffered messages completed. Thus, when the `read_event` caller got scheduled again, the buffer was empty and we should be reading, but it is finally processing the read-pause flag and we end up hanging, unwilling to read messages and unable to learn that we should start reading again as there are no messages to `send_data` for. This should be fairly rare, but not unheard of - the `pause_read` flag in `read_event` is calculated before handling the last message, so there's some time between when its calculated and when its returned. However, that has to race with multiple calls to `send_data` to send all the pending messages, which all have to complete before the `read_event` return happens. We've (as far as I recall) never hit this in prod, but a benchmark HTLC-flood test managed to hit it somewhat reliably within a few minutes on macOS and when a synthetic few-ms sleep was added to each message handling call. Ultimately this is an issue with the API - we pause reads via a returned flag but unpause them via a called method, creating two independent "stream"s of pause/unpauses which can get out of sync. Thus, here, we stick to a single "stream" of pause-read events from `PeerManager` to user code via `send_data` calls, dropping the read-pause flag return from `read_event` entirely. Technically this adds risk that someone can flood us with enough messages fast enough to bloat our outbound buffer for a peer before `PeerManager::process_events` gets called and can flush the pause flag via `read_event` calls to all descriptors. This isn't ideal but it should still be relatively hard to do as `process_events` calls are pretty quick and should be triggered immediately after each `read_event` call completes.
In the previous commit, we moved the `send_data` `resume_read` flag to also indicate that we should pause if its unset. This should work as we mostly only set the flag when we're sending but may cause us to fail to pause if we are blocked on gossip validation but `awaiting_write_event` wasn't set as we had previously failed to fully flush a buffer (which no longer implies read-pause). Here we make this logic much more robust by ensuring we always make at least one `send_data` call in `do_attempt_write_data` if we need to pause read (or unpause read).
6dd0836 to
5e93838
Compare
|
Ha, heat me to it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion but LGTM
| let mut have_written = false; | ||
| while !peer.awaiting_write_event { | ||
| if !self.should_read_from(peer) { | ||
| if !peer.sent_pause_read { | ||
| force_one_write = true; | ||
| } | ||
| } else { | ||
| if peer.sent_pause_read { | ||
| force_one_write = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified, and I think a comment would be helpful in explaining what's going on:
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 3c7f4a845..76737ea67 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -1542,15 +1542,12 @@ where
fn do_attempt_write_data(
&self, descriptor: &mut Descriptor, peer: &mut Peer, mut force_one_write: bool,
) {
- if !self.should_read_from(peer) {
- if !peer.sent_pause_read {
- force_one_write = true;
- }
- } else {
- if peer.sent_pause_read {
- force_one_write = true;
- }
- }
+ // If we detect that we should be reading from the peer but reads are currently paused, or vice
+ // versa, then we need to tell the peer to update their internal flag indicating whether or not
+ // reads are paused. Do this by forcing a write to the peer with the desired `continue_read`
+ // flag set, even if no outbound write is currently pending.
+ force_one_write |= self.should_read_from(peer) == peer.sent_pause_read;
+
while force_one_write || !peer.awaiting_write_event {
if peer.should_buffer_onion_message() {
if let Some((peer_node_id, _)) = peer.their_node_id {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd indeed say this fix is also a simplification. Win-win.
Still don't have full understanding of this part of the code, so should consider my approval secondary.