Fix race in `PeerManager` read pausing. #4168

TheBlueMatt · 2025-10-22T21:15:39Z

We recently ran into a race condition on macOS where `read_event`
would return `Ok(true)` (implying reads should be paused) but calls
to `send_data` which flushed the buffer completed before the
`read_event` caller was able to set the read-pause flag.

This should be fairly rare, but not unheard of - the `pause_read`
flag in `read_event` is calculated before handling the last
message, so there's some time between when its calculated and when
its returned. However, that has to race with multiple calls to
`send_data` to send all the pending messages, which all have to
complete before the `read_event` return happens. We've (as far as I
can tell) never hit this on Linux, but a benchmark HTLC-flood test
managed to hit it somewhat reliably within a few minutes on macOS.

Ultimately we can't fix this with the current API (though we could
make it more rare). Thus, here, we stick to a single "stream" of
pause-read events from `PeerManager` to user code via `send_data`
calls, dropping the read-pause flag return from `read_event`
entirely.

Technically this adds risk that someone can flood us with enough
messages fast enough to bloat our outbound buffer for a peer before
`PeerManager::process_events` gets called and can flush the pause
flag via `read_event` calls to all descriptors. This isn't ideal
but it should still be relatively hard to do as `process_events`
calls are pretty quick and should be triggered immediately after
each `read_event` call completes.

ldk-reviews-bot · 2025-10-22T21:15:42Z

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

codecov · 2025-10-22T21:34:52Z

Codecov Report

❌ Patch coverage is 89.36170% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.80%. Comparing base (05f2848) to head (467d311).
⚠️ Report is 50 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/peer_handler.rs	92.50%	3 Missing ⚠️
lightning-background-processor/src/lib.rs	0.00%	1 Missing ⚠️
lightning-net-tokio/src/lib.rs	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4168      +/-   ##
==========================================
+ Coverage   88.78%   88.80%   +0.01%     
==========================================
  Files         180      180              
  Lines      137004   137271     +267     
  Branches   137004   137271     +267     
==========================================
+ Hits       121642   121901     +259     
- Misses      12538    12559      +21     
+ Partials     2824     2811      -13

Flag	Coverage Δ
fuzzing	`21.49% <65.00%> (+0.53%)`	⬆️
tests	`88.64% <89.36%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lightning-net-tokio/src/lib.rs

lightning/src/ln/peer_handler.rs

ldk-reviews-bot · 2025-10-24T11:40:45Z

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

joostjager · 2025-10-24T11:41:26Z

Ultimately we can't fix this with the current API (though we could
make it more rare).

What API change would be required to fix it completely? And is a reason not to do it to avoid breaking external usage of this code?

TheBlueMatt · 2025-10-24T20:44:45Z

The API change in this PR should fix it completely. My comment was about backporting to 0.1, where we aren't allowed to remove the returned-bool from read_event, and probably shouldn't silently change the semantics of the resume_read/continue_read bool (which now implies we should stop reading if its false).

TheBlueMatt · 2025-10-25T14:17:59Z

Dropped the first commit as it makes it more annoying to remove the spurious Box::pins now that our MSRV is higher.

ldk-reviews-bot · 2025-10-27T14:33:30Z

✅ Added second reviewer: @wpaulino

ldk-reviews-bot · 2025-10-27T14:33:39Z

✅ Added second reviewer: @jkczyz

ldk-reviews-bot · 2025-10-27T14:33:49Z

✅ Added second reviewer: @tankyleo

ldk-reviews-bot · 2025-10-27T14:33:58Z

✅ Added second reviewer: @valentinewallace

valentinewallace · 2025-10-27T15:14:56Z

The API change in this PR should fix it completely. My comment was about backporting to 0.1, where we aren't allowed to remove the returned-bool from read_event, and probably shouldn't silently change the semantics of the resume_read/continue_read bool (which now implies we should stop reading if its false).

Can you clarify this in the commit message? It isn't clear atm

TheBlueMatt · 2025-10-27T15:51:42Z

Okay, I rewrote that paragraph

valentinewallace · 2025-10-27T18:25:56Z

We recently ran into a race condition on macOS where read_event
would return Ok(true) (implying reads should be paused) but calls
to send_data which flushed the buffer completed before the
read_event caller was able to set the read-pause flag.

^ from the commit message

I found this unclear at first, I think adding a sentence saying "This race condition caused us to unnecessarily halt reads from a peer, even after we got out some writes that should have unpaused the reads. Halting reads in this way triggered a hang in testing." would go a long way (assuming what I wrote is correct)

lightning/src/ln/peer_handler.rs

TheBlueMatt · 2025-10-27T19:57:51Z

I found this unclear at first, I think adding a sentence saying "This race condition caused us to unnecessarily halt reads from a peer, even after we got out some writes that should have unpaused the reads. Halting reads in this way triggered a hang in testing." would go a long way (assuming what I wrote is correct)

The proposed wording I found confusing as it seems to imply that the write events happen after the read event, which is only half-true. I tried again with the following:

We recently ran into a race condition on macOS where `read_event`
would return `Ok(true)` (implying reads should be paused) due to
many queued outbound messages but before the caller was able to
set the read-pause flag, the `send_data` calls to flush the
buffered messages completed. Thus, when the `read_event` caller got
scheduled again, the buffer was empty and we should be reading, but
it is finally processing the read-pause flag and we end up hanging,
unwilling to read messages and unable to learn that we should start
reading again as there are no messages to `send_data` for.

valentinewallace · 2025-10-27T20:15:38Z

Much clearer, thanks. I'm good with a squash (and CI fix)

TheBlueMatt · 2025-10-27T20:40:09Z

Squashed with an additional doc fix:

$ git diff-tree -U1 06ad9485c 6dd0836c8
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 9ac39a6fbc..3c7f4a845b 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -783,3 +783,3 @@ struct Peer {
 	awaiting_write_event: bool,
-	/// Set to true if the last call to [`Descriptor::send_data`] for this peer had the
+	/// Set to true if the last call to [`SocketDescriptor::send_data`] for this peer had the
 	/// `should_read` flag unset, indicating we've told the driver to stop reading from this peer.

valentinewallace · 2025-10-27T20:46:52Z

fuzz is still sad

We recently ran into a race condition on macOS where `read_event` would return `Ok(true)` (implying reads should be paused) due to many queued outbound messages but before the caller was able to set the read-pause flag, the `send_data` calls to flush the buffered messages completed. Thus, when the `read_event` caller got scheduled again, the buffer was empty and we should be reading, but it is finally processing the read-pause flag and we end up hanging, unwilling to read messages and unable to learn that we should start reading again as there are no messages to `send_data` for. This should be fairly rare, but not unheard of - the `pause_read` flag in `read_event` is calculated before handling the last message, so there's some time between when its calculated and when its returned. However, that has to race with multiple calls to `send_data` to send all the pending messages, which all have to complete before the `read_event` return happens. We've (as far as I recall) never hit this in prod, but a benchmark HTLC-flood test managed to hit it somewhat reliably within a few minutes on macOS and when a synthetic few-ms sleep was added to each message handling call. Ultimately this is an issue with the API - we pause reads via a returned flag but unpause them via a called method, creating two independent "stream"s of pause/unpauses which can get out of sync. Thus, here, we stick to a single "stream" of pause-read events from `PeerManager` to user code via `send_data` calls, dropping the read-pause flag return from `read_event` entirely. Technically this adds risk that someone can flood us with enough messages fast enough to bloat our outbound buffer for a peer before `PeerManager::process_events` gets called and can flush the pause flag via `read_event` calls to all descriptors. This isn't ideal but it should still be relatively hard to do as `process_events` calls are pretty quick and should be triggered immediately after each `read_event` call completes.

TheBlueMatt · 2025-10-27T20:48:06Z

Ha, heat me to it,

$ git diff-tree -U1 06ad9485c 5e93838f3
diff --git a/fuzz/src/full_stack.rs b/fuzz/src/full_stack.rs
index 186dce0933..97a74871ea 100644
--- a/fuzz/src/full_stack.rs
+++ b/fuzz/src/full_stack.rs
@@ -697,3 +697,3 @@ pub fn do_test(mut data: &[u8], logger: &Arc<dyn Logger>) {
 				match loss_detector.handler.read_event(&mut peer, get_slice!(get_slice!(1)[0])) {
-					Ok(res) => assert!(!res),
+					Ok(()) => {},
 					Err(_) => {
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 9ac39a6fbc..3c7f4a845b 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -783,3 +783,3 @@ struct Peer {
 	awaiting_write_event: bool,
-	/// Set to true if the last call to [`Descriptor::send_data`] for this peer had the
+	/// Set to true if the last call to [`SocketDescriptor::send_data`] for this peer had the
 	/// `should_read` flag unset, indicating we've told the driver to stop reading from this peer.

valentinewallace

One suggestion but LGTM

lightning/src/ln/peer_handler.rs

joostjager

I'd indeed say this fix is also a simplification. Win-win.

Still don't have full understanding of this part of the code, so should consider my approval secondary.

In the previous commit, we moved the `send_data` `resume_read` flag to also indicate that we should pause if its unset. This should work as we mostly only set the flag when we're sending but may cause us to fail to pause if we are blocked on gossip validation but `awaiting_write_event` wasn't set as we had previously failed to fully flush a buffer (which no longer implies read-pause). Here we make this logic much more robust by ensuring we always make at least one `send_data` call in `do_attempt_write_data` if we need to pause read (or unpause read).

TheBlueMatt · 2025-10-28T12:51:38Z

Oops, yea, duh, can be simplified:

$ git diff-tree -U2 5e93838f3 467d311c03
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 3c7f4a845b..74f081b03a 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -1543,13 +1543,9 @@ where
 		&self, descriptor: &mut Descriptor, peer: &mut Peer, mut force_one_write: bool,
 	) {
-		if !self.should_read_from(peer) {
-			if !peer.sent_pause_read {
-				force_one_write = true;
-			}
-		} else {
-			if peer.sent_pause_read {
-				force_one_write = true;
-			}
-		}
+		// If we detect that we should be reading from the peer but reads are currently paused, or
+		// vice versa, then we need to tell the socket driver to update their internal flag
+		// indicating whether or not reads are paused. Do this by forcing a write with the desired
+		// `continue_read` flag set, even if no outbound messages are currently queued.
+		force_one_write |= self.should_read_from(peer) == peer.sent_pause_read;
 		while force_one_write || !peer.awaiting_write_event {
 			if peer.should_buffer_onion_message() {

TheBlueMatt · 2025-10-29T17:47:01Z

Backported to 0.2 in #4185.

TheBlueMatt added this to the 0.2 milestone Oct 22, 2025

TheBlueMatt requested a review from joostjager October 22, 2025 21:15

TheBlueMatt added the backport 0.2 label Oct 22, 2025

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch from aa5f64e to ad1e948 Compare October 22, 2025 21:18

joostjager reviewed Oct 24, 2025

View reviewed changes

lightning-net-tokio/src/lib.rs Show resolved Hide resolved

lightning-net-tokio/src/lib.rs Outdated Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Show resolved Hide resolved

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch 2 times, most recently from e4a70b9 to bd4356a Compare October 25, 2025 14:15

ldk-reviews-bot requested a review from wpaulino October 27, 2025 14:33

ldk-reviews-bot requested a review from jkczyz October 27, 2025 14:33

ldk-reviews-bot requested a review from tankyleo October 27, 2025 14:33

ldk-reviews-bot requested a review from valentinewallace October 27, 2025 14:33

TheBlueMatt removed request for jkczyz, tankyleo and wpaulino October 27, 2025 14:34

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch from bd4356a to 254673b Compare October 27, 2025 15:51

valentinewallace reviewed Oct 27, 2025

View reviewed changes

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Show resolved Hide resolved

lightning/src/ln/peer_handler.rs Show resolved Hide resolved

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch from 254673b to 028d64c Compare October 27, 2025 19:57

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch 2 times, most recently from 06ad948 to 6dd0836 Compare October 27, 2025 20:39

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch from 6dd0836 to 5e93838 Compare October 27, 2025 20:47

valentinewallace previously approved these changes Oct 27, 2025

View reviewed changes

lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved

joostjager previously approved these changes Oct 28, 2025

View reviewed changes

TheBlueMatt dismissed stale reviews from joostjager and valentinewallace via 276894f October 28, 2025 12:50

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch from 5e93838 to 276894f Compare October 28, 2025 12:50

TheBlueMatt added 2 commits October 28, 2025 12:50

Rename PeerManager::peer_should_read for clarity

467d311

TheBlueMatt force-pushed the 2025-10-net-race-fixes branch from 276894f to 467d311 Compare October 28, 2025 12:51

TheBlueMatt requested review from joostjager and valentinewallace October 28, 2025 12:51

valentinewallace approved these changes Oct 28, 2025

View reviewed changes

joostjager approved these changes Oct 28, 2025

View reviewed changes

TheBlueMatt merged commit c11d1a5 into lightningdevkit:main Oct 28, 2025
23 of 25 checks passed

TheBlueMatt mentioned this pull request Oct 29, 2025

0.2 backports #4185

Merged

TheBlueMatt removed the backport 0.2 label Oct 29, 2025

Fix race in PeerManager read pausing. #4168

Fix race in PeerManager read pausing. #4168

Conversation

TheBlueMatt commented Oct 22, 2025

Uh oh!

ldk-reviews-bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldk-reviews-bot commented Oct 24, 2025

Uh oh!

joostjager commented Oct 24, 2025

Uh oh!

TheBlueMatt commented Oct 24, 2025

Uh oh!

TheBlueMatt commented Oct 25, 2025

Uh oh!

ldk-reviews-bot commented Oct 27, 2025

Uh oh!

ldk-reviews-bot commented Oct 27, 2025

Uh oh!

ldk-reviews-bot commented Oct 27, 2025

Uh oh!

ldk-reviews-bot commented Oct 27, 2025

Uh oh!

valentinewallace commented Oct 27, 2025

Uh oh!

TheBlueMatt commented Oct 27, 2025

Uh oh!

valentinewallace commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Oct 27, 2025

Uh oh!

valentinewallace commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valentinewallace commented Oct 27, 2025

Uh oh!

TheBlueMatt commented Oct 27, 2025

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joostjager left a comment

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Oct 28, 2025

Uh oh!

Uh oh!

TheBlueMatt commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix race in `PeerManager` read pausing. #4168

Fix race in `PeerManager` read pausing. #4168

ldk-reviews-bot commented Oct 22, 2025 •

edited

Loading

codecov bot commented Oct 22, 2025 •

edited

Loading

valentinewallace commented Oct 27, 2025 •

edited

Loading

TheBlueMatt commented Oct 27, 2025 •

edited

Loading