Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race between outbound messages and peer disconnection #2663

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

Previously, outbound messages held in process_events could race
with peer disconnection, allowing a message intended for a peer
before disconnection to be sent to the same peer after
disconnection.

The fix is simple - hold the peers read lock while we fetch
pending messages from peers (as we disconnect with the write lock).

@TheBlueMatt TheBlueMatt added this to the 0.0.118 milestone Oct 13, 2023
@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (22a0bfc) 89.04% compared to head (4366369) 89.05%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2663      +/-   ##
==========================================
+ Coverage   89.04%   89.05%   +0.01%     
==========================================
  Files         112      112              
  Lines       87419    87417       -2     
  Branches    87419    87417       -2     
==========================================
+ Hits        77841    77850       +9     
+ Misses       7336     7328       -8     
+ Partials     2242     2239       -3     
Files Coverage Δ
lightning/src/ln/peer_handler.rs 58.68% <100.00%> (+<0.01%) ⬆️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


#[test]
#[cfg(feature = "std")]
fn test_rapid_connect_events_order_multithreaded() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this about 100x with your change reverted and wasn't able to get it to fail :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, yea, sorry, this test isnt even close to right. I'll spend some more time on this but the issue is pretty hard to repro.

@TheBlueMatt
Copy link
Collaborator Author

For now simply removed the test. I think we should land this as-is because writing a generic stress test that isn't super targeted is not trivial and we need to ship this.

return Err(PeerHandleError { }.into());
}
if let Err(()) = self.message_handler.custom_message_handler.peer_connected(&their_node_id, &msg, peer_lock.inbound_connection) {
log_debug!(self.logger, "Custom Message Handler decided we couldn't communicate with peer {}", log_pubkey!(their_node_id));
self.message_handler.onion_message_handler.peer_disconnected(&their_node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about route_handler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have a peer_disconnected.

)*
if should_disconnect {
$(
self.$field.peer_disconnected(their_node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that peer_disconnected may be called for some handlers that did not get a peer_connected first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All handlers get a peer_connected here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat awkward that we have different semantics on the custom message handler from the other message handlers - the others dont get a peer_disconnected if they return an error from peer_connected, custom does (here). Let me fix that.

Previously, outbound messages held in `process_events` could race
with peer disconnection, allowing a message intended for a peer
before disconnection to be sent to the same peer after
disconnection.

The fix is simple - hold the peers read lock while we fetch
pending messages from peers (as we disconnect with the write lock).
@TheBlueMatt
Copy link
Collaborator Author

Dropped all the commits but the real fix, since those commits were mostly just for the test that didnt work anyway. I'll get a PR up with the changes and a test at some point in a followup.

@TheBlueMatt TheBlueMatt merged commit 0357caf into lightningdevkit:main Oct 18, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants