Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect peers on timer ticks to unblock channel state machine #2293

Merged

Conversation

wpaulino
Copy link
Contributor

At times, we've noticed that channels with lnd counterparties do not receive messages we expect to in a timely manner (or at all) after sending them a ChannelReestablish upon reconnection, or a CommitmentSigned message. This can block the channel state machine from making progress, eventually leading to force closes, if any pending HTLCs are committed and their expiration is met.

It seems common wisdom for lnd node operators to periodically restart their node/reconnect to their peers, allowing them to start from a fresh state such that the message we expect to receive hopefully gets sent. We can achieve the same end result by disconnecting peers ourselves (regardless of whether they're a lnd node), which we opt to implement here by awaiting their response within two timer ticks.

Fixes #2282.

@wpaulino wpaulino added this to the 0.0.116 milestone May 13, 2023
@codecov-commenter
Copy link

codecov-commenter commented May 13, 2023

Codecov Report

Patch coverage: 57.66% and project coverage change: -0.09 ⚠️

Comparison is base (4dce209) 90.79% compared to head (09c051f) 90.70%.

❗ Current head 09c051f differs from pull request most recent head 5bf7fac. Consider uploading reports for the commit 5bf7fac to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2293      +/-   ##
==========================================
- Coverage   90.79%   90.70%   -0.09%     
==========================================
  Files         104      104              
  Lines       53033    53162     +129     
  Branches    53033    53162     +129     
==========================================
+ Hits        48153    48223      +70     
- Misses       4880     4939      +59     
Impacted Files Coverage Δ
lightning/src/ln/msgs.rs 85.07% <0.00%> (-0.06%) ⬇️
lightning/src/ln/wire.rs 49.09% <2.17%> (-7.76%) ⬇️
lightning/src/ln/peer_handler.rs 58.85% <23.07%> (-0.52%) ⬇️
lightning/src/ln/channel.rs 89.78% <86.66%> (-0.02%) ⬇️
lightning/src/ln/functional_tests.rs 98.25% <98.48%> (+<0.01%) ⬆️
lightning/src/ln/channelmanager.rs 87.13% <100.00%> (+0.02%) ⬆️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@wpaulino wpaulino force-pushed the disconnect-peers-timer-tick branch from 15c311e to 6353683 Compare May 13, 2023 21:59
@TheBlueMatt
Copy link
Collaborator

Grr, sorry for the delay in responding here. So I'm not sure what to do about expected-CS.

If we have an inbound HTLC that we remove, send a CS, receive an RAA and then don't receive the expected CS so we can RAA we'll (a) no longer have any state about this HTLC in the channel - its been removed from our state as we never need to care about it in the off-chain state side of things and (b) still force-close the channel if the HTLC times out, as the ChannelMonitor sees that if we broadcast we'll still have the HTLC which we need to time-out.

At least the two state-machine-deadlocks I saw with lnd peers we were waiting on an RAA, and because we're not blocked on an RAA we can make progress (in the form of sending our own CS and then presumably if they're still hung we'll decide to d/c because we're awaiting-RAA), but its kinda awkward there's still a case there where we should d/c but won't.

For now its probably fine, and its better that we do d/c if we have to than nothing, at least because actually fixing this would require state machine tweaks that I don't really want to have to think hard about.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
@wpaulino wpaulino force-pushed the disconnect-peers-timer-tick branch from 105e9ac to fe3a962 Compare May 18, 2023 19:15
@TheBlueMatt
Copy link
Collaborator

MSRV build is sad.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good I think, mostly nits.

lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
lightning/src/ln/peer_handler.rs Outdated Show resolved Hide resolved
lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
let alice_init = msgs::Init { features: nodes[0].node.init_features(), remote_network_address: None };
nodes[1].node.peer_connected(&nodes[0].node.get_our_node_id(), &alice_init, true).unwrap();

// Upon reconnection, Alice sends her `ChannelReestablish` to Bob. Alice, however, hasn't
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another (or just extend the test) to include a channel_ready message? That should emulate the exact behavior we've seen out of lnd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter? They can send channel_ready if they wish, the issue is not receiving channel_reestablish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter? They can send channel_ready if they wish, the issue is not receiving channel_reestablish.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, it doesn't matter much I just wanted to match exactly what we've seen in the wild cause I can't readily repro with any peers so its nice to get an exact test to make sure we're good.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good I think, mostly nits.

@wpaulino wpaulino force-pushed the disconnect-peers-timer-tick branch 2 times, most recently from d767d1d to 09c051f Compare May 24, 2023 17:33
@TheBlueMatt
Copy link
Collaborator

Ugh needs rebase, LGTM, though.

The inner structs of each enum variant already implemented them and we
plan to pass in `Message`s to `enqueue_message` in a future commit.
`enqueue_message` simply adds the message to the outbound queue, it
still needs to be written to the socket with `do_attempt_write_data`.
However, since we immediately return an error causing the socket to be
closed, the message never actually gets sent.
At times, we've noticed that channels with `lnd` counterparties do not
receive messages we expect to in a timely manner (or at all) after
sending them a `ChannelReestablish` upon reconnection, or a
`CommitmentSigned` message. This can block the channel state machine
from making progress, eventually leading to force closes, if any pending
HTLCs are committed and their expiration is met.

It seems common wisdom for `lnd` node operators to periodically restart
their node/reconnect to their peers, allowing them to start from a fresh
state such that the message we expect to receive hopefully gets sent. We
can achieve the same end result by disconnecting peers ourselves
(regardless of whether they're a `lnd` node), which we opt to implement
here by awaiting their response within two timer ticks.
@wpaulino wpaulino force-pushed the disconnect-peers-timer-tick branch from 09c051f to 5bf7fac Compare May 26, 2023 21:41
@wpaulino wpaulino requested a review from dunxen May 30, 2023 17:55
Copy link
Contributor

@dunxen dunxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nothing else from my side.

@TheBlueMatt TheBlueMatt merged commit eec5ec6 into lightningdevkit:main May 30, 2023
14 checks passed
@wpaulino wpaulino deleted the disconnect-peers-timer-tick branch May 30, 2023 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disconnect peers which don't respond to commitment signed after a while
4 participants