New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure successful message propagation in case of disconnection mid-handshake #2725
Conversation
This PR makes the following interpretation of the issue and follows the solution accordingly:
If in case, my interpretation of the problem has been erroneous, do let me and I shall be glad to correct it! :) |
Also, this PR has been set to draft because the tests are incomplete and only partially test the added code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the status here. You have it marked draft, do you want feedback? Whay kind of feedback/review is this ready for?
Hi, @TheBlueMatt So, I am facing trouble preparing a test for when the peer is reconnected on time, and hence the open channel message is sent to it because it conflicts with how the rest of the test codebase is set up. So I wanted to get a general Approach ACK, before I go about hacking in the test to make them work. |
Updated from pr2725.01 -> pr2725.02 (diff) Changes:
Note:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here, busy with thanksgiving travel and other stuff.
Updated from pr2725.03 -> pr2725.04 (diff) Changes:
Thank you very much, @TheBlueMatt, for this new idea to solve this problem. |
We still need to resend |
You are right! That was an oversight from my side. Thank you for pointing it out.
You are right. However, the goal of the PR is to ensure proper execution of the |
I don't think we necessarily care about that. What we really want is to end up with a funded channel (within a reasonable timeout) if a user requests one while being able to handle the counterparty disconnecting mid-handshake.
Typically nodes forget all about channels before sending/receiving |
Thanks, @wpaulino, for the details about the message transmissions! Seems like it's worth considering extending the PR from fixing the original issue to not failing channel creation mid-handshake due to channel disconnection. I am tinkering with an approach, and I shall update the PR very soon! |
Update: Okay, so I have figured out an approach, but this depends on the behavior changes introduced in #2760. We can track the list of msg_events we have sent during the handshake process, which can be used in case a peer disconnects midway. Once the funding is signed, we graduate the channel from However, currently, in the main, we graduate the channel as soon as we have created the funding. let (chan: Channel<SP>, msg_opt) = match peer_state.channel_by_id.remove(temporary_channel_id) {
...
}, Since #2760 is already getting approval and will soon be merged, I shall build this new approach over the changes introduced there. |
I don't think we need to explicitly track which message we're ready to send to our counterparty - if we are disconnected from a peer, then reconnect prior to funding, we have to restart from the |
Updated from pr2725.04 -> pr2725.05 (diff) Updates:
Logic: -> Follow the standard handshake routine. -> If we disconnect mid-handshake from our peer (that is, OutboundV1Channel is not resolved to a funded channel), we don't immediately close the OutboundV1Channel. -> Instead, we track how long it has been since we disconnected from peers. -> If we connect back within time, we rebroadcast SendOpenChannel corresponding to OutboundV1Channel to the peer. -> If we do not connect back within N (=2) timer ticks, we force close and remove the channel. Note:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. One comment. Note that the test fixes in the last commit will need to get squashed into the commit that broke tests. We require (but don't actually check in CI) that each individual commit builds and passes tests.
Updated from pr2725.06 -> pr2725.07 (diff) Update:
|
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2725 +/- ##
==========================================
- Coverage 89.14% 89.14% -0.01%
==========================================
Files 116 116
Lines 93205 93186 -19
Branches 93205 93186 -19
==========================================
- Hits 83089 83066 -23
- Misses 7583 7587 +4
Partials 2533 2533 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (4 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Additional comments: 8
lightning/src/ln/functional_tests.rs (4)
- 3701-3704: The test
test_peer_disconnected_before_funding_broadcasted
has been correctly marked as ignored, which aligns with the PR's objective to change the behavior of channel closure upon disconnection.- 10514-10557: The new test
test_channel_close_when_not_timely_accepted
is added to simulate the scenario where a channel is not accepted in a timely manner. The test logic appears to follow the intended behavior of the PR, but it should be verified that the test covers all new edge cases introduced by the PR.- 10558-10596: The new test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
is added to ensure that theSendOpenChannel
message is rebroadcast upon reconnection. This aligns with the PR's objectives and should be verified for completeness.Verification successful
The extended context now includes the assertions after the reconnection logic in the
test_rebroadcast_open_channel_when_reconnect_mid_handshake
test. It confirms that after the peers reconnect, the test checks that theSendOpenChannel
message is indeed added tonode_0
's pending message events, with the assertionassert_eq!(events.len(), 1);
. This assertion is critical as it verifies that theSendOpenChannel
message is rebroadcast upon reconnection, which aligns with the PR's objectives.Based on the provided context, the test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
appears to be complete and covers the rebroadcast logic as intended by the PR.* 10781-10785: The code block is part of a test that simulates the timeout behavior for channel closure. It should be verified that the timeout value `UNFUNDED_CHANNEL_AGE_LIMIT_TICKS` is consistent with the intended behavior and that the test accurately reflects the new logic.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the new test `test_rebroadcast_open_channel_when_reconnect_mid_handshake` is complete and covers the rebroadcast logic. rg --type rust "test_rebroadcast_open_channel_when_reconnect_mid_handshake"Length of output: 179
Script:
#!/bin/bash # Print the lines of code around the test function to examine the test logic in more detail. rg --type rust --context 10 "test_rebroadcast_open_channel_when_reconnect_mid_handshake" lightning/src/ln/functional_tests.rsLength of output: 1061
Script:
#!/bin/bash # Extend the range of context lines to capture more of the test function. rg --type rust --context 30 "test_rebroadcast_open_channel_when_reconnect_mid_handshake" lightning/src/ln/functional_tests.rsLength of output: 2994
Script:
#!/bin/bash # Request more lines after the last shown to see the assertions related to the rebroadcast of the `SendOpenChannel` message. rg --type rust --after-context 20 "peer_connected" lightning/src/ln/functional_tests.rsLength of output: 6881
Verification successful
The timeout value
UNFUNDED_CHANNEL_AGE_LIMIT_TICKS
is used consistently across the tests infunctional_tests.rs
to simulate the passage of time and to test the behavior of channel closure due to unfunded channels over time. The tests reflect the new logic introduced in the PR, where channels are expected to be closed if they remain unfunded past the age limit and the peer does not reconnect within the specified timeframe. The test setup and assertions are aligned with the intended behavior of handling disconnections and reconnections during the handshake process.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the timeout value used in the test is consistent with the intended behavior. rg --type rust "UNFUNDED_CHANNEL_AGE_LIMIT_TICKS"Length of output: 868
Script:
#!/bin/bash # Let's examine the surrounding context of the test where `UNFUNDED_CHANNEL_AGE_LIMIT_TICKS` is used. # This will help us understand how the test is set up and whether it accurately reflects the new logic. rg --type rust -C 10 "UNFUNDED_CHANNEL_AGE_LIMIT_TICKS" lightning/src/ln/functional_tests.rsLength of output: 5292
lightning/src/ln/channelmanager.rs (4)
- 896-896: The addition of a condition to check for
UnfundedOutboundV1
channels in theis_live
function seems to align with the PR's objective to handle disconnections during the handshake. This change ensures that channels in theUnfundedOutboundV1
phase are considered not live, which is consistent with the intended logic.- 8873-8876: The comments added provide clarity on the handling of
UnfundedOutboundV1
channels, indicating that these channels are retained for some time in case the peer reconnects. This is a logical change that supports the PR's goal of making the handshake process more robust against unexpected disconnections.- 9017-9032: The logic added to
get_and_clear_pending_msg_events
function to handleUnfundedOutboundV1
channels by pushing aSendOpenChannel
event is in line with the PR's objectives. This ensures that if a peer reconnects, theopen_channel
message is resent, which is necessary since the peer may have forgotten about the channel. However, ensure that there is a mechanism to prevent an infinite loop of reconnection and message rebroadcast if the peer keeps disconnecting and reconnecting.- 11149-11149: The import changes in hunk 4 seem unrelated to the PR's objectives. Ensure that these imports are used in the code that follows, otherwise, they should be removed to keep the code clean and maintainable.
Verification successful
The imports
PaymentPreimage
,PaymentHash
, andPaymentSecret
are used in the codebase, as evidenced by the search results. Therefore, they are related to the PR's objectives and should not be removed.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the newly added imports are used in the code. rg --type rust "PaymentPreimage|PaymentHash|PaymentSecret|create_recv_pending_htlc_info|HTLCForwardInfo|inbound_payment|PaymentId|PaymentSendFailure|RecipientOnionFields|InterceptId"Length of output: 146768
Updated from pr2725.10 -> pr2725.11 (diff)
|
@shaavan, the updates you've made in response to the comments seem to be well-detailed and focused on improving the PR's clarity and functionality. It's good to see that you've expanded the tests to cover the new behavior thoroughly. Regarding the mention of (\( ⁰⊖⁰)/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 4
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Additional comments: 6
lightning/src/ln/functional_tests.rs (3)
- 3727-3737: The logic for handling disconnection before funding transaction broadcast is clear. However, the test seems to be disabled. Confirm if this is intentional and if so, provide a reason or a TODO comment for future enablement.
- 10513-10561: The new test
test_channel_close_when_not_timely_accepted
is well-structured and seems to cover the scenario it's designed for. However, ensure that the test is enabled and verify that it passes in the test suite.- 10563-10607: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
appears to correctly simulate the scenario of a peer disconnecting and reconnecting mid-handshake. Verify that the test is enabled and that it passes in the test suite.lightning/src/ln/channelmanager.rs (3)
- 895-897: The logic here checks if any channel is in the
Funded
orUnfundedOutboundV1
phase. Ensure that this logic aligns with the intended behavior of theis_live
function, especially considering the newUnfundedOutboundV1
state.- 8876-8877: The
UnfundedOutboundV1
channel phase is set to always return true, which implies that these channels are considered live even if the peer is disconnected. Confirm that this behavior is consistent with the overall system logic and that it won't lead to any unexpected side effects.- 9028-9032: The addition of logic to push a
SendOpenChannel
message forUnfundedOutboundV1
channels is consistent with the PR's objective to allow rebroadcasting if the peer reconnects. Ensure that theget_open_channel
function generates the correct message and that this behavior is tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than some comment re-phrasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (4 hunks)
Files skipped from review as they are similar to previous changes (2)
- lightning/src/ln/channelmanager.rs
- lightning/src/ln/functional_tests.rs
@TheBlueMatt |
Updated from pr2725.12 -> pr2725.13 (diff)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- lightning/src/ln/channelmanager.rs
Additional comments: 6
lightning/src/ln/functional_tests.rs (6)
- 3727-3739: The logic for handling peer disconnection before funding is broadcasted seems to correctly simulate the disconnection and checks for the expected closure reasons. However, it's important to ensure that the
UNFUNDED_CHANNEL_AGE_LIMIT_TICKS
constant is appropriately defined and used across the test to simulate the timeout accurately.Ensure
UNFUNDED_CHANNEL_AGE_LIMIT_TICKS
is defined with a value that accurately represents the intended timeout duration for the test scenario.
- 10515-10559: The test
test_channel_close_when_not_timely_accepted
simulates a scenario where peers disconnect mid-handshake, and the channel is not timely accepted. The test setup and the disconnection simulation are correctly implemented. However, the assertion that checks the channel's state after disconnection (line 10534) and the assertion for the channel's closure (line 10550) are critical to validate the intended behavior. It's essential to ensure that these assertions accurately reflect the expected state changes in the system under test.Verify that the assertions accurately reflect the expected outcomes and that the test covers all relevant scenarios for the feature being tested.
- 10561-10604: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
correctly simulates a peer disconnection and reconnection mid-handshake. The test ensures that theSendOpenChannel
message is rebroadcast upon reconnection (lines 10598-10603). This behavior aligns with the PR's objective to improve the robustness of the channel handshake process. However, it's crucial to verify that the rebroadcast logic is implemented as intended in the actual system code and not just within the test environment.Confirm that the rebroadcast logic for the
SendOpenChannel
message upon peer reconnection is correctly implemented in the system code and not solely within the test.
- 10762-10764: The introduction of the test
test_close_in_funding_batch
aims to ensure that if one channel in a batch closes, the entire batch is closed. This test is crucial for validating the robustness of batch processing in channel funding. It's important to ensure that the test setup correctly simulates the batch funding scenario and that the logic for triggering a channel close within the batch is accurately implemented.Ensure the test accurately simulates batch funding scenarios and correctly implements the logic for closing a channel within a batch.
- 10788-10820: The logic within
test_close_in_funding_batch
for force-closing a channel and verifying the closure of all channels in the batch (lines 10794-10820) is critical for ensuring the intended behavior of batch processing. The assertions and checks (lines 10797-10803, 10805-10809, and 10811-10818) are essential for validating the state of the system after a force-close operation. It's important to verify that these checks accurately reflect the expected outcomes and that the test covers all relevant scenarios for batch processing in channel funding.Verify that the assertions and checks within the test accurately reflect the expected outcomes for batch processing in channel funding and that all relevant scenarios are covered.
- 10820-10820: The final assertion in
test_close_in_funding_batch
that checks for the immediate closure of all channels in the batch upon a single channel's force-close (line 10820) is a key part of validating the intended behavior. However, it's crucial to ensure that this behavior aligns with the system's design and that the test accurately reflects the real-world scenario it intends to simulate.Confirm that the immediate closure of all channels in a batch upon a single channel's force-close aligns with the system's design and that the test accurately simulates this scenario.
- Do not remove channel immediately when peer_disconnect, instead removed it after some time if peer doesn't reconnect soon (handled in previous commit). - Do not mark per ok_to_remove if we have some OutboundV1Channels too. - Rebroadcast SendOpenChannel for outboundV1Channel when peer reconnects. - Update the relevant tests to account for the behavior change. - Repurpose the test_disconnect_in_funding_batch to test that all channels in the batch close when one them closes.
- The first test make sure that the OutboundV1Channel is not immediately removed when peers disconnect, but is removed after N timer ticks. - The second test makes sure that the SendOpenChannel is rebroadcasted for the OutboundV1Channel if peer reconnects within time.
Updated from pr2725.13 -> pr2725.14 (diff)
Updated from pr2725.14 -> pr2725.15 Range-diff
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- lightning/src/ln/channelmanager.rs (3 hunks)
- lightning/src/ln/functional_tests.rs (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- lightning/src/ln/channelmanager.rs
Additional comments: 7
lightning/src/ln/functional_tests.rs (7)
- 3698-3698: The test description for
test_peer_disconnected_before_funding_broadcasted
is clear and sets the context well for what the test aims to achieve. However, ensure that the test implementation fully covers the scenario described, including both the disconnection and the failure to reconnect within the specified time.- 3727-3739: The logic to simulate peer disconnection before funding is broadcasted and to check the channel closure with the appropriate
ClosureReason
is implemented correctly. However, consider adding a comment explaining whyUNFUNDED_CHANNEL_AGE_LIMIT_TICKS
is used to simulate the passage of time and its significance in the context of this test.- 10517-10557: The test
test_channel_close_when_not_timely_accepted
correctly simulates a scenario where a peer disconnects mid-handshake and checks the state of channels and peer state after a specified time has passed. This test effectively covers the new behavior introduced in the PR. Ensure that the constants used, likeUNFUNDED_CHANNEL_AGE_LIMIT_TICKS
, are well-documented and their values are justified within the context of this test.- 10560-10598: The test
test_rebroadcast_open_channel_when_reconnect_mid_handshake
accurately simulates the scenario of peer disconnection and reconnection during the handshake process. It checks that theSendOpenChannel
message is rebroadcast upon reconnection, aligning with the PR's objectives. This test is well-structured and covers the critical functionality introduced. Ensure that the test includes assertions for the state of both nodes after reconnection to fully validate the rebroadcast logic.- 10756-10756: The introduction of
test_close_in_funding_batch
aims to test the behavior when one of the channels in a batch closes. This is a good addition to ensure that batch processing of channel closures behaves as expected. However, the test description could be expanded to detail the expected behavior of the batch closure process for clarity.- 10782-10813: In
test_close_in_funding_batch
, the logic to force-close a channel and check the resulting state, including monitor updates and message events, is implemented correctly. This test effectively validates the behavior when a channel in a funding batch is closed. Ensure that the test also verifies the state of other channels in the batch to confirm that they are affected as expected by the batch closure process.- 10814-10814: The assertion that all channels in the batch should close immediately after one channel is force-closed is a critical part of
test_close_in_funding_batch
. This ensures that the batch processing logic is working as intended. Consider adding more detailed assertions to verify the closure reasons for each channel in the batch to ensure they align with the expected outcomes.
_ => panic!("Unexpected message."), | ||
} | ||
|
||
// We broadcast the commitment transaction as part of the force-close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, this is kinda dumb, maybe we should fix that, but its not super critical and certainly unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super interested in understanding the issue here! And probably might give it a try if it's not a super biggie!
Resolves #2096