-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not broadcast commitment txn on Permanent mon update failure #1106
Do not broadcast commitment txn on Permanent mon update failure #1106
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1106 +/- ##
==========================================
+ Coverage 90.40% 92.12% +1.71%
==========================================
Files 68 66 -2
Lines 34796 40933 +6137
==========================================
+ Hits 31458 37708 +6250
+ Misses 3338 3225 -113
Continue to review full report at Codecov.
|
bd79c03
to
e6966ce
Compare
/// or, e.g. we've moved on to a different watchtower and cannot update with all watchtowers | ||
/// that were previously informed of this channel). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to parse this watchtower case -- in this situation, we're switching watchtowers, but have no way of contacting the old watchtower to just delete our data, therefore the channel needs to close via PermanentFailure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but honestly its confusing, I replaced it with a better example.
/// [`ChannelMonitor::get_latest_holder_commitment_txn`] once you've safely ensured no further | ||
/// off-chain updates to the channel can occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking -- "ensure no further off-chain updates can occur" == "applied the final monitor update mentioned on L170"? Or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means "no ChannelManager still knows about this channel", I reworded it.
Bit confused how to implement these new requirements in the sample. I guess we now need a proxy layer implementing |
I think we, in general, need to think hard about how we handle disk failures in the sample/lightning-persister. If the disk gets unplugged while we're running, we should just shut down and move on, but there's no way to get an error out of the I think the end result for the sample needs to be a "really definitely broadcast the latest state from the channelmonitor for channel X" command, which is marked unsafe. We should probably also have a programatic way to detect monitors in this condition and not just rely on the logs in |
5f848de
to
6ad8583
Compare
Rebased on upstream. Need to make sure all the updated content in the failure enum docs is still kept and should rename temporaryfailure -> asyncpersist or so. |
6ad8583
to
f99da12
Compare
f99da12
to
2572ac5
Compare
Codecov ReportBase: 90.78% // Head: 91.21% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1106 +/- ##
==========================================
+ Coverage 90.78% 91.21% +0.43%
==========================================
Files 86 87 +1
Lines 46631 50794 +4163
Branches 46631 50794 +4163
==========================================
+ Hits 42335 46333 +3998
- Misses 4296 4461 +165
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
2572ac5
to
0fe1f28
Compare
Finally getting back to the monitor work, finally updated this to include a monitor update failure type rework, but no further functional changes. |
0fe1f28
to
b3ff257
Compare
dc2888a
to
a9f656a
Compare
Oops, somehow I totally lost track of this PR having a pending review, so sorry about that! Rebased on latest git and addressed feedback. |
a9f656a
to
e3347f1
Compare
019aadf
to
2ad3987
Compare
2ad3987
to
86a5a28
Compare
See doc updates for more info on the edge case this prevents, and there isn't really a strong reason why we would need to broadcast the latest state immediately. Specifically, in the case of HTLC claims (the most important reason to ensure we have state on chain if it cannot be persisted), we will still force-close if there are HTLCs which need claiming and are going to expire. Surprisingly, there were no tests which failed as a result of this change, but a new one has been added.
86a5a28
to
24f8279
Compare
Rebased to address conflicts and went ahead and squashed as its been a while. |
24f8279
to
4825a96
Compare
lightning/src/ln/channelmanager.rs
Outdated
@@ -4492,7 +4492,7 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana | |||
// We do not do a force-close here as that would generate a monitor update for | |||
// a monitor that we didn't manage to store (and that we don't care about - we | |||
// don't respond with the funding_signed so the channel can never go on chain). | |||
let (_monitor_update, failed_htlcs) = chan.force_shutdown(true); | |||
let (_monitor_update, failed_htlcs) = chan.force_shutdown(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is not test for this scenario. All tests pass when reverting to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its unobservable - note that the monitor_update
, which is the place the bool is propagated to, is drop'd.
4825a96
to
c0bb4f4
Compare
lightning/src/util/errors.rs
Outdated
/// An attempt to call watch/update_channel returned an Err (ie you did this!), causing the | ||
/// attempted action to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the docs since it's no longer from an Err
? Would be good to link to the corresponding method docs and maybe phrase the docs to be a little clearer, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, also a bit weird that it's still considered an APIError
variant, but that might require some more brainstorming than a simple rename/doc update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, did update the docs, but agree - this shouldnt really be an error at all, it should just be Ok
. I think there's only one(-ish) place where it Errs - sending payments - so it should be easy to do in a followup.
/// When this is returned, [`ChannelManager`] will force-close the channel but *not* broadcast | ||
/// our current commitment transaction. This avoids a dangerous case where a local disk failure | ||
/// (e.g. the Linux-default remounting of the disk as read-only) causes [`PermanentFailure`]s | ||
/// for all monitor updates. If we were to broadcast our latest commitment transaction and then | ||
/// restart, we could end up reading a previous [`ChannelMonitor`] and [`ChannelManager`], | ||
/// revoking our now-broadcasted state before seeing it confirm and losing all our funds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, while the user could choose not to broadcast the latest commitment transaction, which has been revoked, it could still be done by LDK if it contained any pending HTLCs that are expiring soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, somewhat of a separate issue that should be addressed in #1593
lightning/src/util/errors.rs
Outdated
/// An attempt to call watch/update_channel returned an Err (ie you did this!), causing the | ||
/// attempted action to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, also a bit weird that it's still considered an APIError
variant, but that might require some more brainstorming than a simple rename/doc update.
c0bb4f4
to
74745cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to squash.
74745cb
to
fc18acb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please squash.
fc18acb
to
4ac5804
Compare
Addressed one more comment and squashed. Diff since last push: $ git diff-tree -U1 fc18acb4 4ac58048
diff --git a/lightning/src/util/errors.rs b/lightning/src/util/errors.rs
index 83324eab6..ad6993542 100644
--- a/lightning/src/util/errors.rs
+++ b/lightning/src/util/errors.rs
@@ -48,7 +48,9 @@ pub enum APIError {
},
- /// An attempt to call watch/update_channel returned a
- /// [`ChannelMonitorUpdateStatus::InProgress`] indicating the persistence of a monitor update
- /// is awaiting async resolution. Once it resolves the attempted action should complete
- /// automatically.
+ /// An attempt to call [`chain::Watch::watch_channel`]/[`chain::Watch::update_channel`]
+ /// returned a [`ChannelMonitorUpdateStatus::InProgress`] indicating the persistence of a
+ /// monitor update is awaiting async resolution. Once it resolves the attempted action should
+ /// complete automatically.
///
+ /// [`chain::Watch::watch_channel`]: crate::chain::Watch::watch_channel
+ /// [`chain::Watch::update_channel`]: crate::chain::Watch::update_channel
/// [`ChannelMonitorUpdateStatus::InProgress`]: crate::chain::ChannelMonitorUpdateStatus::InProgress |
When a `chain::Watch` `ChannelMonitor` update method is called, the user has three options: (a) persist the monitor update immediately and return success, (b) fail to persist the monitor update immediately and return failure, (c) return a flag indicating the monitor update is in progress and will complete in the future. (c) is rather harmless, and in some deployments should be expected to be the return value for all monitor update calls, but currently requires returning `Err(ChannelMonitorUpdateErr::TemporaryFailure)` which isn't very descriptive and sounds scarier than it is. Instead, here, we change the return type used to be a single enum (rather than a Result) and rename `TemporaryFailure` `UpdateInProgress`.
If we receive a monitor event from a forwarded-to channel which contains a preimage for an HTLC, we have to propogate that preimage back to the forwarded-from channel monitor. However, once we have that update, we're running in a relatively unsafe state - we have the preimage in memory, but if we were to crash the forwarded-to channel monitor will not regenerate the update with the preimage for us. If we haven't managed to write the monitor update to the forwarded-from channel by that point, we've lost the preimage, and, thus, money!
This much more accurately represents the error, indicating that a monitor update is in progress asynchronously and may complete at a later time.
4ac5804
to
52934e0
Compare
Oops, sorry, reordered the missing hunk into the right commit, no ultimate diff, though. |
See doc updates for more info on the edge case this prevents, and
there isn't really a strong reason why we would need to broadcast
the latest state immediately. Specifically, in the case of HTLC
claims (the most important reason to ensure we have state on chain
if it cannot be persisted), we will still force-close if there are
HTLCs which need claiming and are going to expire.
Surprisingly, there were no tests which failed as a result of this
change, but a new one has been added.