Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement concurrent broadcast tolerance for distributed watchtowers #679

Merged

Conversation

ariard
Copy link

@ariard ariard commented Aug 28, 2020

Implement concurrent broadcast tolerance for distributed watchtowers

With a distrbuted watchtowers deployment, where each monitor is plugged
to its own chain view, there is no guarantee that block are going to be
seen in same order. Watchtower may diverge in their acceptance of a
submitted commitment_signed update due to a block timing-out a HTLC
and provoking a subset but yet not seen by the other watchtower subset.
Any update reject by one of the watchtower must block offchain coordinator
to move channel state forward and release revocation secret for previous
state.

In this case, we want any watchtower from the rejection subset to still
be able to claim outputs if the concurrent state, has accepted by the
other subset, is confirming. This improve overall watchtower system
fault-tolerance.

This change stores local commitment transaction unconditionally and fail
the update if there is knowledge of an already signed commitment
transaction (ChannelMonitor.local_tx_signed=true).

Follow-up of #667, which let us easily implement concurrent tolerance and get ride off of confusing OnchainTxHandler/ChannelMonitor comments.

@ariard ariard force-pushed the 2020-08-concurrent-watchtowers branch 2 times, most recently from 6ebfabf to 431e0e8 Compare August 28, 2020 15:19
@codecov
Copy link

codecov bot commented Aug 28, 2020

Codecov Report

Merging #679 into master will decrease coverage by 0.01%.
The diff coverage is 98.70%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #679      +/-   ##
==========================================
- Coverage   91.92%   91.91%   -0.02%     
==========================================
  Files          35       35              
  Lines       20082    20154      +72     
==========================================
+ Hits        18461    18524      +63     
- Misses       1621     1630       +9     
Impacted Files Coverage Δ
lightning/src/ln/functional_tests.rs 96.98% <98.63%> (-0.16%) ⬇️
lightning/src/ln/channelmonitor.rs 94.99% <100.00%> (ø)
lightning/src/ln/onchaintx.rs 93.89% <100.00%> (-0.02%) ⬇️
lightning/src/ln/channel.rs 86.23% <0.00%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 343aacc...6622ea7. Read the comment docs.

self.current_local_commitment_number = 0xffff_ffff_ffff - ((((sequence & 0xffffff) << 3*8) | (locktime as u64 & 0xffffff)) ^ self.commitment_transaction_number_obscure_factor);
mem::swap(&mut new_local_commitment_tx, &mut self.current_local_commitment_tx);
self.prev_local_signed_commitment_tx = Some(new_local_commitment_tx);
if self.local_tx_signed {
return Err(MonitorUpdateError("Latest local commitment signed has already been signed, update is rejected"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating before returning an Err probably needs a documentation update to indicate to the user that they should still write out the new version to disk even in case of err (which is a super surprising API).

Copy link
Author

@ariard ariard Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new comment a the top-level API : ManyChannelMonitor::update_monitor

I know this API is surprising, but you rarely have distributed system driven both by a central coordinator and a p2p network.

if let Err(_) = self.onchain_tx_handler.provide_latest_local_tx(commitment_tx) {
return Err(MonitorUpdateError("Local commitment signed has already been signed, no further update of LOCAL commitment transaction is allowed"));
}
self.onchain_tx_handler.provide_latest_local_tx(commitment_tx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert somewhere in here that prev_local_signed_commitment_tx must have been None before this? I suppose return Err and debug_assert!(false).

Copy link
Author

@ariard ariard Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't explicitly prune out prev_local_signed_commitment=None.
We could add a round-trip between coordinator and watchtowers with a ChannelMonitorUpdateStep::PrunePrevLocalCommitment but that would be a latency hit for nothing, assuming we never return an Err if we broadcast a local commitment and coordinator fail advancing forward the channel by releasing the secret. Note the broadcast-commitment-then-reject-update is covered in new test (watchtower_alice failing the update_monitor).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right.


#[test]
fn test_concurrent_monitor_claim() {
// Watchtower Alice receives block 135, broadcasts state X, rejects state Y.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this comment more chronological? I think this is what threw me off the other day,

Watchtower A receives block, broadcasts state,
then channel receives new state Y, sending it to both watchtowers,
Bob accepts Y, then receives block and broadcasts the latest state Y,
Alice rejects state Y, but Bob has already broadcasted it,
Y confirms.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read through the test yet, but I get what you're going for more. A few relatively minor comments but conceptually looks good.

@ariard ariard force-pushed the 2020-08-concurrent-watchtowers branch from 431e0e8 to 963aaa6 Compare August 28, 2020 20:19
@@ -823,6 +830,10 @@ pub struct ChannelMonitor<ChanSigner: ChannelKeys> {
// Set once we've signed a local commitment transaction and handed it over to our
// OnchainTxHandler. After this is set, no future updates to our local commitment transactions
// may occur, and we fail any such monitor updates.
//
// In case of update rejection due to a locally already signed commitment transaction, we
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:
Note that even when we fail a local commitment transaction update, we still store the update to ensure we can claim from it in case a duplicate copy of this ChannelMonitor broadcasts it.

@@ -887,6 +898,11 @@ pub trait ManyChannelMonitor: Send + Sync {
///
/// Any spends of outputs which should have been registered which aren't passed to
/// ChannelMonitors via block_connected may result in FUNDS LOSS.
///
/// In case of distributed watchtowers deployment, even if an Err is return, the new version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more detail about the exact setup of a distributed ChannelMonitor, as several are possible (I think its a misnomer to call it a watchtower as that implies something specific about availability of keys which isn't true today), and probably this comment should be on ChannelMonitor::update_monitor, not the trait (though it could be referenced in the trait), because the trait's return value is only relevant for ChannelManager.

Copy link
Author

@ariard ariard Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know writing a documentation for a distributed ChannelMonitor is tracked by #604. I intend to do it post-anchor as there are implications for bumping utxo management, namely you should also duplicate those keys across distributed monitors.

///
/// Should also be used to indicate a failure to update the local persisted copy of the channel
/// monitor.
/// At reception of this error, channel manager MUST force-close the channel and return at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to describe this in these terms - what ChannelManager MUST do isn't interesting to our users, its what we do! Instead, we should describe it as ChannelManager will do X.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

/// This failure may also signal a failure to update the local persisted copy of one of
/// the channel monitor instance.
///
/// In case of a distributed watchtower deployment, failures sources can be divergent and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean also because there's no use in any more granular errors :).

@ariard ariard force-pushed the 2020-08-concurrent-watchtowers branch from 764f245 to aaf8c83 Compare August 31, 2020 19:03
@ariard
Copy link
Author

ariard commented Aug 31, 2020

Updated, for complete documentation, I will do it post-anchor/chaininterface refactoring.

@ariard ariard force-pushed the 2020-08-concurrent-watchtowers branch from aaf8c83 to 2ad5aa1 Compare August 31, 2020 19:06
/// This failure may also signal a failure to update the local persisted copy of one of
/// the channel monitor instance.
///
/// Note that even when we fail a local commitment transaction update, we still store the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be phrased as "you must" and probably be more clear (not "we" in the sense of "rust-lightning does"). Also we should distinguish between "ChannelMonitor returned" and "you are returning from a ManyChannelMonitor/chain::Watch", though if you want to just make @jkczyz deal with this in #649 thats ok too.

Antoine Riard added 3 commits September 15, 2020 18:17
With a distrbuted watchtowers deployment, where each monitor is plugged
to its own chain view, there is no guarantee that block are going to be
seen in same order. Watchtower may diverge in their acceptance of a
submitted `commitment_signed` update due to a block timing-out a HTLC
and provoking a subset but yet not seen by the other watchtower subset.
Any update reject by one of the watchtower must block offchain coordinator
to move channel state forward and release revocation secret for previous
state.

In this case, we want any watchtower from the rejection subset to still
be able to claim outputs if the concurrent state, has accepted by the
other subset, is confirming. This improve overall watchtower system
fault-tolerance.

This change stores local commitment transaction unconditionally and fail
the update if there is knowledge of an already signed commitment
transaction (ChannelMonitor.local_tx_signed=true).
Watchower Alice receives block 134, broadcasts state X, rejects state Y.
Watchtower Bob accepts state Y, receives blocks 135, broadcasts state Y.
State Y confirms onchain. Alice must be able to claim outputs.
Sources of the failure may be multiple in case of distributed watchtower
deployment. In either case, the channel manager must return a final
update asking to its channel monitor(s) to broadcast the lastest state
available. Revocation secret must not be released for the faultive
channel.

In the future, we may return wider type of failures to take more
fine-grained processing decision (e.g if local disk failure and
redudant remote channel copy available channel may still be processed
forward).
@ariard ariard force-pushed the 2020-08-concurrent-watchtowers branch from 7cd09c5 to 6622ea7 Compare September 15, 2020 22:17
@TheBlueMatt TheBlueMatt merged commit 44734be into lightningdevkit:master Sep 16, 2020
TheBlueMatt added a commit that referenced this pull request Sep 16, 2020
Implement concurrent broadcast tolerance for distributed watchtowers
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Sep 16, 2020
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Sep 16, 2020
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants