Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

TheBlueMatt · 2022-06-23T21:35:17Z

On startup the ChannelManager will broadcast the latest state transaction(s) for any ChannelMonitor it has for which it has no Channel. This is all fine and great except if we'd previously closed the ChannelMonitor with a ChannelForceClosed { should_broadcast: false } indicating it may be unsafe to broadcast. In this case, we should disable automated broadcasting. This should just require tracking the should_broadcast state in the ChannelMonitor itself and checking it before broadcasting (maybe a new automated flag on broadcast_Latest_holder_commitment_txn?)

This is really "the" issue on #775 so I'm gonna close that in favor of this one.

The text was updated successfully, but these errors were encountered:

When we receive a `channel_reestablish` with a `data_loss_protect` that proves we're running with a stale state, instead of force-closing the channel, we immediately panic. This lines up with our refusal to run if we find a `ChannelMonitor` which is stale compared to our `ChannelManager` during `ChannelManager` deserialization. Ultimately both are an indication of the same thing - that the API requirements on `chain::Watch` were violated. In the "running with outdated state but ChannelMonitor(s) and ChannelManager lined up" case specifically its likely we're running off of an old backup, in which case connecting to peers with channels still live is explicitly dangerous. That said, because this could be an operator error that is correctable, panicing instead of force-closing may allow for normal operation again in the future (cc lightningdevkit#1207). In any case, we provide instructions in the panic message for how to force-close channels prior to peer connection, as well as a note on how to broadcast the latest state if users are willing to take the risk. Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.

QPotato · 2022-06-25T01:41:21Z

I want to take this as my first issue.

When you say "On startup the ChannelManager will broadcast the latest state transaction(s) for any ChannelMonitor it has for which it has no Channel" I assume you mean channelmanager.rs:6823-6828 where it checks for each channel monitor if it's funding transaction is not in the funding transactions set, right? Then if I understand this correctly, when de-serializing the persisted state, every ChannelMonitorUpdateStep for a ChannelMonitor is replayed so that we build up to the state we had before the last shutdown. Then, the solution you propose would be to

Add a new property allow_automated_broadcast to ChannelMonitorImpl that defaults to true
On update_monitor, when matching ChannelForceClosed with should_broadcast: false set the flag to false
On broadcast_latest_holder_commitment_txn check for the flag and if it's false then... log_info! and do nothing?

Some questions I have:

It feels odd to have a method called braodcast... that won't always broadcast. Should we either:
- Rename the method to maybe_broadcast_latest_holder_commitment_txn, so that it's explicit that under certain conditions it will not broadcast?
- Define a new method maybe_broadcast... that wraps up the current one but checks the flag first and use this new method in this specific use case?
I'm also asking because the method is also called when the de-serialized ChannelManager is stale compared to the ChannelMonitor, though in this case there is a channel in the founding transactions set (lines 6789-6804). I assume in that case too we want the flag to stop us, but I want to make sure I'm not changing some important semantics.
Is there any other ChannelMonitorUpdateStep that would imply that it's safe again to broadcast? Looking at the docs for the Enum I would guess no, but asking just in case.

When we receive a `channel_reestablish` with a `data_loss_protect` that proves we're running with a stale state, instead of force-closing the channel, we immediately panic. This lines up with our refusal to run if we find a `ChannelMonitor` which is stale compared to our `ChannelManager` during `ChannelManager` deserialization. Ultimately both are an indication of the same thing - that the API requirements on `chain::Watch` were violated. In the "running with outdated state but ChannelMonitor(s) and ChannelManager lined up" case specifically its likely we're running off of an old backup, in which case connecting to peers with channels still live is explicitly dangerous. That said, because this could be an operator error that is correctable, panicing instead of force-closing may allow for normal operation again in the future (cc lightningdevkit#1207). In any case, we provide instructions in the panic message for how to force-close channels prior to peer connection, as well as a note on how to broadcast the latest state if users are willing to take the risk. Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.

TheBlueMatt · 2022-06-27T16:40:50Z

Heh, 1564 didn't resolve this, it just referenced it...

TheBlueMatt · 2022-06-27T16:44:42Z

Yes, your understanding is correct, as for your questions:

It feels odd to have a method called braodcast... that won't always broadcast. Should we either:

Probably the second, its still important that users be able to broadcast the latest state if they want to, so we at least need an option to broadcast the state, though it also wouldn't be unreasonable to split it into two methods - maybe_ and force_..._unsafe to indicate that the first will only work if we think its okay to do so, and the second will force it to happen even if its unsafe to do so.

I'm also asking because the method is also called when the de-serialized ChannelManager is stale compared to the ChannelMonitor, though in this case there is a channel in the founding transactions set (lines 6789-6804). I assume in that case too we want the flag to stop us, but I want to make sure I'm not changing some important semantics.

I think in that case, yes, the flag should stop us (though of course generally the flag is not set).

Is there any other ChannelMonitorUpdateStep that would imply that it's safe again to broadcast? Looking at the docs for the Enum I would guess no, but asking just in case.

No, once its unsafe its unsafe. Unsafe here means we've lost some data, there's no way to recover that data.

QPotato · 2022-06-27T22:34:38Z

Great. I'll get to work into the PR then. Thanks!

QPotato · 2022-07-04T18:30:13Z

I'm having a hard time trying to test the code. I'll keep working on it but I'm posting my progress here in case anyone wants to throw help/comments/ideas. My idea would be to cover this new behavior with a simple unit test that:

Builds a ChannelMonitor
Sends it a ForceClosed{should_broadcast: false}
Writes and reads it again
And then tries to call the new methods and asserts whether the transaction was broadcasted.

My plan was to copy and adapt some other other unit test for broadcast_latest_holder_commitment_txn but couldn't find any. If I understand correctly, you write unit tests for this at the end of the module and integration tests in separate modules, mostly in funcional_tests.rs. I also found:

A TestPersister that I think implement Write And Read in-memory.
A TestBroadcaster that can probably help me with the assertions
A check_closed_broadcast! which I believe won't be useful here cause it's checking whether we broadcasted the channel closed message as a gossip update for network peers. Though I'm not really sure I understood correctly.
On your last PR Matt, you added do_test_data_loss_protect which seems to be a more general case integration test. Maybe I should just (or also) extend that one?

TheBlueMatt · 2022-07-05T01:05:08Z

I left a small pointer on your PR and let's continue there but in general its probably simplest and most-coverage to add a new functional test based on an existing one.

TheBlueMatt added the good first issue Good for newcomers label Jun 23, 2022

TheBlueMatt added this to the 0.1 milestone Jun 23, 2022

TheBlueMatt mentioned this issue Jun 23, 2022

Handle data_loss_protect more gracefully #775

Closed

5 tasks

TheBlueMatt mentioned this issue Jun 23, 2022

Panic if we're running with outdated state instead of force-closing #1564

Merged

TheBlueMatt closed this as completed in #1564 Jun 27, 2022

TheBlueMatt reopened this Jun 27, 2022

QPotato linked a pull request Jul 4, 2022 that will close this issue

Add flag to disable broadcasting when it's dangerous due to information loss #1593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

TheBlueMatt commented Jun 23, 2022

QPotato commented Jun 25, 2022

TheBlueMatt commented Jun 27, 2022

TheBlueMatt commented Jun 27, 2022

QPotato commented Jun 27, 2022

QPotato commented Jul 4, 2022 •

edited

TheBlueMatt commented Jul 5, 2022

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

Comments

TheBlueMatt commented Jun 23, 2022

QPotato commented Jun 25, 2022

TheBlueMatt commented Jun 27, 2022

TheBlueMatt commented Jun 27, 2022

QPotato commented Jun 27, 2022

QPotato commented Jul 4, 2022 • edited

TheBlueMatt commented Jul 5, 2022

QPotato commented Jul 4, 2022 •

edited