Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

Open
TheBlueMatt opened this issue Jun 23, 2022 · 6 comments · Fixed by #1564 · May be fixed by #1593
Open

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

TheBlueMatt opened this issue Jun 23, 2022 · 6 comments · Fixed by #1564 · May be fixed by #1593
Labels
good first issue Good for newcomers
Milestone

Comments

@TheBlueMatt
Copy link
Collaborator

On startup the ChannelManager will broadcast the latest state transaction(s) for any ChannelMonitor it has for which it has no Channel. This is all fine and great except if we'd previously closed the ChannelMonitor with a ChannelForceClosed { should_broadcast: false } indicating it may be unsafe to broadcast. In this case, we should disable automated broadcasting. This should just require tracking the should_broadcast state in the ChannelMonitor itself and checking it before broadcasting (maybe a new automated flag on broadcast_Latest_holder_commitment_txn?)

This is really "the" issue on #775 so I'm gonna close that in favor of this one.

@TheBlueMatt TheBlueMatt added the good first issue Good for newcomers label Jun 23, 2022
@TheBlueMatt TheBlueMatt added this to the 0.1 milestone Jun 23, 2022
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 23, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 24, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 24, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
@QPotato
Copy link

QPotato commented Jun 25, 2022

I want to take this as my first issue.

When you say "On startup the ChannelManager will broadcast the latest state transaction(s) for any ChannelMonitor it has for which it has no Channel" I assume you mean channelmanager.rs:6823-6828 where it checks for each channel monitor if it's funding transaction is not in the funding transactions set, right? Then if I understand this correctly, when de-serializing the persisted state, every ChannelMonitorUpdateStep for a ChannelMonitor is replayed so that we build up to the state we had before the last shutdown. Then, the solution you propose would be to

  • Add a new property allow_automated_broadcast to ChannelMonitorImpl that defaults to true
  • On update_monitor, when matching ChannelForceClosed with should_broadcast: false set the flag to false
  • On broadcast_latest_holder_commitment_txn check for the flag and if it's false then... log_info! and do nothing?

Some questions I have:

  • It feels odd to have a method called braodcast... that won't always broadcast. Should we either:

    • Rename the method to maybe_broadcast_latest_holder_commitment_txn, so that it's explicit that under certain conditions it will not broadcast?
    • Define a new method maybe_broadcast... that wraps up the current one but checks the flag first and use this new method in this specific use case?

    I'm also asking because the method is also called when the de-serialized ChannelManager is stale compared to the ChannelMonitor, though in this case there is a channel in the founding transactions set (lines 6789-6804). I assume in that case too we want the flag to stop us, but I want to make sure I'm not changing some important semantics.

  • Is there any other ChannelMonitorUpdateStep that would imply that it's safe again to broadcast? Looking at the docs for the Enum I would guess no, but asking just in case.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 25, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
@TheBlueMatt TheBlueMatt reopened this Jun 27, 2022
@TheBlueMatt
Copy link
Collaborator Author

Heh, 1564 didn't resolve this, it just referenced it...

@TheBlueMatt
Copy link
Collaborator Author

Yes, your understanding is correct, as for your questions:

It feels odd to have a method called braodcast... that won't always broadcast. Should we either:

Probably the second, its still important that users be able to broadcast the latest state if they want to, so we at least need an option to broadcast the state, though it also wouldn't be unreasonable to split it into two methods - maybe_ and force_..._unsafe to indicate that the first will only work if we think its okay to do so, and the second will force it to happen even if its unsafe to do so.

I'm also asking because the method is also called when the de-serialized ChannelManager is stale compared to the ChannelMonitor, though in this case there is a channel in the founding transactions set (lines 6789-6804). I assume in that case too we want the flag to stop us, but I want to make sure I'm not changing some important semantics.

I think in that case, yes, the flag should stop us (though of course generally the flag is not set).

Is there any other ChannelMonitorUpdateStep that would imply that it's safe again to broadcast? Looking at the docs for the Enum I would guess no, but asking just in case.

No, once its unsafe its unsafe. Unsafe here means we've lost some data, there's no way to recover that data.

@QPotato
Copy link

QPotato commented Jun 27, 2022

Great. I'll get to work into the PR then. Thanks!

@QPotato
Copy link

QPotato commented Jul 4, 2022

I'm having a hard time trying to test the code. I'll keep working on it but I'm posting my progress here in case anyone wants to throw help/comments/ideas. My idea would be to cover this new behavior with a simple unit test that:

  • Builds a ChannelMonitor
  • Sends it a ForceClosed{should_broadcast: false}
  • Writes and reads it again
  • And then tries to call the new methods and asserts whether the transaction was broadcasted.

My plan was to copy and adapt some other other unit test for broadcast_latest_holder_commitment_txn but couldn't find any. If I understand correctly, you write unit tests for this at the end of the module and integration tests in separate modules, mostly in funcional_tests.rs. I also found:

  • A TestPersister that I think implement Write And Read in-memory.
  • A TestBroadcaster that can probably help me with the assertions
  • A check_closed_broadcast! which I believe won't be useful here cause it's checking whether we broadcasted the channel closed message as a gossip update for network peers. Though I'm not really sure I understood correctly.
  • On your last PR Matt, you added do_test_data_loss_protect which seems to be a more general case integration test. Maybe I should just (or also) extend that one?

@TheBlueMatt
Copy link
Collaborator Author

I left a small pointer on your PR and let's continue there but in general its probably simplest and most-coverage to add a new functional test based on an existing one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
2 participants