Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[contractcourt] Republish close tx on startup #3016

Merged

Conversation

@halseth
Copy link
Collaborator

commented Apr 26, 2019

This PR makes us store the closing TX we propose to the database, such that the ChainArbitrator can re-publish them at startup for channels in state waiting close.

This is meant to handle the case were we attempt to close the channel, but go down before the tx has propagated to the network. Now we'll instead store this tx the moment it is ready, such that we can re-publish it.

Builds on #3510

Copy link
Member

left a comment

The diff looks pretty straight forward upon first pass. Only significant question that comes to mind is if we need to manually set the state of the channel arb if we find that we marked it as broadcast, but then never committed that state transition.

This PR should also serve to remedy a portion of the lingering DLP initiation bug. However the larger fix seems to be a bit more involved, so I'm happy that we split this off into its own PR as it can likely land in 0.6.1.

channeldb/channel.go Outdated Show resolved Hide resolved
contractcourt/chain_arbitrator_test.go Outdated Show resolved Hide resolved
contractcourt/chain_arbitrator_test.go Outdated Show resolved Hide resolved
contractcourt/chain_arbitrator.go Show resolved Hide resolved
@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch 2 times, most recently from 5cf7f84 to 4686e5c Apr 30, 2019
@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 30, 2019

Comments addressed. PTAL @Roasbeef

contractcourt/chain_arbitrator.go Show resolved Hide resolved
lnwallet/test_utils.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch from 4686e5c to d390428 May 9, 2019
@Roasbeef Roasbeef added this to the 0.7 milestone May 10, 2019
channeldb/channel.go Outdated Show resolved Hide resolved
@@ -721,6 +721,15 @@ func (c *ChannelArbitrator) stateStep(triggerHeight uint32,
}
closeTx = closeSummary.CloseTx

// Before publishing the closing tx, we persist it to the

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef May 10, 2019

Member

I think there's a gap we're still missing. When we call the ForceCloseChan method above, we'll go to mark the channel as borked: https://github.com/lightningnetwork/lnd/blob/master/contractcourt/chain_arbitrator.go#L236. However, we don't specify any additional context here to convey that we're about to force close the channel. As a result, when we come back up, we won't know to trigger the force close again as there's no force close transaction present.

I'm on the fence w.r.t if we need to care about this scenario atm. This starts to get into the greater issue w.r.t reliably carrying out the DLP steps, since it also applies to all the calls to MarkBorked in this method: https://github.com/lightningnetwork/lnd/blob/master/lnwallet/channel.go#L3211

If we provide the context there, then we'll be able to handle failures at each step of the force close (and also DLP) flow:

  • mark channel action (borked) -> (1) mark commitment broadcast and commit transaction -> (2) broadcast transaction -> (3) update state

The state written at step 1 should allow us to determine if we need to act (remote party data loss DLP and our local force close), or just halt and wait (local DLP loss). Any auxiliary data such as a channel close summary, or the last chan-reest message sent should also be written at that point. In the DLP case, we need to write the message up front as otherwise when we go to look for it, it won't be present until the channel has been close on chain fully.

This comment has been minimized.

Copy link
@halseth

halseth May 21, 2019

Author Collaborator

You're right, there are several cases like this, and feels like they should be handled in a general way.

We could make all the MarkChanX methods take an optional close tx, such that we can provide it for the cases where we know of a safe tx to publish. Alternatively we could make the borked state imply that we deem it safe to publish our latest commitment.

Copy link
Collaborator

left a comment

My general idea about this part of the system is that it should move towards having a lot less persistent state. Rederiving data on startup rather than relying on state from disk.

There are a lot of things persisted atm like the actions, resolvers, resolutions, arbitrator state. It is likely that not all of that is necessary to keep on disk, but it does limit our flexibility in making changes.

This PR adds another bit of data, the close tx, to what we already have. Making it more db data instead of less. In that regards, it feels like a step in the direction where we may not want to go.

Is it possible to fix this issue without adding more persistent state? Maybe by doing a bit more republishing, possibly redundant, like in the sweeper?

@Roasbeef

This comment has been minimized.

Copy link
Member

commented May 14, 2019

Is it possible to fix this issue without adding more persistent state? Maybe by doing a bit more republishing, possibly redundant, like in the sweeper?

Yes, but it needs to store this transaction somewhere. For channels that are closing or closed, it currently isn't stored anywhere. We can't reconstruct this data by ourselves since we need the signature of the remote node along side it.

@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented May 16, 2019

Yes, the idea behind storing the tx was to handle all possible closes uniformly. Our own force close tx we can re-derive, but not coop closes. It would also allow us in the future to store the remote force close tx if we retrieve it.

@Roasbeef

This comment has been minimized.

Copy link
Member

commented May 21, 2019

So with that said, I think we can move forward on this? Or are we still unclear w.r.t the goals here?

@joostjager

This comment has been minimized.

Copy link
Collaborator

commented May 21, 2019

My question about why db is answered.

@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch 3 times, most recently from 8beb6e5 to 08a4a2d May 21, 2019
@Roasbeef

This comment has been minimized.

Copy link
Member

commented May 21, 2019

Is it possible to fix this issue without adding more persistent state? Maybe by doing a bit more republishing, possibly redundant, like in the sweeper?

We can reconstruct our commitment transaction for force closes, but we can't reconstruct the co-op close transaction since we need a new signature from the remote party. So then the question is should we have a unified location to check for closing transaction, or special case the possible closing paths. At the very least, we need a new key in the database to know what type of close was attempted and where we are in that process it seems.

@Roasbeef

This comment has been minimized.

Copy link
Member

commented May 21, 2019

Another open question is if we deem the level of fix implement in this PR acceptable to move forward with, or if we'd like to step back and resolve the other issues as well I outlined in an earlier comment.

Copy link
Collaborator

left a comment

in agreement with the general idea of distinguishing coop and force closes in the channel status. if it's clear how we could migrate from the fix in this pr to distinguishing the coop case later, then i think it makes sense to move forward. if that's not so clear, might be better to hold off an resolve them all at once.

if err := c.cfg.MarkCommitmentBroadcasted(closeTx); err != nil {
log.Errorf("ChannelArbitrator(%v): unable to "+
"mark commitment broadcasted: %v",
c.cfg.ChanPoint, err)

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht May 22, 2019

Collaborator

feels like the error be returned here?

This comment has been minimized.

Copy link
@halseth

halseth Sep 11, 2019

Author Collaborator

Done: 3725355

@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch from 08a4a2d to 152afb3 May 29, 2019
@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented May 29, 2019

Pushed a new version which makes sure we store the context necessary when changing the channel status in the DB. Instead of calling MarkBorked when we attempt to force close, we'll now call MarkCommitmentBroadcasted immediately, with our closing tx, such that it will be re-published on startup.

Calling MarkBorked should now only be done in cases where we don't know if it is safe to force close the channel.

PTAL @Roasbeef @cfromknecht @wpaulino

@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch from 89129f8 to 70a0c54 Sep 19, 2019
go.mod Outdated Show resolved Hide resolved
lnwallet/btcwallet/btcwallet.go Outdated Show resolved Hide resolved
channeldb/channel.go Outdated Show resolved Hide resolved
// connection is being torn down simultaneously. This delay can be
// removed after the force close is reliable, but in the meantime it
// improves the reliability of successfully closing out the channel.
if chanState.HasChanStatus(channeldb.ChanStatusRestored) {

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Sep 19, 2019

Collaborator

does this still trigger itest flakes?

This comment has been minimized.

Copy link
@halseth

halseth Sep 20, 2019

Author Collaborator

Flake fixed by 70a0c54 😄

return daveBalResp.ConfirmedBalance > 0
}, defaultTimeout)
if err != nil {
t.Fatalf("On-chain balance not restored: %v", err)

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht Sep 19, 2019

Collaborator

perhaps wait.NoError here, so that we can get the incorrect balance in the err msg?

This comment has been minimized.

Copy link
@halseth

halseth Sep 20, 2019

Author Collaborator

It will most likely just be zero, but fixed: f4786ab 😅

@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch from 70a0c54 to f4786ab Sep 20, 2019
Copy link
Member

left a comment

This is getting very close, nice work! Can we get a pre-emptive squash on this? To my current understanding, what's implemented in this PR appears to address the remaining gaps we were aware of w.r.t reliable DLP execution.

However, neutrino itest fails with:

    --- FAIL: TestLightningNetworkDaemon/data_loss_protection (13.37s)
        lnd_test.go:100: Failed: (data loss protection): exited with error: 
            *errors.errorString unable to connect Dave to carol: rpc error: code = Unknown desc = dial tcp 127.0.0.1:19799: connect: connection refused
            /home/travis/gopath/src/github.com/lightningnetwork/lnd/lntest/itest/lnd_test.go:8213 (0xd4c4a3)

Is this unrelated, do we have any idea what is causing the failure? It seems distinct from the prior lack of force close related errors due to the inconsistency/race-condition.

peer.go Show resolved Hide resolved
peer.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch from f4786ab to 0993bed Sep 24, 2019
@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 24, 2019

However, neutrino itest fails with:

    --- FAIL: TestLightningNetworkDaemon/data_loss_protection (13.37s)
        lnd_test.go:100: Failed: (data loss protection): exited with error: 
            *errors.errorString unable to connect Dave to carol: rpc error: code = Unknown desc = dial tcp 127.0.0.1:19799: connect: connection refused
            /home/travis/gopath/src/github.com/lightningnetwork/lnd/lntest/itest/lnd_test.go:8213 (0xd4c4a3)

Is this unrelated, do we have any idea what is causing the failure? It seems distinct from the prior lack of force close related errors due to the inconsistency/race-condition.

Yeah, I saw this flake pretty regularly when working on the initial neutrino itests. It was caused by some connection replacement in the peer/connmgr logic. It went away mostly with the connection manager changes, but maybe it is still not totally robust yet.

I gave it a squash, but still needs #3510 before final swuash an merge

@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

However, neutrino itest fails with:

    --- FAIL: TestLightningNetworkDaemon/data_loss_protection (13.37s)
        lnd_test.go:100: Failed: (data loss protection): exited with error: 
            *errors.errorString unable to connect Dave to carol: rpc error: code = Unknown desc = dial tcp 127.0.0.1:19799: connect: connection refused
            /home/travis/gopath/src/github.com/lightningnetwork/lnd/lntest/itest/lnd_test.go:8213 (0xd4c4a3)

Is this unrelated, do we have any idea what is causing the failure? It seems distinct from the prior lack of force close related errors due to the inconsistency/race-condition.

Yeah, I saw this flake pretty regularly when working on the initial neutrino itests. It was caused by some connection replacement in the peer/connmgr logic. It went away mostly with the connection manager changes, but maybe it is still not totally robust yet.

Just hit this also on the travis run for #3510 https://travis-ci.org/lightningnetwork/lnd/jobs/589294404, so it is existing.

halseth added 17 commits Sep 6, 2019
The exported FullSync method is only used by test code, so we remove it
and instead use SyncPending.
When loading active channels for a connected peer, we gather channel
sync messages for all borked channels, and send them to the peer. This
should help a peer realize that the state is irreconcible, as we have
already realized.
Previously it would always be the same, resulting in multiple calls to
the method not being usabel to create more than one set of channels.
…Commitment

TestChainArbitratorRepulishCommitment testst that the chain arbitrator
will republish closing transactions for channels marked
CommitementBroadcast in the database at startup.
This commit converts the ErrCommitSyncLocalDataLoss error into a struct,
that also holds the received last unrevoked commit point from the remote
party.
Instead of marking the database state when processing the channel
reestablishment message, we wait for the result of this processing to
arrive in the link, and mark it accordingly in the database here.

We do this move the logic determining whether we should force close the
channel or not, and what state to mark it in the DB, to the same place,
as these need to be consistent.
Instead of marking the channel Borked in cases where we want to force
close it, we immediately let the peer fail the link. The channel state
will instead be updated by the channel arbitrator, which will transition
to StateBroadcastCommit, marking the channel borked, then marking the
commitment tx broadcasted right before publishing the force close tx. We
do this to avoid the case where we would mark it Borked, but go down
before being able to publish the closing tx.

Storing the force close tx ensures it will be re-published on startup.
We call MarkCommitmentBroadcasted before publishing the closing tx to
ensure we can attempt to republish at startup if something goes wrong.
…blish

Before publishing the close tx to the network and commit to the
StateCommitmentBroadcasted state, we mark the commitment as broadcasted
and store it to the db. This ensures it will get re-published on startup
if we go down.
Since we have access to the internal state of the channel, we can
instead get it directly instead of passing it in as a parameter.
Earlier this delay was needed to increase the likelihood that the DLP
scanario was successfully completed. Since we would risk the connection
being torn down, and the link exit, we could end up with the remote
marking the channel borked, but not finishing the force close.

With the previous set of commits, we should now trigger the force close
before we merk the channel borked, which should ensure we'll resume the
orocess on next restart/connect.
We add a wait predicate to make sure the node's on-chain balance is
restored before continuing the restore test case.

This is needed since the DLP test scenario includes several restarts of
the node, and if the node isn't done scanning for on-chain balance
before the restart happens, it would be unlocked without a recovery
window, causing funds to be left undiscovered.
@halseth halseth force-pushed the halseth:republish-close-tx-on-startup branch from 0993bed to 97093b4 Sep 25, 2019
}

daveBal := daveBalResp.ConfirmedBalance
if daveBal <= 0 {

This comment has been minimized.

Copy link
@wpaulino

wpaulino Sep 25, 2019

Collaborator

Can we provide a better bound to ensure Dave has recovered all of his balance?

Copy link
Collaborator

left a comment

LGTM 🔮

Copy link
Member

left a comment

LGTM 🧩

@Roasbeef Roasbeef merged commit 0c076bf into lightningnetwork:master Sep 25, 2019
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.06%) to 63.217%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.