New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss protect resending #1937

Merged
merged 18 commits into from Nov 26, 2018

Conversation

Projects
None yet
3 participants
@halseth
Copy link
Collaborator

halseth commented Sep 19, 2018

This PR makes the latest channel reestablishment message being stored as part of the channel close summary, such that it can be sent even when a channel ahs been closed.

This makes it possible for nodes to recover from the case where they have been offline, lost state, and comes back online, attempting to reseync the channel.

Note: This PR contains a DB migration. To ensure deserialization code is not changed in the future, and breaking old migrations, a versioned copy of the current deserialization logic is added to the file legacy_serialization.go. Backwards compatible addition of the new field turned out to be complex because of the presence of optional fields (see the now removed commit 14648b2).

TODO:

  • Unit test for FetchClosedChannelForID
// Check if we have a channel sync message to read.
var hasChanSyncMsg bool
err = ReadElements(r, &hasChanSyncMsg)
if err == io.EOF {

This comment has been minimized.

@halseth

halseth Sep 19, 2018

Collaborator

Note: we can add write the false marker here as part of the migration to avoid this EOF check.

This comment has been minimized.

@halseth

halseth Sep 19, 2018

Collaborator

Currently the addition of the LastChanSyncMsg field is kept in its own commit to show how new fields can easily be added after the presence flags are introduced in the serialization.

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

Why are we mixing the two approaches? We should either use EOF everywhere (so no migration) or the bool based approach.

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

Ah scratch that, I see the usage now

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

This is how you would add new fields in the future without a migration.

if hasChanSyncMsg {
// We must pass in reference to a lnwire.Message for the codec
// to support it.
var msg lnwire.Message

This comment has been minimized.

@halseth

halseth Sep 19, 2018

Collaborator

#golang pop quiz: Why doesn't this work?

var msg *lnwire.ChannelReestablish
if err := ReadElements(r, &msg); err != nil {
		return nil, err
}
c.LastChanSyncMsg = msg

This comment has been minimized.

@cfromknecht

cfromknecht Sep 21, 2018

Collaborator

i'm guessing that **lnwire.ChannelReestablish and *lnwire.Message are distinct concrete types, and the type switch doesn't try to determine if the suffix *lnwire.ChannelReestablish implements the other suffix lnwire.Message

This comment has been minimized.

@halseth

halseth Sep 25, 2018

Collaborator

Yes, I think that's right. I think the type switch is checking if the passed interface{} is of concrete type *lnwire.Message, not if it is satisfying the interface. TIL you cannot do that: https://play.golang.org/p/SmL37Hobvvs

@Roasbeef
Copy link
Member

Roasbeef left a comment

Nice work! This along with static channel back ups (incoming) really wrap up the set of items we need in place in lnd to cover all the bases we (currently) can as far as channel recovery. Most of my comments in this initial sweep are surrounding the new migration and if it's actually needed or not. Will do another pass in the tests, and also test this out a bit locally my self as well.

@@ -2094,7 +2094,12 @@ func serializeChannelCloseSummary(w io.Writer, cs *ChannelCloseSummary) error {
// If this is a close channel summary created before the addition of
// the new fields, then we can exit here.
if cs.RemoteCurrentRevocation == nil {
return nil
return WriteElements(w, false)

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

This is to indicate if optional fields are there or not? Why's this better (in terms of set of changes) than just attempting to read where we know fields will be appended? May be worth even just going to something more future proof here.

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

The problem with omitting optional fields is that it only works for the last field. If there are more than one optional field, then you won't know what the data present represents. This way we attempt to make it future proof.

By adding booleans to indicate presence, we can make sure old code will read the new fields (they won't attempt to read any extra data they don't understand) and that new code will read the old format (booleans will indicate whether fields are present, so it can know which fields are not present from the old format).

This comment has been minimized.

@Roasbeef

Roasbeef Nov 21, 2018

Member

Yep, realized this later that day. I think in the future, we should do a clean sweep to just move to a TLV format everywhere. You read it all, then at the end check the array/map for the keys you actually know of then parse that. With that, we gain a way to handle an arbitrary number of fields in the future in a uniform manner throughout the codebase.

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

ooooh, yeah that would be nice!

if cs.RemoteNextRevocation == nil {
return nil
if err := WriteElements(w, false); err != nil {

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

This seems to break the current serialization more than necessary. The minimal patch set would be to append the chan reest we need and nothing more.

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

Won't work for the case where cs.RemoteNextRevocation == nil and chanSyncMsg != nil, as the deserialization code will attempt to read the chanSyncMsg as RemoteNextRevocation.

@@ -830,11 +830,11 @@ func newChanMsgStream(p *peer, cid lnwire.ChannelID) *msgStream {
fmt.Sprintf("Update stream for ChannelID(%x) exiting", cid[:]),
1000,
func(msg lnwire.Message) {
_, isChanSycMsg := msg.(*lnwire.ChannelReestablish)
_, isChanSyncMsg := msg.(*lnwire.ChannelReestablish)

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

;)

@@ -573,3 +573,40 @@ func migratePruneEdgeUpdateIndex(tx *bolt.Tx) error {

return nil
}

// migrateOptionalChannelCloseSummaryFields migrates the serialized format of

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

See comments re if this is really worth yet another migration.

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

I first attempted to do this without a migration, but it turned out to be complex: 14648b2

This comment has been minimized.

@Roasbeef

Roasbeef Nov 21, 2018

Member

Ah ok, yeah that's pretty convoluted.

@@ -7198,6 +7198,85 @@ func testDataLossProtection(net *lntest.NetworkHarness, t *harnessTest) {

assertNodeNumChannels(t, ctxb, dave, 0)
assertNodeNumChannels(t, ctxb, carol, 0)

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

Perhaps this should be split out into another test, or a sub-test? As is this test is nearly 400 lines, not a blocker though, just something we should start to think about re the growth of the integration tests in the future.

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

Least blocking route would be a comment that delineates this new testing scenario.

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

Improved the comments.

Agreed that some of the tests would benefit from being split up. I think this might cause our integration tests to run even longer though, bc of the added overhead. Might be worth looking into in combination with running integration tests in parallel (perhaps using containers?).

This comment has been minimized.

@Roasbeef

Roasbeef Nov 21, 2018

Member

👍

Yeah we'll need to white board out a better architecture as we add more backends, and also eventually starting other platforms other than mac OS.

Show resolved Hide resolved contractcourt/chain_watcher.go
} else {
closeSummary.LastChanSyncMsg = chanSync
}

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

Unrelated to this PR, but shouldn't the BRAR be the one that's making this DB state here? Re the whole reliable hand off thing.

This comment has been minimized.

@cfromknecht

cfromknecht Nov 20, 2018

Collaborator

the brar writes the retributions to disk, then acks back. after that the chain watcher knows it's safe to mark the channel as pending closed

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

We could let the BRAR have that responsibility, as we could get rid of ACK logic.

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

the cleaner solution IMO is just to write resolutions in CloseChannel. then there is no need for a reliable handoff. the resolutions can be passed in memory to kick off the process, but if we crash the brar would look for all breached channels in the db and load resolutions from there

Show resolved Hide resolved lnd_test.go Outdated
Show resolved Hide resolved peer.go
// point, there's not much we can do other than wait
// for us to retrieve it. We will attempt to retrieve
// it from the peer each time we connect to it.
// TODO(halseth): actively initiate re-connection to

This comment has been minimized.

@Roasbeef

Roasbeef Nov 20, 2018

Member

This can eventually be hooked up into the channel life cycle management stuff. So to request a notification once a new node connects. Actually we already have this for use in the funding manager, so perhaps it can be used here?

This comment has been minimized.

@cfromknecht

cfromknecht Nov 20, 2018

Collaborator

yeah i think we can use NotifyWhenOnline to just try if they ever reconnect

This comment has been minimized.

@halseth

halseth Nov 20, 2018

Collaborator

I actually like the current polling more, since we might need some time to process the ChannelReestablish after they reconnect. In that case the first lookup after they connect might fail, so we have to wait a little bit anyway.

If we don't wait before retrying at that point, we risk the peer coming online send us the ChannelReestablish and disconnects, and we will stall here waiting for them to connect again.

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

good point, leaving this decoupled from the peer lifecycle makes sense to me then

Show resolved Hide resolved lnd_test.go Outdated

@halseth halseth force-pushed the halseth:data-loss-protect-resending branch 8 times, most recently from a34d016 to 4124c6d Nov 20, 2018

@halseth

This comment has been minimized.

Copy link
Collaborator

halseth commented Nov 20, 2018

Comments address, PTAL 🐶

Show resolved Hide resolved channeldb/db_test.go
Show resolved Hide resolved contractcourt/chain_watcher.go Outdated
Show resolved Hide resolved contractcourt/chain_watcher.go
return &commitmentChain{
commitments: list.New(),
startingHeight: initialHeight,
commitments: list.New(),

This comment has been minimized.

@Roasbeef

Roasbeef Nov 21, 2018

Member

Where do we set the new starting height of a commitment chain? Don't see why this change is necessary atm. There're other spots where we depend on the height being set properly.

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

We never read it, we just look at the height of the commitment in the chain.

Agreed that the change is not directly relevant to this PR, I just noticed when transforming the ChanSync method to not being dependent on the chain heights.

Removed and made a separate PR: #2206

@@ -5113,6 +5112,15 @@ func NewUnilateralCloseSummary(chanState *channeldb.OpenChannel, signer Signer,
LocalChanConfig: chanState.LocalChanCfg,
}

// Attempt to add a channel sync message to the close summary.
chanSync, err := ChanSyncMsg(chanState)

This comment has been minimized.

@Roasbeef

Roasbeef Nov 21, 2018

Member

Ah ok I see this now re remote close. We could also shuffle the other spots we populate this information into this package. Don't consider it a blocker though.

@Roasbeef

This comment has been minimized.

Copy link
Member

Roasbeef commented Nov 21, 2018

Final lingering comments are re if we should have a lower backoff, and also if a minor refactor is necessary or not given.

@@ -2096,6 +2101,22 @@ func serializeChannelCloseSummary(w io.Writer, cs *ChannelCloseSummary) error {
}
}

// Write the channel sync message, if present.

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

Seems equivalent to:

	// Write whether the channel sync message is present.
	if err := WriteElements(w, cs.LastChanSyncMsg != nil); err != nil {
		return err
	}
	// Write the channel sync message, if present.
	if c.LastChanSyncMsg != nil {
		if err := WriteElements(w, cs.LastChanSyncMsg); err != nil {
			return err
		}
	}

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

YES! Very nice :)

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

Also did the same change for RemoteNextRevocation

Show resolved Hide resolved peer.go Outdated
Show resolved Hide resolved contractcourt/chain_watcher.go Outdated
Show resolved Hide resolved contractcourt/chain_watcher.go
// point, there's not much we can do other than wait
// for us to retrieve it. We will attempt to retrieve
// it from the peer each time we connect to it.
// TODO(halseth): actively initiate re-connection to

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

good point, leaving this decoupled from the peer lifecycle makes sense to me then


// When Dave comes online, he will reconnect to Carol, try to resync
// the channel, but it will already be closed. Carol should resend the
// information Dave needs to sweep his funds.

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

👍

Show resolved Hide resolved channeldb/db.go Outdated
@@ -3446,27 +3445,27 @@ func (lc *LightningChannel) ProcessChanSyncMsg(
// it.
// 3. We didn't get the last RevokeAndAck message they sent, so they'll
// re-send it.
func (lc *LightningChannel) ChanSyncMsg() (*lnwire.ChannelReestablish, error) {
func ChanSyncMsg(c *channeldb.OpenChannel) (*lnwire.ChannelReestablish, error) {

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

should this have a mutex around it? esp now that this is called in more places than just the switch

also seems like something that should be implemented in channeldb as a method of OpenChannel, since it is otherwise unrelated to anything in lnwallet. would be nice to keep the .ChanSyncMsg() notation, and not use a package level function

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

Good idea, added mutex.

The reason I didn't make this a method on OpenChannel is that it calls ComputeCommitmentPoint, which is a method from lnwallet, leading to an import cycle. We could however move this method to a new package (lnutil? 😛) if we want this, but I felt it was not worth it.

Also moving it to channeldb felt wrong 🤔

// migrateOptionalChannelCloseSummaryFields migrates the serialized format of
// ChannelCloseSummary to a format where optional fields' presence is indicated
// with boolean markers.
func migrateOptionalChannelCloseSummaryFields(tx *bolt.Tx) error {

This comment has been minimized.

@cfromknecht

cfromknecht Nov 21, 2018

Collaborator

would be nice to have a unit test for this migration, asserting it behaves properly in all four cases

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

Migration test added! (only three cases though?)

@halseth halseth force-pushed the halseth:data-loss-protect-resending branch from 4124c6d to 1d1ce73 Nov 21, 2018

halseth added some commits Nov 20, 2018

lnd test: refactor testDataLossProtection
This extracts part of the test into a new helper method timeTravel,
which can be used to easily reset a node back to a state where channel
state is lost.
channeldb/legacy_serialization: add deserializeCloseChannelSummaryV6
This commit adds a new file legacy_serialization.go, where a copy of the
current deserializeCloseChannelSummary is made, called
deserializeCloseChannelSummaryV6.

The rationale is to keep old deserialization code around to be used
during migration, as it is hard maintaining compatibility with the old
format while changing the code in use.

@halseth halseth force-pushed the halseth:data-loss-protect-resending branch from 1d1ce73 to 96bb63e Nov 21, 2018

halseth added some commits Nov 20, 2018

@halseth halseth force-pushed the halseth:data-loss-protect-resending branch from 96bb63e to 56ea47b Nov 21, 2018

halseth added some commits Nov 20, 2018

channeldb/channel: add LastChanSync field to CloseChannelSummary
This commit adds an optional field LastChanSyncMsg to the
CloseChannelSummary, which will be used to save the ChannelReestablish
message for the channel at the point of channel close.
lnwallet+link: make ChanSyncMsg take channel state as arg
This lets us get the channel reestablish message without creating the LightningChannel struct first.
peer: define resendChanSyncMsg
This method is used to fetch channel sync messages for closed channels
from the db, and respond to the peer.
channeldb/db: define FetchClosedChannelForID
FetchClosedChannelForID is used to find the channel close summary given
only a channel ID.
lnd test: add offline scenario to testDataLossProtection
This adds the scenario where a channel is closed while the node is
offline, the node loses state and comes back online. In this case the
node should attempt to resync the channel, and the peer should resend a
channel sync message for the closed channel, such that the node can
retrieve its funds.
chain_watcher: poll for commit point in case of failure
We pool the database for the channel commit point with an exponential
backoff. This is meant to handle the case where we are in process of
handling a channel sync, and the case where we detect a channel close
and must wait for the peer to come online to start channel sync before
we can proceed.

@halseth halseth force-pushed the halseth:data-loss-protect-resending branch from 56ea47b to a9bd610 Nov 21, 2018

// ChannelCloseSummary.
//
// NOTE: deprecated, only for migration.
func deserializeCloseChannelSummaryV6(r io.Reader) (*ChannelCloseSummary, error) {

This comment has been minimized.

@halseth

halseth Nov 21, 2018

Collaborator

This was after an idea by @joostjager. Lmk what you think :)

@halseth

This comment has been minimized.

Copy link
Collaborator

halseth commented Nov 21, 2018

Comments addressed and commits squashed. PTAL.

@Roasbeef
Copy link
Member

Roasbeef left a comment

LGTM 🧩

Have been running this on all my nodes (mainnet+testnet) the past few days, and haven't run into any issues. Migration went smooth, but ofc haven't ran (yet) into any instances where my node needed to retransmit the last chan sync message.

@Roasbeef Roasbeef modified the milestones: 0.5.2, 0.5.1 Nov 26, 2018

@Roasbeef Roasbeef merged commit 42c4597 into lightningnetwork:master Nov 26, 2018

1 of 2 checks passed

coverage/coveralls Coverage decreased (-0.03%) to 56.005%
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment