Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss protect resending #1937

Merged
merged 18 commits into from Nov 26, 2018

Conversation

@halseth
Copy link
Collaborator

@halseth halseth commented Sep 19, 2018

This PR makes the latest channel reestablishment message being stored as part of the channel close summary, such that it can be sent even when a channel ahs been closed.

This makes it possible for nodes to recover from the case where they have been offline, lost state, and comes back online, attempting to reseync the channel.

Note: This PR contains a DB migration. To ensure deserialization code is not changed in the future, and breaking old migrations, a versioned copy of the current deserialization logic is added to the file legacy_serialization.go. Backwards compatible addition of the new field turned out to be complex because of the presence of optional fields (see the now removed commit 14648b2).

TODO:

  • Unit test for FetchClosedChannelForID
// Check if we have a channel sync message to read.
var hasChanSyncMsg bool
err = ReadElements(r, &hasChanSyncMsg)
if err == io.EOF {
Copy link
Collaborator Author

@halseth halseth Sep 19, 2018

Note: we can add write the false marker here as part of the migration to avoid this EOF check.

Copy link
Collaborator Author

@halseth halseth Sep 19, 2018

Currently the addition of the LastChanSyncMsg field is kept in its own commit to show how new fields can easily be added after the presence flags are introduced in the serialization.

Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

Why are we mixing the two approaches? We should either use EOF everywhere (so no migration) or the bool based approach.

Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

Ah scratch that, I see the usage now

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

This is how you would add new fields in the future without a migration.

if hasChanSyncMsg {
// We must pass in reference to a lnwire.Message for the codec
// to support it.
var msg lnwire.Message
Copy link
Collaborator Author

@halseth halseth Sep 19, 2018

#golang pop quiz: Why doesn't this work?

var msg *lnwire.ChannelReestablish
if err := ReadElements(r, &msg); err != nil {
		return nil, err
}
c.LastChanSyncMsg = msg

Copy link
Collaborator

@cfromknecht cfromknecht Sep 21, 2018

i'm guessing that **lnwire.ChannelReestablish and *lnwire.Message are distinct concrete types, and the type switch doesn't try to determine if the suffix *lnwire.ChannelReestablish implements the other suffix lnwire.Message

Copy link
Collaborator Author

@halseth halseth Sep 25, 2018

Yes, I think that's right. I think the type switch is checking if the passed interface{} is of concrete type *lnwire.Message, not if it is satisfying the interface. TIL you cannot do that: https://play.golang.org/p/SmL37Hobvvs

@halseth halseth added this to the 0.5.1 milestone Sep 19, 2018
@halseth halseth removed this from the 0.5.1 milestone Sep 20, 2018
@halseth halseth added this to the 0.5.2 milestone Sep 20, 2018
Copy link
Member

@Roasbeef Roasbeef left a comment

Nice work! This along with static channel back ups (incoming) really wrap up the set of items we need in place in lnd to cover all the bases we (currently) can as far as channel recovery. Most of my comments in this initial sweep are surrounding the new migration and if it's actually needed or not. Will do another pass in the tests, and also test this out a bit locally my self as well.

@@ -2094,7 +2094,12 @@ func serializeChannelCloseSummary(w io.Writer, cs *ChannelCloseSummary) error {
// If this is a close channel summary created before the addition of
// the new fields, then we can exit here.
if cs.RemoteCurrentRevocation == nil {
return nil
return WriteElements(w, false)
Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

This is to indicate if optional fields are there or not? Why's this better (in terms of set of changes) than just attempting to read where we know fields will be appended? May be worth even just going to something more future proof here.

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

The problem with omitting optional fields is that it only works for the last field. If there are more than one optional field, then you won't know what the data present represents. This way we attempt to make it future proof.

By adding booleans to indicate presence, we can make sure old code will read the new fields (they won't attempt to read any extra data they don't understand) and that new code will read the old format (booleans will indicate whether fields are present, so it can know which fields are not present from the old format).

Copy link
Member

@Roasbeef Roasbeef Nov 21, 2018

Yep, realized this later that day. I think in the future, we should do a clean sweep to just move to a TLV format everywhere. You read it all, then at the end check the array/map for the keys you actually know of then parse that. With that, we gain a way to handle an arbitrary number of fields in the future in a uniform manner throughout the codebase.

Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

ooooh, yeah that would be nice!

if cs.RemoteNextRevocation == nil {
return nil
if err := WriteElements(w, false); err != nil {
Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

This seems to break the current serialization more than necessary. The minimal patch set would be to append the chan reest we need and nothing more.

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

Won't work for the case where cs.RemoteNextRevocation == nil and chanSyncMsg != nil, as the deserialization code will attempt to read the chanSyncMsg as RemoteNextRevocation.

@@ -830,11 +830,11 @@ func newChanMsgStream(p *peer, cid lnwire.ChannelID) *msgStream {
fmt.Sprintf("Update stream for ChannelID(%x) exiting", cid[:]),
1000,
func(msg lnwire.Message) {
_, isChanSycMsg := msg.(*lnwire.ChannelReestablish)
_, isChanSyncMsg := msg.(*lnwire.ChannelReestablish)
Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

;)

@@ -573,3 +573,40 @@ func migratePruneEdgeUpdateIndex(tx *bolt.Tx) error {

return nil
}

// migrateOptionalChannelCloseSummaryFields migrates the serialized format of
Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

See comments re if this is really worth yet another migration.

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

I first attempted to do this without a migration, but it turned out to be complex: 14648b2

Copy link
Member

@Roasbeef Roasbeef Nov 21, 2018

Ah ok, yeah that's pretty convoluted.

@@ -7198,6 +7198,85 @@ func testDataLossProtection(net *lntest.NetworkHarness, t *harnessTest) {

assertNodeNumChannels(t, ctxb, dave, 0)
assertNodeNumChannels(t, ctxb, carol, 0)

Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

Perhaps this should be split out into another test, or a sub-test? As is this test is nearly 400 lines, not a blocker though, just something we should start to think about re the growth of the integration tests in the future.

Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

Least blocking route would be a comment that delineates this new testing scenario.

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

Improved the comments.

Agreed that some of the tests would benefit from being split up. I think this might cause our integration tests to run even longer though, bc of the added overhead. Might be worth looking into in combination with running integration tests in parallel (perhaps using containers?).

Copy link
Member

@Roasbeef Roasbeef Nov 21, 2018

👍

Yeah we'll need to white board out a better architecture as we add more backends, and also eventually starting other platforms other than mac OS.

contractcourt/chain_watcher.go Show resolved Hide resolved
} else {
closeSummary.LastChanSyncMsg = chanSync
}

Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

Unrelated to this PR, but shouldn't the BRAR be the one that's making this DB state here? Re the whole reliable hand off thing.

Copy link
Collaborator

@cfromknecht cfromknecht Nov 20, 2018

the brar writes the retributions to disk, then acks back. after that the chain watcher knows it's safe to mark the channel as pending closed

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

We could let the BRAR have that responsibility, as we could get rid of ACK logic.

Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

the cleaner solution IMO is just to write resolutions in CloseChannel. then there is no need for a reliable handoff. the resolutions can be passed in memory to kick off the process, but if we crash the brar would look for all breached channels in the db and load resolutions from there

lnd_test.go Outdated Show resolved Hide resolved
peer.go Show resolved Hide resolved
// point, there's not much we can do other than wait
// for us to retrieve it. We will attempt to retrieve
// it from the peer each time we connect to it.
// TODO(halseth): actively initiate re-connection to
Copy link
Member

@Roasbeef Roasbeef Nov 20, 2018

This can eventually be hooked up into the channel life cycle management stuff. So to request a notification once a new node connects. Actually we already have this for use in the funding manager, so perhaps it can be used here?

Copy link
Collaborator

@cfromknecht cfromknecht Nov 20, 2018

yeah i think we can use NotifyWhenOnline to just try if they ever reconnect

Copy link
Collaborator Author

@halseth halseth Nov 20, 2018

I actually like the current polling more, since we might need some time to process the ChannelReestablish after they reconnect. In that case the first lookup after they connect might fail, so we have to wait a little bit anyway.

If we don't wait before retrying at that point, we risk the peer coming online send us the ChannelReestablish and disconnects, and we will stall here waiting for them to connect again.

Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

good point, leaving this decoupled from the peer lifecycle makes sense to me then

lnd_test.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the data-loss-protect-resending branch 8 times, most recently from a34d016 to 4124c6d Nov 20, 2018
@halseth
Copy link
Collaborator Author

@halseth halseth commented Nov 20, 2018

Comments address, PTAL 🐶

channeldb/db_test.go Show resolved Hide resolved
contractcourt/chain_watcher.go Outdated Show resolved Hide resolved
contractcourt/chain_watcher.go Show resolved Hide resolved
return &commitmentChain{
commitments: list.New(),
startingHeight: initialHeight,
commitments: list.New(),
Copy link
Member

@Roasbeef Roasbeef Nov 21, 2018

Where do we set the new starting height of a commitment chain? Don't see why this change is necessary atm. There're other spots where we depend on the height being set properly.

Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

We never read it, we just look at the height of the commitment in the chain.

Agreed that the change is not directly relevant to this PR, I just noticed when transforming the ChanSync method to not being dependent on the chain heights.

Removed and made a separate PR: #2206

@@ -5113,6 +5112,15 @@ func NewUnilateralCloseSummary(chanState *channeldb.OpenChannel, signer Signer,
LocalChanConfig: chanState.LocalChanCfg,
}

// Attempt to add a channel sync message to the close summary.
chanSync, err := ChanSyncMsg(chanState)
Copy link
Member

@Roasbeef Roasbeef Nov 21, 2018

Ah ok I see this now re remote close. We could also shuffle the other spots we populate this information into this package. Don't consider it a blocker though.

@Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Nov 21, 2018

Final lingering comments are re if we should have a lower backoff, and also if a minor refactor is necessary or not given.

@@ -2096,6 +2101,22 @@ func serializeChannelCloseSummary(w io.Writer, cs *ChannelCloseSummary) error {
}
}

// Write the channel sync message, if present.
Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

Seems equivalent to:

	// Write whether the channel sync message is present.
	if err := WriteElements(w, cs.LastChanSyncMsg != nil); err != nil {
		return err
	}
	// Write the channel sync message, if present.
	if c.LastChanSyncMsg != nil {
		if err := WriteElements(w, cs.LastChanSyncMsg); err != nil {
			return err
		}
	}

Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

YES! Very nice :)

Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

Also did the same change for RemoteNextRevocation

peer.go Outdated Show resolved Hide resolved
contractcourt/chain_watcher.go Outdated Show resolved Hide resolved
contractcourt/chain_watcher.go Show resolved Hide resolved
// point, there's not much we can do other than wait
// for us to retrieve it. We will attempt to retrieve
// it from the peer each time we connect to it.
// TODO(halseth): actively initiate re-connection to
Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

good point, leaving this decoupled from the peer lifecycle makes sense to me then


// When Dave comes online, he will reconnect to Carol, try to resync
// the channel, but it will already be closed. Carol should resend the
// information Dave needs to sweep his funds.
Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

👍

channeldb/db.go Outdated Show resolved Hide resolved
@@ -3446,27 +3445,27 @@ func (lc *LightningChannel) ProcessChanSyncMsg(
// it.
// 3. We didn't get the last RevokeAndAck message they sent, so they'll
// re-send it.
func (lc *LightningChannel) ChanSyncMsg() (*lnwire.ChannelReestablish, error) {
func ChanSyncMsg(c *channeldb.OpenChannel) (*lnwire.ChannelReestablish, error) {
Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

should this have a mutex around it? esp now that this is called in more places than just the switch

also seems like something that should be implemented in channeldb as a method of OpenChannel, since it is otherwise unrelated to anything in lnwallet. would be nice to keep the .ChanSyncMsg() notation, and not use a package level function

Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

Good idea, added mutex.

The reason I didn't make this a method on OpenChannel is that it calls ComputeCommitmentPoint, which is a method from lnwallet, leading to an import cycle. We could however move this method to a new package (lnutil? 😛) if we want this, but I felt it was not worth it.

Also moving it to channeldb felt wrong 🤔

// migrateOptionalChannelCloseSummaryFields migrates the serialized format of
// ChannelCloseSummary to a format where optional fields' presence is indicated
// with boolean markers.
func migrateOptionalChannelCloseSummaryFields(tx *bolt.Tx) error {
Copy link
Collaborator

@cfromknecht cfromknecht Nov 21, 2018

would be nice to have a unit test for this migration, asserting it behaves properly in all four cases

Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

Migration test added! (only three cases though?)

@halseth halseth force-pushed the data-loss-protect-resending branch from 4124c6d to 1d1ce73 Nov 21, 2018
halseth added 4 commits Nov 21, 2018
This extracts part of the test into a new helper method timeTravel,
which can be used to easily reset a node back to a state where channel
state is lost.
This commit adds a new file legacy_serialization.go, where a copy of the
current deserializeCloseChannelSummary is made, called
deserializeCloseChannelSummaryV6.

The rationale is to keep old deserialization code around to be used
during migration, as it is hard maintaining compatibility with the old
format while changing the code in use.
@halseth halseth force-pushed the data-loss-protect-resending branch from 1d1ce73 to 96bb63e Nov 21, 2018
@halseth halseth force-pushed the data-loss-protect-resending branch from 96bb63e to 56ea47b Nov 21, 2018
halseth added 9 commits Nov 21, 2018
This commit adds an optional field LastChanSyncMsg to the
CloseChannelSummary, which will be used to save the ChannelReestablish
message for the channel at the point of channel close.
This lets us get the channel reestablish message without creating the LightningChannel struct first.
FetchClosedChannelForID is used to find the channel close summary given
only a channel ID.
This method is used to fetch channel sync messages for closed channels
from the db, and respond to the peer.
We pool the database for the channel commit point with an exponential
backoff. This is meant to handle the case where we are in process of
handling a channel sync, and the case where we detect a channel close
and must wait for the peer to come online to start channel sync before
we can proceed.
This adds the scenario where a channel is closed while the node is
offline, the node loses state and comes back online. In this case the
node should attempt to resync the channel, and the peer should resend a
channel sync message for the closed channel, such that the node can
retrieve its funds.
@halseth halseth force-pushed the data-loss-protect-resending branch from 56ea47b to a9bd610 Nov 21, 2018
// ChannelCloseSummary.
//
// NOTE: deprecated, only for migration.
func deserializeCloseChannelSummaryV6(r io.Reader) (*ChannelCloseSummary, error) {
Copy link
Collaborator Author

@halseth halseth Nov 21, 2018

This was after an idea by @joostjager. Lmk what you think :)

@halseth
Copy link
Collaborator Author

@halseth halseth commented Nov 21, 2018

Comments addressed and commits squashed. PTAL.

Copy link
Member

@Roasbeef Roasbeef left a comment

LGTM 🧩

Have been running this on all my nodes (mainnet+testnet) the past few days, and haven't run into any issues. Migration went smooth, but ofc haven't ran (yet) into any instances where my node needed to retransmit the last chan sync message.

@Roasbeef Roasbeef removed this from the 0.5.2 milestone Nov 26, 2018
@Roasbeef Roasbeef added this to the 0.5.1 milestone Nov 26, 2018
@Roasbeef Roasbeef merged commit 42c4597 into lightningnetwork:master Nov 26, 2018
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants