Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: implement new safe static channel backup and recovery scheme, RPCs, and cli commands #2313

Merged
merged 46 commits into from Apr 1, 2019

Conversation

7 participants
@Roasbeef
Copy link
Member

Roasbeef commented Dec 11, 2018

Overview

In this PR, we implement a new safe scheme for static channel backups (SCB's) for lnd. We say safe, as care has been taken to ensure that there are no foot guns in this method of backing up channels, vs doing things like rsyncing or copying the channel.db file periodically. Those methods can be dangerous as one never knows if they have the latest state of a channel or not. Instead, we aim to provide a simple safe instead to allow users to recover the settled funds in their channels in the case of partial or complete data loss. The backups themselves are encrypted using the a key derived from the user's seed, this way we protect privacy of the users channels in the back up state, and ensure that a random node can't attempt to import another user's channels.

Once this PR is merged, given their seed and the latest back up file, the user will be able to recover both their on-chain funds, and also funds that are fully settled within their channels. By "fully settled" we mean funds that are in the base commitment outputs, and not HTLCs. We can only restore these funds as right after the channel is created, we have all the data required to make a backup. In contrast, in order to resolve HTLCs, we would also need to update the backup state with each new channel update, which is tricky to do without additional infrastructure. This infrastructure will be built out in the near future, but until then we have this scheme which will also be a fall back in the scenario that any higher level mechanisms fail.

At a later point, we also plan to propose this backup scheme as an addition to the spec, as even with the change to make the "to self" outputs static, we still need this SCB information in order to restore user funds. Additionally, the current serialization format is a bit up in the air. Atm, we use the same "codec" as we do within the wire protocol for the BOLT specs. However, we'll likely move to a TLV (type-length-value) format as it's extremely flexible and allows us to add/remove fields in the future once we gain new channel types, or modifications are made in the protocol that warrant a change to the backup format. Most importantly, if aezeed and this chanbackup scheme are added to the spec, then it will be possible to write a simple program, that given a seed+backup from any of the implementations, will be able to recover all funds (sweep to an address) the shutdown.

Recovery Flow

Skipping the backup flow for a second, given their 24-word aezeed seed, and a special channels.backup file, the recovery flow would be something like the following

  1. The user uses lncli create or the gRPC WalletUnlocker.Init call to input their seed and fully serialized backups.

    1. Alternatively, if they already have a new node set up, they can use the cli and RPC commands to import channels one at a time, or the entire file.
  2. lnd boots up and the wallet performs a rescan from the wallet's birthday (encoded in their aezeed) to restore all on-chain funds. Once this process is complete, the main lnd server will start up.

  3. Given the set of channels to recover, the server will then (using the new chanbackup) package, will insert a series of "channel shells" into the database. These contains only the information required to initiate the DLP (data loss protection) protocol and nothing more. As a result, they're makred as "recovered" channels in the database, and we'll disallow trying to use then for any other process.

  4. Once the channel is recovered, the chanbackup package will attempt to insert a LinkNode that contains all prior addresses that we were able to reach the peer at. During the process, we'll also insert the edge for that channel (only out outgoing direction) into the database as well.

  5. lnd will then start up, and as usual attempt to establish connections to all peers that we have channels open with.

  6. Once we connect with a peer, we'll then initiate the DLP protocol. The remote peer will discover that we've lost data, and then immediately force close their channel. Before they do though, they'll send over their latest unrevoked commitment point which we need to derive keys (will be fixed in BOLT 1.1 by making the key static) to sweep our funds.

  7. Once the commitment transaction confirms, given information within the SCB we'll re-derive all keys we need, and then sweep the funds.

Backup + Recovery Methods

This PR exposes multiple safe ways to backup and recover a channel. We expect only one of them to be used primarily by unsophisticated end users, but have provided other mechanisms for more advanced users and business that already script lnd via the gRPC system.

First, the easiest method for backup+recovery. After this PR, lnd will maintain a channels.backup file in the same location that we store all the other files. Users will at any time be able to safely copy and backup this file. Each time a channel is opened or closed, lnd will update this file with the latest channel state. Users can use scripts to detect changes to the file, and upload them to their backup location. Something like fsnotify can notify a script each time the file changes to be backed up once again. The file is encrypted using an AEAD scheme, so it can safely be stored plainly in cloud storage, your SD card, etc. The file uses a special format and can be used to import via any of the recovery methods described below.

The second mechanism is via the new SubscribeChanBackups steaming gRPC method. Each time an channel is opened or closed, you'll get a new notification with all the chanbackup.Single files (described below), and a single chanbackup.Multi that contains all the information for all channels.

Finally, users are able to request a backup of a single channel, or all the channels via the cli and RPC methods. Here's an example, of a few ways users can obtain backups, see the PR for full details:

⛰ lncli --network=simnet exportchanbackup --chan_point=29be6d259dc71ebdf0a3a0e83b240eda78f9023d8aeaae13c89250c7e59467d5:0
{
    "chan_point": "29be6d259dc71ebdf0a3a0e83b240eda78f9023d8aeaae13c89250c7e59467d5:0",
    "chan_backup": "02e7b423c8cf11038354732e9696caff9d5ac9720440f70a50ca2b9fcef5d873c8e64d53bdadfe208a86c96c7f31dc4eb370a02631bb02dce6611c435753a0c1f86c9f5b99006457f0dc7ee4a1c19e0d31a1036941d65717a50136c877d66ec80bb8f3e67cee8d9a5cb3f4081c3817cd830a8d0cf851c1f1e03fee35d790e42d98df5b24e07e6d9d9a46a16352e9b44ad412571c903a532017a5bc1ffe1369c123e1e17e1e4d52cc32329aa205d73d57f846389a6e446f612eeb2dcc346e4590f59a4c533f216ee44f09c1d2298b7d6c"
}

⛰ lncli --network=simnet exportchanbackup --all
{
    "chan_points": [
        "29be6d259dc71ebdf0a3a0e83b240eda78f9023d8aeaae13c89250c7e59467d5:0"
    ],
    "multi_chan_backup": "fd73e992e5133aa085c8e45548e0189c411c8cfe42e902b0ee2dec528a18fb472c3375447868ffced0d4812125e4361d667b7e6a18b2357643e09bbe7e9110c6b28d74f4f55e7c29e92419b52509e5c367cf2d977b670a2ff7560f5fe24021d246abe30542e6c6e3aa52f903453c3a2389af918249dbdb5f1199aaecf4931c0366592165b10bdd58eaf706d6df02a39d9323a0c65260ffcc84776f2705e4942d89e4dbefa11c693027002c35582d56e295dcf74d27e90873699657337696b32c05c8014911a7ec8eb03bdbe526fe658be8abdf50ab12c4fec9ddeefc489cf817721c8e541d28fbe71e32137b5ea066a9f4e19814deedeb360def90eff2965570aab5fedd0ebfcd783ce3289360953680ac084b2e988c9cbd0912da400861467d7bb5ad4b42a95c2d541653e805cbfc84da401baf096fba43300358421ae1b43fd25f3289c8c73489977592f75bc9f73781f41718a752ab325b70c8eb2011c5d979f6efc7a76e16492566e43d94dbd42698eb06ff8ad4fd3f2baabafded"
}

⛰ lncli --network=simnet exportchanbackup --all --output_file=channels.backup

⛰ ll channels.backup
-rw-r--r--  1 roasbeef  staff   381B Dec  9 18:16 channels.backup

Static Channel Backup Scheme

Crypto

For encryption, we utilize chacha20poly1305 with a random 24 byte nonce. We use a larger nonce size as this can be safely generated via a CSPRNG without fear of frequency collisions between nonces generated. To encrypt a blob, we then use this nonce as the AD (associated data) and prepend the nonce to the front of the ciphertext package.

For key generation, in order to ensure the user only needs their passphrase and the backup file, we utilize the existing keychain to derive a private key. In order to ensure that at we don't force any hardware signer to be aware of our crypto operations, we instead opt to utilize a public key that will be hashed to derive our private key. The assumption here is that this key will only be exposed to this software, and never derived as a public facing address.

chanbackup.Single

The SCB contains all information required to initiate the data loss protection protocol once we restore the channel and connect to the remote channel peer.

The primary way outside callers will interact with this package are via the Pack and Unpack methods. Packing means writing a serialized+encrypted version of the SCB to an io.Writer. Unpacking does the opposite.

The encoding format itself uses the same encoding as we do on the wire within Lightning. Each encoded backup begins with a version so we can easily add or modify the serialization format in the future, if new channel types appear, or we need to add/remove fields. The backup contains:

  • The chain a channel belongs to.
  • The chanPoint of the channel.
  • The shortChanID of the channel.
  • The public key of the remote node.
  • The series of addresses that we can use to reach the node.
  • The CSV delay of the channel (required to later reconstruct our output script after BOLT 1.1)
  • A keychain.KeyLocator that allows us to re-derive the payment bas epoint we need to sweep our funds .
  • A keychain.KeyDescriptor that we need in order to re-derive our shachain root to validate the information the remote party gives us during the DLP protocol. (see the next section for the complications that arose here)

chanbackup.Multi

Multi is a series of static channel backups. This type of backup can contains ALL the channel
backup state in a single packed blob. This is suitable for storing on your file system, cloud storage, etc. Systems will be in place within lnd to ensure that one can easily obtain the latest version of the Multi for the node, and also that it will be kept up to date if channel state changes.

Implementation Complications and Open Questions

The main complication that arose during the implementation was that I realized late in development, that we also need to backup the details w.r.t how we derive out shachain root. We got a bit lucky here as we store the private key we use as the root, and not the public key itself. In order to derive the shachain roots, we use a special keychain.KeyFamily. However, we don't store the keychain.KeyLocator information which is a two-tuple that allow us to derive a key w/o knowing the public key or having any state in the wallet. Instead, within the backup, we're forced to store the entire public key and not just the key locator information. As a result, I needed to modify keychain.SecretKeyRing.DerivePrivKey to support a brute force scan to allow us to derive the key. In the future, we'll want to do a migration to also store the key locator information so we don't need to always do this brute force. In order to ensure we don't scan to infinity if we don't actually know the public key, I've added a cap on the max number of iterations.

As a result of the case above, it's now the case that any future hardware signers need to be aware of the shachain protocol, in order to generate and validate any points we receive.

The one other section that we maybe want to modify is the way we derive the key we use for encryption. We made an attempt to ensure that any future hardware signers don't actually need to understand our encryption protocol. So instead what we do is use a public point with the assumption that it will never be used for an address and be unveiled to the outside world. One alternative that I had (but scrapped, idk why TBH) is use a point, but then have the hardware signer provide us with an ECDH of that point and another. This would ensure that the key is derived from secret data, but allow us to not store any private data in the backup.

TODO's

  • write integration tests

  • write additional unit tests in channeldb

  • real world recovery attempts

  • update docs on how to use the recovery tools

  • after #1988 is in, finish hooking up the chanbackup.SubSwapper so we can auto update the backup file on disk

Fixes #175

Show resolved Hide resolved chanbackup/backupfile.go Outdated
}
}

// UpdateAndSwap will attempt write a new temporary backup file to disk with

This comment has been minimized.

Copy link
@alexbosworth

alexbosworth Dec 11, 2018

Contributor
Suggested change
// UpdateAndSwap will attempt write a new temporary backup file to disk with
// UpdateAndSwap will attempt to write a new temporary backup file to disk with
Show resolved Hide resolved chanbackup/crypto.go Outdated
@lsching17

This comment has been minimized.

Copy link

lsching17 commented Dec 13, 2018

"First, the easiest method for backup+recovery. After this PR, lnd will maintain a channels.backup file in the same location that we store all the other files. .."

Can a dedicated folder be used? If it is mounted with sshfs or nfs, the channels.backup and channel.db files can be separated into different machine.

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Dec 13, 2018

Can a dedicated folder be used?

I don't see why not. We can add a config flag for the backup file location.

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Dec 25, 2018

Alrighty, I've broken this PR up into 5 distinct PR's. Each new PR depends on the prior PR. As a result, they can go in one by one and be reviewed in smaller units, rather than waiting for the final dependents of this larger PR to be finalized. I'll keep this one as is though as it has the full description, and also builds allowing users to experiment with the set of commands. Once the final PR is ready for review (as all the prior PRs have been merged), I'll rebase this on on top of that, so everyone can use this as a central point of end to end testing.

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Feb 1, 2019

Pushed up a rebased version as all the dependent PRs have been merged. Once in #1988 is in, then I'll start the final push to getting this merged!

@Roasbeef Roasbeef force-pushed the Roasbeef:static-chan-backups branch from f31163d to 747b1c2 Feb 7, 2019

@Roasbeef Roasbeef force-pushed the Roasbeef:static-chan-backups branch from 747b1c2 to 861c6fb Feb 9, 2019

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Feb 9, 2019

Pushed up a new version that maintains the backup file on disk and modifies it based on new/closed channels. Will push up the integration tests next, and after that it's ready for review.

Show resolved Hide resolved lnrpc/rpc.proto Outdated
Show resolved Hide resolved lnrpc/rpc.proto Outdated
Show resolved Hide resolved lnrpc/rpc.proto Outdated
Show resolved Hide resolved walletunlocker/service.go Outdated

Roasbeef added some commits Mar 10, 2019

contractcourt: only look for local force close for non-recovered channel
In this commit, we modify the main `closeObserver` dispatch loop to only
look for the local force close if we didn't recover the channel. We do
this, as for a recovered channel, it isn't possible for us to force
close from a recovered channel.
contractcourt: ignore all other dispatch cases in closeObserver when …
…recovered chan

In this commit, we modify the `closeObserver` to fast path the DLP
dispatch case if we detect that the channel has been restored. We do
this as otherwise, we may inadvertently enter one of the other cases
erroneously, causing us to now properly look up their dlp commitment
point.
server: convert Start/Stop methods to use sync.Once
In this commit, we convert the server's Start/Stop methods to use the
sync.Once. We do this in order to fix concurrency issues that would
allow certain queries to be sent to the server before it has actually
fully start up. Before this commit, we would set started to 1 at the
very top of the method, allowing certain queries to pass before the rest
of the daemon was had started up.

In order to fix this issue, we've converted the server to using a
sync.Once, and two new atomic variables for clients to query to see if
the server has fully started up, or is in the process of stopping.
channeldb: in RestoreChannelShells don't exit if edge already exists
During the restore process, it may be possible that we have already
heard about our prior edge from a node on the network (or our channel
peers). As a result, we shouldn't exit if this happens, and instead
should continue with the rest of the restoration process.
lntest: extend the restore/restart methods to also accept optional SCBs
In this commit, we modify the `RestoreNodeWithSeed` and `RestartNode`
methods to also accept an SCB. This will be useful in new integration
tests to properly exercise the various restore/restart scenarios using
static channel backups.
test: update to new getChanPointFundingTxid
In this commit, we update all uses of the `getChanPointFundingTxid` to
match the new function signature. We no longer need to convert to a
chainhash.Hash, as the method does so underneath now.
test: refactor testDataLossProtection to extract core DLP scenario ou…
…t to new func

In this commit, we modify the core testDataLossProtection test to
extract the primary DLP assertion logic into a new function. We do this,
as the upcoming SCB tests will fallback to this test after some initial
set up.
test: add new series of itests for various SCB restore scenarios
In this commit, we add 4 new itests for exercising the SCB restore
process via 4 primary scenarios: recover from backup using RPC, recover
from file using RPC, recover channels during init/creation, recover
channels during unlock. With all fields populated there're a total of 24
new scenarios to cover. At the time of authoring of this commit, the
other scenarios (bits are: initiator, updates, private) have been left
out for now, as they increased the run time of the integration tests
significantly.

@Roasbeef Roasbeef force-pushed the Roasbeef:static-chan-backups branch from fc8f85c to f216027 Mar 29, 2019

@molxyz

This comment has been minimized.

Copy link

molxyz commented Mar 30, 2019

Tested on a testnet node that has been running with noseedbackup, SCB still let me do exportchanbackup. Shouldn't this result in an error message instead?
https://hastebin.com/raw/urohupeful

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Apr 1, 2019

@molxyz at runtime, lnd doesn't know if you actually got a seed or not.

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Apr 1, 2019

In that case, you wouldn't actually be able to decrypt the SCB unless you read out the private data of the database.

@cfromknecht
Copy link
Collaborator

cfromknecht left a comment

Awesome work on this feature @Roasbeef!

It is time to start securing our bags.

LGTM 💰

@Roasbeef Roasbeef merged commit c37ea68 into lightningnetwork:master Apr 1, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls First build on static-chan-backups at 59.342%
Details

High Priority automation moved this from Needs review to Done Apr 1, 2019

@ZapUser77

This comment has been minimized.

Copy link

ZapUser77 commented Apr 2, 2019

Any chance you can include what the commands are to restore (exact syntax), and what the expected outputs would be (just and example)? Considering how important this is, just guess and 'tying to figure it out' may not be the best idea.

From my understanding, this isn't actually a "back up" of the channels, and is instead a "channel funds recovery mechanism". Correct? If you restored using this, you'd have a node with zero channels, and would have to start open channels from scratch. Correct?

@Roasbeef

This comment has been minimized.

Copy link
Member Author

Roasbeef commented Apr 2, 2019

@ZapUser77

This comment has been minimized.

Copy link

ZapUser77 commented Apr 2, 2019

"Check out the PR description"
I did, read the entire thing, multiple times. I wouldn't have asked before reading.

"more docs will be provided later."
Thanks as always for your diligent hard work. It really is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.