Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htlcswitch: fix empty commit sig and no commit sig #2927

Merged
merged 8 commits into from Nov 9, 2019

Conversation

joostjager
Copy link
Collaborator

@joostjager joostjager commented Apr 10, 2019

Builds on top of #3564.

This PR fixes two known bugs in the commitment tx update sequence:

  • Sending an commit sig message that didn't sign any new updates ("empty commit sig")
  • Not sending a commit sig message when it is expected.

In addition to that, several simplifications are carried through where values were tracked in the link that can also be obtained from the channel.

@joostjager joostjager changed the title htlcswitch: always send commitSig htlcswitch: always send commitSig [wip] Apr 10, 2019
@cfromknecht cfromknecht added this to the 0.7 milestone Apr 11, 2019
@halseth
Copy link
Collaborator

@halseth halseth commented Apr 15, 2019

For reference: #1927

Roasbeef referenced this issue in rustyrussell/lightning Apr 16, 2019
The spec says not to send a commitment_signed without any changes, but LND
does this.  To understand why, you have to understand how LND works.  I
haven't read the code, but I'm pretty sure it works like this:

1. lnd slows down to do garbage collection, because it's in Go.
2. When an alert timer goes off, noticing it's not making process, it
   sends a twitter message to @Roasbeef.
3. @Roasbeef sshs into the user's machine and binary patches lnd to send
   a commitment_signed message.
4. Unfortunately he works so fast that various laws of causality are broken,
   meaning sometimes the commitment_signed is sent before any of thes
   other things happen.

I'm fairly sure that this will stop as @Roasbeef ages, or lnd introduces
some kind of causality enforcement fix.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@Roasbeef Roasbeef added bug fix htlcswitch commitments labels Apr 20, 2019
@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Apr 25, 2019

@halseth's repro of empty commitSig: #3012

@cfromknecht cfromknecht removed this from the 0.7 milestone May 30, 2019
@wpaulino wpaulino added this to the 0.8.0 milestone Jul 25, 2019
@wpaulino wpaulino added the v0.8.0 label Jul 30, 2019
@wpaulino wpaulino removed the v0.8.0 label Sep 16, 2019
@wpaulino wpaulino removed this from the 0.8.0 milestone Sep 16, 2019
@joostjager joostjager changed the title htlcswitch: always send commitSig [wip] htlcswitch: fix empty commit sig and no commit sig Sep 24, 2019
@joostjager joostjager removed the request for review from halseth Sep 24, 2019
@joostjager joostjager force-pushed the hodl-drop-fix-better branch 2 times, most recently from 8b50fb3 to ab2078b Compare Sep 24, 2019
@joostjager joostjager force-pushed the hodl-drop-fix-better branch 2 times, most recently from a073f13 to c230496 Compare Sep 26, 2019
@@ -1416,6 +1422,7 @@ func NewLightningChannel(signer input.Signer,
Capacity: state.Capacity,
LocalFundingKey: state.LocalChanCfg.MultiSigKey.PubKey,
RemoteFundingKey: state.RemoteChanCfg.MultiSigKey.PubKey,
log: build.NewPrefixLog(logPrefix, walletLog),
Copy link
Collaborator

@cfromknecht cfromknecht Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge with prior commit? should we also use this in l.debugf(), l.errorf(), etc? atm i think some use chan point and others use short channel id, would be nice if we were consistent

Copy link
Collaborator Author

@joostjager joostjager Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe. Ok, I know that I sometimes comment on commit structure in PRs and suggest some of them to be merged. This is particular the case for structs that are introduced in a separate commit but it is hard to review the code without any further context. In this case the prefix logger is an object that doesn't need much context to review (although this is a sliding scale) and therefore I made it a separate commit.

Copy link
Collaborator Author

@joostjager joostjager Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added commits to also make the link logging consistent. Open question is what we think is the best prefix to use for link and channel? uint64, shortchan, chanpoint?

remoteUpdatesPending := lastLocalCommit.theirMessageIndex !=
lastRemoteCommit.theirMessageIndex

allSigned := !localUpdatesPending && !remoteUpdatesPending
Copy link
Collaborator

@cfromknecht cfromknecht Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks duplicated from earlier commit, perhaps make internal helper

Copy link
Collaborator Author

@joostjager joostjager Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it isn't duplicated. Here we check if we owe a signature, while in ReceiveNewCommitment, we check whether the remote party owed anything.

build/prefix_log.go Outdated Show resolved Hide resolved
htlcswitch/link.go Show resolved Hide resolved
htlcswitch/link.go Show resolved Hide resolved
htlcswitch/link_isolated_test.go Outdated Show resolved Hide resolved
@joostjager joostjager requested a review from cfromknecht Oct 1, 2019
v0.9.0-beta automation moved this from Needs Review to Approved Oct 29, 2019
Copy link
Collaborator

@cfromknecht cfromknecht left a comment

Great work! Indeed this simplifies state machine and reduces the chance of bugs due to local state, e.g. needsUpdate 🍾

htlcswitch/link.go Show resolved Hide resolved
lnwallet/channel.go Show resolved Hide resolved
@joostjager joostjager moved this from Approved to WIP in v0.9.0-beta Oct 29, 2019
@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Oct 29, 2019

PR is approved, but need to do interop testing with CL and Eclair

@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Oct 31, 2019

Testing interop between CL 0.7.2 and this branch 5080a03. Setup: connect CL and LND nodes with a channel. Then pay invoices both ways concurrently. Used script:

#!/bin/bash

while true
do
        LNCLI="$HOME/lightninglabs/lnd/lncli-debug --network=regtest"

        ID1=`od -vAn -N4 -tu4 < /dev/urandom`
        ID2=`od -vAn -N4 -tu4 < /dev/urandom`

        BOLT1=`lightning-cli invoice 10000 $ID1 x | jq -r .bolt11`
        BOLT2=`$LNCLI addinvoice 10 | jq -r .pay_req`

        $LNCLI --network=regtest payinvoice -f $BOLT1
        lightning-cli pay $BOLT2
done

I ran four instances of the script in parallel. No anomalies observed and channel remained functional.

@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Oct 31, 2019

Same test for Eclair 0.3.2-5ad3944, no problems observed.

#!/bin/bash

while true
do
        LNCLI="$HOME/lightninglabs/lnd/lncli-debug --network=regtest"
        ECLAIRCLI="$HOME/Downloads/eclair-cli.sh -a localhost:8081 -p foobar"
        ID1=`od -vAn -N4 -tu4 < /dev/urandom`
        ID2=`od -vAn -N4 -tu4 < /dev/urandom`

        BOLT1=`$ECLAIRCLI createinvoice --amountMsat=10000 --description="test" | jq -r .serialized`
        BOLT2=`$LNCLI addinvoice 10 | jq -r .pay_req`

        $LNCLI --network=regtest payinvoice -f $BOLT1
        $ECLAIRCLI payinvoice --invoice=$BOLT2
done

@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Oct 31, 2019

Finally interop with lnd 0.8. Works too.

@joostjager joostjager moved this from WIP to Approved in v0.9.0-beta Oct 31, 2019
@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Oct 31, 2019

Moved PR back into the board's approved column. Ready for a final check

@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Nov 5, 2019

Rebased

Copy link
Member

@Roasbeef Roasbeef left a comment

Solid set of changes, no major comments on initial pass. I've let a comment on a older related PR, to see how we can reconcile the two of them. Only material comment is a shift to avoid grabbing the channel state machine's mutex twice when we receive a new commitment.

// updates in the local commit tx that aren't reflected in the remote commit tx
// yet.
func (lc *LightningChannel) OweCommitment(local bool) bool {
lc.RLock()
Copy link
Member

@Roasbeef Roasbeef Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should instead obtain the mutex once within ReceiveNewCommitment, then either make this a private method, or update the godoc to note that it expects the mutex to already be held?

Copy link
Collaborator Author

@joostjager joostjager Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also called without a lock from ChannelLink. I now created a private no-lock version and a public method that wraps it in a lock.

I am wondering whether our channel manipulations are always safe though. Something like this for example:

if lc.OweCommitment(true) {
			commitSig, htlcSigs, _, err := lc.SignNextCommitment()

I think the channel isn't modified concurrently, but why it the lock there then? Is it a safety precaution? Or for reading channel state in another subsystem?

Copy link
Member

@Roasbeef Roasbeef Nov 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock was mostly a safety precaution when the struct was first created just to ensure things were goroutine safe, so all other code added afterwards followed that trend. In the past there were also instances of grabbing live state machines from the peer, and using that to populate RPC information.

@@ -1103,7 +1103,7 @@ out:
// update in some time, check to see if we have any
// pending updates we need to commit due to our
// commitment chains being desynchronized.
if l.channel.FullySynced() {
if !l.channel.OweCommitment(true) {
Copy link
Member

@Roasbeef Roasbeef Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see how this is used here now. Updating my prior comment, we can make a private version of this method that assumes the lock is already held, then call that when accepting a new commitment.

@@ -1109,19 +1112,10 @@ out:
}

case <-l.cfg.BatchTicker.Ticks():
// If the current batch is empty, then we have no work
Copy link
Member

@Roasbeef Roasbeef Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this should also lower idle CPU usage for routing nodes with tons of channels.

@joostjager joostjager force-pushed the hodl-drop-fix-better branch 3 times, most recently from 7d856d1 to d0d6b20 Compare Nov 6, 2019
joostjager and others added 8 commits Nov 6, 2019
To facilitate the logging, this commit adds a new OweCommitment method.
For the logging, we only need to consider the remote perspective. In a
later commit, we'll also start using the local perspective to support
the decision to send another signature.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
Previously the channel method FullySynced was used to decide whether to
send a new commit sig message. However, it could happen that FullySynced
was false, but that we didn't owe a commitment signature. Instead we
were waiting on the other party to send us a signature. If that
happened, we'd send out an empty commit sig. This commit modifies the
condition that triggers a new commit sig and fixes this deviation from
the spec.
Instead of tracking local updates in a separate link variable, query
this state from the channel itself.

This commit also fixes the issue where the commit tx was not updated
anymore after a failed first attempt because the revocation window was
closed. Also those pending updates will be taken into account when the
remote party revokes.
Now that channel exposes the number of pending local updates, it is no
longer necessary to track the batch size separately in the link.
Replace logCommitTick as a way to deal with revocation window exhaustion
by retrying to update the commit tx when the remote revocation is
received.

The rationale is that the revocation window always opens up because of a
revoke message that is received from the other party. It is therefore
not necessary to set a timer for this. The reception of the revoke
message is the trigger to send a new commit sig if necessary.
@joostjager
Copy link
Collaborator Author

@joostjager joostjager commented Nov 6, 2019

Test from #3012 added to this PR. It needed a bit of modification, but the idea of the scenario is the same.

For review, this is the delta with the previous revision of this pr:
https://github.com/joostjager/lnd/compare/hodl-drop-fix-better-ref..joostjager:hodl-drop-fix-better

@joostjager joostjager requested a review from Roasbeef Nov 6, 2019
Copy link
Member

@Roasbeef Roasbeef left a comment

LGTM 🎷

@Roasbeef Roasbeef merged commit 86bed39 into lightningnetwork:master Nov 9, 2019
2 checks passed
v0.9.0-beta automation moved this from Approved to Done Nov 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix commitments htlcswitch v0.9.0
Projects
No open projects
v0.9.0-beta
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants