Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reliable payments] persist htlcswitch pending payments #2762

Merged

Conversation

Projects
None yet
5 participants
@halseth
Copy link
Collaborator

commented Mar 12, 2019

This PR is a follow-up from #2761. It adds a persistent pendingPaymentStore to the Switch, needed to store payment results when the router is not available to handle them.

Problem

After a restart, there is no connection between an HTLC in flight on the network and the router payment flow that initially sent the HTLC. The router will eventually call GetPaymentResult to retrieve the result of the HTLC, but we have no guarantees this will be done before the result is received. This can lead to the result getting dropped, which is not ideal.

Solution

We introduce a pendingPayementStore which has two primary functions:

  1. It stores PaymentResults when they are received, regardless of whether there are any subscribers (the ChannelRouter) to the corresponding HTLC. This ensures that the router gets handed the result when it calls GetPaymentResult.
  2. It stores the information necessary to identify and handle a result that comes back after a restart. This is the chanID and paymentID, used to identify the payment in the circuit map.
  3. On restarts it syncs the pending payments with the CircuitMap. We use the circuit map as the source of truth whether a payment is in flight on the network. Syncing it at startup, before we start handling HTLCs, is necessary to be able to decide whether an HTLC was successfully forwarded.

Builds on #2761
Replaces #2265

@halseth halseth added this to the 0.6 milestone Mar 12, 2019

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch 6 times, most recently from c22bf88 to 298a129 Mar 12, 2019

Show resolved Hide resolved channeldb/payments.go Outdated
@Roasbeef
Copy link
Member

left a comment

I find this new approach much easier to follow than the prior iterations! I'm not sure we can get rid of the control tower as is though, due to the possibility of a user initiating a payment, then updating to this new version. Worst case they would re-try with the same payment hash and possibly lose that payment amount.

First pass review completed, will need some additional unit tests for the new biz logic in the pending payment store, and also the new functionality w.r.t passing back the same payment ID, etc.

Show resolved Hide resolved htlcswitch/switch.go Outdated
Show resolved Hide resolved htlcswitch/switch.go Outdated
Show resolved Hide resolved htlcswitch/pending_payment.go Outdated
Show resolved Hide resolved htlcswitch/pending_payment.go Outdated
Show resolved Hide resolved htlcswitch/pending_payment.go Outdated
Show resolved Hide resolved htlcswitch/pending_payment.go Outdated
Show resolved Hide resolved htlcswitch/pending_payment.go Outdated
Show resolved Hide resolved htlcswitch/switch.go Outdated
Show resolved Hide resolved htlcswitch/pending_payment.go Outdated
// network. By deleting those pending payments that are not found in
// the circuit map, we ensure that the router won't wait for a result
// to be returned for this payment ID.
// NOTE: this assumes that the circuit map has already been trimmed,

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Mar 20, 2019

Member

I don't think this last assumption is sound with the current implementation. IIRC, the circuit map as is, can contain circuits for payments they have either already been completed or failed on the network.

cc @cfromknecht

@joostjager
Copy link
Collaborator

left a comment

First pass review

// fetchPendingPayments retrieves all payments from the store that are pending,
// meaning no results are available yet, and can be re-forwarded on the
// network.
func (store *pendingPaymentStore) fetchPendingPayments() ([]*PendingPayment,

This comment has been minimized.

Copy link
@joostjager

joostjager Mar 20, 2019

Collaborator

Looks like this is test only code. Would look for a different way to assert behaviour (intercepting mock?)


// getPayment is used to query the store for a pending payment for the given
// payment ID. If the pid is not found, ErrPaymentIDNotFound will be returned.
func (store *pendingPaymentStore) getPayment(pid uint64) (*PendingPayment,

This comment has been minimized.

Copy link
@joostjager

joostjager Mar 20, 2019

Collaborator

Is this function used?

// syncWithCircuitMap should be called at startup to delete the pending
// payments which are no longer found in the circuit map. This removes pending
// payments that was not successfully forwarded to the link before shutdown.
func (store *pendingPaymentStore) syncWithCircuitMap(

This comment has been minimized.

Copy link
@joostjager

joostjager Mar 20, 2019

Collaborator

If we need to do this anyway, what is the use of persisting pending payments? Can't that list be reconstructed on startup from the circuit map?

So only persist the outcomes when they become available.


// completePayment stores the PaymentResult for the given paymentID, and
// notifies any subscribers.
func (store *pendingPaymentStore) completePayment(paymentID uint64,

This comment has been minimized.

Copy link
@joostjager

joostjager Mar 20, 2019

Collaborator

Are those results left in forever?

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch from 298a129 to cb790c4 Mar 20, 2019

@Roasbeef Roasbeef removed this from the 0.6 milestone Mar 26, 2019

@joostjager joostjager referenced this pull request Apr 1, 2019

Merged

Loop In #34

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch 3 times, most recently from aea2217 to 858ffd8 Apr 8, 2019

@cfromknecht cfromknecht added this to the 0.7 milestone Apr 11, 2019

// If the provided deobfuscator is nil, we have discarded the error
// decryptor due to a restart. We'll return a fixed error and signal a
// temporary channel failure to the router.
case deobfuscator == nil:
userErr := fmt.Sprintf("error decryptor for payment " +
"could not be located, likely due to restart")

This comment has been minimized.

Copy link
@joostjager

joostjager May 6, 2019

Collaborator

How is this still possible after this PR?

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch 2 times, most recently from 39263d4 to e24ded1 May 6, 2019

@@ -0,0 +1,268 @@
package htlcswitch

This comment has been minimized.

Copy link
@joostjager

joostjager May 6, 2019

Collaborator

Rename file to payment_result_store.go, as we do with other stores too?


// Otherwise, write 'true' and the error fields.
msg := []byte(p.Error.ExtraMsg)
err := channeldb.WriteElements(w, true,

This comment has been minimized.

Copy link
@joostjager

joostjager May 6, 2019

Collaborator

Saving the human readable error ExtraMsg to the db is not ideal. There are only a few cases for ExtraMsg, those can probably be saved in a structured way.

This comment has been minimized.

Copy link
@joostjager

joostjager May 6, 2019

Collaborator

Upon further inspection, there may be a few more. But still doubt whether we should persist data just to get a nice error message. Maybe just logging when the error occurs is enough.

This comment has been minimized.

Copy link
@halseth

halseth May 7, 2019

Author Collaborator

Yeah, looks like they are only used locally for passing the case of error to the RPC. Would it be useful to define a field for local error codes that can be used to communicate local failures?

This comment has been minimized.

Copy link
@joostjager

joostjager May 7, 2019

Collaborator

I think that local internal errors (things that should never happen) don't need to be exposed over rpc. Logging and returning a generic internal error should be enough?

// Otherwise, write 'true' and the error fields.
msg := []byte(p.Error.ExtraMsg)
err := channeldb.WriteElements(w, true,
p.Error.ErrorSource, msg,

This comment has been minimized.

Copy link
@joostjager

joostjager May 6, 2019

Collaborator

ErrorSource could be nil if the error source is unknown (for example after restart or when the onion decrypt fails). The store should probably prepare for that, to prevent a db migration.

This comment has been minimized.

Copy link
@halseth

halseth May 7, 2019

Author Collaborator

Using nil can leads to problems down the line, what do you think about defining a constant UnknownPaymentSource?

This comment has been minimized.

Copy link
@joostjager

joostjager May 7, 2019

Collaborator

What kind of problems? nil for unset fields is quite common? For serialize to the db we can of course write anything we want.

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch 2 times, most recently from 7c2f8d7 to 8294a42 May 6, 2019

Show resolved Hide resolved htlcswitch/payment_result.go
Show resolved Hide resolved htlcswitch/switch.go Outdated
Show resolved Hide resolved htlcswitch/switch.go Outdated

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch 4 times, most recently from 0975df7 to 28401bb May 7, 2019

@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented May 28, 2019

Rebased.

@Roasbeef

This comment has been minimized.

Copy link
Member

commented May 29, 2019

First commit can be removed during the next rebase.

Show resolved Hide resolved htlcswitch/payment_result.go
Show resolved Hide resolved htlcswitch/payment_result.go
Show resolved Hide resolved htlcswitch/payment_result.go
var paymentIDBytes [8]byte
binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID)

err := store.db.Batch(func(tx *bbolt.Tx) error {

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef May 29, 2019

Member

May want to consider making this an update instead, if only for the sake of the integration tests. In the past, then extra 30ms or so jitter here really accumulated making many tests time out when we tried Batching everywhere before.

This comment has been minimized.

Copy link
@halseth

halseth May 29, 2019

Author Collaborator

Yeah, I'm def in favor of using Update as long as we don't have any clear indication that it is a performance bottleneck.

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht May 29, 2019

Collaborator

if we do decide to use Update, then we probably don't need a multimutex since the db txns can't execute concurrently. storeResult is called in separate goroutines, so under heavy load i suspect Batch would help, but would need to benchmark to be certain

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Jun 7, 2019

Member

Still using a Update here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.

Show resolved Hide resolved htlcswitch/switch.go
@@ -989,6 +976,7 @@ func (s *Switch) parseFailedPayment(deobfuscator ErrorDecrypter,
// the first hop. In this case, we'll report a permanent
// channel failure as this means us, or the remote party had to
// go on chain.
// TODO: check reason length instead

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht May 29, 2019

Collaborator

what is meant by this/when do we plan to do this?

This comment has been minimized.

Copy link
@halseth

halseth Jun 7, 2019

Author Collaborator

Removed.

t.Fatalf("unable to get payment result: %v", err)
}

select {

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht May 29, 2019

Collaborator

not blocking, but this segment is an exact copy of the logic above, maybe add a checkResult closure?

@cfromknecht

This comment has been minimized.

Copy link
Collaborator

commented May 29, 2019

Really happy with how this refactor turned out, this method of storing the payment results i think is much cleaner than some of the avenues we explored earlier! Will need rebase now that #3087 has landed

lnd_test.go Outdated
}

// Wait for all the invoices to reach the OPEN state.
for _, stream := range invoiceStreams {

This comment has been minimized.

Copy link
@cfromknecht

cfromknecht May 30, 2019

Collaborator

at some point these helper routines for hodl invoices could be extracted, but not blocker on this PR

msg: pkt.htlc,
unencrypted: unencrypted,
isResolution: pkt.isResolution,
}

This comment has been minimized.

Copy link
@joostjager

joostjager Jun 4, 2019

Collaborator

Add timestamp to record when failure was received. This is one building block of black hole defense

This comment has been minimized.

Copy link
@joostjager

joostjager Jun 4, 2019

Collaborator

Preferably time.UnixNano for sub-second resolution

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Jun 7, 2019

Member

AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.

In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

Can now be updated to use the latest RPC calls in the new sub-server for the new itests.

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch from bc77400 to 88baef2 Jun 6, 2019

@Roasbeef
Copy link
Member

left a comment

LGTM ☄️

Two non-blocking follow up that I'd like to see extended as a PR soon after this lands:

  1. Additional itest for the payment failure case
  2. Clean up of the payment result once that router updates its state, and ACKs the message from the switch.
var paymentIDBytes [8]byte
binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID)

err := store.db.Batch(func(tx *bbolt.Tx) error {

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Jun 7, 2019

Member

Still using a Update here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.

msg: pkt.htlc,
unencrypted: unencrypted,
isResolution: pkt.isResolution,
}

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Jun 7, 2019

Member

AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.

In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.

@cfromknecht
Copy link
Collaborator

left a comment

LGTM! 🔥

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch from b4ea245 to 78dfb83 Jun 7, 2019

halseth added some commits Jun 7, 2019

htlcswitch/payment_result: add paymentResultStore
paymentResultStore is a persistent store where we keep track of all
received payment results. This is used to ensure we don't lose results
from payment attempts on restarts.
multi: make GetPaymentResult take payment hash
Used for logging in the switch, and when we remove the pending payments,
only the router will have the hash stored across restarts.
lnd_test: add testHoldInvoicePersistence
testHoldInvoicePersistence tests that a sender to a hold-invoice, can be
restarted before the payment gets settled, and still be able to receive
the preimage.
htlcswitch/payment_result_test: add TestNetworkResultStore
TestNetworkResultStore tests that the networkResult store behaves as
expected, and that we can store, get and subscribe to results.
htlcswitch/switch test: add TestSwitchGetPaymentResult
TestSwitchGetPaymentResult tests that the switch interacts as expected
with the circuit map and network result store when looking up the result
of a payment ID. This is important for not to lose results under
concurrent lookup and receiving results.

@halseth halseth force-pushed the halseth:reliable-payments-lookup-circuitmap branch from 78dfb83 to dd88015 Jun 7, 2019

@halseth

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 7, 2019

Alright, itest using the new payment API is pushed. PTAL @joostjager

@halseth halseth merged commit e45d4d7 into lightningnetwork:master Jun 8, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.03%) to 60.938%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.