-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[reliable payments] persist htlcswitch pending payments #2762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reliable payments] persist htlcswitch pending payments #2762
Conversation
c22bf88 to
298a129
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this new approach much easier to follow than the prior iterations! I'm not sure we can get rid of the control tower as is though, due to the possibility of a user initiating a payment, then updating to this new version. Worst case they would re-try with the same payment hash and possibly lose that payment amount.
First pass review completed, will need some additional unit tests for the new biz logic in the pending payment store, and also the new functionality w.r.t passing back the same payment ID, etc.
htlcswitch/switch.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this last assumption is sound with the current implementation. IIRC, the circuit map as is, can contain circuits for payments they have either already been completed or failed on the network.
cc @cfromknecht
joostjager
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review
htlcswitch/pending_payment.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is test only code. Would look for a different way to assert behaviour (intercepting mock?)
htlcswitch/pending_payment.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used?
htlcswitch/pending_payment.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to do this anyway, what is the use of persisting pending payments? Can't that list be reconstructed on startup from the circuit map?
So only persist the outcomes when they become available.
htlcswitch/pending_payment.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those results left in forever?
298a129 to
cb790c4
Compare
aea2217 to
858ffd8
Compare
htlcswitch/switch.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this still possible after this PR?
39263d4 to
e24ded1
Compare
htlcswitch/payment_result.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename file to payment_result_store.go, as we do with other stores too?
htlcswitch/payment_result.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving the human readable error ExtraMsg to the db is not ideal. There are only a few cases for ExtraMsg, those can probably be saved in a structured way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further inspection, there may be a few more. But still doubt whether we should persist data just to get a nice error message. Maybe just logging when the error occurs is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like they are only used locally for passing the case of error to the RPC. Would it be useful to define a field for local error codes that can be used to communicate local failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that local internal errors (things that should never happen) don't need to be exposed over rpc. Logging and returning a generic internal error should be enough?
htlcswitch/payment_result.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrorSource could be nil if the error source is unknown (for example after restart or when the onion decrypt fails). The store should probably prepare for that, to prevent a db migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using nil can leads to problems down the line, what do you think about defining a constant UnknownPaymentSource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of problems? nil for unset fields is quite common? For serialize to the db we can of course write anything we want.
7c2f8d7 to
8294a42
Compare
0975df7 to
28401bb
Compare
7195371 to
e9c25ae
Compare
|
Rebased. |
|
First commit can be removed during the next rebase. |
| var paymentIDBytes [8]byte | ||
| binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID) | ||
|
|
||
| err := store.db.Batch(func(tx *bbolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to consider making this an update instead, if only for the sake of the integration tests. In the past, then extra 30ms or so jitter here really accumulated making many tests time out when we tried Batching everywhere before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm def in favor of using Update as long as we don't have any clear indication that it is a performance bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do decide to use Update, then we probably don't need a multimutex since the db txns can't execute concurrently. storeResult is called in separate goroutines, so under heavy load i suspect Batch would help, but would need to benchmark to be certain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still using a Update here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.
e9c25ae to
bc77400
Compare
htlcswitch/switch.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is meant by this/when do we plan to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
htlcswitch/switch_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking, but this segment is an exact copy of the logic above, maybe add a checkResult closure?
|
Really happy with how this refactor turned out, this method of storing the payment results i think is much cleaner than some of the avenues we explored earlier! Will need rebase now that #3087 has landed |
lnd_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point these helper routines for hodl invoices could be extracted, but not blocker on this PR
htlcswitch/switch.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timestamp to record when failure was received. This is one building block of black hole defense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferably time.UnixNano for sub-second resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.
In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.
|
Can now be updated to use the latest RPC calls in the new sub-server for the new itests. |
bc77400 to
88baef2
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ☄️
Two non-blocking follow up that I'd like to see extended as a PR soon after this lands:
- Additional itest for the payment failure case
- Clean up of the payment result once that router updates its state, and ACKs the message from the switch.
| var paymentIDBytes [8]byte | ||
| binary.BigEndian.PutUint64(paymentIDBytes[:], paymentID) | ||
|
|
||
| err := store.db.Batch(func(tx *bbolt.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still using a Update here, but haven't had a chance to attempt to ascertain the performance impact. I think we can just leave it as it is, then during the rc process possibly adjust it if we find that it makes a significant impact.
htlcswitch/switch.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we don't store time stamps when the payment is sent, so what good will it to do store them here? Even then, I think we can store all the timestamps at the router level rather than the switch since although they're able to operate independently with the new design, they operate in near unison typically.
In any case, rather than piggy backing on the PR, I think this can be done as a distinct change once we start to center in on a more stable design.
cfromknecht
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🔥
b4ea245 to
78dfb83
Compare
paymentResultStore is a persistent store where we keep track of all received payment results. This is used to ensure we don't lose results from payment attempts on restarts.
Used for logging in the switch, and when we remove the pending payments, only the router will have the hash stored across restarts.
testHoldInvoicePersistence tests that a sender to a hold-invoice, can be restarted before the payment gets settled, and still be able to receive the preimage.
TestNetworkResultStore tests that the networkResult store behaves as expected, and that we can store, get and subscribe to results.
TestSwitchGetPaymentResult tests that the switch interacts as expected with the circuit map and network result store when looking up the result of a payment ID. This is important for not to lose results under concurrent lookup and receiving results.
78dfb83 to
dd88015
Compare
|
Alright, itest using the new payment API is pushed. PTAL @joostjager |
This PR is a follow-up from #2761. It adds a persistent
pendingPaymentStoreto theSwitch, needed to store payment results when therouteris not available to handle them.Problem
After a restart, there is no connection between an HTLC in flight on the network and the
routerpayment flow that initially sent the HTLC. Therouterwill eventually callGetPaymentResultto retrieve the result of the HTLC, but we have no guarantees this will be done before the result is received. This can lead to the result getting dropped, which is not ideal.Solution
We introduce a
pendingPayementStorewhich has two primary functions:PaymentResults when they are received, regardless of whether there are any subscribers (theChannelRouter) to the corresponding HTLC. This ensures that theroutergets handed the result when it callsGetPaymentResult.chanIDandpaymentID, used to identify the payment in the circuit map.CircuitMap. We use the circuit map as the source of truth whether a payment is in flight on the network. Syncing it at startup, before we start handling HTLCs, is necessary to be able to decide whether an HTLC was successfully forwarded.Builds on #2761
Replaces #2265