Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reliable payments] router payment state machine #2761

Merged

Conversation

@halseth
Copy link
Collaborator

@halseth halseth commented Mar 12, 2019

This PR introduces a persistent state machine to the ChannelRouter's payment flow, ensuring we can handle payment results that are received after a restart. It also paves they way for adding cancellation of payments, and resuming payment sessions across restarts.

Problem

lnd currently runs into problems if it is restarted while an HTLC is in flight on the network. The primary reason for this is the way the router hands the HTLC to the switch. The router persist no information about the payment, so when a result eventually comes back, we risk the information needed to properly populate the OutgoingPayment in the database is lost.

Solution

With the paymentStateMachine introduced in this PR, we store two pieces of key information:

  1. The HTLC paymentID. This is used to query the Switch whether the HTLC is still active. If not active we know that we can safely retry the payment attempt. If active the router will wait for the result to be available, and store the result to the DB.
  2. The HTLC's route. This is added to the DB together with the preimage when the payment succeeds.

Note

The Switch does not currently persist the pending payment attempts across restarts. This will be added in a follow-up PR.

Replaces #2475

Builds on

@halseth halseth added this to the 0.6 milestone Mar 12, 2019
@halseth halseth force-pushed the reliable-payments-router-state-machine branch from e83fd93 to 2409e3d Mar 12, 2019
Copy link
Member

@Roasbeef Roasbeef left a comment

I really dig this approach! With the state machine, I find it much easier to follow than the prior attempts at a solution to this problem. I've completed an initial pass so far, and the main question in my mind is the size of the overlap between this new state machine and the existing control tower in the switch. At one point the control tower addressed a need within the codebase, but it seems like this new state machine can eventually subsume the responsibilities of the control tower.

htlcswitch/switch.go Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
routing/route.go Outdated Show resolved Hide resolved
routing/payment_state_machine.go Outdated Show resolved Hide resolved
routing/payment_state_machine.go Outdated Show resolved Hide resolved
routing/payment_state_machine.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the reliable-payments-router-state-machine branch 7 times, most recently from 78a0c81 to 1687cdc Mar 20, 2019
Copy link
Collaborator

@joostjager joostjager left a comment

I am, as always, worried about the persistence and the work and inflexibility it may cause us in the future. Tried to mainly analyze this part of the PR.

Why don't you merge the switch pr first btw, work bottom up? Or will we have with just this PR already good working resumes of payments?

channeldb/payments.go Outdated Show resolved Hide resolved
channeldb/payments.go Outdated Show resolved Hide resolved
htlcswitch/pending_payment.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/payment_state_machine.go Outdated Show resolved Hide resolved
binary.BigEndian.PutUint64(paymentIDBytes, paymentID)
// CompletePayment overwrites the OutgoingPayment stored in the DB for the
// corresponding payment hash with the completed one.
func (db *DB) CompletePayment(preimage lntypes.Preimage,
Copy link
Collaborator

@joostjager joostjager Mar 20, 2019

Looks unused

routing/router.go Show resolved Hide resolved
routing/router.go Show resolved Hide resolved
routing/payment_state_machine.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the reliable-payments-router-state-machine branch 5 times, most recently from ce35b99 to 9d1bd42 Mar 21, 2019
Copy link
Collaborator

@joostjager joostjager left a comment

Definitely nicer without that intermediate persistent state.

Main comments:

  • Code structure in router
  • Consolidation of payment related stores

channeldb/payments.go Outdated Show resolved Hide resolved
routing/control_tower.go Outdated Show resolved Hide resolved
routing/payment_store.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
channeldb/payments.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@cfromknecht cfromknecht left a comment

nice refactor of the router/switch interactions, this should make the whole payment flow much tighter!

one question i have with new state introduced in the control tower, will we just treat all payments that are currently grounded as started but never attempted? an alternative would be to rename grounded payments as failed, and introduce a new state for initiated. not sure which is better atm

routing/router.go Outdated Show resolved Hide resolved
return nil
}

ctx.router.cfg.GetPaymentResult = func(paymentID uint64) (
Copy link
Collaborator

@cfromknecht cfromknecht Mar 22, 2019

many of these GetPaymentResult funcs seem similar, any way we can generate the closures w/ less code duplication?

routing/router.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/route_test.go Outdated Show resolved Hide resolved
rpcserver.go Show resolved Hide resolved
routing/router_test.go Outdated Show resolved Hide resolved
routing/control_tower.go Outdated Show resolved Hide resolved
routing/control_tower.go Outdated Show resolved Hide resolved
htlcswitch/switch.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the reliable-payments-router-state-machine branch 3 times, most recently from 7180772 to 7aae12a Mar 26, 2019
channeldb/payments.go Outdated Show resolved Hide resolved
routing/router.go Show resolved Hide resolved
channeldb/control_tower.go Outdated Show resolved Hide resolved
@halseth halseth force-pushed the reliable-payments-router-state-machine branch 2 times, most recently from fd20644 to ab0f77c Mar 26, 2019
@Roasbeef Roasbeef removed this from the 0.6 milestone Mar 26, 2019
halseth added 25 commits May 27, 2019
migrateOutgoingPayments moves the OutgoingPayments into a new bucket format
where they all reside in a top-level bucket indexed by the payment hash. In
this sub-bucket we store information relevant to this payment, such as the
payment status.

To avoid that the router resend payments that have the status InFlight (we
cannot resume these payments for pre-migration payments) we delete those
statuses, so only Completed payments remain in the new bucket structure.
Since we have performed a migration, the db should be in a consistent
state, and we can remove the non-strict option.
This commit gives a new responsibility to the control tower, letting it
populate the payment bucket structure as the payment goes through
its different stages.

The payment will transition states Grounded->InFlight->Success/Failed,
where the CreationInfo/AttemptInfo/Preimage must be set accordingly.

This will be the main driver for the router state machine.
This encapsulates all state needed to resume a payment from any point of
the payment flow, and that must be shared between the different stages
of the execution. This is done to prepare for breaking the send loop
into smaller parts, and being able to resume the payment from any point
from persistent state.
This commit makes the router use the ControlTower to drive the payment
life cycle state machine, to keep track of active payments across
restarts.  This lets the router resume payments on startup, such that
their final results can be handled and stored when ready.
On startup the router will fetch the in-flight payments from the control
tower, and resume their execution.
TestRouterPaymentStateMachine tests that the router interacts as
expected with the ControlTower during a payment lifecycle, such that it
payment attempts are not sent twice to the switch, and results are
handled after a restart.
TestPaymentControlDeleteNonInFlight checks that calling DeletaPayments
only deletes payments from the database that are not in-flight.
@halseth halseth force-pushed the reliable-payments-router-state-machine branch from 9f7b1d9 to 7cb25a5 May 27, 2019
@Roasbeef Roasbeef merged commit 19fafd7 into lightningnetwork:master May 27, 2019
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants