routing: fix payment state and refactor payment lifecycle tests #5332

yyforyongyu · 2021-05-26T13:05:43Z

Depends on #5218, only the ~~last 7 commits~~ last 4 commits are relevant.

This PR fixes a potential issue that causes inconsistency in the paymentState. In particular, the paymentState.terminate might become inconsistent with the rest. A new mock library is added to assist the unit testing and be further applied in refactoring the old MPP-related tests.

PS, is there a way to view the diff of a PR based on another PR on github?

Roasbeef · 2021-05-27T01:31:37Z

PS, is there a way to view the diff of a PR based on another PR on github?

Unfortunately no :(

Stuff like https://reviewable.io/ helps do that, but we never managed to make the full switch over.

routing/payment_lifecycle.go

yyforyongyu · 2021-05-27T08:50:26Z

routing/payment_lifecycle.go

+		p.state.Lock()
+
+		// Block until the result is available.
+		result, err := sh.collectResult(attempt)


@halseth Suppose the payment state is,

terminate: false remainingAmt: 0 numInflightShards: 2

There are two steps here,

We fail the attempt inside the collectResult, this has the effect of marking one of the payment.HTLCs as failed, the payment state then becomes,
terminate: false remainingAmt: 5000 numInflightShards: 1

Then calls handleSendError at line 116, which will fail the payment, and the state becomes,
terminate: true remainingAmt: 5000 numInflightShards: 1

Between step 1 and 2, other goroutines may act and change the state.

Thanks for the explanation! I'm still trying to understand why changing the state between step 1 and 2 would be a problem though. Payment attempt outcome and overall payment outcome should ideally be decoupled enough for that not to be a problem.

What could happen between 1 and 2 that would lead to unexpected behavior?

Thank you for reading this!
Tbh I am too trying to understand its implications. The above case could lead to the situation where we would fire one more shard in the next round of the lifecycle loop. Suppose this new one is settled, but it failed to update the payment in db, and the previous step 2 now finished, then we have a payment that's both settled and failed? (Don't know if it's possible tho, still needs to read more code on MMP part).

This is a case when I was refactoring the tests, data race came up because we were updating the payment in one goroutine, while reading it from another. Previously, whenever we construct the payment state, by calling payment.SentAmt(), payment.TerminalInfo(), etc, there is another goroutine modifying them.

I've also updated this PR, realized that there's a mutex used in control tower when we update the payment. That mutex can be used too when we read the payment state.

If the payment has both failed and succeed (which can happen if both happens simultaneously), the success takes precedence:

lnd/channeldb/payments.go

Lines 349 to 367 in 9163d1a

// Use the DB state to determine the status of the payment.

var paymentStatus PaymentStatus

switch {

// If any of the the HTLCs did succeed and there are no HTLCs in

// flight, the payment succeeded.

case !inflight && settled:

paymentStatus = StatusSucceeded

// If we have no in-flight HTLCs, and the payment failure is set, the

// payment is considered failed.

case !inflight && failureReason != nil:

paymentStatus = StatusFailed

// Otherwise it is still in flight.

default:

paymentStatus = StatusInFlight

}

I see what you mean that firing a new shard is unnecessary after the attempt is failed, but before we have marked the payment failed, but it should be fine.

Maybe the data race you saw is a missing mutex within the mock used during the test?

Further looked into it and yeah, it's a data race within the test. But this line got my attention tho, should there be a lock during the following read?

lnd/routing/control_tower.go

Lines 188 to 192 in 9163d1a

func (p *controlTower) FetchPayment(paymentHash lntypes.Hash) (

*channeldb.MPPayment, error) {

return p.db.FetchPayment(paymentHash)

}

I've updated the PR again. If the inconsistency is not an issue, then yeah there's no need to acquire a lock. The extra shard will then resolve itself out in the next round of lifecycle.

I think the mutex is not needed here since it is only used when sending a notification.

routing/payment_lifecycle.go

routing/mock_test.go

routing/payment_lifecycle_test.go

Roasbeef · 2021-06-23T03:33:25Z

Needs a rebase!

Roasbeef · 2021-06-23T03:35:24Z

Will be looking to merge this as a whole instead of #5218 then this since this builds upon the base PR cleanly.

The simulated error returned was rejected due to signature failure, and didn't simulate correctly the insufficient fees error as intended. Fix error by including correct signature.

This commit refactors some of the tests in router_test.go to use the require package.

This commit moves the handleSendError method from ChannelRouter to shardHandler. In doing so, shardHandler can now apply updates to the in-memory paymentSession if they are found in the error message.

This commit adds the method UpdateAdditionalEdge in PaymentSession, which allows the addtional channel edge policy to be updated from a ChannelUpdate message. Another method, GetAdditionalEdgePolicy is added to allow querying additional edge policies.

This commit adds payment session to shardHandler to enable private edge policies being updated in shardHandler. The relevant interface and mock are updated. From now on, upon seeing a ChannelUpdate message, shardHandler will first try to find the target policy in additionalEdges and update it. If nothing found, it will then check the database for edge policy to update.

This commit renames the mock structs by appending Old in their names. In doing so the old tests stay unchanged and new mock structs can be added in the following commit.

This commit uses the package mock to create new mock structs, replacing the old ones for better control when writing tests.

This commit refactors the resumePayment to extract some logics back to paymentState so that the code is more testable. It also adds unit tests for paymentState, and breaks the original MPPayment tests into independent tests so that it's easier to maintain and debug. All the new tests are built using mock so that the control flow is eaiser to setup and change.

yyforyongyu · 2021-06-23T12:39:11Z

Needs a rebase!

Done!

Roasbeef

Solid work! We definitely have a bit more work ahead of us here to make this section of the codebase more scrutable, but this is certainly a step in the right direction. I like the new style you've introduced here in the newly added tests where we work up our series of assertions, then asynchronously kick off the function under test and assert that things lined up with our expectations.

One lingering question re an additional case that was added in the paymentState struct, and if that may address some recently reported issues that have popped up.

LGTM 🍙

Roasbeef · 2021-06-23T03:37:18Z

routing/mock_test.go

@@ -529,3 +530,182 @@ func (m *mockControlTowerOld) SubscribePayment(paymentHash lntypes.Hash) (

 	return nil, errors.New("not implemented")
 }
+
+type mockPaymentAttemptDispatcher struct {


Nice move re renaming the older versions to keep things in place and create a new base test abstraction for any future changes!

Roasbeef · 2021-06-23T23:54:22Z

routing/payment_lifecycle.go

+	// If we have in flight shards and the payment is in final stage, we
+	// need to wait for the outcomes from the shards. Or if we have no more
+	// money to be sent, we need to wait for the already launched shards.
+	if ps.numShardsInFlight == 0 {


Does this represent a bug fix as the existing switch statement didn't exactly account for this?

(still getting through the diff)

Yeah I would say so. When there are no shards in flight, there's no need to wait for outcomes from shards. When a new payment is created, we have zero shards thus there are no outcomes to wait for. When we are in the middle of a MPP, zero numShardsInFlight could mean three cases,

all the HTLC attempts failed (h.Failure != nil). In this case, remainingAmt would be the full payment amount, terminate is false unless we have a "payment level" failure.

all the HTLC attempts settled (h.Settle !=nil). In this case, remainingAmt would be 0, and terminate is true.

a mix of settled and failed HTLC attempts coexist (bad). remaingAmt is not 0, but terminate is true.

And we need to check this logic together with terminated(), as we have the cases,

case currentState.terminated(): ... case currentState.needWaitForShards(): ...

Now for case 2 and 3, because ps.terminate is true and ps.numShardsInFlight == 0, this means we would go to case currentState.terminated():.
For case 1, if the payment failed, it would also go to case currentState.terminated():, otherwise we would launch new shards.

That being said, the complexity shown here probably means we have some improvement to achieve. There are three states, ps.terminate == true, ps.remainingAmt == 0 and ps.numShardsInFlight == 0 to control the workflow. In theory, we should have 8 conditions to be checked. That's a bit much. Underneath the states, we use payment and HTLC attempts to control them. This is the part that brings complexity too. We have a payment with multiple HTLC attempts (one-to-many). Each of the HTLC has states, settled, failed, or in-flight. A settled HTLC means the payment has state settled, a failed HTLC does not. A payment has its own failed state determined by a different set of logic. I think this could be the root cause of all the complexity and unclear abstraction found in the routing package.

Roasbeef · 2021-06-23T23:57:00Z

routing/payment_lifecycle.go

+// updatePaymentState will fetch db for the payment to find the latest
+// information we need to act on every iteration of the payment loop and update
+// the paymentState.
+func (p *paymentLifecycle) updatePaymentState() (*channeldb.MPPayment,


updatePaymentState makes it semes like we're mutating the state on disk. How about fetchPaymentState instead?

Also some small grammatical changes to the comment:

updatePaymentState will query the db for the latest payment state information we need to act on every iteration of the payment loop and update the paymentState.

Agreed, now fixed!

Roasbeef · 2021-06-24T00:02:21Z

routing/payment_lifecycle.go

-		// a terminal state (settled or permanent failure), while we
-		// were pathfinding. We know we're in a terminal state here,
-		// so we can continue and wait for our last shards to return.
-		case err == channeldb.ErrPaymentTerminal:


Why don't we need to handle this terminal payment state any longer?

Actually we do need. This was deleted when the previous change locked the payment state so we wouldn't hit this error. This is now put back.

Roasbeef · 2021-06-24T00:07:03Z

routing/payment_lifecycle.go

-			case <-p.router.quit:
-			case <-p.quit:
-			}
+			// Overwrite errToSend and return.


May be worth mentioning here (to make the control flow clearer) that we're just setting the "return param" for the defer above (handleResultErr).

Roasbeef · 2021-06-24T00:07:16Z

routing/payment_lifecycle.go

@@ -456,32 +493,18 @@ func (p *shardHandler) collectResultAsync(attempt *channeldb.HTLCAttemptInfo) {
 					attempt.AttemptID, p.identifier, err)
 			}

-			select {
-			case p.shardErrors <- err:


Nice usage of the defer here to get rid of that duplication!

Roasbeef · 2021-06-24T00:10:36Z

routing/payment_lifecycle_test.go

+	testCases := []struct {
+		name string
+
+		// Use the following three params, each is equivalent to a bool


Roasbeef · 2021-06-24T00:24:04Z

routing/router_test.go

+	var failedAttempt channeldb.HTLCAttempt
+	controlTower.On("FailAttempt",
+		identifier, mock.Anything, mock.Anything,
+	).Return(&failedAttempt, nil).Run(func(args mock.Arguments) {


Interesting alternative to the way we typically do things (sort of create the mocks as normal struct and "re-implement" any logic not able to be mocked out cleanly). This route sort of clutters the call site in the tests (imo), but as you mentioned earlier lets us more directly assert the control flow based on our "ideal mental model".

This route sort of clutters the call site in the tests (imo), but as you mentioned earlier lets us more directly assert the control flow based on our "ideal mental model".

Agreed that there are unnessary (imo) side effects used in these mock functions. Ideally it should just be mock.On("method").Return(something). A lot of the extra lines are used to contruct a wanted payment state. If we could somehow mock currentState.terminated() and currentState.needWaitForShards() then the test would be much easier. I guess this will be a future improvement.

Roasbeef · 2021-06-25T01:41:14Z

Travis runs look good other than the shutdown issue, and we're running into that reported timeout in one of the switch tests (unrelated to these changes), will land as is.

yyforyongyu requested a review from Roasbeef May 26, 2021 13:05

yyforyongyu force-pushed the 5259-routing-fix-state branch 2 times, most recently from 846a62c to e7ecb4e Compare May 26, 2021 16:07

Roasbeef requested a review from halseth May 27, 2021 01:31

yyforyongyu force-pushed the 5259-routing-fix-state branch from e7ecb4e to 13325b5 Compare May 27, 2021 06:56

halseth suggested changes May 27, 2021

View reviewed changes

routing/payment_lifecycle.go Outdated Show resolved Hide resolved

yyforyongyu commented May 27, 2021

View reviewed changes

yyforyongyu requested a review from halseth May 27, 2021 09:47

yyforyongyu force-pushed the 5259-routing-fix-state branch 4 times, most recently from 2e2b65d to 831602b Compare May 28, 2021 09:08

halseth reviewed Jun 1, 2021

View reviewed changes

routing/payment_lifecycle.go Show resolved Hide resolved

routing/mock_test.go Outdated Show resolved Hide resolved

routing/payment_lifecycle_test.go Show resolved Hide resolved

yyforyongyu force-pushed the 5259-routing-fix-state branch from 831602b to 253a8bd Compare June 3, 2021 11:39

yyforyongyu requested a review from halseth June 3, 2021 11:50

halseth approved these changes Jun 4, 2021

View reviewed changes

Crypt-iQ added routing testing Improvements/modifications to the test suite labels Jun 21, 2021

yyforyongyu and others added 9 commits June 23, 2021 18:13

routing: make payment lifecycle test more verbose

9062ab1

routing: add private key attribute in router test

fc113c7

routing: fix TestSendPaymentErrorRepeatedFeeInsufficient

242a844

The simulated error returned was rejected due to signature failure, and didn't simulate correctly the insufficient fees error as intended. Fix error by including correct signature.

routing: test private edge on fee error

37d0e21

routing: parse Channel ID from json file

ae6d8a9

routing: rm hardcoded channel id in router test

8172811

routing: use require in router test

54aacac

This commit refactors some of the tests in router_test.go to use the require package.

routing: use shardHandler to process err in SendToRoute

1656611

routing: move sendErr handling in shardHandler

cf2b574

This commit moves the handleSendError method from ChannelRouter to shardHandler. In doing so, shardHandler can now apply updates to the in-memory paymentSession if they are found in the error message.

yyforyongyu added 4 commits June 23, 2021 18:13

routing: refactor TestSendPaymentErrorFeeInsufficientPrivateEdge

e05b78f

routing: add TestSendPaymentPrivateEdgeUpdateFeeExceedsLimit

735e89c

yyforyongyu force-pushed the 5259-routing-fix-state branch 2 times, most recently from e6a430a to 33caf07 Compare June 23, 2021 10:44

yyforyongyu added 8 commits June 23, 2021 20:35

itest: moving routing related tests into one file

e10bd84

itest: fix make lint

319cc53

itest: test channel policy update in private channels

193d149

itest: silence depreciation warnings

54ed6c2

routing: rename mock structs to make them obsolete

289d97f

This commit renames the mock structs by appending Old in their names. In doing so the old tests stay unchanged and new mock structs can be added in the following commit.

mod: add package stretchr/testify mock

f5de56a

routing: add mock structs for testing

e79e46e

This commit uses the package mock to create new mock structs, replacing the old ones for better control when writing tests.

yyforyongyu force-pushed the 5259-routing-fix-state branch from 33caf07 to cd35981 Compare June 23, 2021 12:37

Roasbeef approved these changes Jun 24, 2021

View reviewed changes

Roasbeef added this to the 0.13.1 milestone Jun 24, 2021

Roasbeef merged commit 198ac34 into lightningnetwork:master Jun 25, 2021

yyforyongyu mentioned this pull request Jun 25, 2021

routing: rename method and add err check when launch shard #5431

Merged

alexbosworth mentioned this pull request Jul 6, 2021

Payment failures are no longer populated as structured errors, payment failure details not returned #5477

Closed

Crypt-iQ mentioned this pull request Oct 20, 2021

Explicitly allow funding_locked early, and support alias scids (feat 46/47/50/51) lightning/bolts#910

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

routing: fix payment state and refactor payment lifecycle tests #5332

routing: fix payment state and refactor payment lifecycle tests #5332

yyforyongyu commented May 26, 2021 •

edited

Loading

Roasbeef commented May 27, 2021

yyforyongyu May 27, 2021

halseth May 27, 2021

yyforyongyu May 27, 2021

halseth May 27, 2021

yyforyongyu May 27, 2021

halseth Jun 1, 2021

Roasbeef commented Jun 23, 2021

Roasbeef commented Jun 23, 2021

yyforyongyu commented Jun 23, 2021

Roasbeef left a comment

Roasbeef Jun 23, 2021

Roasbeef Jun 23, 2021

yyforyongyu Jun 24, 2021

Roasbeef Jun 23, 2021

yyforyongyu Jun 24, 2021

Roasbeef Jun 24, 2021

yyforyongyu Jun 24, 2021

Roasbeef Jun 24, 2021

yyforyongyu Jun 24, 2021

Roasbeef Jun 24, 2021

Roasbeef Jun 24, 2021

Roasbeef Jun 24, 2021

yyforyongyu Jun 24, 2021

Roasbeef commented Jun 25, 2021

	// Use the DB state to determine the status of the payment.
	var paymentStatus PaymentStatus

	switch {

	// If any of the the HTLCs did succeed and there are no HTLCs in
	// flight, the payment succeeded.
	case !inflight && settled:
	paymentStatus = StatusSucceeded

	// If we have no in-flight HTLCs, and the payment failure is set, the
	// payment is considered failed.
	case !inflight && failureReason != nil:
	paymentStatus = StatusFailed

	// Otherwise it is still in flight.
	default:
	paymentStatus = StatusInFlight
	}

	func (p *controlTower) FetchPayment(paymentHash lntypes.Hash) (
	*channeldb.MPPayment, error) {

	return p.db.FetchPayment(paymentHash)
	}

routing: fix payment state and refactor payment lifecycle tests #5332

routing: fix payment state and refactor payment lifecycle tests #5332

Conversation

yyforyongyu commented May 26, 2021 • edited Loading

Roasbeef commented May 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef commented Jun 23, 2021

Roasbeef commented Jun 23, 2021

yyforyongyu commented Jun 23, 2021

Roasbeef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef commented Jun 25, 2021

yyforyongyu commented May 26, 2021 •

edited

Loading