Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing: fix payment state and refactor payment lifecycle tests #5332

Merged
merged 21 commits into from
Jun 25, 2021

Conversation

yyforyongyu
Copy link
Collaborator

@yyforyongyu yyforyongyu commented May 26, 2021

Depends on #5218, only the last 7 commits last 4 commits are relevant.

This PR fixes a potential issue that causes inconsistency in the paymentState. In particular, the paymentState.terminate might become inconsistent with the rest. A new mock library is added to assist the unit testing and be further applied in refactoring the old MPP-related tests.

PS, is there a way to view the diff of a PR based on another PR on github?

@yyforyongyu yyforyongyu requested a review from Roasbeef May 26, 2021 13:05
@yyforyongyu yyforyongyu force-pushed the 5259-routing-fix-state branch 2 times, most recently from 846a62c to e7ecb4e Compare May 26, 2021 16:07
@Roasbeef
Copy link
Member

PS, is there a way to view the diff of a PR based on another PR on github?

Unfortunately no :(

Stuff like https://reviewable.io/ helps do that, but we never managed to make the full switch over.

routing/payment_lifecycle.go Outdated Show resolved Hide resolved
p.state.Lock()

// Block until the result is available.
result, err := sh.collectResult(attempt)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halseth Suppose the payment state is,

terminate: false
remainingAmt: 0
numInflightShards: 2

There are two steps here,

  1. We fail the attempt inside the collectResult, this has the effect of marking one of the payment.HTLCs as failed, the payment state then becomes,
    terminate: false
    remainingAmt: 5000
    numInflightShards: 1
    
  2. Then calls handleSendError at line 116, which will fail the payment, and the state becomes,
    terminate: true
    remainingAmt: 5000
    numInflightShards: 1
    

Between step 1 and 2, other goroutines may act and change the state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I'm still trying to understand why changing the state between step 1 and 2 would be a problem though. Payment attempt outcome and overall payment outcome should ideally be decoupled enough for that not to be a problem.

What could happen between 1 and 2 that would lead to unexpected behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reading this!
Tbh I am too trying to understand its implications. The above case could lead to the situation where we would fire one more shard in the next round of the lifecycle loop. Suppose this new one is settled, but it failed to update the payment in db, and the previous step 2 now finished, then we have a payment that's both settled and failed? (Don't know if it's possible tho, still needs to read more code on MMP part).

This is a case when I was refactoring the tests, data race came up because we were updating the payment in one goroutine, while reading it from another. Previously, whenever we construct the payment state, by calling payment.SentAmt(), payment.TerminalInfo(), etc, there is another goroutine modifying them.

I've also updated this PR, realized that there's a mutex used in control tower when we update the payment. That mutex can be used too when we read the payment state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the payment has both failed and succeed (which can happen if both happens simultaneously), the success takes precedence:

lnd/channeldb/payments.go

Lines 349 to 367 in 9163d1a

// Use the DB state to determine the status of the payment.
var paymentStatus PaymentStatus
switch {
// If any of the the HTLCs did succeed and there are no HTLCs in
// flight, the payment succeeded.
case !inflight && settled:
paymentStatus = StatusSucceeded
// If we have no in-flight HTLCs, and the payment failure is set, the
// payment is considered failed.
case !inflight && failureReason != nil:
paymentStatus = StatusFailed
// Otherwise it is still in flight.
default:
paymentStatus = StatusInFlight
}

I see what you mean that firing a new shard is unnecessary after the attempt is failed, but before we have marked the payment failed, but it should be fine.

Maybe the data race you saw is a missing mutex within the mock used during the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further looked into it and yeah, it's a data race within the test. But this line got my attention tho, should there be a lock during the following read?

func (p *controlTower) FetchPayment(paymentHash lntypes.Hash) (
*channeldb.MPPayment, error) {
return p.db.FetchPayment(paymentHash)
}

I've updated the PR again. If the inconsistency is not an issue, then yeah there's no need to acquire a lock. The extra shard will then resolve itself out in the next round of lifecycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mutex is not needed here since it is only used when sending a notification.

@yyforyongyu yyforyongyu requested a review from halseth May 27, 2021 09:47
@yyforyongyu yyforyongyu force-pushed the 5259-routing-fix-state branch 4 times, most recently from 2e2b65d to 831602b Compare May 28, 2021 09:08
routing/payment_lifecycle.go Show resolved Hide resolved
routing/mock_test.go Outdated Show resolved Hide resolved
routing/payment_lifecycle_test.go Show resolved Hide resolved
@Crypt-iQ Crypt-iQ added routing testing Improvements/modifications to the test suite labels Jun 21, 2021
@Roasbeef
Copy link
Member

Needs a rebase!

@Roasbeef
Copy link
Member

Will be looking to merge this as a whole instead of #5218 then this since this builds upon the base PR cleanly.

yyforyongyu and others added 9 commits June 23, 2021 18:13
The simulated error returned was rejected due to signature failure,
and didn't simulate correctly the insufficient fees error as
intended. Fix error by including correct signature.
This commit refactors some of the tests in router_test.go to use the
require package.
This commit moves the handleSendError method from ChannelRouter to
shardHandler. In doing so, shardHandler can now apply updates to the
in-memory paymentSession if they are found in the error message.
This commit adds the method UpdateAdditionalEdge in PaymentSession,
which allows the addtional channel edge policy to be updated from a
ChannelUpdate message. Another method, GetAdditionalEdgePolicy is added
to allow querying additional edge policies.
This commit adds payment session to shardHandler to enable private edge
policies being updated in shardHandler. The relevant interface and mock
are updated. From now on, upon seeing a ChannelUpdate message,
shardHandler will first try to find the target policy in additionalEdges
and update it. If nothing found, it will then check the database for
edge policy to update.
@yyforyongyu yyforyongyu force-pushed the 5259-routing-fix-state branch 2 times, most recently from e6a430a to 33caf07 Compare June 23, 2021 10:44
This commit renames the mock structs by appending Old in their names. In
doing so the old tests stay unchanged and new mock structs can be added
in the following commit.
This commit uses the package mock to create new mock structs, replacing
the old ones for better control when writing tests.
This commit refactors the resumePayment to extract some logics back to
paymentState so that the code is more testable. It also adds unit tests
for paymentState, and breaks the original MPPayment tests into independent tests
so that it's easier to maintain and debug. All the new tests are built
using mock so that the control flow is eaiser to setup and change.
@yyforyongyu
Copy link
Collaborator Author

Needs a rebase!

Done!

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! We definitely have a bit more work ahead of us here to make this section of the codebase more scrutable, but this is certainly a step in the right direction. I like the new style you've introduced here in the newly added tests where we work up our series of assertions, then asynchronously kick off the function under test and assert that things lined up with our expectations.

One lingering question re an additional case that was added in the paymentState struct, and if that may address some recently reported issues that have popped up.

LGTM 🍙

@@ -529,3 +530,182 @@ func (m *mockControlTowerOld) SubscribePayment(paymentHash lntypes.Hash) (

return nil, errors.New("not implemented")
}

type mockPaymentAttemptDispatcher struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice move re renaming the older versions to keep things in place and create a new base test abstraction for any future changes!

// If we have in flight shards and the payment is in final stage, we
// need to wait for the outcomes from the shards. Or if we have no more
// money to be sent, we need to wait for the already launched shards.
if ps.numShardsInFlight == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this represent a bug fix as the existing switch statement didn't exactly account for this?

(still getting through the diff)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would say so. When there are no shards in flight, there's no need to wait for outcomes from shards. When a new payment is created, we have zero shards thus there are no outcomes to wait for. When we are in the middle of a MPP, zero numShardsInFlight could mean three cases,

  1. all the HTLC attempts failed (h.Failure != nil). In this case, remainingAmt would be the full payment amount, terminate is false unless we have a "payment level" failure.
  2. all the HTLC attempts settled (h.Settle !=nil). In this case, remainingAmt would be 0, and terminate is true.
  3. a mix of settled and failed HTLC attempts coexist (bad). remaingAmt is not 0, but terminate is true.

And we need to check this logic together with terminated(), as we have the cases,

case currentState.terminated():
    ...
case currentState.needWaitForShards():
    ...

Now for case 2 and 3, because ps.terminate is true and ps.numShardsInFlight == 0, this means we would go to case currentState.terminated():.
For case 1, if the payment failed, it would also go to case currentState.terminated():, otherwise we would launch new shards.

That being said, the complexity shown here probably means we have some improvement to achieve. There are three states, ps.terminate == true, ps.remainingAmt == 0 and ps.numShardsInFlight == 0 to control the workflow. In theory, we should have 8 conditions to be checked. That's a bit much. Underneath the states, we use payment and HTLC attempts to control them. This is the part that brings complexity too. We have a payment with multiple HTLC attempts (one-to-many). Each of the HTLC has states, settled, failed, or in-flight. A settled HTLC means the payment has state settled, a failed HTLC does not. A payment has its own failed state determined by a different set of logic. I think this could be the root cause of all the complexity and unclear abstraction found in the routing package.

// updatePaymentState will fetch db for the payment to find the latest
// information we need to act on every iteration of the payment loop and update
// the paymentState.
func (p *paymentLifecycle) updatePaymentState() (*channeldb.MPPayment,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updatePaymentState makes it semes like we're mutating the state on disk. How about fetchPaymentState instead?

Also some small grammatical changes to the comment:

updatePaymentState will query the db for the latest payment state information we need to act on every iteration of the payment loop and update the paymentState.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, now fixed!

// a terminal state (settled or permanent failure), while we
// were pathfinding. We know we're in a terminal state here,
// so we can continue and wait for our last shards to return.
case err == channeldb.ErrPaymentTerminal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to handle this terminal payment state any longer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we do need. This was deleted when the previous change locked the payment state so we wouldn't hit this error. This is now put back.

case <-p.router.quit:
case <-p.quit:
}
// Overwrite errToSend and return.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth mentioning here (to make the control flow clearer) that we're just setting the "return param" for the defer above (handleResultErr).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@@ -456,32 +493,18 @@ func (p *shardHandler) collectResultAsync(attempt *channeldb.HTLCAttemptInfo) {
attempt.AttemptID, p.identifier, err)
}

select {
case p.shardErrors <- err:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice usage of the defer here to get rid of that duplication!

testCases := []struct {
name string

// Use the following three params, each is equivalent to a bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

var failedAttempt channeldb.HTLCAttempt
controlTower.On("FailAttempt",
identifier, mock.Anything, mock.Anything,
).Return(&failedAttempt, nil).Run(func(args mock.Arguments) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting alternative to the way we typically do things (sort of create the mocks as normal struct and "re-implement" any logic not able to be mocked out cleanly). This route sort of clutters the call site in the tests (imo), but as you mentioned earlier lets us more directly assert the control flow based on our "ideal mental model".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This route sort of clutters the call site in the tests (imo), but as you mentioned earlier lets us more directly assert the control flow based on our "ideal mental model".

Agreed that there are unnessary (imo) side effects used in these mock functions. Ideally it should just be mock.On("method").Return(something). A lot of the extra lines are used to contruct a wanted payment state. If we could somehow mock currentState.terminated() and currentState.needWaitForShards() then the test would be much easier. I guess this will be a future improvement.

@Roasbeef Roasbeef added this to the 0.13.1 milestone Jun 24, 2021
@Roasbeef
Copy link
Member

Travis runs look good other than the shutdown issue, and we're running into that reported timeout in one of the switch tests (unrelated to these changes), will land as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
routing testing Improvements/modifications to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants