Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing: failure attribution #3256

Merged
merged 7 commits into from Aug 20, 2019

Conversation

@joostjager
Copy link
Collaborator

commented Jun 28, 2019

This PR overhauls the interpretation of failed payments. It changes the interpretation rules so that we always apply the strongest possible set of penalties, without making assumptions that would hurt good nodes.

Main changes are:

  • Apply different rule sets for intermediate and final nodes. Both types of nodes have different sets of failures that we expect. Penalize nodes that send unexpected failure messages.

  • Distinguish between direct payments and multi-hop payments. For direct payments, we can infer more about the performance of our peer because we trust ourselves.

  • In many cases it is impossible for the sender to determine which of the two nodes in a pair is responsible for the failure. In this situation, we now penalize bidirectionally. This does not hurt the good node of the pair, because only its connection to a bad node is penalized.

  • Previously we always penalized the outgoing connection of the reporting node. This is incorrect for policy related failures. For policy related failures, it could also be that the reporting node received a wrongly crafted htlc from its predecessor. By penalizing the incoming channel, we surely hit the responsible node.

  • FailExpiryTooSoon is a failure that could have been caused by any node up to the reporting node by delaying forwarding of the htlc. We don't know which node is responsible, therefore we now penalize all node pairs in the route. By doing this, we do hurt good nodes. But unfortunately there is currently no way to pinpoint just a single pair to penalize.

@joostjager joostjager force-pushed the joostjager:failure-attribution branch from 50a8a0b to 4127a85 Jun 28, 2019

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 28, 2019

note to self: handle private channels

@joostjager joostjager force-pushed the joostjager:failure-attribution branch 2 times, most recently from 008ca87 to fc858ec Jun 28, 2019

@joostjager joostjager force-pushed the joostjager:failure-attribution branch from fc858ec to decbe12 Jul 29, 2019

@joostjager joostjager force-pushed the joostjager:failure-attribution branch 7 times, most recently from ced7aed to 40c997b Jul 29, 2019

@joostjager joostjager force-pushed the joostjager:failure-attribution branch 5 times, most recently from 4170868 to 2172b17 Aug 2, 2019

@joostjager joostjager force-pushed the joostjager:failure-attribution branch 3 times, most recently from a3c0e2d to 6587416 Aug 5, 2019

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Aug 16, 2019

On travis, the integration tests seem to stall out after (reset twice and same result):

=== RUN   TestLightningNetworkDaemon/sphinx_replay_persistence

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated

@joostjager joostjager force-pushed the joostjager:failure-attribution branch 3 times, most recently from 890e1d8 to 10a9751 Aug 16, 2019

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 16, 2019

This was a very useful test failure. Previously I assumed that any failures originating from self would be temporary and caused by the gap between path finding and payment execution. But there is also the case where our peer sends us a malformed htlc failure that is converted into a local error.

I changed the failure that is returned in that case (commit "htlcswitch: always assume an onion error for malformed htlc failures"), see commit message for explanation. And I added this exception for handling self failures during result interpretation.

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 16, 2019

private_channels itest now failing, going to look into that.

@joostjager joostjager force-pushed the joostjager:failure-attribution branch from 10a9751 to b84c303 Aug 17, 2019

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 17, 2019

private_channels was another interesting failing test that surfaced that we only query bandwidth hints once per payment. In this test, two concurrent payments are executed. Because we stop tracking our own reputation in this pr, not updating the bandwidth hints led to an endless loop. Added a commit to update bandwidth hints with every attempt.

@joostjager joostjager force-pushed the joostjager:failure-attribution branch from b84c303 to fe94e56 Aug 17, 2019

joostjager added 7 commits Aug 5, 2019
routing: convert to nillable failure reason
This commit converts several functions from returning a bool and a
failure reason to a nillable failure reason as return parameter. This
will take away confusion about the interpretation of the two separate
values.
routing: isolate failure interpretation from mission control
This commit moves the payment outcome interpretation logic into a
separate file. Also, mission control isn't updated directly anymore, but
results are stored in an interpretedResult struct. This allows the
mission control state to be locked for a minimum amount of time and
makes it easier to unit test the result interpretation.
routing/test: do not test local channel mission control
This commit updates existing tests to not rely on mission control for
pruning of local channels. Information about local channels should
already be up to date before path finding starts. If not, the problem
should be fixed where bandwidth hints are set up.
routing: penalize all node pairs for unknown outcomes
When an undecryptable failure comes back for a payment attempt, we
previously only penalized our own outgoing connection. However,
any node could have caused this failure. It is therefore better to
penalize all node connections along the route. Then at least we know for
sure that we will hit the responsible node.
htlcswitch: always assume an onion error for malformed htlc failures
Previously a temporary channel failure was returning for unexpected
malformed htlc failures. This is not what we want to communicate to the
sender, because the sender may apply a penalty to us only.

Returning the temporary channel failure is especially problematic if we
ourselves are the sender and the malformed htlc failure comes from our
direct peer. When interpretating the failure, we aren't able to
distinguish anymore between our channel not having enough balance and
our peer sending an unexpected failure back.
routing: query bandwidth hints before each payment attempt
Previously the bandwidth hints were only queried once per payment. This
did not allow for concurrent payments changing channel balances.
routing: stricter payment result interpretation
This commit overhauls the interpretation of failed payments. It changes
the interpretation rules so that we always apply the strongest possible
set of penalties, without making assumptions that would hurt good nodes.

Main changes are:

- Apply different rule sets for intermediate and final nodes. Both types
of nodes have different sets of failures that we expect. Penalize nodes
that send unexpected failure messages.

- Distinguish between direct payments and multi-hop payments. For direct
payments, we can infer more about the performance of our peer because we
trust ourselves.

- In many cases it is impossible for the sender to determine which of
the two nodes in a pair is responsible for the failure. In this
situation, we now penalize bidirectionally. This does not hurt the good
node of the pair, because only its connection to a bad node is
penalized.

- Previously we always penalized the outgoing connection of the
reporting node. This is incorrect for policy related failures. For
policy related failures, it could also be that the reporting node
received a wrongly crafted htlc from its predecessor. By penalizing the
incoming channel, we surely hit the responsible node.

- FailExpiryTooSoon is a failure that could have been caused by any node
up to the reporting node by delaying forwarding of the htlc. We don't
know which node is responsible, therefore we now penalize all node pairs
in the route.

@joostjager joostjager force-pushed the joostjager:failure-attribution branch from fe94e56 to d9ec158 Aug 17, 2019

@joostjager joostjager requested a review from Roasbeef Aug 19, 2019

@joostjager joostjager removed the incomplete label Aug 19, 2019

@cfromknecht
Copy link
Collaborator

left a comment

Because we stop tracking our own reputation in this pr, not updating the bandwidth hints led to an endless loop.

Is this infinite loop still possible if an outdated node sends back tempchanfailure instead of invalidonionkey?

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

In this test, two concurrent payments are executed. Because we stop tracking our own reputation in this pr,

Doesn't this series of bugs introduced show that maybe we shouldn't ignore our own channels at the router level? These isolated cases in our existing integration tests have been resolved, but whose to say there aren't other regressions awaiting?

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Because we stop tracking our own reputation in this pr, not updating the bandwidth hints led to an endless loop.

Why does it loop endlessly rather than just fail? (tentative LGTM...but want to know why this happened)

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 20, 2019

Is this infinite loop still possible if an outdated node sends back tempchanfailure instead of invalidonionkey?

No, because the infinite loop could only happen when we got a tempchanfail from the self node. The self node cannot be outdated. For all other nodes on the route, we apply penalties.

Doesn't this series of bugs introduced show that maybe we shouldn't ignore our own channels at the router level? These isolated cases in our existing integration tests have been resolved, but whose to say there aren't other regressions awaiting?

Imo it shows that bugs have been able to go undetected because we penalize our own channels to create a vaguely defined safety net. Fixing those bugs doesn't just serve the changes in this pr, but makes the performance of lnd really better. For example, querying bandwidth hints only once per payment may have been causing problems with payments sometimes failing in a setting with concurrent payments.

I am still convinced that we should move forward with not penalizing ourselves. Imo there isn't a fundamental reason why would need to do that and it does create a complex interaction between switch, (global) mission control and concurrent payments that is hard to understand.

I do agree that other latent bugs may surface because of this change. In that case we will log a warning, but it could be that an endless loop occurs. Not the worst thing that can happen and with a one minute payment loop timeout it doesn't last that long, but it isn't pretty.

We could add some kind of circuit breaker that for example terminates the payment loop if more than three times the same route is attempted and log an error. But it needs to be kept in mind that three consecutive failures with identical routes could also happen without any bugs: If every time a channel has balance during path finding, but not anymore during execution (because of concurrent payments) AND a payment comes in thereafter to refill the channel before the next path finding round.

Also, the circuit breaker itself may again hide bugs because users ignore the error in the log. And it may never trip because there are no more bugs.

I tend towards reviewing the local dispatch execution path once more for any missed self-failure cases and then merging this without protection. What are your thoughts?

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 20, 2019

Why does it loop endlessly rather than just fail? (tentative LGTM...but want to know why this happened)

It queried the bandwidth hints once at the start of the payment process. Bandwidth was sufficient at that point. Then a concurrent payment took the balance. The first payment process then started its attempt loop. An endless loop occurred because it tried the same depleted channel over and over again without updating the bandwidth hints. If it would have done so, it would have known that the channel had been emptied and would have given up.

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 20, 2019

Traced the local dispatch/response path and checked all locally generated onion failure messages in switch.handleLocalDispatch, link.handleDownStreamPkt, link.handleUpstreamMsg and switch.GetPaymentResult (and the functions they call). Error cases can be divided in three groups:

  1. Errors that fail the link -> will be picked up in next path finding round
  2. Errors that are checked for during path finding too (for example max_htlc) -> will be picked up in next path finding round
  3. Unexpected errors that should not happen. Examples:
    • We can't deserialize our own locally serialized failure
    • channel.AddHtlc fails (no overflow)
    • Can not get our own channel policy
      -> may trigger retry or an endless loop
  4. Errors that could happen, but aren't picked up during path finding -> I found none of those

Assuming my assessment of this is correct, a potential endless loop can only occur for errors that shouldn't happen (3). I think that is acceptable. An endless loop (until timeout) isn't the worst that can happen and in those cases, there may be instability on other areas of lnd as well.

Further into the future, we may want to reevaluate how we handle those unexpected errors. Better define how fail-over behavior should be implemented, for example by failing the link instead of just logging an error. Or alternatively don't support fail-over at all and exit lnd completely in case of an undefined state.

@Roasbeef

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Thanks for investigating this in detail @joostjager! In the wild, it seems that even if we encounter something within the 3rd category, it won't loop endlessly as we still have the payment timeout parameter. In any case, as you pointed out, re-evaluating the bandwidth hints each time should address any issues that can arise today with multiple concurrent payments, which'll really help multi-user lnd instances such as some of the popular custodial wallets out there.

@Roasbeef
Copy link
Member

left a comment

LGTM 👽

@Roasbeef Roasbeef merged commit 2f8d3c4 into lightningnetwork:master Aug 20, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 60.945%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.