-
Notifications
You must be signed in to change notification settings - Fork 2.2k
routing: failure attribution #3256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: failure attribution #3256
Conversation
50a8a0b to
4127a85
Compare
|
note to self: handle private channels |
008ca87 to
fc858ec
Compare
fc858ec to
decbe12
Compare
ced7aed to
40c997b
Compare
4170868 to
2172b17
Compare
a3c0e2d to
6587416
Compare
|
|
10a9751 to
b84c303
Compare
|
|
b84c303 to
fe94e56
Compare
This commit converts several functions from returning a bool and a failure reason to a nillable failure reason as return parameter. This will take away confusion about the interpretation of the two separate values.
This commit moves the payment outcome interpretation logic into a separate file. Also, mission control isn't updated directly anymore, but results are stored in an interpretedResult struct. This allows the mission control state to be locked for a minimum amount of time and makes it easier to unit test the result interpretation.
This commit updates existing tests to not rely on mission control for pruning of local channels. Information about local channels should already be up to date before path finding starts. If not, the problem should be fixed where bandwidth hints are set up.
When an undecryptable failure comes back for a payment attempt, we previously only penalized our own outgoing connection. However, any node could have caused this failure. It is therefore better to penalize all node connections along the route. Then at least we know for sure that we will hit the responsible node.
Previously a temporary channel failure was returning for unexpected malformed htlc failures. This is not what we want to communicate to the sender, because the sender may apply a penalty to us only. Returning the temporary channel failure is especially problematic if we ourselves are the sender and the malformed htlc failure comes from our direct peer. When interpretating the failure, we aren't able to distinguish anymore between our channel not having enough balance and our peer sending an unexpected failure back.
Previously the bandwidth hints were only queried once per payment. This did not allow for concurrent payments changing channel balances.
This commit overhauls the interpretation of failed payments. It changes the interpretation rules so that we always apply the strongest possible set of penalties, without making assumptions that would hurt good nodes. Main changes are: - Apply different rule sets for intermediate and final nodes. Both types of nodes have different sets of failures that we expect. Penalize nodes that send unexpected failure messages. - Distinguish between direct payments and multi-hop payments. For direct payments, we can infer more about the performance of our peer because we trust ourselves. - In many cases it is impossible for the sender to determine which of the two nodes in a pair is responsible for the failure. In this situation, we now penalize bidirectionally. This does not hurt the good node of the pair, because only its connection to a bad node is penalized. - Previously we always penalized the outgoing connection of the reporting node. This is incorrect for policy related failures. For policy related failures, it could also be that the reporting node received a wrongly crafted htlc from its predecessor. By penalizing the incoming channel, we surely hit the responsible node. - FailExpiryTooSoon is a failure that could have been caused by any node up to the reporting node by delaying forwarding of the htlc. We don't know which node is responsible, therefore we now penalize all node pairs in the route.
fe94e56 to
d9ec158
Compare
cfromknecht
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we stop tracking our own reputation in this pr, not updating the bandwidth hints led to an endless loop.
Is this infinite loop still possible if an outdated node sends back tempchanfailure instead of invalidonionkey?
Doesn't this series of bugs introduced show that maybe we shouldn't ignore our own channels at the router level? These isolated cases in our existing integration tests have been resolved, but whose to say there aren't other regressions awaiting? |
Why does it loop endlessly rather than just fail? (tentative LGTM...but want to know why this happened) |
No, because the infinite loop could only happen when we got a
Imo it shows that bugs have been able to go undetected because we penalize our own channels to create a vaguely defined safety net. Fixing those bugs doesn't just serve the changes in this pr, but makes the performance of I am still convinced that we should move forward with not penalizing ourselves. Imo there isn't a fundamental reason why would need to do that and it does create a complex interaction between switch, (global) mission control and concurrent payments that is hard to understand. I do agree that other latent bugs may surface because of this change. In that case we will log a warning, but it could be that an endless loop occurs. Not the worst thing that can happen and with a one minute payment loop timeout it doesn't last that long, but it isn't pretty. We could add some kind of circuit breaker that for example terminates the payment loop if more than three times the same route is attempted and log an error. But it needs to be kept in mind that three consecutive failures with identical routes could also happen without any bugs: If every time a channel has balance during path finding, but not anymore during execution (because of concurrent payments) AND a payment comes in thereafter to refill the channel before the next path finding round. Also, the circuit breaker itself may again hide bugs because users ignore the error in the log. And it may never trip because there are no more bugs. I tend towards reviewing the local dispatch execution path once more for any missed self-failure cases and then merging this without protection. What are your thoughts? |
It queried the bandwidth hints once at the start of the payment process. Bandwidth was sufficient at that point. Then a concurrent payment took the balance. The first payment process then started its attempt loop. An endless loop occurred because it tried the same depleted channel over and over again without updating the bandwidth hints. If it would have done so, it would have known that the channel had been emptied and would have given up. |
|
Traced the local dispatch/response path and checked all locally generated onion failure messages in
Assuming my assessment of this is correct, a potential endless loop can only occur for errors that shouldn't happen (3). I think that is acceptable. An endless loop (until timeout) isn't the worst that can happen and in those cases, there may be instability on other areas of Further into the future, we may want to reevaluate how we handle those unexpected errors. Better define how fail-over behavior should be implemented, for example by failing the link instead of just logging an error. Or alternatively don't support fail-over at all and exit |
|
Thanks for investigating this in detail @joostjager! In the wild, it seems that even if we encounter something within the 3rd category, it won't loop endlessly as we still have the payment timeout parameter. In any case, as you pointed out, re-evaluating the bandwidth hints each time should address any issues that can arise today with multiple concurrent payments, which'll really help multi-user |
Roasbeef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👽
|
Hi @joostjager. I was going through the penalization mechanism of failed payments. I don't find any mention of the same BOLT (sorry may be I have missed it). I see this has already been pushed in the lnd codebase so is it somehow working inherently? Can you elaborate on how it works? How are you deciding the penalty value? |
|
There is no BOLT for the penalization mechanism. It is up to the implementation team to decide. |
|
@joostjager do you have any documentation on how this would work, if implemented? And how does penalty value gets decided for handling each category of failure? |
|
There is just the code and the comments in the code. You'll have to figure it out from there. |
|
Another resource is my presentation at lnconf 2019 in Berlin, https://twitter.com/joostjgr/status/1186177262238031872 |
|
Thanks for sharing. |
This PR overhauls the interpretation of failed payments. It changes the interpretation rules so that we always apply the strongest possible set of penalties, without making assumptions that would hurt good nodes.
Main changes are:
Apply different rule sets for intermediate and final nodes. Both types of nodes have different sets of failures that we expect. Penalize nodes that send unexpected failure messages.
Distinguish between direct payments and multi-hop payments. For direct payments, we can infer more about the performance of our peer because we trust ourselves.
In many cases it is impossible for the sender to determine which of the two nodes in a pair is responsible for the failure. In this situation, we now penalize bidirectionally. This does not hurt the good node of the pair, because only its connection to a bad node is penalized.
Previously we always penalized the outgoing connection of the reporting node. This is incorrect for policy related failures. For policy related failures, it could also be that the reporting node received a wrongly crafted htlc from its predecessor. By penalizing the incoming channel, we surely hit the responsible node.
FailExpiryTooSoon is a failure that could have been caused by any node up to the reporting node by delaying forwarding of the htlc. We don't know which node is responsible, therefore we now penalize all node pairs in the route. By doing this, we do hurt good nodes. But unfortunately there is currently no way to pinpoint just a single pair to penalize.