routing+routerrpc: improve prob. estimation for untried connections#3462
routing+routerrpc: improve prob. estimation for untried connections#3462Roasbeef merged 7 commits intolightningnetwork:masterfrom
Conversation
95a93c8 to
4d4e209
Compare
b93992e to
d80f08c
Compare
|
Did a bit of performance testing with and without these changes. With a clean mission control history, there is no noticeable difference. This is expected, because there are hardly iterations in the loop inside To see a performance effect, it is probably needed to have a fully loaded mission control history. It is a question how likely this situation is, especially on low-powered devices where pathfinding performance becomes important. There is also a default limit to the mission control history of 1000 past payments, although this can be changed. I tried a simulation of a full mission control by inserting 10 fake results for every node visited, but didn't see an effect either. Db access during path finding is likely still where time is mainly spent. I'd propose to not prematurely add optimizations like caching of the node probability in this pr and first see what feedback we get on it from users. I've added another commit that logs useful performance metrics. |
03c6ad6 to
9614d33
Compare
There was a problem hiding this comment.
it's not entirely obvious to me what this formula is doing... care to add a more detailed comment on how the factor is different from the weight?
There was a problem hiding this comment.
explanation added
There was a problem hiding this comment.
this is the critical change of this PR. Would be nice to have some examples on how the resulting probability will look like given a few results. Maybe the examples can be given as unit tests for this formula?
There was a problem hiding this comment.
In particular I'm interested in the scenarios we talked about earlier:
A node having 100s of untried channels, and 1 tried channel (success/failure)
A node having 100s of untried channels, and X tried channels (success/failure), where X is about 2-5
A node having few channels, and some result (say 3 channels, 2 failures/successes).
There was a problem hiding this comment.
Extracted probability estimation from mission control and added more unit tests.
The number of untried channels isn't a factor in the calculation at all, so I can't write a unit test for that. Probability estimates are only based on previous payment results.
ef1b2e9 to
9316e0f
Compare
9316e0f to
4c01b5a
Compare
Roasbeef
left a comment
There was a problem hiding this comment.
Completed an initial surface-level pass. Will do another one focusing on the core estimation changes to ensure I fully grok them. Overall I think this is a step in the right direction, but wonder how we can test these assumptions in the wild, and also if it's too early to make a change like this given that as we know it, node operators aren't very sophisticated yet.
There was a problem hiding this comment.
If we have A <-> B <-> C and B failed, the incoming is A -> B and outgoing is B -> C?
There was a problem hiding this comment.
Yes. Added helper variables to make this clearer.
We can always set the default a priori weight quite high, so the effect of this will be light initially. Give some time for users and operators to become aware of the new scoring. |
4c01b5a to
c315ee3
Compare
halseth
left a comment
There was a problem hiding this comment.
I like the simplicity and conceived impact of this PR! I'm down with the current approach, deemed that we test it a bit to make sure we don't have any obvious regressions.
In addition to logging performance metrics as now, should we consider logging more (on trace) how the probabilities are calculated? I think we now log the final probability found for a channel, but not how it was calculated. Could maybe be useful to log whether it was calculated from a previous success, failure or node level failure?
There was a problem hiding this comment.
set success=false in pairResult for clarity?
There was a problem hiding this comment.
If I remember correctly, there were comments in other prs about not adding redundant initializers. Not sure what you standard is here.
There was a problem hiding this comment.
Personally I like adding them (exceptions being nil initializers perhaps), and in this case I feel it makes sense, since an empty struct is not obvious what it means.
There was a problem hiding this comment.
same comment about setting success=false
00b3271 to
07e1ad1
Compare
|
Moved new |
halseth
left a comment
There was a problem hiding this comment.
LGTM, but would be nice IMO to set success: false to more clearly indicate failure 😄
There was a problem hiding this comment.
Can use this in indexing above.
There was a problem hiding this comment.
There was a problem hiding this comment.
don't you agree that just from looking at this call it is not obvious that we are recording a failure here?
There was a problem hiding this comment.
I've added constructors to make it more obvious without adding redundant field initializers.
07e1ad1 to
ac28dd3
Compare
ac28dd3 to
414a5f1
Compare
There was a problem hiding this comment.
Why don't we thread through the amount of the HTLC that resulted in this failure at this stage?
There was a problem hiding this comment.
We don't need it for this PR. Here it is intentionally set to zero, because it should be a penalty to any amount, not just >= amt.
There was a problem hiding this comment.
Why do we need to reverse the pair here? If I'm following, then the intention is to fail A -> B, not B <- A.
Prior context (in case it's hidden):
If we have A <-> B <-> C and B failed, the incoming is A -> B and outgoing is B -> C?
There was a problem hiding this comment.
We intent to penalize as much as we can for a node level failure, including future outgoing traffic for this connection. The pair as it is returned by getPair is directed towards the failed node. Therefore we first reverse the pair.
Added this as comment.
ed598c0 to
a0b76b6
Compare
a0b76b6 to
24e3299
Compare
guggero
left a comment
There was a problem hiding this comment.
Nice, clean PR! I was able to follow the changes quite easily even though the subject matter is very new to me. Only found a few nits in the code.
Not scope of this PR, just thoughts in general:
The main thing I'm not sure about are the memory and CPU time implications of these changes when we get to a larger number of channels/peers. It would be very useful if we had something like the "1 million channel project" for lnd.
To make it explicit whether a failure or a success result is instantiated.
This commit changes the in-memory structure of the mission control state. It prepares for calculation of a node probability. For this we need to be able to efficiently look up the last results for all channels of a node.
This commit modifies the interpretation of node-level failures. Previously only the failing node was marked. With this commit, also the incoming and outgoing connections involved in the route are marked as failed. The change prepares for the removal of node-level failures in mission control probability estimation.
Probability estimates are amount dependent. Previously we assumed an amount, but that starts to make less sense when we make probability more dependent on amounts in the future.
This commit changes mission control to partially base the estimated probability for untried connections on historical results obtained in previous payment attempts. This incentivizes routing nodes to keep all of their channels in good shape.
24e3299 to
1fac41d
Compare
This PR changes mission control to partially base the estimated probability for untried connections on historical results obtained in previous payment attempts. This prevents the payment loop from getting stuck in an area of the network that is not of good quality. It also incentivizes routing nodes to keep all
of their channels in good shape.
The configuration parameter
--routerrpc.aprioriweightis a value in the range [0, 1] that defines to what extent historical results should be extrapolated to untried connections. Setting it to 1 will completely ignore historical results and always assume the configured a priori probability for untried connections. A value of 0 will ignore the a priori probability completely and only base the probability on historical results, unless there are none available.Using this parameter, senders can configure how aggressive nodes should be penalized. A low a priori weight will make mission control switch quickly to using a different node if a tried node doesn't perform well. The downside of this is that the sender may miss out on low fee channels that could possibly complete the payment successfully.
Indirectly, the collective of senders also shapes the network with their settings of this parameter. When senders start to demand a higher performance from routing nodes, routing nodes need to meet that expectation in order to keep generating revenue.