-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing+routerrpc: improve prob. estimation for untried connections #3462
Conversation
95a93c8
to
4d4e209
Compare
b93992e
to
d80f08c
Compare
Did a bit of performance testing with and without these changes. With a clean mission control history, there is no noticeable difference. This is expected, because there are hardly iterations in the loop inside To see a performance effect, it is probably needed to have a fully loaded mission control history. It is a question how likely this situation is, especially on low-powered devices where pathfinding performance becomes important. There is also a default limit to the mission control history of 1000 past payments, although this can be changed. I tried a simulation of a full mission control by inserting 10 fake results for every node visited, but didn't see an effect either. Db access during path finding is likely still where time is mainly spent. I'd propose to not prematurely add optimizations like caching of the node probability in this pr and first see what feedback we get on it from users. I've added another commit that logs useful performance metrics. |
03c6ad6
to
9614d33
Compare
routing/missioncontrol.go
Outdated
// The value of the apriori weight is in the range [0, 1]. Convert it to | ||
// a factor that properly expresses the intention of the weight in the | ||
// following weight average calculation. | ||
aprioriFactor := 1/(1-m.cfg.AprioriWeight) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not entirely obvious to me what this formula is doing... care to add a more detailed comment on how the factor is different from the weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation added
routing/missioncontrol.go
Outdated
|
||
return probability | ||
now := m.now() | ||
for _, result := range results { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the critical change of this PR. Would be nice to have some examples on how the resulting probability will look like given a few results. Maybe the examples can be given as unit tests for this formula?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular I'm interested in the scenarios we talked about earlier:
A node having 100s of untried channels, and 1 tried channel (success/failure)
A node having 100s of untried channels, and X tried channels (success/failure), where X is about 2-5
A node having few channels, and some result (say 3 channels, 2 failures/successes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted probability estimation from mission control and added more unit tests.
The number of untried channels isn't a factor in the calculation at all, so I can't write a unit test for that. Probability estimates are only based on previous payment results.
ef1b2e9
to
9316e0f
Compare
9316e0f
to
4c01b5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed an initial surface-level pass. Will do another one focusing on the core estimation changes to ensure I fully grok them. Overall I think this is a step in the right direction, but wonder how we can test these assumptions in the wild, and also if it's too early to make a change like this given that as we know it, node operators aren't very sophisticated yet.
i.pairResults[inPair.Reverse()] = pairResult{} | ||
|
||
// If not the ultimate node, mark the outgoing connection as failed for | ||
// the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have A <-> B <-> C
and B
failed, the incoming is A -> B
and outgoing is B -> C
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added helper variables to make this clearer.
We can always set the default a priori weight quite high, so the effect of this will be light initially. Give some time for users and operators to become aware of the new scoring. |
4c01b5a
to
c315ee3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity and conceived impact of this PR! I'm down with the current approach, deemed that we test it a bit to make sure we don't have any obvious regressions.
In addition to logging performance metrics as now, should we consider logging more (on trace) how the probabilities are calculated? I think we now log the final probability found for a channel, but not how it was calculated. Could maybe be useful to log whether it was calculated from a previous success, failure or node level failure?
routing/result_interpretation.go
Outdated
|
||
// Mark the incoming connection as failed for the node. | ||
inPair, _ := getPair(rt, idx-1) | ||
i.pairResults[inPair.Reverse()] = pairResult{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set success=false
in pairResult
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, there were comments in other prs about not adding redundant initializers. Not sure what you standard is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I like adding them (exceptions being nil
initializers perhaps), and in this case I feel it makes sense, since an empty struct is not obvious what it means.
routing/missioncontrol.go
Outdated
m.lastNodeFailure[*i.nodeFailure] = result.timeReply | ||
m.setAllPairResult(*i.nodeFailure, &timedPairResult{ | ||
timestamp: result.timeReply, | ||
pairResult: pairResult{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about setting success=false
00b3271
to
07e1ad1
Compare
Moved new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but would be nice IMO to set success: false
to more clearly indicate failure 😄
i.nodeFailure = &rt.Hops[idx-1].PubKeyBytes | ||
|
||
// Mark the incoming connection as failed for the node. | ||
incomingChannelIdx := idx - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use this in indexing above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
routing/missioncontrol.go
Outdated
@@ -446,8 +452,7 @@ func (m *MissionControl) applyPaymentResult( | |||
"node=%v", *i.nodeFailure) | |||
|
|||
m.setAllPairResult(*i.nodeFailure, timedPairResult{ | |||
timestamp: result.timeReply, | |||
pairResult: pairResult{}, | |||
timestamp: result.timeReply, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you agree that just from looking at this call it is not obvious that we are recording a failure here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added constructors to make it more obvious without adding redundant field initializers.
lnrpc/routerrpc/router.proto
Outdated
|
||
reserved 3, 4, 5, 6; | ||
|
||
PairData data = 7 [json_name="data"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name suggestion: history
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to #3556 and renamed
07e1ad1
to
ac28dd3
Compare
ac28dd3
to
414a5f1
Compare
if idx < len(rt.Hops) { | ||
outgoingChannelIdx := idx | ||
outPair, _ := getPair(rt, outgoingChannelIdx) | ||
i.pairResults[outPair] = failPairResult(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we thread through the amount of the HTLC that resulted in this failure at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it for this PR. Here it is intentionally set to zero, because it should be a penalty to any amount, not just >= amt
.
// Mark the incoming connection as failed for the node. | ||
incomingChannelIdx := idx - 1 | ||
inPair, _ := getPair(rt, incomingChannelIdx) | ||
i.pairResults[inPair.Reverse()] = failPairResult(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to reverse the pair here? If I'm following, then the intention is to fail A -> B
, not B <- A
.
Prior context (in case it's hidden):
If we have A <-> B <-> C and B failed, the incoming is A -> B and outgoing is B -> C?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We intent to penalize as much as we can for a node level failure, including future outgoing traffic for this connection. The pair as it is returned by getPair is directed towards the failed node. Therefore we first reverse the pair.
Added this as comment.
ed598c0
to
a0b76b6
Compare
a0b76b6
to
24e3299
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, clean PR! I was able to follow the changes quite easily even though the subject matter is very new to me. Only found a few nits in the code.
Not scope of this PR, just thoughts in general:
The main thing I'm not sure about are the memory and CPU time implications of these changes when we get to a larger number of channels/peers. It would be very useful if we had something like the "1 million channel project" for lnd.
To make it explicit whether a failure or a success result is instantiated.
This commit changes the in-memory structure of the mission control state. It prepares for calculation of a node probability. For this we need to be able to efficiently look up the last results for all channels of a node.
This commit modifies the interpretation of node-level failures. Previously only the failing node was marked. With this commit, also the incoming and outgoing connections involved in the route are marked as failed. The change prepares for the removal of node-level failures in mission control probability estimation.
Probability estimates are amount dependent. Previously we assumed an amount, but that starts to make less sense when we make probability more dependent on amounts in the future.
This commit changes mission control to partially base the estimated probability for untried connections on historical results obtained in previous payment attempts. This incentivizes routing nodes to keep all of their channels in good shape.
24e3299
to
1fac41d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🏹
This PR changes mission control to partially base the estimated probability for untried connections on historical results obtained in previous payment attempts. This prevents the payment loop from getting stuck in an area of the network that is not of good quality. It also incentivizes routing nodes to keep all
of their channels in good shape.
The configuration parameter
--routerrpc.aprioriweight
is a value in the range [0, 1] that defines to what extent historical results should be extrapolated to untried connections. Setting it to 1 will completely ignore historical results and always assume the configured a priori probability for untried connections. A value of 0 will ignore the a priori probability completely and only base the probability on historical results, unless there are none available.Using this parameter, senders can configure how aggressive nodes should be penalized. A low a priori weight will make mission control switch quickly to using a different node if a tried node doesn't perform well. The downside of this is that the sender may miss out on low fee channels that could possibly complete the payment successfully.
Indirectly, the collective of senders also shapes the network with their settings of this parameter. When senders start to demand a higher performance from routing nodes, routing nodes need to meet that expectation in order to keep generating revenue.