-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: prune based on channel sets instead of channels #1734
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joostjager really dig these changes, should help drive down the number of failed attempts for making a successful payment. Did an initial pass and changes read well to me.
One thing that I'm not sure of is if we should always prune the edge set in response to a TemporaryChannelFailure
or PermanentChannelFailue
.
In the most literal sense, it would seem that we should only apply this to the specific edges that failed. Given the current state of the network however, it could be that these are also symptoms of unstable hops, and perhaps should be avoided altogether. There seems to be a balance, but I don't have enough data to really point one way or another. Do you have any more thoughts on this?
Down the road, these failures could be fed directly into a stocastic mission control, which would learn or apply the correct heurisitc. At the moment though, we may have to determine this through experimentation.
routing/router.go
Outdated
if badChan, ok := route.nextHopChannel(errSource); ok { | ||
return badChan.ChannelID, nil | ||
|
||
if nextNode, ok := route.nextHopVertex(errSource); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it always the case that we want to prefer the outgoing channel? for most of the cases this appears to be correct, though it is possible to get a TemporaryChannelFailure
from the incoming link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I don't think I changed the logic for determining the problematic channel. In master it is done similarly:
Line 2006 in dcd8190
badChan, ok := route.nextHopChannel(errSource) |
Will think about this. Actually, if we stop making assumptions completely, if we receive an error from a remote node, the only thing we really know is that it is coming from that node and that it can be related to any of his incoming channels from the prev node or outgoing channels to the next node. So an even broader scope than the set of just the outgoing channels. We cannot even know which channels those are, because they can be private (node chooses private channel instead of the specified one).
Do we want to assume that nodes return honest failure messages? If we don't want to assume that, we could just as well threat all errors identical. Take their type as just an identifier to keep an administration for statistical purposes in the probability machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider the outgoing channel here as if the node is operating as normal, it arrived via the incoming channel. the only time we consider the incoming channel as the channel being failed is for the destination node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to assume that nodes return honest failure messages?
The current logic assumes that to a degree. It seems in order to remove this assumption, we'd only need to perform a sanity check (if the failure contains a channel update, maybe they all should?) to ensure we prune the correct edge set.
routing/router.go
Outdated
@@ -1891,15 +1884,15 @@ func (r *ChannelRouter) sendPayment(payment *LightningPayment, | |||
// the update and continue. | |||
case *lnwire.FailChannelDisabled: | |||
applyChannelUpdate(&onionErr.Update) | |||
pruneEdgeFailure(paySession, failedChanID) | |||
pruneEdgeFailure(paySession, failedEdgeSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Noticing now that we don't return this error anywhere in the link or switch, though the spec says it should be returned if "The channel from the processing node has been disabled." LND will send out disables if the remote peer isn't online, which would make it equivalent to UnknownNextPeer
. If others are using the same disabling criteria then pruning the edge set seems appropriate. If that changes, this may need to be revisited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed too that lnd
doesn't send it. Actually, disabled notification should I think be handled similarly to policy mismatches. It should get a second chance. Somehow it is done differently here. It could be, now knowing that the channel is disabled, that rerouting yields another channel between the same two nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct that we don't currently send it as we assume that if we've sent out a disable, then nodes simply won't route through it. Channel disable and unknown next peer are the same more or less to us atm. What we could do though, is if we get a forwarding request over a channel whose closing transaction has been broadcast, but not yet confirmed, then we send back the disabled error.
@cfromknecht, this PR builds on #1706. The first two commits in this PR are taken from there. Will apply your comments on those commits to #1706. |
The problem remains that you cannot be sure what edge really failed. The more I think about it, the more I am in favor of not making any assumptions. Not only to deal with buggy nodes properly, but also in case of future malicious nodes appearing. Suppose the hop payload would not have contained a channel id, but just the pubkey of the next hop. Edge set pruning was already a smaller potential problem for For c-lightning it wasn't a problem as you said, because only a single channel allowed. Than just for eclair, what could happen is that we wouldn't try all channels from the forwarding node to the next, but stop after two failures. Even though a channel with enough balance may exist. I don't know how many relevant eclair fwd'ing nodes there are currently, what Eclair's plans are to change fwding logic to what lnd does (and increase earned fees for fwd nodes), how often routing needs to probe more than 2 channels between the same set of nodes, how unbalanced those channels are, etc. It is an uncertainty, but maybe we are able to try it out when we are working on the probability machine anyway. Therefore we also need to evaluate different models/parameters. |
This PR is still open-ended. At the moment, the code serves primarily to show a possible direction to take this subject to. |
Eclair merged non-strict fwding: |
Based on the findings re eclair above, and also the to-be-added section to the spec on non-strict forwarding, I think we can safely proceed with these changes now. The one open question I have (was possibly answered above) is how will we deal with fee errors in the case of multiple channels to a node with distinct fees? Will we simply assume that there's no reason to do this, and not try to do anything fancy w.r.t errors sent back to the sender? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🗿
Needs a rebase to resolve the conflicts once #1706 is merged, then we can merge it in post release to get more active testing on mainnet and testnet.
routing/missioncontrol.go
Outdated
failedVertexes: make(map[Vertex]time.Time), | ||
selfNode: selfNode, | ||
queryBandwidth: qb, | ||
graph: g, | ||
} | ||
} | ||
|
||
// edgeSet identifies a set of channel edges from a specific to a specific node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
routing/router.go
Outdated
if badChan, ok := route.nextHopChannel(errSource); ok { | ||
return badChan.ChannelID, nil | ||
|
||
if nextNode, ok := route.nextHopVertex(errSource); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider the outgoing channel here as if the node is operating as normal, it arrived via the incoming channel. the only time we consider the incoming channel as the channel being failed is for the destination node.
routing/router.go
Outdated
if badChan, ok := route.nextHopChannel(errSource); ok { | ||
return badChan.ChannelID, nil | ||
|
||
if nextNode, ok := route.nextHopVertex(errSource); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to assume that nodes return honest failure messages?
The current logic assumes that to a degree. It seems in order to remove this assumption, we'd only need to perform a sanity check (if the failure contains a channel update, maybe they all should?) to ensure we prune the correct edge set.
I would say that we only prune the channels that would need a fee of at most what we paid in the attempt for which we got the error. In case all channels have the same policy, this means we prune all channels. If there are cheaper channels, we prune those too. More expensive ones are left in for a new path finding round. Time lock need to be worked in too. So prune all channels with a lower or equal fee and a shorter or equal timelock delta. |
To prune channel sets, the issue of directionality needs to be fixed first: #2243 |
Follow up: #3353 in which we go one step further by not only generalizing over channels, but also tracking without directionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I really like the direction this is going! Diff was much smaller than I expected, and separating pair failures from node failures also makes the code easier to reason about.
LGTM, but only did a high-level pass for now, will circle back to it.
@@ -142,37 +142,41 @@ func (r *RouterBackend) QueryRoutes(ctx context.Context, | |||
ignoredNodes[ignoreVertex] = struct{}{} | |||
} | |||
|
|||
ignoredEdges := make(map[routing.EdgeLocator]struct{}) | |||
ignoredEdges := make(map[routing.DirectedNodePair]struct{}) | |||
for _, ignoredEdge := range in.IgnoredEdges { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change the RPC
param to use node pairs instead of chans also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC parameter added and ignored_edges
deprecated
4e9f823
to
5dfeffa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(comments moved from #3256)
/** | ||
A list of directed node pairs that will be ignored during path finding. | ||
*/ | ||
repeated NodePair ignored_pairs = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat unfortunate that this requires 33*2/8 = 8.25x more bandwidth to transmit (assuming single channel), but probably okay unless the set is huge.
are there any plans to make this streaming so the whole list isn't transmitted on each request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have those plans, but I find it difficult to foresee if this becomes a problem, because it depends on the actual use of the functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely, i just mean that, atm, this would require a quadratic amount of bandwidth/serialization to do an iterative blacklist. i don't think it'd be to hard to switch to a streaming mode later if we find it an issue
Previously mission control tracked failures on a per node, per channel basis. This commit changes this to tracking on the level of directed node pairs. The goal of moving to this coarser-grained level is to reduce the number of required payment attempts without compromising payment reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 cool to finally see this improvement go in after all the preliminary refactoring has been done, happy almost birthday to this PR 🎂🎉🍾
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐳
UPDATE 2019-08-02
Things have moved on since this PR was opened. I pushed an updated version of node pair pruning in mission control.
Advantages of node pair pruning:
When two nodes have multiple public channels between them, mission control is only going to try one of those channels. With non-strict forwarding (which all current implementations support), this should be enough. This does assume that the same policy is set for all channels between those nodes (as is recommended by the LN spec).
When a node closes a channel and opens a new channel to the same peer, it won't need to rebuild its reputation with senders for the new channel.
ORIGINAL PR DESCRIPTION
In the light of the discussion in #1527, it could be useful to also acknowledge in the retry behaviour that the actual channel being used isn't known. It may be better to associate failures not with specific channels, but with the two endpoints (pubkeys) of the channel. So instead of pruning channel 123, we prune the set of channels from node A to node B. (In case of a single channel, the result is the same)
@cfromknecht:
So for Eclair what would happen:
In the last case, channel set pruning will result in a sub-optimal outcome. In the current channel-based pruning the third route calculation may retry 125 with the new policy applied or come up with a third channel E->F which hasn't been tried yet.
If the forwarding node (L) is an lnd node,
So actually, in both the eclair and lnd cases, channel set pruning can be sub-optimal.
So maybe not a good idea then at all for policy failures?
Other scenarios to think about could be with a malicious/buggy node. I am not sure if it could keep the source node busy with rerouting for a while by sending back channel updates for different channels than the channel that was requested (setting failed once markers for irrelevant channels, so S keeps trying). It might be worthwhile to match the channel id in the update with the requested channel id as a way to mitigate this.
On the other hand, edges are also pruned without a channel update:
FailUnknownNextPeer
,FailPermanentChannelFailure
andFailTemporaryChannelFailure
(for which update is optional). In that case it may be less clear (for lnd forwarding nodes) that the (requested) channel should be pruned. Pruning the full set may be better.I think a fundamental decision is whether to base the logic on
or
Ideas are welcome.