Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing: prune based on channel sets instead of channels #1734

Merged
merged 6 commits into from
Aug 14, 2019

Conversation

joostjager
Copy link
Contributor

@joostjager joostjager commented Aug 16, 2018

UPDATE 2019-08-02

Things have moved on since this PR was opened. I pushed an updated version of node pair pruning in mission control.

Advantages of node pair pruning:

  • When two nodes have multiple public channels between them, mission control is only going to try one of those channels. With non-strict forwarding (which all current implementations support), this should be enough. This does assume that the same policy is set for all channels between those nodes (as is recommended by the LN spec).

  • When a node closes a channel and opens a new channel to the same peer, it won't need to rebuild its reputation with senders for the new channel.

ORIGINAL PR DESCRIPTION

In the light of the discussion in #1527, it could be useful to also acknowledge in the retry behaviour that the actual channel being used isn't known. It may be better to associate failures not with specific channels, but with the two endpoints (pubkeys) of the channel. So instead of pruning channel 123, we prune the set of channels from node A to node B. (In case of a single channel, the result is the same)

@cfromknecht:

most certainly, that's something we've discussed before as it's very related to this PR.

Since we don't have certainty about what channel the other hops try, the feedback to the router is lossy (and potentially malicious/incorrect). With sufficient fees, multiple channels between the same peers can be effectively treated as one channel from the source's perspective, so the suggestion to prune all channels connecting the pubkeys makes sense to me.

It also shouldn't be an issue if we assume that a node will always try the "best" (or at least sufficient) channel first, as then there'd no point retrying if the others are worse. However, this change could still cause us to miss a possible route if they maintain a strict forwarding policy.

Atm we will try to find the hop with the most bandwidth. I think c-lightning has a max of one channel per peer, so it wouldn't be an issue there. Eclair looks like it abides by the channel requested in the payload, though as a disclaimer, my scala is a little rusty. Just based on that knowledge, it may produce less-than-optimal results when routing through an eclair node that has multiple outgoing channels with the next peer.

So for Eclair what would happen:

  • Eclair node E does strict forwarding over channel 123 to destination F, finds that its policy isn't satisfied and sends back a channel update for channel 123.
  • Source node S processes the channel update and adds all channels from E to F to its failed once set.
  • S will calculate a second route using the updated graph. It is not yet going to avoid E-F channels, because a channel of that set has only failed once. If the rerouting again chooses the E-F connection, three things can happen:
    • Channel 123 is selected again, but this time with the right fee/delta timelock. The forwarding from E to F is expected to succeed.
    • A different channel from E to F, say 125, is selected. If the policy for 125 is up to date, the forwarding is expected to succeed.
    • However, if 125's policy is also not up to date, the payment will fail again. This time the complete set of channels E->F will be pruned. And currently also the node is pruned completely (quite aggressive).

In the last case, channel set pruning will result in a sub-optimal outcome. In the current channel-based pruning the third route calculation may retry 125 with the new policy applied or come up with a third channel E->F which hasn't been tried yet.

If the forwarding node (L) is an lnd node,

  • S request L to forward to M over channel 234.
  • L sees that the policy of none of its channels is satisfied. It returns a channel update for the requested channel 234 (for example with increased fees).
  • Now similar to the steps above, if the second route attempt uses a different L->M channel 235 (because according to S's version of the graph, it is now the cheapest option) and that policy is also not up to date, the payment fails again and the complete set will be pruned, even though a route via 235 could be optimal. Channel 235 couldn't be picked by L on the first attempt, because in that route there was not enough fee. So it is not always that lnd will try the best channel and that there is no point in retrying. It may be that a route with different fees 'unlocks' other channels that do make the payment succeed.

So actually, in both the eclair and lnd cases, channel set pruning can be sub-optimal.

So maybe not a good idea then at all for policy failures?

Other scenarios to think about could be with a malicious/buggy node. I am not sure if it could keep the source node busy with rerouting for a while by sending back channel updates for different channels than the channel that was requested (setting failed once markers for irrelevant channels, so S keeps trying). It might be worthwhile to match the channel id in the update with the requested channel id as a way to mitigate this.

On the other hand, edges are also pruned without a channel update: FailUnknownNextPeer, FailPermanentChannelFailure and FailTemporaryChannelFailure (for which update is optional). In that case it may be less clear (for lnd forwarding nodes) that the (requested) channel should be pruned. Pruning the full set may be better.

I think a fundamental decision is whether to base the logic on

  • what the source node knows for sure
    or
  • what would be optimal given inside knowledge of the various implementations.

Ideas are welcome.

@Roasbeef Roasbeef added enhancement Improvements to existing features / behaviour routing P3 might get fixed, nice to have labels Aug 17, 2018
@Roasbeef Roasbeef added this to the 0.5.1 milestone Aug 17, 2018
Copy link
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joostjager really dig these changes, should help drive down the number of failed attempts for making a successful payment. Did an initial pass and changes read well to me.

One thing that I'm not sure of is if we should always prune the edge set in response to a TemporaryChannelFailure or PermanentChannelFailue.

In the most literal sense, it would seem that we should only apply this to the specific edges that failed. Given the current state of the network however, it could be that these are also symptoms of unstable hops, and perhaps should be avoided altogether. There seems to be a balance, but I don't have enough data to really point one way or another. Do you have any more thoughts on this?

Down the road, these failures could be fed directly into a stocastic mission control, which would learn or apply the correct heurisitc. At the moment though, we may have to determine this through experimentation.

routing/router.go Outdated Show resolved Hide resolved
rpcserver.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
routing/router.go Outdated Show resolved Hide resolved
rpcserver.go Outdated Show resolved Hide resolved
if badChan, ok := route.nextHopChannel(errSource); ok {
return badChan.ChannelID, nil

if nextNode, ok := route.nextHopVertex(errSource); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always the case that we want to prefer the outgoing channel? for most of the cases this appears to be correct, though it is possible to get a TemporaryChannelFailure from the incoming link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I don't think I changed the logic for determining the problematic channel. In master it is done similarly:

badChan, ok := route.nextHopChannel(errSource)

Will think about this. Actually, if we stop making assumptions completely, if we receive an error from a remote node, the only thing we really know is that it is coming from that node and that it can be related to any of his incoming channels from the prev node or outgoing channels to the next node. So an even broader scope than the set of just the outgoing channels. We cannot even know which channels those are, because they can be private (node chooses private channel instead of the specified one).

Do we want to assume that nodes return honest failure messages? If we don't want to assume that, we could just as well threat all errors identical. Take their type as just an identifier to keep an administration for statistical purposes in the probability machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider the outgoing channel here as if the node is operating as normal, it arrived via the incoming channel. the only time we consider the incoming channel as the channel being failed is for the destination node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to assume that nodes return honest failure messages?

The current logic assumes that to a degree. It seems in order to remove this assumption, we'd only need to perform a sanity check (if the failure contains a channel update, maybe they all should?) to ensure we prune the correct edge set.

@@ -1891,15 +1884,15 @@ func (r *ChannelRouter) sendPayment(payment *LightningPayment,
// the update and continue.
case *lnwire.FailChannelDisabled:
applyChannelUpdate(&onionErr.Update)
pruneEdgeFailure(paySession, failedChanID)
pruneEdgeFailure(paySession, failedEdgeSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Noticing now that we don't return this error anywhere in the link or switch, though the spec says it should be returned if "The channel from the processing node has been disabled." LND will send out disables if the remote peer isn't online, which would make it equivalent to UnknownNextPeer. If others are using the same disabling criteria then pruning the edge set seems appropriate. If that changes, this may need to be revisited

Copy link
Contributor Author

@joostjager joostjager Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed too that lnd doesn't send it. Actually, disabled notification should I think be handled similarly to policy mismatches. It should get a second chance. Somehow it is done differently here. It could be, now knowing that the channel is disabled, that rerouting yields another channel between the same two nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that we don't currently send it as we assume that if we've sent out a disable, then nodes simply won't route through it. Channel disable and unknown next peer are the same more or less to us atm. What we could do though, is if we get a forwarding request over a channel whose closing transaction has been broadcast, but not yet confirmed, then we send back the disabled error.

@joostjager
Copy link
Contributor Author

joostjager commented Sep 7, 2018

@cfromknecht, this PR builds on #1706. The first two commits in this PR are taken from there. Will apply your comments on those commits to #1706.

@joostjager
Copy link
Contributor Author

In the most literal sense, it would seem that we should only apply this to the specific edges that failed. Given the current state of the network however, it could be that these are also symptoms of unstable hops, and perhaps should be avoided altogether. There seems to be a balance, but I don't have enough data to really point one way or another. Do you have any more thoughts on this?

The problem remains that you cannot be sure what edge really failed. The more I think about it, the more I am in favor of not making any assumptions. Not only to deal with buggy nodes properly, but also in case of future malicious nodes appearing.

Suppose the hop payload would not have contained a channel id, but just the pubkey of the next hop.
(Could it be that this would actually have been a better decision?) In that case all implementations would be forced to choose the best channel themselves. Also the logic to decide for what channel an update should be returned in case that none of the policies match, would be different. The forwarding node would know at that point the amount to forward. It could look at its outgoing channels and see what is the lowest fee channel that would be able (with actual balances) to carry the payment and send back an update for that channel. The sender could then still choose himself what fee to attach for the next attempt, but it will know for sure that satisfying the returned policy will most likely make the payment succeed (for that hop). I am wondering, this way of selecting the channel update, would it actually be beneficial in lnd already now? It would sort of use the channel update reply to signal where the balance is available.

Edge set pruning was already a smaller potential problem for lnd fwding nodes, because of the auto channel selection. I think by sending back an update according to the idea above, edge set pruning will be an even smaller problem. There are still scenarios when a lot of the graph data is out of data on the sender's side where pruning individual edges might be better, but how likely is this?

For c-lightning it wasn't a problem as you said, because only a single channel allowed.

Than just for eclair, what could happen is that we wouldn't try all channels from the forwarding node to the next, but stop after two failures. Even though a channel with enough balance may exist. I don't know how many relevant eclair fwd'ing nodes there are currently, what Eclair's plans are to change fwding logic to what lnd does (and increase earned fees for fwd nodes), how often routing needs to probe more than 2 channels between the same set of nodes, how unbalanced those channels are, etc. It is an uncertainty, but maybe we are able to try it out when we are working on the probability machine anyway. Therefore we also need to evaluate different models/parameters.

@halseth halseth modified the milestones: 0.5.1, 0.5.2 Sep 20, 2018
@joostjager
Copy link
Contributor Author

This PR is still open-ended. At the moment, the code serves primarily to show a possible direction to take this subject to.

@joostjager
Copy link
Contributor Author

Eclair merged non-strict fwding:
ACINQ/eclair#648

@Roasbeef
Copy link
Member

Based on the findings re eclair above, and also the to-be-added section to the spec on non-strict forwarding, I think we can safely proceed with these changes now. The one open question I have (was possibly answered above) is how will we deal with fee errors in the case of multiple channels to a node with distinct fees? Will we simply assume that there's no reason to do this, and not try to do anything fancy w.r.t errors sent back to the sender?

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🗿

Needs a rebase to resolve the conflicts once #1706 is merged, then we can merge it in post release to get more active testing on mainnet and testnet.

failedVertexes: make(map[Vertex]time.Time),
selfNode: selfNode,
queryBandwidth: qb,
graph: g,
}
}

// edgeSet identifies a set of channel edges from a specific to a specific node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if badChan, ok := route.nextHopChannel(errSource); ok {
return badChan.ChannelID, nil

if nextNode, ok := route.nextHopVertex(errSource); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider the outgoing channel here as if the node is operating as normal, it arrived via the incoming channel. the only time we consider the incoming channel as the channel being failed is for the destination node.

if badChan, ok := route.nextHopChannel(errSource); ok {
return badChan.ChannelID, nil

if nextNode, ok := route.nextHopVertex(errSource); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to assume that nodes return honest failure messages?

The current logic assumes that to a degree. It seems in order to remove this assumption, we'd only need to perform a sanity check (if the failure contains a channel update, maybe they all should?) to ensure we prune the correct edge set.

@joostjager
Copy link
Contributor Author

Based on the findings re eclair above, and also the to-be-added section to the spec on non-strict forwarding, I think we can safely proceed with these changes now. The one open question I have (was possibly answered above) is how will we deal with fee errors in the case of multiple channels to a node with distinct fees? Will we simply assume that there's no reason to do this, and not try to do anything fancy w.r.t errors sent back to the sender?

I would say that we only prune the channels that would need a fee of at most what we paid in the attempt for which we got the error. In case all channels have the same policy, this means we prune all channels. If there are cheaper channels, we prune those too. More expensive ones are left in for a new path finding round. Time lock need to be worked in too. So prune all channels with a lower or equal fee and a shorter or equal timelock delta.

@joostjager
Copy link
Contributor Author

To prune channel sets, the issue of directionality needs to be fixed first: #2243

@joostjager joostjager changed the title routing: prune based on channel sets instead of channels routing: prune based on channel sets instead of channels [no review] Dec 4, 2018
@Roasbeef Roasbeef removed this from the 0.5.2 milestone Jan 16, 2019
@joostjager
Copy link
Contributor Author

joostjager commented Jul 29, 2019

Follow up: #3353 in which we go one step further by not only generalizing over channels, but also tracking without directionality.

@joostjager joostjager closed this Jul 29, 2019
@joostjager joostjager reopened this Aug 2, 2019
@joostjager joostjager changed the title routing: prune based on channel sets instead of channels [no review] routing: prune based on channel sets instead of channels Aug 2, 2019
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I really like the direction this is going! Diff was much smaller than I expected, and separating pair failures from node failures also makes the code easier to reason about.

LGTM, but only did a high-level pass for now, will circle back to it.

routing/nodepair.go Outdated Show resolved Hide resolved
@@ -142,37 +142,41 @@ func (r *RouterBackend) QueryRoutes(ctx context.Context,
ignoredNodes[ignoreVertex] = struct{}{}
}

ignoredEdges := make(map[routing.EdgeLocator]struct{})
ignoredEdges := make(map[routing.DirectedNodePair]struct{})
for _, ignoredEdge := range in.IgnoredEdges {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the RPC param to use node pairs instead of chans also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC parameter added and ignored_edges deprecated

@joostjager joostjager force-pushed the edgesets branch 3 times, most recently from 4e9f823 to 5dfeffa Compare August 7, 2019 11:23
@joostjager joostjager requested a review from halseth August 7, 2019 11:25
Copy link
Contributor

@halseth halseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚂

lnrpc/routerrpc/router.proto Outdated Show resolved Hide resolved
lnrpc/routerrpc/router_server.go Outdated Show resolved Hide resolved
routing/missioncontrol.go Show resolved Hide resolved
routing/missioncontrol.go Outdated Show resolved Hide resolved
routing/route/route.go Show resolved Hide resolved
lnrpc/routerrpc/router_server.go Show resolved Hide resolved
routing/missioncontrol.go Show resolved Hide resolved
routing/missioncontrol.go Show resolved Hide resolved
Copy link
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comments moved from #3256)

lnrpc/routerrpc/router_backend_test.go Outdated Show resolved Hide resolved
routing/missioncontrol.go Show resolved Hide resolved
/**
A list of directed node pairs that will be ignored during path finding.
*/
repeated NodePair ignored_pairs = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat unfortunate that this requires 33*2/8 = 8.25x more bandwidth to transmit (assuming single channel), but probably okay unless the set is huge.

are there any plans to make this streaming so the whole list isn't transmitted on each request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have those plans, but I find it difficult to foresee if this becomes a problem, because it depends on the actual use of the functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely, i just mean that, atm, this would require a quadratic amount of bandwidth/serialization to do an iterative blacklist. i don't think it'd be to hard to switch to a streaming mode later if we find it an issue

Previously mission control tracked failures on a per node, per channel basis.
This commit changes this to tracking on the level of directed node pairs. The goal
of moving to this coarser-grained level is to reduce the number of required
payment attempts without compromising payment reliability.
Copy link
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 cool to finally see this improvement go in after all the preliminary refactoring has been done, happy almost birthday to this PR 🎂🎉🍾

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐳

@Roasbeef Roasbeef merged commit d134e03 into lightningnetwork:master Aug 14, 2019
@joostjager joostjager deleted the edgesets branch August 14, 2019 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to existing features / behaviour P3 might get fixed, nice to have routing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants