routing+channeldb: create new sub-system to keep track of "fitness" of various nodes #1253

Roasbeef · 2018-05-16T01:13:03Z

Throughout the operation of lnd, we indirectly and directly interact with various nodes within the network. It may be that we see a new NodeAnnouncement for a node, successfully route a payment where the node was a vertex traversed, receive an HTLC routing failure from that node, or need to force close a channel due to a node being inactive. This is all very valuable information that currently isn't being persistently tracked at all. If we start to collate this information persistently, we can begin to use it in several sub-systems within the daemon. For example, within the autopilot subsystem, we can use this information to weed out a node as it's been known to be unreliable historically. Another example for usage lies in the interaction with the missionControl sub-system: we can preemptively black list any path that includes known historically faulty vertexes.

Steps To Completion

Create a new NodeHistory sub-bucket within the current channeldb package. Starting points for desirable metrics to track include:
- first time a node was seen
- last time a node was seen
- number of times they were included within an attempted route
- number of times a route they were involved w/ was successfully settled
- number of times a route they were involved w/ was cancelled back (can be more granular: sent error or was source of error)
- number of times a route they were involved w/ had to timeout
- number of times we had to force close a channel w/ the node due to inactivity
- number of times the node incurred a protocol violation (invalid commit or the like)
- largest number of channels closed in single or consecutive blocks by the node
- number of attempted breaches
- number of successful breaches
- frequency of channel turn over (are channels opened, then closed only a few blocks after wards?)
Due the that nature of each of the fields, we'll likely want to adopt a sort of "columnar" schema within the database. This can be implemented by using key || nodeID prefixes. This schema will allow for efficient queries/scans as all the data will be laid out contiguously. Metrics for a particular node can still be obtained by fully populating the two-tuple key.
Start to populate/integrate incrementing counters for the various metrics in the ChannelRouter as well as within missionControl. For example, each successful/failed payment should trigger an update (or several updates) to the NodeHistory database.
As a bonus, we can start to proactively probe nodes within the missionControl sub-system by using specially crafted HTLCs. If we find a node that we want to update our statistics for, we'll locate the smallest possible HTLC that all nodes in paths to that node will accept. We'll then create an HTLC with this value, but use some random unknown payment hash to the node. If the node is online, then they should cancel back with an UnknownPayment hash error. Otherwise, we'll get various other types of errors back.
Create a set of RPCs to allow users/UIs/monitoring tools to query (and possibly manually update) these metrics.

The text was updated successfully, but these errors were encountered:

joostjager · 2018-08-31T09:36:27Z

The approach described in this issue is to start with (persistently) tracking various metrics and from there start to use those metrics to improve autopilot and routing behaviour.

I find it difficult to define the metrics without yet knowing what exactly is needed.

How would this work if the process was turned around? Start with trying to come up with the metrics that could improve routing and/or autopilot, create an in-memory implementation first to validate that the behaviour indeed improves and only then move on to database schemes.

Looking just at routing, there are already some metrics being tracked in memory. There are black lists for nodes and channels with decay periods and there is a "second chance" after the first failure. Pull request #1734 contains some thoughts about how to extend this to sets of channels.

An idea is to try to unify all this using the following mechanism: Expose for every node and channel a "success chance" to the path finding algorithm. For a successful payment, none of the nodes and channels should fail. The success chance for a payment is the multiplication of all the success chances of nodes and channels along the path. This could be factored into the edge weight function, so that routing does not only take into account fee and timelock, but success chance as well. A trade off needs to be set here.

Then there needs to be a unit to provide those chances. First step could be to rewrite the current black list logic as a "success chances provider". The second attempt mechanism can be incorporated into this unit and also the time decay can be translated into a success chance that slowly recovers. This first step alone might already bring up some interesting questions.

Looking at how to extend this, a further step could be to keep track of the success chance for unknown nodes and channels. This is the base chance when no other information is available.

Next, based on successes or failures, the chance may be adjusted upward or downward. As time passes, the chance slowly moves back to the base chance.

Interesting here is how to deal with timeouts. It can take a long time before a htlc times out, but already one minute after no response has been received, you may want to starting lowering the success chance of that node or channel.

Another aspect that could be a factor in calculating the success chance is payment amount and payment amount in relation to total capacity.

It may require quite some experimentation (or even simulation) to get this right. At that point, it will also be clear what kind of information needs to be tracked. A subsequent step is to save this information in the db. Maybe it will result in more or less the same metrics as suggested in the description of this issue, but at least then you already know exactly why you need those metrics.

Roasbeef added advanced Issues suitable for very experienced developers database Related to the database/storage of LND routing autopilot mission control labels May 16, 2018

maurycy mentioned this issue May 21, 2018

statsd: proof of concept #1264

Closed

Roasbeef added P2 should be fixed if one has time P3 might get fixed, nice to have labels Jul 10, 2018

Roasbeef added this to To do in Autopilot and Mission Control Evolution via automation Dec 18, 2018

carlaKC mentioned this issue Jul 22, 2019

Chanfitness: Track peer uptime #3332

Merged

4 tasks

yope mentioned this issue Sep 21, 2019

lncli restorechanbackup segfault on binary backup #3529

Closed

hakonamatata mentioned this issue May 16, 2020

Handle low fee when funding channels #4292

Closed

dawiepoolman mentioned this issue Dec 5, 2021

[ERR] LTND: unable to open graph DB: edge not found #6058

Closed

Jokin-BalanceBot mentioned this issue Dec 29, 2021

Unexpected second output while use "--sweepall" #6121

Closed

dawiepoolman mentioned this issue Apr 23, 2022

Raspberry PI 3B+ runtime/panic.go #6452

Closed

Roasbeef closed this as completed Aug 19, 2022

Roasbeef mentioned this issue Aug 19, 2022

multi: node fitness tracking lightninglabs/faraday#149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

routing+channeldb: create new sub-system to keep track of "fitness" of various nodes #1253

routing+channeldb: create new sub-system to keep track of "fitness" of various nodes #1253

Roasbeef commented May 16, 2018

joostjager commented Aug 31, 2018

routing+channeldb: create new sub-system to keep track of "fitness" of various nodes #1253

routing+channeldb: create new sub-system to keep track of "fitness" of various nodes #1253

Comments

Roasbeef commented May 16, 2018

Steps To Completion

joostjager commented Aug 31, 2018