Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing+channeldb: create new sub-system to keep track of "fitness" of various nodes #1253

Closed
5 tasks
Roasbeef opened this issue May 16, 2018 · 1 comment
Closed
5 tasks
Labels
advanced Issues suitable for very experienced developers autopilot database Related to the database/storage of LND mission control P2 should be fixed if one has time P3 might get fixed, nice to have routing

Comments

@Roasbeef
Copy link
Member

Throughout the operation of lnd, we indirectly and directly interact with various nodes within the network. It may be that we see a new NodeAnnouncement for a node, successfully route a payment where the node was a vertex traversed, receive an HTLC routing failure from that node, or need to force close a channel due to a node being inactive. This is all very valuable information that currently isn't being persistently tracked at all. If we start to collate this information persistently, we can begin to use it in several sub-systems within the daemon. For example, within the autopilot subsystem, we can use this information to weed out a node as it's been known to be unreliable historically. Another example for usage lies in the interaction with the missionControl sub-system: we can preemptively black list any path that includes known historically faulty vertexes.

Steps To Completion

  • Create a new NodeHistory sub-bucket within the current channeldb package. Starting points for desirable metrics to track include:

    • first time a node was seen
    • last time a node was seen
    • number of times they were included within an attempted route
    • number of times a route they were involved w/ was successfully settled
    • number of times a route they were involved w/ was cancelled back (can be more granular: sent error or was source of error)
    • number of times a route they were involved w/ had to timeout
    • number of times we had to force close a channel w/ the node due to inactivity
    • number of times the node incurred a protocol violation (invalid commit or the like)
    • largest number of channels closed in single or consecutive blocks by the node
    • number of attempted breaches
    • number of successful breaches
    • frequency of channel turn over (are channels opened, then closed only a few blocks after wards?)
  • Due the that nature of each of the fields, we'll likely want to adopt a sort of "columnar" schema within the database. This can be implemented by using key || nodeID prefixes. This schema will allow for efficient queries/scans as all the data will be laid out contiguously. Metrics for a particular node can still be obtained by fully populating the two-tuple key.

  • Start to populate/integrate incrementing counters for the various metrics in the ChannelRouter as well as within missionControl. For example, each successful/failed payment should trigger an update (or several updates) to the NodeHistory database.

  • As a bonus, we can start to proactively probe nodes within the missionControl sub-system by using specially crafted HTLCs. If we find a node that we want to update our statistics for, we'll locate the smallest possible HTLC that all nodes in paths to that node will accept. We'll then create an HTLC with this value, but use some random unknown payment hash to the node. If the node is online, then they should cancel back with an UnknownPayment hash error. Otherwise, we'll get various other types of errors back.

  • Create a set of RPCs to allow users/UIs/monitoring tools to query (and possibly manually update) these metrics.

@Roasbeef Roasbeef added advanced Issues suitable for very experienced developers database Related to the database/storage of LND routing autopilot mission control labels May 16, 2018
@Roasbeef Roasbeef added P2 should be fixed if one has time P3 might get fixed, nice to have labels Jul 10, 2018
@joostjager
Copy link
Contributor

The approach described in this issue is to start with (persistently) tracking various metrics and from there start to use those metrics to improve autopilot and routing behaviour.

I find it difficult to define the metrics without yet knowing what exactly is needed.

How would this work if the process was turned around? Start with trying to come up with the metrics that could improve routing and/or autopilot, create an in-memory implementation first to validate that the behaviour indeed improves and only then move on to database schemes.

Looking just at routing, there are already some metrics being tracked in memory. There are black lists for nodes and channels with decay periods and there is a "second chance" after the first failure. Pull request #1734 contains some thoughts about how to extend this to sets of channels.

An idea is to try to unify all this using the following mechanism: Expose for every node and channel a "success chance" to the path finding algorithm. For a successful payment, none of the nodes and channels should fail. The success chance for a payment is the multiplication of all the success chances of nodes and channels along the path. This could be factored into the edge weight function, so that routing does not only take into account fee and timelock, but success chance as well. A trade off needs to be set here.

Then there needs to be a unit to provide those chances. First step could be to rewrite the current black list logic as a "success chances provider". The second attempt mechanism can be incorporated into this unit and also the time decay can be translated into a success chance that slowly recovers. This first step alone might already bring up some interesting questions.

Looking at how to extend this, a further step could be to keep track of the success chance for unknown nodes and channels. This is the base chance when no other information is available.

Next, based on successes or failures, the chance may be adjusted upward or downward. As time passes, the chance slowly moves back to the base chance.

Interesting here is how to deal with timeouts. It can take a long time before a htlc times out, but already one minute after no response has been received, you may want to starting lowering the success chance of that node or channel.

Another aspect that could be a factor in calculating the success chance is payment amount and payment amount in relation to total capacity.

It may require quite some experimentation (or even simulation) to get this right. At that point, it will also be clear what kind of information needs to be tracked. A subsequent step is to save this information in the db. Maybe it will result in more or less the same metrics as suggested in the description of this issue, but at least then you already know exactly why you need those metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced Issues suitable for very experienced developers autopilot database Related to the database/storage of LND mission control P2 should be fixed if one has time P3 might get fixed, nice to have routing
Projects
No open projects
Development

No branches or pull requests

2 participants