Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routing: process payment successes in mission control #3372

Merged
merged 2 commits into from Aug 23, 2019

Conversation

@joostjager
Copy link
Collaborator

commented Aug 5, 2019

This PR modifies paymentLifecycle so that it not only feeds failures into mission control, but successes as well. This allows for more accurate probability estimates. Previously, the success probability for a successful pair and a pair with no history was equal. There was no force that pushed towards previously successful routes.

A fixed success probability is introduced for node pairs that proved to be successful on the previous payment attempt. This value could be exposed as a parameter in the future.

Furthermore the default a priori success probability (defines the probability for node pairs with no history) is lowered from 95% to a more realistic 60%. This wasn't possible before, because it would discourage longer routes too much. With the newly introduced "success success probability", only long routes consisting of unknown node pairs are avoided.

No database migration is needed, because we persist only the raw payment results and not the interpretation.

@joostjager joostjager requested a review from Roasbeef as a code owner Aug 5, 2019
@joostjager joostjager changed the title routing: process payment successes in mission control routing: process payment successes in mission control [wip] Aug 5, 2019
@joostjager joostjager removed the request for review from Roasbeef Aug 5, 2019
@joostjager joostjager force-pushed the joostjager:mc-successes branch 3 times, most recently from 5e3c09f to 17c38b4 Aug 5, 2019
@joostjager joostjager changed the title routing: process payment successes in mission control [wip] routing: process payment successes in mission control Aug 6, 2019
@joostjager joostjager force-pushed the joostjager:mc-successes branch from 17c38b4 to b5ddcf8 Aug 6, 2019
@wpaulino wpaulino added this to the 0.8.0 milestone Aug 7, 2019
@joostjager joostjager requested review from halseth and cfromknecht Aug 7, 2019
@Roasbeef Roasbeef requested review from wpaulino and removed request for cfromknecht Aug 13, 2019
Copy link
Collaborator

left a comment

Changes LGTM, only have minor comments. My review is only of the relevant commit for this PR: b5ddcf8.

routing/missioncontrol.go Outdated Show resolved Hide resolved
routing/missioncontrol.go Show resolved Hide resolved
routing/missioncontrol_test.go Outdated Show resolved Hide resolved
routing/result_interpretation.go Outdated Show resolved Hide resolved
routing/missioncontrol.go Show resolved Hide resolved
routing/result_interpretation_test.go Outdated Show resolved Hide resolved
@joostjager joostjager force-pushed the joostjager:mc-successes branch 2 times, most recently from 96675fa to b5c88f7 Aug 14, 2019
@joostjager joostjager requested review from wpaulino and Roasbeef and removed request for halseth Aug 14, 2019
@joostjager joostjager force-pushed the joostjager:mc-successes branch 2 times, most recently from bc5b1c0 to 5357daa Aug 15, 2019
@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 15, 2019

Converted to parameterized tests and added one for incorrect payment details.

@joostjager joostjager requested a review from wpaulino Aug 15, 2019
@joostjager joostjager force-pushed the joostjager:mc-successes branch from 5357daa to 7a6dd9a Aug 15, 2019
Copy link
Collaborator

left a comment

LGTM 💥

@wpaulino

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Currently failing travis though:

--- FAIL: TestMissionControl (0.01s)
missioncontrol_test.go:175: unexpected number of channels

Copy link
Member

left a comment

Only a few minor comments, and some brain storming for possible future directions. Dependent on another PR, so we'll need to wait until that goes in before we can proceed with this one.

routing/missioncontrol.go Show resolved Hide resolved
routing/result_interpretation.go Show resolved Hide resolved
getTestPair(2, 1): 0,
getTestPair(2, 3): 0,
getTestPair(3, 2): 0,
pairResults: map[DirectedNodePair]pairResult{

This comment has been minimized.

Copy link
@Roasbeef

Roasbeef Aug 16, 2019

Member

Shouldn't all nodes but the final have been marked as a success since they successfully forwarded?

This comment has been minimized.

Copy link
@joostjager

joostjager Aug 17, 2019

Author Collaborator

In case of ExpiryTooSoon, any of the nodes could have caused a delay triggering this failure.

@joostjager joostjager force-pushed the joostjager:mc-successes branch from 7a6dd9a to 17cd4d4 Aug 17, 2019
@joostjager joostjager requested a review from cfromknecht as a code owner Aug 17, 2019
@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 17, 2019

Currently failing travis though:

--- FAIL: TestMissionControl (0.01s)
missioncontrol_test.go:175: unexpected number of channels

Fixed

@joostjager joostjager force-pushed the joostjager:mc-successes branch 2 times, most recently from 2b87a39 to 9855ca4 Aug 17, 2019
@joostjager joostjager requested a review from Roasbeef Aug 19, 2019
@joostjager joostjager force-pushed the joostjager:mc-successes branch from 9855ca4 to 88debb9 Aug 21, 2019
@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 21, 2019

Rebased after merge of #3256

Copy link
Collaborator

left a comment

Changes LGTM

Only thing that isn't immediately obvious is the constants, how much testing/validation has been done in the wild?

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 22, 2019

Yes, valid concern. Those constants are set to something that looked reasonable to me, but there is no science behind it.

I spend quite some time to produce a test result that would show the benefit of the changes in this pr.

Approach:

  • Try to pay to random (testnet) nodes with an unknown hash. The ones that reply with invalid_payment_details are logged in a file. I stopped when this file had 40 nodes in it. So those are nodes that I could pay to: the test set.
  • Reset mission control
  • Run a test where a unknown hash payment payment is made to each of those test set nodes. Payments are executed concurrently, but no more than 5 at the same time. This should really exercise mission control and provide path finding input.
  • Count the number of payments attempts made by grepping on the lnd log. It took 66 attempts to complete the 40 payments.
  • Repeat the same test on master. It needed 68 attempts.

No significant difference unfortunately. Intuitively I'd expect that accounting for successes should show up in the number of attempts needed, but with this test setup it doesn't. I tried with different amounts, but that didn't make much of a difference either.

My current theory is that there are not enough failures to make any improvement in mission control stand out. However, in other circumstances (mainnet, node connected to a bad cluster) it could make a difference. It isn't ideal though that I cannot demonstrate this.

One thing that I did get out of this test is that there apparently isn't a regression with the changes in this pr. Payment performance is about the same as before.

What I didn't test is the difference in performance over time. Because failures decay (and we need to decay them otherwise we may end up with an empty channel set), a mission control based on failures only will slowly forget everything it has seen. The decay speed can be tuned, but it remains a difficult parameter to pick. Success observations as they are introduced in this pr, do not decay. They keep the success probability of the channel high as long as no failure is observed. That means that this part of mission control's memory isn't forgetting. Therefore payment performance after a day (with default mission control half life time) could very well be noticably better for a successes-aware mission control.

@joostjager

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 22, 2019

To validate that last hypothesis (performance over time), I did the following:

  • Run the test as described above
  • Re-running this test on master after a day would return the same score, because all mission control data would have decayed.
  • With this pr, it should perform better because successes don't decay. To test this, I restarted lnd and set routerrpc.penaltyhalflife to 1 second. After lnd was started, I verified by running querymc that the failures were indeed decayed back to 60%. The success were still at 95%.
  • I reran the test with mission control history from the first run of which the failures had decayed artificially fast.
  • 42 payment attempts were needed, meaning every payment but two succeeded the first time.

So possibly the behavior over time is where the real win of this PR is.

joostjager added 2 commits Aug 14, 2019
This commit modifies paymentLifecycle so that it not only feeds
failures into mission control, but successes as well.
This allows for more accurate probability estimates. Previously,
the success probability for a successful pair and a pair with
no history was equal. There was no force that pushed towards
previously successful routes.
@joostjager joostjager force-pushed the joostjager:mc-successes branch from 88debb9 to ff0c5a0 Aug 23, 2019
Copy link
Member

left a comment

LGTM 🐲

@Roasbeef Roasbeef merged commit 557083c into lightningnetwork:master Aug 23, 2019
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.09%) to 61.115%
Details
@1043465747

This comment has been minimized.

Copy link

commented Aug 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.