Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop decaying liquidity information during scoring #2656

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

Because scoring is an incredibly performance-sensitive operation,
doing liquidity information decay (and especially fetching the
current time!) during scoring isn't really a great idea. Instead, this PR moves to handling decaying in a background processor job.

This should fix #2311, which apparently is still an issue for some users as of 0.0.116.

There was some discussion of an alternative approach where we fetch the time at the start of a routefinding session, store it on the stack, and pass it through to the scorer as we go. I opted not to do this because (a) bindings can't map unbounded generics, which this would need, (b) this avoids actually doing the decay during scoring at all, which probably saves an ms or two, though certainly not a ton, (c) this leads to a much nicer/simpler API - we can remove Time, which we either need to remove or make public (see #2497), and can drop the excess type alias, both of which are much nicer than the alternative. I'm open to more discussion here, but the cost of having one more thing to call as time moves forward doesn't seem high enough to outweigh a, b, and c here.

@codecov-commenter
Copy link

codecov-commenter commented Oct 9, 2023

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (9856fb6) 88.64% compared to head (f8fb70a) 88.93%.
Report is 4 commits behind head on main.

Files Patch % Lines
lightning/src/routing/scoring.rs 92.69% 18 Missing and 1 partial ⚠️
lightning/src/util/test_utils.rs 0.00% 5 Missing ⚠️
lightning-background-processor/src/lib.rs 91.17% 0 Missing and 3 partials ⚠️
lightning/src/routing/router.rs 0.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2656      +/-   ##
==========================================
+ Coverage   88.64%   88.93%   +0.29%     
==========================================
  Files         115      115              
  Lines       91894    93489    +1595     
  Branches    91894    93489    +1595     
==========================================
+ Hits        81458    83145    +1687     
+ Misses       7953     7885      -68     
+ Partials     2483     2459      -24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tnull tnull self-requested a review October 12, 2023 08:12
@TheBlueMatt
Copy link
Collaborator Author

Fixed the no-std bug, should be good to go now.

@TheBlueMatt
Copy link
Collaborator Author

Marking this 119, it turns out we're not decaying our historical buckets properly in at least two cases - (a) get_total_valid_points only shifts once, but the data is squared, so should be shifting twice, (b) the issue fixed alternatively in #2530.

While both could be fixed directly, I'd like to at least consider this first.

Comment on lines 244 to 273
match event {
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.payment_path_failed(path, *scid);
score.payment_path_failed(path, *scid, duration_since_epoch);
},
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
// Reached if the destination explicitly failed it back. We treat this as a successful probe
// because the payment made it all the way to the destination with sufficient liquidity.
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::PaymentPathSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.payment_path_successful(path);
score.payment_path_successful(path, duration_since_epoch);
},
Event::ProbeSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.probe_failed(path, *scid);
score.probe_failed(path, *scid, duration_since_epoch);
},
_ => return false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this mean channels along recently used paths will have their offsets decayed but other channels will not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather the opposite - by the end of the patchset, we only decay in the timer method. When updating we just set the last-update to duration_since_epoch. In theory if a channel is updated in between each timer tick it won't be materially decayed, but I think that's kinda okay, I mean its not a lot of time anyway. If we want to be more pedantically correct I could decay the old data before update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm confused, but it looks like we only decay once per hour in the background processor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus once on startup. I'm not understanding your issue you're raising, are you saying we should reduce the hour to something less?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was pointing out that we are left in a state of partial decay. Added a comment elsewhere, but if you modify last_updated and set, say, the max offset, then you need to decay the min offset. Otherwise, it won't be properly decayed on the timer tick. So --after fixing that -- you'll end up with recently used channels decayed while the others are not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that said, I'm not really convinced either is a super critical issue, at least if we decay more often, at max we'd be off by a small part of a half-life.

Hmm... if one offset is updated frequently, you'll get into a state where the other offset is only ever partially decayed even though it may have been given that value many half-lives ago. So would really depend on both payment and decay frequency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're regularly sending some sats over a channel successfully, so we're constantly reducing our upper bound by the amount we're sending, I think its fine to not decay the lower bound? We'll eventually pick some other channel to send over cause we ran out of estimated liquidity, and we'll decay at that point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, that's not the only scenario. Failures at a channel and downstream from it adjust it's upper and lower bounds, respectively. So if you fail downstream with increasing amounts, the upper bound may not be properly decayed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but presumably repeatedly failing downstream of a channel with higher and higher amounts isn't super likely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for the same payment or at the same downstream channel. From the perspective of the scored channel, it's simply the normal case of learning a more accurate lower bound on its liquidity as a consequence of knowing a payment routed through it but failed downstream.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion of an alternative approach where we fetch the time at the start of a routefinding session, store it on the stack, and pass it through to the scorer as we go. I opted not to do this because (a) bindings can't map unbounded generics, which this would need, (b) this avoids actually doing the decay during scoring at all, which probably saves an ms or two, though certainly not a ton, (c) this leads to a much nicer/simpler API - we can remove Time, which we either need to remove or make public (see #2497), and can drop the excess type alias, both of which are much nicer than the alternative. I'm open to more discussion here, but the cost of having one more thing to call as time moves forward doesn't seem high enough to outweigh a, b, and c here.

Hmmm... (a) can be avoided if we use a Duration since we have Time::duration_since_epoch. (b) seem negligible. And I'm not entirely convinced on (c) regarding the API as now Duration is used in the mutable but not the non-mutable interface, which isn't vert intuitive in places (see comments). Also, there's the risk of adding new bugs.

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
Comment on lines 1315 to 1316
*self.last_updated = duration_since_epoch;
*self.offset_history_last_updated = duration_since_epoch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change last_updated, max_liquidity_offset_msat needs to be decayed. Likewise, for the buckets when changing offset_history_last_updated, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, mostly, lets discuss on your first comment at #2656 (comment)

Comment on lines 1320 to 1326
fn set_max_liquidity_msat(&mut self, amount_msat: u64, duration_since_epoch: Duration) {
*self.max_liquidity_offset_msat = self.capacity_msat.checked_sub(amount_msat).unwrap_or(0);
*self.min_liquidity_offset_msat = if amount_msat < self.min_liquidity_msat() {
0
} else {
self.decayed_offset_msat(*self.min_liquidity_offset_msat)
};
*self.last_updated = self.now;
if amount_msat < *self.min_liquidity_offset_msat {
*self.min_liquidity_offset_msat = 0;
}
*self.last_updated = duration_since_epoch;
*self.offset_history_last_updated = duration_since_epoch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise for min_liquidity_offset_msat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, mostly, lets discuss on your first comment at #2656 (comment)

Comment on lines 1252 to 1249
let existing_max_msat = self.max_liquidity_msat();
if amount_msat < existing_max_msat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unintuitive that we compare against and un-decayed value even though we have the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think we can consider the undecayed values canonical for channels we're updating often, but we can discuss more on your first thread at Yea, mostly, lets discuss on your first comment at #2656 (comment)

/// Adjusts the channel liquidity balance bounds when failing to route `amount_msat`.
fn failed_at_channel<Log: Deref>(&mut self, amount_msat: u64, chan_descr: fmt::Arguments, logger: &Log) where Log::Target: Logger {
fn failed_at_channel<Log: Deref>(
&mut self, amount_msat: u64, duration_since_epoch: Duration, chan_descr: fmt::Arguments, logger: &Log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta say, I really don't like the incongruity of passing the current time in the mutable interface but not in the non-mutable one, which can't be avoided with this approach. That makes uses of the non-mutable interface from the mutable interface harder to reason about. I much prefer the approach of passing the current time to DirectedChannelLiquidity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I see why its annoying to have another parameter, but I kinda disagree about it belonging in DirectedChannelLiquidity. Everything else in DirectedChannelLiquidity is just a reference to liquidity data for a single channel, plus a reference to the decay settings of the overall scorer. The current time doesn't fit into either of those, and is information about the failed payment, which is otherwise all arguments to the failed/success methods.

Comment on lines 244 to 273
match event {
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.payment_path_failed(path, *scid);
score.payment_path_failed(path, *scid, duration_since_epoch);
},
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
// Reached if the destination explicitly failed it back. We treat this as a successful probe
// because the payment made it all the way to the destination with sufficient liquidity.
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::PaymentPathSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.payment_path_successful(path);
score.payment_path_successful(path, duration_since_epoch);
},
Event::ProbeSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.probe_failed(path, *scid);
score.probe_failed(path, *scid, duration_since_epoch);
},
_ => return false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was pointing out that we are left in a state of partial decay. Added a comment elsewhere, but if you modify last_updated and set, say, the max offset, then you need to decay the min offset. Otherwise, it won't be properly decayed on the timer tick. So --after fixing that -- you'll end up with recently used channels decayed while the others are not.

*self.max_liquidity_offset_msat = 0;
}
*self.last_updated = duration_since_epoch;
*self.offset_history_last_updated = duration_since_epoch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update offset_history_last_updated in update_history_buckets instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch, fixed.

let half_lives = self.now.duration_since(*self.last_updated).as_secs()
fn update_history_buckets(&mut self, bucket_offset_msat: u64, duration_since_epoch: Duration) {
let half_lives =
duration_since_epoch.checked_sub(*self.offset_history_last_updated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't accurate given that offset_history_last_updated is updated in set_min_liquidity_msat and set_max_liquidity_msat which could (but may not) be called prior to calling update_history_buckets. Do we have tests to catch this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't have really good testing of these kinds of issues, as evidenced also by your bugfix at #2530. Luckily, doing the decaying in the background means this isn't actually a concern anymore - we only care about this in the rare case that we need to decay the buckets now, but havent run the decayer yet, and then we get a new datapoint. But, that doesn't really matter cause that's no difference than just increasing the half-life by a few minutes, which shouldn't really matter at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, just went ahead and removed the half-life-based decay here, there's really no reason for it and we should just rely on the one in decay_liquidity_certainty.

@TheBlueMatt TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch 3 times, most recently from 0498a13 to 60be6f9 Compare November 29, 2023 03:20
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass (finally, excuse the delay).

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
@@ -274,7 +274,7 @@ macro_rules! define_run_body {
$channel_manager: ident, $process_channel_manager_events: expr,
$gossip_sync: ident, $peer_manager: ident, $logger: ident, $scorer: ident,
$loop_exit_check: expr, $await: expr, $get_timer: expr, $timer_elapsed: expr,
$check_slow_await: expr)
$check_slow_await: expr, $time_fetch: expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than just making this a closure, could we introduce a TimeSource trait and add a default impl based on SystemTime? It seems that we regularly need to retrieve the time in some no-std compatible way and it might be nice to make this generic (e.g., we'd also want something like that in lightning-liquidity lightningdevkit/lightning-liquidity#54)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd really prefer to have a million copies of "CurrentTimeFetcher" in the API everywhere, and for the most part we don't need it - we can mostly just use the latest block timestamp and call it a day (and in a few places use timer ticks if we want more granular expiry). Part of the goal of this PR is to move towards removing the Time trait, which I think is really unnecessary (and also no longer depending on it being blazing fast for routing perf).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why having a trait is directly connected to using it in scoring? We could still have a trait-based impl that is called in the background processor? Just brought it up as in lighting-liquidity we probably want to store a reference to the time source the user will give us upon setup, and it might make sense to have that generic cross compatible with other LDK crates that require the same functionality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think my point is mostly "why can't lightning-liquidity use the block timestamp rather than the time?".

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

Addressed feedback and rebased.

@tnull
Copy link
Contributor

tnull commented Dec 6, 2023

This unfortunately needs a rebase now.

@TheBlueMatt TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch 2 times, most recently from 2338237 to 2dce56d Compare December 6, 2023 20:31
@TheBlueMatt
Copy link
Collaborator Author

Rebased on top of #2774, since it needs to go anyway.

@@ -773,7 +796,10 @@ impl BackgroundProcessor {
handle_network_graph_update(network_graph, &event)
}
if let Some(ref scorer) = scorer {
if update_scorer(scorer, &event) {
use std::time::SystemTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this functions uses system time now, it should probably be #[cfg(all(feature = "std"), not(feature = "no-std")] to handle when you'll be able to use both flags together in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just feature=std is correct. If two crates depend on LDK, with one setting std and another setting no-std, LDK should build with all features. Otherwise, the create relying on std features will fail to compile because of an unrelated crate also in the dependency tree.

@TheBlueMatt
Copy link
Collaborator Author

Rebased now that #2774 landed.

Comment on lines 244 to 273
match event {
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.payment_path_failed(path, *scid);
score.payment_path_failed(path, *scid, duration_since_epoch);
},
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
// Reached if the destination explicitly failed it back. We treat this as a successful probe
// because the payment made it all the way to the destination with sufficient liquidity.
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::PaymentPathSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.payment_path_successful(path);
score.payment_path_successful(path, duration_since_epoch);
},
Event::ProbeSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.probe_failed(path, *scid);
score.probe_failed(path, *scid, duration_since_epoch);
},
_ => return false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, decaying more often helps. For me it's more about consistency within our model -- last_updated no longer has a well-defined meaning as it may only be accurate for one offset. So we have to chose between internal consistency for a channel and consistency across channels with this approach.

@@ -114,7 +114,7 @@ const ONION_MESSAGE_HANDLER_TIMER: u64 = 1;
const NETWORK_PRUNE_TIMER: u64 = 60 * 60;

#[cfg(not(test))]
const SCORER_PERSIST_TIMER: u64 = 60 * 60;
const SCORER_PERSIST_TIMER: u64 = 60 * 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should use a constant here. It should be no more than the user-defined half-life, ideally such that the half-life is divisible by it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess? If a user sets an aggressive half-life I'm not entirely convinced we want to spin their CPU trying to decay liquidity bounds. Doing it a bit too often when they set a super high decay also seems fine-ish? I agree it'd be a bit nicer to switch to some function of the configured half-life, but I'm not sure its worth adding some accessor to ScoreUpdate.

lightning-background-processor/src/lib.rs Show resolved Hide resolved
@@ -1700,7 +1732,7 @@ mod tests {
_ = exit_receiver.changed() => true,
}
})
}, false,
}, false, || Some(Duration::from_secs(1696300000)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's behind the choice of this number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its, basically, when I wrote the patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why it can't be Duration::ZERO like in the other tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it just seemed a bit more realistic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the value doesn't affect the test, it's just curious to the reader to see something different from all the other places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I tried to switch to ZERO but the test fails - it expects to prune entries from the network graph against a static RGS snapshot that has a timestamp in it.

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

On a skylake system, 10k samples in bench gives me these changes for this branch. There's quite a bit of noise, as usual, but it does look like a non-zero win.


generate_routes_with_zero_penalty_scorer
0.0.118		                        time:   [123.62 ms 124.40 ms 125.20 ms]
current git	                        time:   [110.30 ms 110.91 ms 111.52 ms]
			                        change: [-11.604% -10.846% -10.115%] (p = 0.00 < 0.05)
bg decay	                        time:   [118.97 ms 119.83 ms 120.69 ms]
			                        change: [+7.0673% +8.0403% +9.0578%] (p = 0.00 < 0.05)

generate_mpp_routes_with_zero_penalty_scorer
0.0.118		                        time:   [92.596 ms 93.406 ms 94.223 ms]
current git	                        time:   [116.78 ms 118.15 ms 119.54 ms]
			                        change: [+24.595% +26.488% +28.359%] (p = 0.00 < 0.05)
bg decay	                        time:   [108.13 ms 109.02 ms 109.92 ms]
			                        change: [-9.1122% -7.7238% -6.4060%] (p = 0.00 < 0.05)

generate_routes_with_probabilistic_scorer
0.0.118		                        time:   [149.15 ms 149.83 ms 150.51 ms]
current git	                        time:   [164.07 ms 164.76 ms 165.45 ms]
			                        change: [+9.3074% +9.9667% +10.682%] (p = 0.00 < 0.05)
bg decay	                        time:   [136.01 ms 136.90 ms 137.80 ms]
			                        change: [-17.595% -16.910% -16.259%] (p = 0.00 < 0.05)

generate_mpp_routes_with_probabilistic_scorer
0.0.118		                        time:   [143.37 ms 144.12 ms 144.88 ms]
current git	                        time:   [155.43 ms 156.14 ms 156.84 ms]
			                        change: [+7.6204% +8.3365% +9.0807%] (p = 0.00 < 0.05)
bg decay	                        time:   [148.16 ms 149.06 ms 149.96 ms]
			                        change: [-5.2250% -4.5331% -3.7806%] (p = 0.00 < 0.05)

generate_large_mpp_routes_with_probabilistic_scorer
0.0.118		                        time:   [426.72 ms 432.73 ms 438.78 ms]
current git	                        time:   [403.56 ms 409.49 ms 415.51 ms]
			                        change: [-7.2966% -5.3700% -3.4360%] (p = 0.00 < 0.05)
bg decay	                        time:   [443.54 ms 447.70 ms 451.88 ms]
			                        change: [+7.4605% +9.3325% +11.248%] (p = 0.00 < 0.05)

generate_routes_with_nonlinear_probabilistic_scorer
0.0.118		                        time:   [149.56 ms 150.22 ms 150.89 ms]
current git	                        time:   [148.34 ms 149.29 ms 150.24 ms]
			                        change: [-1.3545% -0.6203% +0.1605%] (p = 0.12 > 0.05)
bg decay	                        time:   [140.49 ms 141.41 ms 142.32 ms]
			                        change: [-6.1363% -5.2818% -4.4185%] (p = 0.00 < 0.05)

generate_mpp_routes_with_nonlinear_probabilistic_scorer
0.0.118		                        time:   [151.39 ms 152.03 ms 152.67 ms]
current git	                        time:   [146.50 ms 147.28 ms 148.07 ms]
			                        change: [-3.8003% -3.1201% -2.4334%] (p = 0.00 < 0.05)
bg decay	                        time:   [142.28 ms 143.08 ms 143.87 ms]
			                        change: [-3.5925% -2.8575% -2.0824%] (p = 0.00 < 0.05)

generate_large_mpp_routes_with_nonlinear_probabilistic_scorer
0.0.118		                        time:   [368.48 ms 372.81 ms 377.14 ms]
current git	                        time:   [403.42 ms 408.91 ms 414.47 ms]
			                        change: [+7.6832% +9.6843% +11.707%] (p = 0.00 < 0.05)
bg decay	                        time:   [350.87 ms 355.56 ms 360.27 ms]
			                        change: [-14.678% -13.047% -11.244%] (p = 0.00 < 0.05)

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
Comment on lines 244 to 273
match event {
Event::PaymentPathFailed { ref path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.payment_path_failed(path, *scid);
score.payment_path_failed(path, *scid, duration_since_epoch);
},
Event::PaymentPathFailed { ref path, payment_failed_permanently: true, .. } => {
// Reached if the destination explicitly failed it back. We treat this as a successful probe
// because the payment made it all the way to the destination with sufficient liquidity.
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::PaymentPathSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.payment_path_successful(path);
score.payment_path_successful(path, duration_since_epoch);
},
Event::ProbeSuccessful { path, .. } => {
let mut score = scorer.write_lock();
score.probe_successful(path);
score.probe_successful(path, duration_since_epoch);
},
Event::ProbeFailed { path, short_channel_id: Some(scid), .. } => {
let mut score = scorer.write_lock();
score.probe_failed(path, *scid);
score.probe_failed(path, *scid, duration_since_epoch);
},
_ => return false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More in the sense that its use in decaying isn't well defined. We should at least note that in the decay_liquidity_certainty implementation.

@@ -1700,7 +1732,7 @@ mod tests {
_ = exit_receiver.changed() => true,
}
})
}, false,
}, false, || Some(Duration::from_secs(1696300000)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why it can't be Duration::ZERO like in the other tests?

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
Comment on lines +1457 to +1467
let half_life = decay_params.historical_no_updates_half_life.as_secs_f64();
if half_life != 0.0 {
let divisor = powf64(2048.0, elapsed_time.as_secs_f64() / half_life) as u64;
for bucket in liquidity.min_liquidity_offset_history.buckets.iter_mut() {
*bucket = ((*bucket as u64) * 1024 / divisor) as u16;
}
for bucket in liquidity.max_liquidity_offset_history.buckets.iter_mut() {
*bucket = ((*bucket as u64) * 1024 / divisor) as u16;
}
liquidity.offset_history_last_updated = duration_since_epoch;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this means we'll decay partial half-lives but only after decaying one full half-life. Why bother with using 2048.0 and 1024 here if this is happening in the background?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those multipliers are just to get reasonable precision. We could cast the bucket to a float and then do the whole thing in float math, but it seems easier to just keep the buckets as ints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not do partial decays when less than one half-life has passed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda goes against the model of the historical buckets - they're intended to be "time-free", only using a decay parameter if we really haven't seen that channel in a long time. Now, I wouldn't be against revisiting that idea, its quite possible we over-corrected from having too much of a time parameter in the non-historical data, but I'd like to think about that separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, SGTM.

In the next commits we'll need `f64`'s `powf`, which is only
available in `std`. For `no-std`, here we depend on `libm` (a
`rust-lang` org project), which we can use for `powf`.
In the coming commits, we'll stop relying on fetching the time
during routefetching, preferring to decay score data in the
background instead.

The first step towards this - passing the current time through into
the scorer when updating.
Rather than relying on fetching the current time during
routefinding, here we introduce a new trait method to `ScoreUpdate`
to do so. This largely mirrors what we do with the `NetworkGraph`,
and allows us to take on much more expensive operations (floating
point exponentiation) in our decaying.
In the next commit, we'll start to use the new
`ScoreUpdate::decay_liquidity_certainty` to decay our bounds in the
background. This will result in the `last_updated` field getting
updated regularly on decay, rather than only on update. While this
isn't an issue for the regular liquidity bounds, it poses a problem
for the historical liquidity buckets, which are decayed on a
separate (and by default much longer) timer. If we didn't move to
tracking their decays separately, we'd never let the `last_updated`
field get old enough for the historical buckets to decay at all.

Instead, here we introduce a new `Duration` in the
`ChannelLiquidity` which tracks the last time the historical
liquidity buckets were last updated. We initialize it to a copy of
`last_updated` on deserialization if it is missing.
This implements decaying in the `ProbabilisticScorer`'s
`ScoreLookup::decay_liquidity_certainty` implementation, using
floats for accuracy since we're no longer particularly
time-sensitive. Further, it (finally) removes score entries which
have decayed to zero.
Because scoring is an incredibly performance-sensitive operation,
doing liquidity information decay (and especially fetching the
current time!) during scoring isn't really a great idea. Now that
we decay liquidity information in the background, we don't have any
reason to decay during scoring, and we remove the historical bucket
liquidity decaying here.
Because scoring is an incredibly performance-sensitive operation,
doing liquidity information decay (and especially fetching the
current time!) during scoring isn't really a great idea. Now that
we decay liquidity information in the background, we don't have any
reason to decay during scoring, and we ultimately remove it
entirely here.
Now that we aren't decaying during scoring, when we set the
last_updated time in the history bucket logic doesn't matter, so
we should just update it when we've just updated the history
buckets.
In the coming commits, the `T: Time` bound on `ProbabilisticScorer`
will be removed. In order to enable that, we need to pass the
current time (as a `Duration` since the unix epoch) through the
score updating pipeline, allowing us to keep the
`*last_updated_time` fields up-to-date as we go.
In the coming commits, the `T: Time` bound on `ProbabilisticScorer`
will be removed. In order to enable that, we need to switch over to
using the `ScoreUpdate`-provided current time (as a `Duration`
since the unix epoch), making the `T` bound entirely unused.
Now that we don't access time via the `Time` trait in
`ProbabilisticScorer`, we can finally drop the `Time` bound
entirely, removing the `ProbabilisticScorerUsingTime` and type
alias indirection and replacing it with a simple struct.
As we now no longer decay bounds information when fetching them,
there is no need to have a decaying-fetching helper utility.
This is a good gut-check to ensure we don't end up taking a ton of
time decaying channel liquidity info.

It currently clocks in around 1.25ms on an i7-1360P.
Now that the serialization format of `no-std` and `std`
`ProbabilisticScorer`s both just use `Duration` since UNIX epoch
and don't care about time except when decaying, we don't need to
warn users to not mix the scorers across `no-std` and `std` flags.

Fixes lightningdevkit#2539
There's some edge cases in our scoring when the information really
should be decayed but hasn't yet been prior to an update. Rather
than try to fix them exactly, we instead decay the scorer a bit
more often, which largely solves them but also gives us a bit more
accurate bounds on our channels, allowing us to reuse channels at
a similar amount to what just failed immediately, but at a
substantial penalty.
Because we decay the bucket information in the background, there's
not much reason to try to decay them immediately prior to updating,
and in removing that we can also clean up a good bit of dead code,
which we do here.
Now that we use explicit times passed to decay methods, there's no
reason to make calls to `SinceEpoch::advance` in scoring tests.
@TheBlueMatt
Copy link
Collaborator Author

Squashed the fixups, without any changes.

@jkczyz
Copy link
Contributor

jkczyz commented Dec 15, 2023

Code overall looks good. I'm not opposed to the approach, necessarily. I think there still can be an issue decaying in the case of a node with a large payment volume (See #2656 (comment)).

Otherwise, I don't have any other concerns. @tnull Could you take another look?

@TheBlueMatt
Copy link
Collaborator Author

Code overall looks good. I'm not opposed to the approach, necessarily. I think there still can be an issue decaying in the case of a node with a large payment volume (See #2656 (comment)).

Yea, I mean its definitely not "right", just not clear to me its "wrong" either. The only thing we could really do to address it is split the last_updated fields into two. I'm totally fine doing that if you think its worth it.

@TheBlueMatt TheBlueMatt force-pushed the 2023-09-scoring-decay-timer branch 2 times, most recently from 4e9783e to f8fb70a Compare December 15, 2023 04:57
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass.

LGTM I think. ACKing in case we want to land this soon, but might come back for yet another pass next week.

@jkczyz
Copy link
Contributor

jkczyz commented Dec 15, 2023

Yea, I mean its definitely not "right", just not clear to me its "wrong" either. The only thing we could really do to address it is split the last_updated fields into two. I'm totally fine doing that if you think its worth it.

I guess it just can come across as surprising to an outside observer. They can tell that the a bound has been adjusted, but one of them doesn't seem to be decaying as expected based on the configuration and the observed time of adjustment. Maybe it's not horrible in practice?

@TheBlueMatt
Copy link
Collaborator Author

I guess to an outside observer it just looks like both ends got updated? Which isn't true, but not crazy broken either (at least in the sense that I'm not sure what outside observer would be looking at both their failures and their success/failure stream and somehow caring about the inconsistent decays). I'm happy with either solution, though - as-is or splitting the last-updated tracking. We could also split the last-updated tracking in a followup, it'd be a new commit either way.

@jkczyz
Copy link
Contributor

jkczyz commented Dec 15, 2023

I guess to an outside observer it just looks like both ends got updated? Which isn't true, but not crazy broken either (at least in the sense that I'm not sure what outside observer would be looking at both their failures and their success/failure stream and somehow caring about the inconsistent decays). I'm happy with either solution, though - as-is or splitting the last-updated tracking. We could also split the last-updated tracking in a followup, it'd be a new commit either way.

Was thinking in terms of something like estimated_channel_liquidity_range and a prober that may use the data to make decisions on future probes. An outsider can observe the bounds changing over time and from events.

But, yeah, a follow-up is fine. Seems like we can always do it later if we find problems, too, given the serialization format. Just seems more correct to use separate timestamps.

@TheBlueMatt
Copy link
Collaborator Author

Alright, gonna merge this. I have a large pile of performance tweaks to the router and scorer up next, and I can incorporate a separate decay for the two bounds in some of that work.

@TheBlueMatt TheBlueMatt merged commit c92db69 into lightningdevkit:main Dec 15, 2023
29 of 30 checks passed
Comment on lines -3519 to -3520
usage.inflight_htlc_msat = 0;
assert_eq!(scorer.channel_penalty_msat(&candidate, usage, &params), 866);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just fixing a build warning and noticed this. Why did this check need to be removed? Deleted in 35b4964.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous test relied on the behavior where we actually used undecayed data in the buckets when scoring, and only considered the decaying when deciding if we should score at all. We now actually decay the data so do not have the undecayed data available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support loading a std-persisted ProbabilisticScorer in no-std Bug: On iOS LDK panics on reading scorer
5 participants