Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wake background-processor on async monitor update completion #2090

Merged
merged 8 commits into from
Apr 3, 2023

Conversation

TheBlueMatt
Copy link
Collaborator

Fixes #2052.

Open to feedback especially on the last commit, its nice to reduce our polling interval, but the complexity here just to reduce wakeups by something like 2x (3 times per second plus once every 500ms, or something like that down from 10 times per second) seems...not worth it?

cc @MaxFangX

@TheBlueMatt TheBlueMatt added this to the 0.0.115 milestone Mar 9, 2023
lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
lightning-background-processor/src/lib.rs Show resolved Hide resolved
lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
@@ -175,6 +119,9 @@ impl FutureState {
}

/// A simple future which can complete once, and calls some callback(s) when it does so.
///
/// Clones can be made and all futures cloned from the same source will complete at the same time.
#[derive(Clone)]
pub struct Future {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confusing to see Future != StdFuture. maybe rename to Notified or NotifiedFuture?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I agree, let's change it in a followup though, this PR is already kinda big.

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved
@MaxFangX
Copy link
Contributor

MaxFangX commented Mar 9, 2023

the complexity here just to reduce wakeups by something like 2x (3 times per second plus once every 500ms, or something like that down from 10 times per second) seems...not worth it?

Yeah, I certainly don't want to add more complexity / maintenance burden to LDK especially given that we're not actually using the upstream background processor. As I wrote in #2052:

Even though we don't use LDK's background processor directly, [removing the 100ms sleeper future in the LDK background processor] would be a strong signal to us that LDK has confidence in its provided mechanisms for waking the background processor when needed, and would thus give us the confidence to remove our process events timer downstream.

In other words, the main guarantee that we're interested in is that if any event is generated which needs processing, an associated future (either the chain monitor or channel manager future will do) will complete, which will allow our background processor to immediately begin processing events (in order to minimize latency and response time). Previously, before these futures were exposed, the 100ms sleeper future acted as a catch-all to ensure that any generated events are processed within at most 100ms, which meant that the average minimum response time was about 50ms, rather than the 0ms that we want. Now that we have the channel manager and chain monitor futures, (IIUC) the only reason the 100ms sleeper future still needs to exist is to detect if the node has been starved of CPU cycles on some platforms in order to reconnect sockets, NOT as a catch-all to process events. So it seems that the BGP could be rewritten so that events are processed ONLY IF one of those two futures completed, rather than any time the 100ms sleeper future completes. This would be a strong signal to us that the guarantee we're interested in has been satisfied.

Also, Lexe likely has more stringent CPU and performance requirements than most LDK users, so I wouldn't want you all to take on the extra complexity to reduce the CPU usage by just a factor of 2 in a BGP that we don't even use, just for us. So if you feel that most users won't need the 2x performance improvement, definitely feel free to remove the last bits which introduce complexity - as emphasized above, our primary interest in the upstream BGP is mostly as a reference to see which guarantees LDK has implicitly promised to uphold.

@TheBlueMatt
Copy link
Collaborator Author

In other words, the main guarantee that we're interested in is that if any event is generated which needs processing, an associated future (either the chain monitor or channel manager future will do) will complete, which will allow our background processor to immediately begin processing events (in order to minimize latency and response time).

This is absolutely a goal, was before, and continues to be. Notably even the 100ms that provides us a backstop is an impossibly long delay to have to wait twice per payment! #2052 is, after all, a bug :)

Now that we have the channel manager and chain monitor futures, (IIUC) the only reason the 100ms sleeper future still needs to exist is to detect if the node has been starved of CPU cycles on some platforms in order to reconnect sockets, NOT as a catch-all to process events. So it seems that the BGP could be rewritten so that events are processed ONLY IF one of those two futures completed, rather than any time the 100ms sleeper future completes. This would be a strong signal to us that the guarantee we're interested in has been satisfied.

Indeed, largely. Sadly I totally misread the constants, they say 1 second for testing, but I skimmed that as 1 second in all builds :). Thus, the proposed rewrite here makes no sense, really. Rather, we should just always sleep on the PING_TIMER's 10 second interval, and expose the concept of "am I on mobile".

Also, Lexe likely has more stringent CPU and performance requirements than most LDK users, so I wouldn't want you all to take on the extra complexity to reduce the CPU usage by just a factor of 2 in a BGP that we don't even use, just for us. So if you feel that most users won't need the 2x performance improvement, definitely feel free to remove the last bits which introduce complexity - as emphasized above, our primary interest in the upstream BGP is mostly as a reference to see which guarantees LDK has implicitly promised to uphold.

I don't think Lexe is the only LDK user who will ever run a ton of nodes in one process on one server, rather y'all are just the furthest along towards building that. It's definitely a use-case we want to support, even if we also have ten other use-cases we also have to support :)

@MaxFangX
Copy link
Contributor

MaxFangX commented Mar 9, 2023

Btw, we actually run our nodes as separate processes, for SGX reasons. But yeah, all the above sounds good to me!

@TheBlueMatt TheBlueMatt force-pushed the 2023-03-mon-wake-bp branch 2 times, most recently from 003be28 to 4f2b20b Compare March 14, 2023 21:44
@TheBlueMatt
Copy link
Collaborator Author

Replaced the last commit with one that simply bumps the sleep time to 10s based on a new option, which should suffice.

@TheBlueMatt TheBlueMatt force-pushed the 2023-03-mon-wake-bp branch 2 times, most recently from 7eba7ef to 2f1fb7c Compare March 15, 2023 18:01
@codecov-commenter
Copy link

codecov-commenter commented Mar 15, 2023

Codecov Report

Patch coverage: 93.18% and project coverage change: -0.02 ⚠️

Comparison is base (0e28bcb) 91.38% compared to head (7f1bbf5) 91.37%.

❗ Current head 7f1bbf5 differs from pull request most recent head 94a11f7. Consider uploading reports for the commit 94a11f7 to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2090      +/-   ##
==========================================
- Coverage   91.38%   91.37%   -0.02%     
==========================================
  Files         102      102              
  Lines       49767    49832      +65     
  Branches    49767    49832      +65     
==========================================
+ Hits        45482    45532      +50     
- Misses       4285     4300      +15     
Impacted Files Coverage Δ
lightning-net-tokio/src/lib.rs 78.41% <ø> (ø)
lightning/src/sync/nostd_sync.rs 100.00% <ø> (ø)
lightning-background-processor/src/lib.rs 77.77% <83.33%> (-0.03%) ⬇️
lightning/src/sync/debug_sync.rs 89.69% <83.33%> (-2.33%) ⬇️
lightning/src/util/wakers.rs 96.20% <92.92%> (-1.48%) ⬇️
lightning/src/chain/chainmonitor.rs 94.72% <100.00%> (+0.08%) ⬆️
lightning/src/ln/channelmanager.rs 89.10% <100.00%> (ø)

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TheBlueMatt
Copy link
Collaborator Author

Will have to wait on #2107

lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Show resolved Hide resolved
lightning/src/util/wakers.rs Show resolved Hide resolved
lightning/src/chain/chainmonitor.rs Outdated Show resolved Hide resolved
@tnull tnull self-requested a review March 16, 2023 09:23
lightning-net-tokio/src/lib.rs Outdated Show resolved Hide resolved
lightning-net-tokio/src/lib.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2023-03-mon-wake-bp branch 3 times, most recently from 74775fe to c827d88 Compare March 20, 2023 18:28
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, seems we now also need to add wait_while/wait_timeout_while to nostd_sync::Condvar. Should be trivial I believe as all of the methods there are dummies anyway?

@TheBlueMatt
Copy link
Collaborator Author

Mmm, actually, yea, that's nonsense, I opted to remove the dummy Condvar instead and make the relevant wait methods std-only (which they should have been to begin with).

Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do a full pass after squash

.wait_timeout(Duration::from_millis(10)));

// Test ordering somewhat more.
notifier_a.notify();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this call to only complete the previous future because we were already waiting on it, rather than also completing the next future below. Is there a reason behind this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, our goal is to ensure that after any notification the user can see it. The sleepers that were already running, but that we'd given up on by this point absolutely shouldn't count for that - the user has given up on them, dropped them, and moved on. We need to ensure that their next wait completes so that they will persist ASAP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after re-visiting the Notifier (and I may be missing something), it seems this is not the case there. The future may be dropped by the user, but not by the Notifier itself. A notify would wake the "new" future, but it's really just a clone of the future they previously dropped (we test this in notifier_future_completes_wake). With the Sleeper, I guess the new Sleeper future could be polling futures from a distinct set of Notifiers than the previous Sleeper future dropped, so we'd definitely want to wake the new future there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there's no such thing as a "new" future (until we set callbacks_made in the old one's state), but that's not really observable to the user. As far as the user is concerned, they got a new future, which completes instantly because a notification came in after they dropped the previous future (or after the last time they poll'ed it).

At a high level, our goal is "after we call notify, the user will be notified" - if the user has a future, and they don't sleep on it (or have already been woken on a sleeper on it for a different future) or they stopped poll'ing already, or....we will notify them the next time they go to poll an associated future. This applies no matter what the internal state of the future is - we always make sure to wake/complete a future at some point after the notify.

Concretely, in this test, you can imagine the BP - it is polling two notifiers - the ChannelManager and the ChainMonitor - it got notified by both, and both caused a wakeup, then, it went to poll one of the futures again, but there wasn't a wake, we timed out instead. At this point we're gonna go around the loop. If we get a notify to one of our objects while we're in the loop body processing some events, when we get back to the top we don't want to sleep - we want the new future to complete instantly.

Just because we polled before and it timed out (ignoring whether the future was drop'd - we treat it the same whether it was or not) doesn't mean we dont need to be woken up now - if some notify happens, we need to make sure the next poll/sleep wakes instantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was very helpful, thanks! I think we could do a bit better with documentation here, even if it's mostly internal, just noting some of the requirements.

@TheBlueMatt
Copy link
Collaborator Author

Went ahead and squashed down all the fixups.

lightning/src/sync/debug_sync.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Show resolved Hide resolved
.wait_timeout(Duration::from_millis(10)));

// Test ordering somewhat more.
notifier_a.notify();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after re-visiting the Notifier (and I may be missing something), it seems this is not the case there. The future may be dropped by the user, but not by the Notifier itself. A notify would wake the "new" future, but it's really just a clone of the future they previously dropped (we test this in notifier_future_completes_wake). With the Sleeper, I guess the new Sleeper future could be polling futures from a distinct set of Notifiers than the previous Sleeper future dropped, so we'd definitely want to wake the new future there?

lightning/src/util/wakers.rs Show resolved Hide resolved
lightning/src/util/wakers.rs Show resolved Hide resolved
lightning-background-processor/src/lib.rs Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2023-03-mon-wake-bp branch 3 times, most recently from 7cd102c to 90587bc Compare March 25, 2023 02:46
@TheBlueMatt
Copy link
Collaborator Author

Had to re-add the loop-based waits in a fixup and then remove them later - reording the commits here is...kinda tough - we want to remove the no-std condvar and make all the sleeps std-only, except first we add the sleeper, which is then used to move the pub wait methods on channelmanager, which then include moving to futures, which then allow us to remove the no-std waits...I figured it wasn't worth bothering trying to rewrite half the commits here just to reorder them, rather adding code that's probably correct and then removing it to simplify in a later commit seemed fine.

@wpaulino
Copy link
Contributor

Feel free to squash, mostly LGTM but will give it a final pass after

lightning/src/chain/chainmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/chainmonitor.rs Outdated Show resolved Hide resolved
lightning/src/chain/chainmonitor.rs Outdated Show resolved Hide resolved
@wpaulino
Copy link
Contributor

LGTM after squash, but CI is not happy. Will likely do a final pass once @tnull comes around.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some non-blocking nits.

lightning/src/chain/chainmonitor.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
lightning/src/util/wakers.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would be ready for the squashing from my side.

@TheBlueMatt
Copy link
Collaborator Author

Squashed without further changes.

@wpaulino
Copy link
Contributor

error[E0599]: no method named `wait_while` found for struct `alloc::sync::Arc<nostd_sync::Condvar>` in the current scope
   --> lightning/src/util/wakers.rs:243:25
    |
243 |         let notified_fut = cv.wait_while(notified_fut_mtx.lock().unwrap(), |fut_opt| fut_opt.is_none())
    |                               ^^^^^^^^^^ method not found in `alloc::sync::Arc<nostd_sync::Condvar>`

Could also use a rebase to get the MSRV builds running

The `lightning-net-tokio` crate-level example contained a carryover
from when it was the primary notifier of the background processor
and now just shows an "example" of creating a method to call
another method with the same parameters and then do event
processing (which doesn't make sense, the BP should do that).

Instead, the examples are simply removed and the documentation is
tweaked to include recent changes.
@TheBlueMatt TheBlueMatt force-pushed the 2023-03-mon-wake-bp branch 2 times, most recently from 128bb01 to e1ba7ff Compare March 31, 2023 05:14
@tnull
Copy link
Contributor

tnull commented Mar 31, 2023

Nope, still not fixed in all commits:

Checking lightning v0.0.114 (/home/runner/work/rust-lightning/rust-lightning/lightning)
error[E0599]: no method named `wait_while` found for struct `alloc::sync::Arc<nostd_sync::Condvar>` in the current scope
   --> lightning/src/util/wakers.rs:243:25
    |
243 |         let notified_fut = cv.wait_while(notified_fut_mtx.lock().unwrap(), |fut_opt| fut_opt.is_none())
    |                               ^^^^^^^^^^ method not found in `alloc::sync::Arc<nostd_sync::Condvar>

@TheBlueMatt
Copy link
Collaborator Author

Lol sorry I was "fixing" the issue in the wrong place, actually fixed now.

tnull
tnull previously approved these changes Apr 3, 2023
@wpaulino
Copy link
Contributor

wpaulino commented Apr 3, 2023

Needs squash

These are useful, but we previously couldn't use them due to our
MSRV. Now that we can, we should use them, so we expose them via
our normal debug_sync wrappers.
`Send` is rather useless on a `no-std` target - we don't have
threads and are just needlessly restricting ourselves, so here we
skip it for the wakers callback.
In the next commits we'll be adding a second notify pipeline - from
the `ChainMonitor` back to the background processor. This will
cause the `background-processor` to need to await multiple wakers
at once, which we cannot do in the current scheme without taking on
a full async runtime.

Building a multi-future waiter isn't so bad, and notably will allow
us to remove the existing sleep pipeline entirely, reducing the
complexity of our wakers implementation by only having one notify
path to consider.
Rather than having three ways to await a `ChannelManager` being
persistable, this moves to just exposing the awaitable `Future` and
having sleep functions on that.
In `no-std`, we exposed `wait` functions which rely on a dummy
`Condvar` which never actually sleeps. This is somwhat nonsensical,
not to mention confusing to users. Instead, we simply remove the
`wait` methods in `no-std` builds.
If the `ChainMonitor` gets an async monitor update completion, this
means the `ChannelManager` needs to be polled for event processing.
Here we wake it using the new multi-`Future`-await `Sleeper`, or
the existing `select` block in the async BP.

Fixes lightningdevkit#2052.
Some users have suggested that waking every 100ms can be
CPU-intensive in deployments with hundreds or thousands of nodes
all running on the same machine. Thus, we add an option to the
futures-based `background-processor` to avoid waking every 100ms to
check for iOS having backgrounded our app and cut our TCP sockets.

This cuts the normal sleep time down from 100ms to 10s, for those
who turn it on.
@TheBlueMatt
Copy link
Collaborator Author

Squashed without further changes.

@TheBlueMatt TheBlueMatt merged commit 3b8bf93 into lightningdevkit:main Apr 3, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wake background-processor on async monitor update completion
6 participants