Skip to content

Conversation

@TheBlueMatt
Copy link
Collaborator

A user pointed out, when looking to upgrade to LDK 0.2, that the
lazy flag is actually quite important for performance when using
a MonitorUpdatingPersister, especially in synchronous persistence
mode.

Thus, we add it back here.

@TheBlueMatt TheBlueMatt added this to the 0.2 milestone Oct 30, 2025
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Oct 30, 2025

👋 Thanks for assigning @joostjager as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@TheBlueMatt TheBlueMatt linked an issue Oct 30, 2025 that may be closed by this pull request
@TheBlueMatt TheBlueMatt requested a review from tnull October 30, 2025 00:21
@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 56.25000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.84%. Comparing base (02a9af9) to head (0f9548b).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
lightning-persister/src/fs_store.rs 52.38% 5 Missing and 5 partials ⚠️
lightning/src/util/persist.rs 68.75% 3 Missing and 2 partials ⚠️
lightning-background-processor/src/lib.rs 0.00% 2 Missing ⚠️
lightning/src/util/test_utils.rs 60.00% 2 Missing ⚠️
lightning-liquidity/src/lsps2/service.rs 0.00% 1 Missing ⚠️
lightning-liquidity/src/lsps5/service.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4189      +/-   ##
==========================================
- Coverage   88.87%   88.84%   -0.04%     
==========================================
  Files         180      180              
  Lines      137863   137870       +7     
  Branches   137863   137870       +7     
==========================================
- Hits       122522   122485      -37     
- Misses      12532    12573      +41     
- Partials     2809     2812       +3     
Flag Coverage Δ
fuzzing 21.44% <0.00%> (+0.58%) ⬆️
tests 88.68% <56.25%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

/// potentially get lost on crash after the method returns. Therefore, this flag should only be
/// set for `remove` operations that can be safely replayed at a later time.
///
/// All removal operations must complete in a consistent total order with [`Self::write`]s
Copy link
Contributor

@tnull tnull Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure if this would even work. For example in FilesystemStore, if we simply call remove and leave the decision on when to sync the changes to disk to the OS, how could we be certain that the ordering is preserved? IIRC we basically concluded this can't be guaranteed, especially since different guarantees on different platforms might vary?

Copy link
Contributor

@tnull tnull Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that man 2 unlink states:

If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed.

That means that if we have a concurrent read, we may defer the actual deletion, allowing it to interact with following writes, e.g.:

| t1   | t2     | t3     |
| READ | unlink |        |
| READ | ...    | WRITE  |
| READ | ...    |        |
| READ | ...    | SYNC   |
| READ | SYNC   |        |

Although, given we use rename for write, I do wonder if the unlink would simply get lost here as it would apply to the original file that is dropped already anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this, maybe it is possible to constrain lazy removes to keys that won't ever be written again in the future?

I've read the context of this PR now, and it seems the perf issue was around monitor updates. I think those are never re-written?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure if this would even work. For example in FilesystemStore, if we simply call remove and leave the decision on when to sync the changes to disk to the OS, how could we be certain that the ordering is preserved?

Filesystems provide an order, the only thing they dont provide without an fsync is any kinds of guarantee its on disk. I don't think this is a problem.

Although, given we use rename for write, I do wonder if the unlink would simply get lost here as it would apply to the original file that is dropped already anyways?

Yes, that is how it should work on any reasonable filesystem. In theory its possible for some filesystems to fail the write rename part because the file still exists, but that's unrelated to the remove, that's just the read+write being at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filesystems provide an order, the only thing they dont provide without an fsync is any kinds of guarantee its on disk. I don't think this is a problem.

A file that should have been removed, but is still there. Is that not a pb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the explicit point of the lazy flag - it allows a store to not guarantee that the entry will be removed if there's a crash/ill-timed restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline, man 3p unlink had me convinced this is safe to do on POSIX.

@ldk-reviews-bot
Copy link

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

@tnull tnull requested a review from joostjager October 30, 2025 09:42
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there more details on why/when the lazy flag is important?

@TheBlueMatt
Copy link
Collaborator Author

In this case its important for perf when removing a large number of monitor updates on an fsstore (requiring fsync for each can add up rather substantially eg if we're removing 1k mon updates), but thinking about it more I think its also important for the same case in the async design - if you have a KVStore that handles ordering (eg like the locks in the fsstore/vss store) then the lazy flag allows you to spawn-and-forget a removal, rather than the callsite having to "block" waiting on your removal to finish.

@joostjager
Copy link
Contributor

1k updates is a lot. Do you think it adds much over let's say 50 updates? Maybe that also makes the perf problem go away without lazy flag...

@TheBlueMatt
Copy link
Collaborator Author

1k updates seems entirely reasonable for a node doing lots of forwarding. ChannelMonitors can easily be a few thousand times larger than ChannelMonitorUpdates, so wanting to amortize over more ChannelMonitorUpdates seems very reasonable (the startup cost of more ChannelMonitorUpdates is pretty low, or at least is if your KVStore read latency is low or once we load them in parallel).

Not being able to pick a reasonable update count just because of an API limitation in how we do removals seems like a pretty weird limitation, no?

@joostjager
Copy link
Contributor

Not being able to pick a reasonable update count just because of an API limitation in how we do removals seems like a pretty weird limitation, no?

The question was just whether 1000 is reasonable, and you made it clear that it is 👍

joostjager
joostjager previously approved these changes Oct 30, 2025
This reverts commit 561da4c.

A user pointed out, when looking to upgrade to LDK 0.2, that the
`lazy` flag is actually quite important for performance when using
a `MonitorUpdatingPersister`, especially in synchronous persistence
mode.

Thus, we add it back here.

Fixes lightningdevkit#4188
In the previous commit we reverted
561da4c. One of the motivations
for it (in addition to `lazy` removals being somewhat less, though
still arguably useful in an async context) was that the ordering
requirements of `lazy` removals is somewhat unclear.

Here we simply default to the simplest safe option, requiring a
total order across all `write` and `remove` operations to the same
key, `lazy` or not.
@TheBlueMatt
Copy link
Collaborator Author

Fixed rustfmt

$ git diff-tree -U1 9973d780a 0f9548bf8
diff --git a/lightning/src/util/persist.rs b/lightning/src/util/persist.rs
index 78fdba2113..5d34603c96 100644
--- a/lightning/src/util/persist.rs
+++ b/lightning/src/util/persist.rs
@@ -1084,4 +1084,3 @@ where
 				let latest_update_id = current_monitor.get_latest_update_id();
-				self
-					.cleanup_stale_updates_for_monitor_to(&monitor_key, latest_update_id, lazy)
+				self.cleanup_stale_updates_for_monitor_to(&monitor_key, latest_update_id, lazy)
 					.await?;

@TheBlueMatt
Copy link
Collaborator Author

Only a rustfmt change since @joostjager ack'd, so landing.

@TheBlueMatt TheBlueMatt merged commit d53d6b4 into lightningdevkit:main Oct 30, 2025
23 of 25 checks passed
@TheBlueMatt TheBlueMatt mentioned this pull request Oct 30, 2025
@TheBlueMatt
Copy link
Collaborator Author

Backported to 0.2 in #4193

@domZippilli
Copy link
Contributor

🥳

@wvanlint
Copy link
Contributor

Thanks for landing this!

I think the comments above covered everything. We use the MonitorUpdatingPersister with maximum_pending_updates = 1000 for efficiency due to the high forwarding volume, and an update_persisted_channel call can trigger channel monitor update consolidation when maximum_pending_updates is reached. This consolidation results in maximum_pending_updates sequential KVStore::remove calls, which caused issues when it's performed in a non-lazy fashion. In our case in 0.1, it blocked the Tokio runtime (for ~7 ms * 1000 = 7s), but I assume it will affect the caller in the async design as well as Matt mentioned.

I was curious if there are possible simplifications, such as all remove calls being considered lazy or if remove can be constrained to keys that won't ever be written again in the future as Joost mentioned. But I see there are requirements coming from #4059 (comment) as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add lazy persistence back to KVStore::delete

6 participants