-
Notifications
You must be signed in to change notification settings - Fork 433
Defer ChainMonitor updates and persistence to flush() #4345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Defer ChainMonitor updates and persistence to flush() #4345
Conversation
|
👋 Hi! I see this is a draft PR. |
|
Added a DeferredChainMonitor wrapper instead of modifying ChainMonitor directly. The wrapper intercepts watch_channel and update_channel calls, queues them, and returns InProgress. When flush is called, it processes the queued operations and persists them in the correct order after ChannelManager persistence. This approach keeps ChainMonitor unchanged so that existing tests which expect synchronous behavior continue to work without modification. Only the background processor and production code paths use the deferred wrapper while the test suite can keep using ChainMonitor directly. |
36a8b33 to
73c0a66
Compare
|
Initially attempted to implement this as a thin adapter/wrapper that would sit between the ChannelManager and an existing ChainMonitor, forwarding calls while deferring the Watch operations. However, when integrating with ldk-node, this approach quickly ran into Rust ownership and lifetime issues since it required keeping both the original ChainMonitor and the wrapper around simultaneously. The current implementation takes a simpler approach where DeferredChainMonitor owns its own ChainMonitor internally and implements Deref to it, making it a complete drop-in replacement that can be instantiated with the same parameters as ChainMonitor while exposing all the same traits and methods. |
5bd0ea3 to
0c005d0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4345 +/- ##
========================================
Coverage 86.09% 86.09%
========================================
Files 156 157 +1
Lines 102462 102932 +470
Branches 102462 102932 +470
========================================
+ Hits 88213 88621 +408
- Misses 11753 11808 +55
- Partials 2496 2503 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0c005d0 to
bc1b327
Compare
ed6f0a3 to
360b6e5
Compare
Extract common logic for listing monitor files into a helper function that filters out temporary .tmp files created during persistence operations. This simplifies test code and improves reliability on systems where directory iteration order is non-deterministic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduce a `DeferredChainMonitor` wrapper around `ChainMonitor` that queues `watch_channel` and `update_channel` operations, returning `InProgress` until `flush()` is called. This enables batched persistence of monitor updates after `ChannelManager` persistence, ensuring correct ordering where the `ChannelManager` state is never ahead of the monitor state on restart. Key changes: - `DeferredChainMonitor` queues monitor operations and returns `InProgress` - Calling `flush()` applies pending operations and persists monitors - All `ChainMonitor` traits (Listen, Confirm, EventsProvider, etc.) are passed through, allowing drop-in replacement - Background processor updated to capture pending count before `ChannelManager` persistence, then flush after persistence completes Includes comprehensive tests covering the full channel lifecycle with payment flows using `DeferredChainMonitor`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
360b6e5 to
b7d9730
Compare
| /// `update_channel` operations until `flush()` is called, using real | ||
| /// ChannelManagers and a complete channel open + payment flow. | ||
| #[test] | ||
| fn test_deferred_monitor_payment() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to include a test that sets up from scratch, because the test infra is built around ChainMonitor
Summary
Introduce
DeferredChainMonitor, a wrapper aroundChainMonitorthat queueswatch_channelandupdate_channeloperations, returningInProgressuntilflush()is called. This enables batched persistence of monitor updates afterChannelManagerpersistence, ensuring correct ordering where theChannelManagerstate is never ahead of the monitor state on restart.The Problem
There's a race condition that can cause channel force closures: if the node crashes after writing channel monitors but before writing the channel manager, the monitors will be ahead of the manager on restart. This can lead to state desync and force closures.
The Solution
By deferring monitor writes until after the channel manager is persisted (via
flush()), we ensure the manager is always at least as up-to-date as the monitors.Key changes:
DeferredChainMonitorqueues monitor operations and returnsInProgressflush()applies pending operations and persists monitorsChainMonitortraits (Listen,Confirm,EventsProvider, etc.) are passed through, allowing drop-in replacementChannelManagerpersistence, then flush after persistence completesPerformance Impact
Multi-channel, multi-node load testing (using ldk-server chaos branch) shows no measurable throughput difference between deferred and direct persistence modes.
This is likely because forwarding and payment processing are already effectively single-threaded: the background processor batches all forwards for the entire node in a single pass, so the deferral overhead doesn't add any meaningful bottleneck to an already serialized path.
Alternative Designs Considered
Several approaches were explored to solve the monitor/manager persistence ordering problem:
1. Queue at KVStore level (#4310)
Introduces a
QueuedKVStoreSyncwrapper that queues all writes in memory, committing them in a single batch at chokepoints where data leaves the system (get_and_clear_pending_msg_events,get_and_clear_pending_events). This approach aims for true atomic multi-key writes but requires KVStore backends that support transactions (e.g., SQLite) - filesystem backends cannot achieve full atomicity.Trade-offs: Most general solution but requires changes to persistence boundaries and cannot fully close the desync gap with filesystem storage.
2. Queue at Persister level (#4317)
Updates
MonitorUpdatingPersisterto queue persist operations in memory, with actual writes happening onflush(). Addsflush()to thePersisttrait andChainMonitor.Trade-offs: Only fixes the issue for
MonitorUpdatingPersister; customPersistimplementations remain vulnerable to the race condition.3. Queue internally in ChainMonitor (#4351)
Modifies
ChainMonitordirectly to queue operations internally, returningInProgressuntilflush()is called.Trade-offs: Requires an enormous amount of test changes since existing tests expect immediate persistence behavior.