Skip to content

Conversation

@joostjager
Copy link
Contributor

@joostjager joostjager commented Jan 26, 2026

Summary

Introduce DeferredChainMonitor, a wrapper around ChainMonitor that queues watch_channel and update_channel operations, returning InProgress until flush() is called. This enables batched persistence of monitor updates after ChannelManager persistence, ensuring correct ordering where the ChannelManager state is never ahead of the monitor state on restart.

The Problem

There's a race condition that can cause channel force closures: if the node crashes after writing channel monitors but before writing the channel manager, the monitors will be ahead of the manager on restart. This can lead to state desync and force closures.

The Solution

By deferring monitor writes until after the channel manager is persisted (via flush()), we ensure the manager is always at least as up-to-date as the monitors.

Key changes:

  • DeferredChainMonitor queues monitor operations and returns InProgress
  • Calling flush() applies pending operations and persists monitors
  • All ChainMonitor traits (Listen, Confirm, EventsProvider, etc.) are passed through, allowing drop-in replacement
  • Background processor updated to capture pending count before ChannelManager persistence, then flush after persistence completes

Performance Impact

Multi-channel, multi-node load testing (using ldk-server chaos branch) shows no measurable throughput difference between deferred and direct persistence modes.

This is likely because forwarding and payment processing are already effectively single-threaded: the background processor batches all forwards for the entire node in a single pass, so the deferral overhead doesn't add any meaningful bottleneck to an already serialized path.

Alternative Designs Considered

Several approaches were explored to solve the monitor/manager persistence ordering problem:

1. Queue at KVStore level (#4310)

Introduces a QueuedKVStoreSync wrapper that queues all writes in memory, committing them in a single batch at chokepoints where data leaves the system (get_and_clear_pending_msg_events, get_and_clear_pending_events). This approach aims for true atomic multi-key writes but requires KVStore backends that support transactions (e.g., SQLite) - filesystem backends cannot achieve full atomicity.

Trade-offs: Most general solution but requires changes to persistence boundaries and cannot fully close the desync gap with filesystem storage.

2. Queue at Persister level (#4317)

Updates MonitorUpdatingPersister to queue persist operations in memory, with actual writes happening on flush(). Adds flush() to the Persist trait and ChainMonitor.

Trade-offs: Only fixes the issue for MonitorUpdatingPersister; custom Persist implementations remain vulnerable to the race condition.

3. Queue internally in ChainMonitor (#4351)

Modifies ChainMonitor directly to queue operations internally, returning InProgress until flush() is called.

Trade-offs: Requires an enormous amount of test changes since existing tests expect immediate persistence behavior.

@ldk-reviews-bot
Copy link

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

@joostjager joostjager changed the title Chain mon deferred writes Defer ChainMonitor updates and persistence to flush() Jan 26, 2026
@joostjager
Copy link
Contributor Author

Added a DeferredChainMonitor wrapper instead of modifying ChainMonitor directly. The wrapper intercepts watch_channel and update_channel calls, queues them, and returns InProgress. When flush is called, it processes the queued operations and persists them in the correct order after ChannelManager persistence. This approach keeps ChainMonitor unchanged so that existing tests which expect synchronous behavior continue to work without modification. Only the background processor and production code paths use the deferred wrapper while the test suite can keep using ChainMonitor directly.

@joostjager joostjager force-pushed the chain-mon-deferred-writes branch 3 times, most recently from 36a8b33 to 73c0a66 Compare January 26, 2026 13:59
@joostjager
Copy link
Contributor Author

Initially attempted to implement this as a thin adapter/wrapper that would sit between the ChannelManager and an existing ChainMonitor, forwarding calls while deferring the Watch operations. However, when integrating with ldk-node, this approach quickly ran into Rust ownership and lifetime issues since it required keeping both the original ChainMonitor and the wrapper around simultaneously. The current implementation takes a simpler approach where DeferredChainMonitor owns its own ChainMonitor internally and implements Deref to it, making it a complete drop-in replacement that can be instantiated with the same parameters as ChainMonitor while exposing all the same traits and methods.

@joostjager joostjager force-pushed the chain-mon-deferred-writes branch 2 times, most recently from 5bd0ea3 to 0c005d0 Compare January 26, 2026 14:08
@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

❌ Patch coverage is 86.30952% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.09%. Comparing base (e8a9303) to head (b7d9730).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/chain/deferred.rs 87.78% 46 Missing and 7 partials ⚠️
lightning-background-processor/src/lib.rs 77.14% 12 Missing and 4 partials ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #4345    +/-   ##
========================================
  Coverage   86.09%   86.09%            
========================================
  Files         156      157     +1     
  Lines      102462   102932   +470     
  Branches   102462   102932   +470     
========================================
+ Hits        88213    88621   +408     
- Misses      11753    11808    +55     
- Partials     2496     2503     +7     
Flag Coverage Δ
tests 86.09% <86.30%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joostjager and others added 2 commits January 28, 2026 12:56
Extract common logic for listing monitor files into a helper function
that filters out temporary .tmp files created during persistence
operations. This simplifies test code and improves reliability on
systems where directory iteration order is non-deterministic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduce a `DeferredChainMonitor` wrapper around `ChainMonitor` that
queues `watch_channel` and `update_channel` operations, returning
`InProgress` until `flush()` is called. This enables batched persistence
of monitor updates after `ChannelManager` persistence, ensuring correct
ordering where the `ChannelManager` state is never ahead of the monitor
state on restart.

Key changes:
- `DeferredChainMonitor` queues monitor operations and returns `InProgress`
- Calling `flush()` applies pending operations and persists monitors
- All `ChainMonitor` traits (Listen, Confirm, EventsProvider, etc.) are
  passed through, allowing drop-in replacement
- Background processor updated to capture pending count before
  `ChannelManager` persistence, then flush after persistence completes

Includes comprehensive tests covering the full channel lifecycle with
payment flows using `DeferredChainMonitor`.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@joostjager joostjager force-pushed the chain-mon-deferred-writes branch from 360b6e5 to b7d9730 Compare January 28, 2026 12:31
/// `update_channel` operations until `flush()` is called, using real
/// ChannelManagers and a complete channel open + payment flow.
#[test]
fn test_deferred_monitor_payment() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to include a test that sets up from scratch, because the test infra is built around ChainMonitor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants