channeldb: split channel close into two phases to avoid long DB write-locks#10732
channeldb: split channel close into two phases to avoid long DB write-locks#10732ziggie1984 wants to merge 5 commits intolightningnetwork:masterfrom
Conversation
Define a backend-agnostic interface for the channel close lifecycle.
The interface deliberately splits close into two phases to avoid
holding a single large write lock on SQLite/Postgres KV-over-SQL
backends, where bulk cascade-deletes inside one transaction can stall
all other writers for seconds:
- CloseChannel (Phase 1): fast, atomic — writes the close summary,
archives channel state, updates the outpoint index, deletes the
small per-channel keys, and registers a cleanup task.
- PurgeClosedChannelData (Phase 2): heavy, incremental — deletes
revocation log entries and forwarding packages in small batches,
each batch in its own short transaction, so other writers can
interleave between batches.
- FetchChannelsPendingCleanup: startup hook — returns channels that
completed Phase 1 but not Phase 2, allowing the node to resume
interrupted purges after a crash or restart.
The current KV implementation on *ChannelStateDB will satisfy this
interface; a future native-SQL backend can provide its own concrete
type without changing any caller.
Phase 1 (CloseChannel) atomically moves the channel out of all open-channel views and records a sentinel key in the channel bucket so scan helpers can distinguish an intentional zombie from data corruption. A lightweight entry is written to pendingChanCleanupBucket so the node can resume Phase 2 after a restart. Phase 2 (PurgeClosedChannelData) deletes revocation-log entries and forwarding packages in small batches (DefaultPurgeBatchSize = 500) to avoid holding a long write-lock on the KV-over-SQL backend, then removes the channel bucket and deregisters the cleanup task. fetchChanInfo now returns the new ErrChannelPendingCleanup sentinel when the channel-info key is absent but the phase-2 sentinel key is present. The three open-channel scan paths (fetchNodeChannels, FetchPermAndTempPeers, channelScanner) are each guarded to skip zombie buckets rather than surface an error to callers. TestChannelStateTransition is updated to drive Phase 2 explicitly before asserting that revlog and forwarding packages have been deleted.
After RepairLinkNodes, query FetchChannelsPendingCleanup and spawn a goroutine for each outstanding channel to drive PurgeClosedChannelData to completion. This ensures that a purge interrupted by a node restart is automatically resumed on the next startup rather than leaving stale bulk data in the KV store indefinitely.
Three new test cases: TestCloseChannelPhase1RemovesFromOpenScans - verifies that after Phase 1 the closed channel no longer appears in FetchAllChannels, FetchOpenChannels, or FetchPermAndTempPeers, and that FetchChannelsPendingCleanup returns the outpoint while revlog entries are still intact. TestCloseChannelPhase2PurgesDataInBatches - drives Phase 2 with a small batch size (2) to confirm multi-iteration deletion, checks the channel bucket is fully removed, the cleanup task deregistered, and that a second call is idempotent. TestCloseChannelPendingCleanupPersists - simulates a node restart by wrapping the same backend in a fresh ChannelStateDB, confirms the cleanup task survives, and verifies Phase 2 on the new instance runs to completion.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses database performance issues encountered when closing Lightning channels with large revocation logs. By decoupling the atomic state update from the resource-intensive deletion of historical data, the node remains responsive during channel closures. The solution is designed to be crash-safe, ensuring that background cleanup tasks are completed even if the node restarts. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Documents the two-phase channel close change (PR lightningnetwork#10732) in the 0.21.0 release notes under the Database section.
🔴 PR Severity: CRITICAL
🔴 Critical (5 files)
⚪ Excluded (tests) (2 files)
AnalysisThis PR touches multiple distinct critical packages simultaneously:
The PR exceeds 500 non-test lines changed (707 lines), which independently qualifies for a severity bump, though baseline is already CRITICAL. To override, add a |
There was a problem hiding this comment.
Code Review
This pull request implements a two-phase channel closure process to prevent long-held database locks on SQLite and Postgres backends. Phase 1 (CloseChannel) handles atomic state updates and archiving, while Phase 2 (PurgeClosedChannelData) incrementally deletes bulk historical data, such as revocation logs, in small batches. The changes also include logic to resume interrupted purges on startup and updates to channel scanning to skip channels pending cleanup. Feedback identifies opportunities to improve the conciseness of the ErrChannelPendingCleanup documentation and to consolidate redundant comments regarding the cleanup sentinel key.
| // ErrChannelPendingCleanup is returned when a channel bucket exists | ||
| // in openChannelBucket but chanInfoKey is absent and the | ||
| // pendingCleanupKey sentinel is present. This means CloseChannel | ||
| // (Phase 1) has completed — the channel is logically closed — but | ||
| // PurgeClosedChannelData (Phase 2) has not yet deleted the bulk | ||
| // historical data from this bucket. Callers that scan all open | ||
| // channels should skip such buckets; callers doing targeted lookups | ||
| // should treat the channel as not found. | ||
| ErrChannelPendingCleanup = fmt.Errorf("channel pending cleanup") |
| var ( | ||
| // pendingChanCleanupBucket tracks channels whose bulk historical data | ||
| // (revocation log, forwarding packages) has not yet been deleted after | ||
| // close. Each key is a serialized wire.OutPoint; the value is empty. | ||
| // The presence of a key means Phase 2 (PurgeClosedChannelData) is still | ||
| // outstanding for that channel. | ||
| pendingChanCleanupBucket = []byte("pending-chan-cleanup") | ||
|
|
||
| // pendingCleanupKey is written into a channel's own bucket by Phase 1 | ||
| // (CloseChannel) to mark that bulk data deletion is still pending. It | ||
| // allows fetchNodeChannels to distinguish intentional Phase 1 state | ||
| // from genuine data corruption when chanInfoKey is absent. The key is | ||
| // removed automatically when Phase 2 deletes the channel bucket. | ||
| pendingCleanupKey = []byte("phase2-pending") | ||
| ) |
|
Massivelly affected by this. Closing a channnel with more then 2M states locks my node up about 30-50minutes depening on the number of states. Also lnd doesn't register correct channel closure on tx confirmation. During closure a lot of channels go offline, and I see force closures due to revocation errors lnd is throwing because of db locks Running PG and lnd 0.20.1 here Possibly also Is related to this and should be solved by this PR |
|
Authenticity is also affected by this. Closing a channnel with more then 500k states locks up my node for 20-30 minutes. I am running oversized RAM/CPU on both my LND (20.1) and PG Server (on separate nodes). I currently have 180 public and 15 private channels with total states on my node at 40million. I occasionally cycle channels to reduce total number of states, but with this issue there is downtime with every channel close. Logs constantly is showing timeouts/db retries in lnd.log ("closed: db tx retries exceeded" and other similar errors). |
Problem
When a Lightning channel with a large revocation log is closed, the
current implementation deletes all channel data — including every
revocation log entry and forwarding package — inside a single
transaction. On the KV-over-SQL backends (SQLite / Postgres) the
underlying schema stores all KV data in one self-referential table with
ON DELETE CASCADE. A channel that has been open for a long time canaccumulate tens of thousands of revocation log entries, so this cascade
delete holds the database write-lock for an extended period, blocking
all other DB operations on the node (HTLC updates, channel state
transitions, etc.).
Solution
Split channel close into two phases:
Phase 1 – fast, atomic close (
CloseChannel)Removes the channel from every open-channel view (writes the close
summary, moves the record to the closed-channel bucket, clears the
in-memory state). A small sentinel key is written into the channel
bucket and an entry is added to a new
pendingChanCleanupBucketsothe node can always resume Phase 2 after a restart. The total data
written is O(1) regardless of channel history.
Phase 2 – incremental bulk deletion (
PurgeClosedChannelData)Deletes revocation log entries and forwarding packages in small batches
(
DefaultPurgeBatchSize = 500per transaction), then removes thechannel bucket entirely and deregisters the cleanup task. Each batch
holds the write-lock only for a short window, keeping the node
responsive throughout.
Restart safety
At startup,
server.gocallsFetchChannelsPendingCleanupand spawnsa goroutine for each outstanding channel to drive Phase 2 to
completion, so no cleanup task can be permanently lost.
Zombie-bucket guards
fetchChanInfonow returns the newErrChannelPendingCleanupsentinelwhen a channel bucket exists but lacks its info key (Phase 2 in
progress). The three open-channel scan paths (
fetchNodeChannels,FetchPermAndTempPeers,channelScanner) each skip such bucketsrather than surface an error to callers.
Commits
channeldb: introduce ChannelCloser interface for two-phase cleanupchanneldb: split CloseChannel into two phases to reduce DB lock pressureserver: resume pending channel data purges at startupchanneldb: add tests for two-phase close and pending cleanupTest plan
go test ./channeldb/... -run TestCloseChannel— three new unit tests covering Phase 1 scan guards, Phase 2 batch deletion, and cleanup persistence across restartgo test ./channeldb/... -run TestChannelStateTransition— existing test updated to drive Phase 2 before asserting revlog deletionmake lintpasses