channeldb: split channel close into two phases to avoid long DB write-locks by ziggie1984 · Pull Request #10732 · lightningnetwork/lnd

ziggie1984 · 2026-04-15T12:06:21Z

Problem

When a Lightning channel with a large revocation log is closed, the
current implementation deletes all channel data — including every
revocation log entry and forwarding package — inside a single
transaction. On the KV-over-SQL backends (SQLite / Postgres) the
underlying schema stores all KV data in one self-referential table with
ON DELETE CASCADE. A channel that has been open for a long time can
accumulate tens of thousands of revocation log entries, so this cascade
delete holds the database write-lock for an extended period, blocking
all other DB operations on the node (HTLC updates, channel state
transitions, etc.).

Solution

Split channel close into two phases:

Phase 1 – fast, atomic close (CloseChannel)
Removes the channel from every open-channel view (writes the close
summary, moves the record to the closed-channel bucket, clears the
in-memory state). A small sentinel key is written into the channel
bucket and an entry is added to a new pendingChanCleanupBucket so
the node can always resume Phase 2 after a restart. The total data
written is O(1) regardless of channel history.

Phase 2 – incremental bulk deletion (PurgeClosedChannelData)
Deletes revocation log entries and forwarding packages in small batches
(DefaultPurgeBatchSize = 500 per transaction), then removes the
channel bucket entirely and deregisters the cleanup task. Each batch
holds the write-lock only for a short window, keeping the node
responsive throughout.

Restart safety
At startup, server.go calls FetchChannelsPendingCleanup and spawns
a goroutine for each outstanding channel to drive Phase 2 to
completion, so no cleanup task can be permanently lost.

Zombie-bucket guards
fetchChanInfo now returns the new ErrChannelPendingCleanup sentinel
when a channel bucket exists but lacks its info key (Phase 2 in
progress). The three open-channel scan paths (fetchNodeChannels,
FetchPermAndTempPeers, channelScanner) each skip such buckets
rather than surface an error to callers.

Commits

channeldb: introduce ChannelCloser interface for two-phase cleanup
channeldb: split CloseChannel into two phases to reduce DB lock pressure
server: resume pending channel data purges at startup
channeldb: add tests for two-phase close and pending cleanup

Test plan

go test ./channeldb/... -run TestCloseChannel — three new unit tests covering Phase 1 scan guards, Phase 2 batch deletion, and cleanup persistence across restart
go test ./channeldb/... -run TestChannelStateTransition — existing test updated to drive Phase 2 before asserting revlog deletion
make lint passes

Define a backend-agnostic interface for the channel close lifecycle. The interface deliberately splits close into two phases to avoid holding a single large write lock on SQLite/Postgres KV-over-SQL backends, where bulk cascade-deletes inside one transaction can stall all other writers for seconds: - CloseChannel (Phase 1): fast, atomic — writes the close summary, archives channel state, updates the outpoint index, deletes the small per-channel keys, and registers a cleanup task. - PurgeClosedChannelData (Phase 2): heavy, incremental — deletes revocation log entries and forwarding packages in small batches, each batch in its own short transaction, so other writers can interleave between batches. - FetchChannelsPendingCleanup: startup hook — returns channels that completed Phase 1 but not Phase 2, allowing the node to resume interrupted purges after a crash or restart. The current KV implementation on *ChannelStateDB will satisfy this interface; a future native-SQL backend can provide its own concrete type without changing any caller.

Phase 1 (CloseChannel) atomically moves the channel out of all open-channel views and records a sentinel key in the channel bucket so scan helpers can distinguish an intentional zombie from data corruption. A lightweight entry is written to pendingChanCleanupBucket so the node can resume Phase 2 after a restart. Phase 2 (PurgeClosedChannelData) deletes revocation-log entries and forwarding packages in small batches (DefaultPurgeBatchSize = 500) to avoid holding a long write-lock on the KV-over-SQL backend, then removes the channel bucket and deregisters the cleanup task. fetchChanInfo now returns the new ErrChannelPendingCleanup sentinel when the channel-info key is absent but the phase-2 sentinel key is present. The three open-channel scan paths (fetchNodeChannels, FetchPermAndTempPeers, channelScanner) are each guarded to skip zombie buckets rather than surface an error to callers. TestChannelStateTransition is updated to drive Phase 2 explicitly before asserting that revlog and forwarding packages have been deleted.

After RepairLinkNodes, query FetchChannelsPendingCleanup and spawn a goroutine for each outstanding channel to drive PurgeClosedChannelData to completion. This ensures that a purge interrupted by a node restart is automatically resumed on the next startup rather than leaving stale bulk data in the KV store indefinitely.

Three new test cases: TestCloseChannelPhase1RemovesFromOpenScans - verifies that after Phase 1 the closed channel no longer appears in FetchAllChannels, FetchOpenChannels, or FetchPermAndTempPeers, and that FetchChannelsPendingCleanup returns the outpoint while revlog entries are still intact. TestCloseChannelPhase2PurgesDataInBatches - drives Phase 2 with a small batch size (2) to confirm multi-iteration deletion, checks the channel bucket is fully removed, the cleanup task deregistered, and that a second call is idempotent. TestCloseChannelPendingCleanupPersists - simulates a node restart by wrapping the same backend in a fresh ChannelStateDB, confirms the cleanup task survives, and verifies Phase 2 on the new instance runs to completion.

gemini-code-assist · 2026-04-15T12:06:34Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses database performance issues encountered when closing Lightning channels with large revocation logs. By decoupling the atomic state update from the resource-intensive deletion of historical data, the node remains responsive during channel closures. The solution is designed to be crash-safe, ensuring that background cleanup tasks are completed even if the node restarts.

Highlights

Two-Phase Channel Closure: Split the channel closure process into two phases to prevent long-held database write-locks. Phase 1 performs a fast, atomic record of the closure, while Phase 2 handles the bulk deletion of historical data in small, incremental batches.
Restart Safety: Introduced a persistent cleanup task tracking system that ensures any interrupted Phase 2 purges are automatically resumed upon node restart.
Zombie-Bucket Guards: Implemented sentinel keys and a new error type to allow the system to safely identify and skip channels that are in the middle of a Phase 2 cleanup, preventing errors during channel scans.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Documents the two-phase channel close change (PR lightningnetwork#10732) in the 0.21.0 release notes under the Database section.

github-actions · 2026-04-15T12:08:09Z

🔴 PR Severity: CRITICAL

First classification | 5 files | 707 lines changed (excluding tests)

🔴 Critical (5 files)

channeldb/channel.go - Channel state persistence; core channel struct modifications
channeldb/channel_closer.go - New file in channeldb; channel closing logic and state
channeldb/db.go - Channel database operations; largest change at 423 additions
contractcourt/chain_arbitrator.go - On-chain dispute resolution and chain arbitration
server.go - Core server coordination

⚪ Excluded (tests) (2 files)

channeldb/channel_test.go - Test file (excluded from severity calculation)
channeldb/close_channel_test.go - Test file (excluded from severity calculation)

Analysis

This PR touches multiple distinct critical packages simultaneously:

channeldb/ - Three files modified/added, including db.go (+423 lines) and a new channel_closer.go. Changes here affect channel state persistence.
contractcourt/chain_arbitrator.go - The chain arbitrator coordinates on-chain responses to channel breaches and force-closes.
server.go - Core server coordination with 38 additions.

The PR exceeds 500 non-test lines changed (707 lines), which independently qualifies for a severity bump, though baseline is already CRITICAL.

_{To override, add a severity-override-{critical,high,medium,low} label.}

gemini-code-assist

Code Review

This pull request implements a two-phase channel closure process to prevent long-held database locks on SQLite and Postgres backends. Phase 1 (CloseChannel) handles atomic state updates and archiving, while Phase 2 (PurgeClosedChannelData) incrementally deletes bulk historical data, such as revocation logs, in small batches. The changes also include logic to resume interrupted purges on startup and updates to channel scanning to skip channels pending cleanup. Feedback identifies opportunities to improve the conciseness of the ErrChannelPendingCleanup documentation and to consolidate redundant comments regarding the cleanup sentinel key.

gemini-code-assist · 2026-04-15T12:09:21Z

+	// ErrChannelPendingCleanup is returned when a channel bucket exists
+	// in openChannelBucket but chanInfoKey is absent and the
+	// pendingCleanupKey sentinel is present. This means CloseChannel
+	// (Phase 1) has completed — the channel is logically closed — but
+	// PurgeClosedChannelData (Phase 2) has not yet deleted the bulk
+	// historical data from this bucket. Callers that scan all open
+	// channels should skip such buckets; callers doing targeted lookups
+	// should treat the channel as not found.
+	ErrChannelPendingCleanup = fmt.Errorf("channel pending cleanup")


The comment for ErrChannelPendingCleanup is quite long and could be more concise to improve readability while still explaining the purpose.

gemini-code-assist · 2026-04-15T12:09:21Z

+var (
+	// pendingChanCleanupBucket tracks channels whose bulk historical data
+	// (revocation log, forwarding packages) has not yet been deleted after
+	// close. Each key is a serialized wire.OutPoint; the value is empty.
+	// The presence of a key means Phase 2 (PurgeClosedChannelData) is still
+	// outstanding for that channel.
+	pendingChanCleanupBucket = []byte("pending-chan-cleanup")
+
+	// pendingCleanupKey is written into a channel's own bucket by Phase 1
+	// (CloseChannel) to mark that bulk data deletion is still pending. It
+	// allows fetchNodeChannels to distinguish intentional Phase 1 state
+	// from genuine data corruption when chanInfoKey is absent. The key is
+	// removed automatically when Phase 2 deletes the channel bucket.
+	pendingCleanupKey = []byte("phase2-pending")
+)


The pendingCleanupKey comment is slightly redundant with the ErrChannelPendingCleanup comment. Consider consolidating the explanation of the sentinel key's purpose.

warioishere · 2026-04-15T12:42:44Z

Massivelly affected by this. Closing a channnel with more then 2M states locks my node up about 30-50minutes depening on the number of states. Also lnd doesn't register correct channel closure on tx confirmation. During closure a lot of channels go offline, and I see force closures due to revocation errors lnd is throwing because of db locks

Running PG and lnd 0.20.1 here

Possibly also
#10320

Is related to this and should be solved by this PR

AuthenticityBTC · 2026-04-16T01:53:57Z

Authenticity is also affected by this. Closing a channnel with more then 500k states locks up my node for 20-30 minutes. I am running oversized RAM/CPU on both my LND (20.1) and PG Server (on separate nodes). I currently have 180 public and 15 private channels with total states on my node at 40million. I occasionally cycle channels to reduce total number of states, but with this issue there is downtime with every channel close.

Logs constantly is showing timeouts/db retries in lnd.log ("closed: db tx retries exceeded" and other similar errors).

ziggie1984 added 4 commits April 15, 2026 14:01

ziggie1984 self-assigned this Apr 15, 2026

github-actions bot added the severity-critical Requires expert review - security/consensus critical label Apr 15, 2026

docs: add release note for two-phase channel close

d307aaa

Documents the two-phase channel close change (PR lightningnetwork#10732) in the 0.21.0 release notes under the Database section.

ziggie1984 added kvdb sql labels Apr 15, 2026

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

ziggie1984 added this to the v0.21.0 milestone Apr 15, 2026

ziggie1984 added this to v0.21 Apr 15, 2026

ziggie1984 added performance code health Related to code commenting, refactoring, and other non-behaviour improvements labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

channeldb: split channel close into two phases to avoid long DB write-locks#10732

channeldb: split channel close into two phases to avoid long DB write-locks#10732
ziggie1984 wants to merge 5 commits intolightningnetwork:masterfrom
ziggie1984:channeldb-two-phase-close

ziggie1984 commented Apr 15, 2026

Uh oh!

gemini-code-assist bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

warioishere commented Apr 15, 2026 •

edited

Loading

Uh oh!

AuthenticityBTC commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ziggie1984 commented Apr 15, 2026

Problem

Solution

Commits

Test plan

Uh oh!

gemini-code-assist bot commented Apr 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Apr 15, 2026

🔴 PR Severity: CRITICAL

Analysis

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

warioishere commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AuthenticityBTC commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

warioishere commented Apr 15, 2026 •

edited

Loading