Skip to content

channeldb: split channel close into two phases to avoid long DB write-locks#10732

Draft
ziggie1984 wants to merge 5 commits intolightningnetwork:masterfrom
ziggie1984:channeldb-two-phase-close
Draft

channeldb: split channel close into two phases to avoid long DB write-locks#10732
ziggie1984 wants to merge 5 commits intolightningnetwork:masterfrom
ziggie1984:channeldb-two-phase-close

Conversation

@ziggie1984
Copy link
Copy Markdown
Collaborator

Problem

When a Lightning channel with a large revocation log is closed, the
current implementation deletes all channel data — including every
revocation log entry and forwarding package — inside a single
transaction. On the KV-over-SQL backends (SQLite / Postgres) the
underlying schema stores all KV data in one self-referential table with
ON DELETE CASCADE. A channel that has been open for a long time can
accumulate tens of thousands of revocation log entries, so this cascade
delete holds the database write-lock for an extended period, blocking
all other DB operations on the node (HTLC updates, channel state
transitions, etc.).

Solution

Split channel close into two phases:

Phase 1 – fast, atomic close (CloseChannel)
Removes the channel from every open-channel view (writes the close
summary, moves the record to the closed-channel bucket, clears the
in-memory state). A small sentinel key is written into the channel
bucket and an entry is added to a new pendingChanCleanupBucket so
the node can always resume Phase 2 after a restart. The total data
written is O(1) regardless of channel history.

Phase 2 – incremental bulk deletion (PurgeClosedChannelData)
Deletes revocation log entries and forwarding packages in small batches
(DefaultPurgeBatchSize = 500 per transaction), then removes the
channel bucket entirely and deregisters the cleanup task. Each batch
holds the write-lock only for a short window, keeping the node
responsive throughout.

Restart safety
At startup, server.go calls FetchChannelsPendingCleanup and spawns
a goroutine for each outstanding channel to drive Phase 2 to
completion, so no cleanup task can be permanently lost.

Zombie-bucket guards
fetchChanInfo now returns the new ErrChannelPendingCleanup sentinel
when a channel bucket exists but lacks its info key (Phase 2 in
progress). The three open-channel scan paths (fetchNodeChannels,
FetchPermAndTempPeers, channelScanner) each skip such buckets
rather than surface an error to callers.

Commits

  1. channeldb: introduce ChannelCloser interface for two-phase cleanup
  2. channeldb: split CloseChannel into two phases to reduce DB lock pressure
  3. server: resume pending channel data purges at startup
  4. channeldb: add tests for two-phase close and pending cleanup

Test plan

  • go test ./channeldb/... -run TestCloseChannel — three new unit tests covering Phase 1 scan guards, Phase 2 batch deletion, and cleanup persistence across restart
  • go test ./channeldb/... -run TestChannelStateTransition — existing test updated to drive Phase 2 before asserting revlog deletion
  • make lint passes

Define a backend-agnostic interface for the channel close lifecycle.
The interface deliberately splits close into two phases to avoid
holding a single large write lock on SQLite/Postgres KV-over-SQL
backends, where bulk cascade-deletes inside one transaction can stall
all other writers for seconds:

  - CloseChannel (Phase 1): fast, atomic — writes the close summary,
    archives channel state, updates the outpoint index, deletes the
    small per-channel keys, and registers a cleanup task.

  - PurgeClosedChannelData (Phase 2): heavy, incremental — deletes
    revocation log entries and forwarding packages in small batches,
    each batch in its own short transaction, so other writers can
    interleave between batches.

  - FetchChannelsPendingCleanup: startup hook — returns channels that
    completed Phase 1 but not Phase 2, allowing the node to resume
    interrupted purges after a crash or restart.

The current KV implementation on *ChannelStateDB will satisfy this
interface; a future native-SQL backend can provide its own concrete
type without changing any caller.
Phase 1 (CloseChannel) atomically moves the channel out of all
open-channel views and records a sentinel key in the channel bucket so
scan helpers can distinguish an intentional zombie from data corruption.
A lightweight entry is written to pendingChanCleanupBucket so the node
can resume Phase 2 after a restart.

Phase 2 (PurgeClosedChannelData) deletes revocation-log entries and
forwarding packages in small batches (DefaultPurgeBatchSize = 500) to
avoid holding a long write-lock on the KV-over-SQL backend, then
removes the channel bucket and deregisters the cleanup task.

fetchChanInfo now returns the new ErrChannelPendingCleanup sentinel when
the channel-info key is absent but the phase-2 sentinel key is present.
The three open-channel scan paths (fetchNodeChannels, FetchPermAndTempPeers,
channelScanner) are each guarded to skip zombie buckets rather than
surface an error to callers.

TestChannelStateTransition is updated to drive Phase 2 explicitly before
asserting that revlog and forwarding packages have been deleted.
After RepairLinkNodes, query FetchChannelsPendingCleanup and spawn a
goroutine for each outstanding channel to drive PurgeClosedChannelData
to completion. This ensures that a purge interrupted by a node restart
is automatically resumed on the next startup rather than leaving stale
bulk data in the KV store indefinitely.
Three new test cases:

TestCloseChannelPhase1RemovesFromOpenScans - verifies that after Phase 1
the closed channel no longer appears in FetchAllChannels,
FetchOpenChannels, or FetchPermAndTempPeers, and that
FetchChannelsPendingCleanup returns the outpoint while revlog entries
are still intact.

TestCloseChannelPhase2PurgesDataInBatches - drives Phase 2 with a small
batch size (2) to confirm multi-iteration deletion, checks the channel
bucket is fully removed, the cleanup task deregistered, and that a
second call is idempotent.

TestCloseChannelPendingCleanupPersists - simulates a node restart by
wrapping the same backend in a fresh ChannelStateDB, confirms the
cleanup task survives, and verifies Phase 2 on the new instance runs to
completion.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses database performance issues encountered when closing Lightning channels with large revocation logs. By decoupling the atomic state update from the resource-intensive deletion of historical data, the node remains responsive during channel closures. The solution is designed to be crash-safe, ensuring that background cleanup tasks are completed even if the node restarts.

Highlights

  • Two-Phase Channel Closure: Split the channel closure process into two phases to prevent long-held database write-locks. Phase 1 performs a fast, atomic record of the closure, while Phase 2 handles the bulk deletion of historical data in small, incremental batches.
  • Restart Safety: Introduced a persistent cleanup task tracking system that ensures any interrupted Phase 2 purges are automatically resumed upon node restart.
  • Zombie-Bucket Guards: Implemented sentinel keys and a new error type to allow the system to safely identify and skip channels that are in the middle of a Phase 2 cleanup, preventing errors during channel scans.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ziggie1984 ziggie1984 self-assigned this Apr 15, 2026
@github-actions github-actions bot added the severity-critical Requires expert review - security/consensus critical label Apr 15, 2026
Documents the two-phase channel close change (PR lightningnetwork#10732) in the 0.21.0
release notes under the Database section.
@github-actions
Copy link
Copy Markdown

🔴 PR Severity: CRITICAL

First classification | 5 files | 707 lines changed (excluding tests)

🔴 Critical (5 files)
  • channeldb/channel.go - Channel state persistence; core channel struct modifications
  • channeldb/channel_closer.go - New file in channeldb; channel closing logic and state
  • channeldb/db.go - Channel database operations; largest change at 423 additions
  • contractcourt/chain_arbitrator.go - On-chain dispute resolution and chain arbitration
  • server.go - Core server coordination
Excluded (tests) (2 files)
  • channeldb/channel_test.go - Test file (excluded from severity calculation)
  • channeldb/close_channel_test.go - Test file (excluded from severity calculation)

Analysis

This PR touches multiple distinct critical packages simultaneously:

  1. channeldb/ - Three files modified/added, including db.go (+423 lines) and a new channel_closer.go. Changes here affect channel state persistence.

  2. contractcourt/chain_arbitrator.go - The chain arbitrator coordinates on-chain responses to channel breaches and force-closes.

  3. server.go - Core server coordination with 38 additions.

The PR exceeds 500 non-test lines changed (707 lines), which independently qualifies for a severity bump, though baseline is already CRITICAL.


To override, add a severity-override-{critical,high,medium,low} label.
<!-- pr-severity-bot -->

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a two-phase channel closure process to prevent long-held database locks on SQLite and Postgres backends. Phase 1 (CloseChannel) handles atomic state updates and archiving, while Phase 2 (PurgeClosedChannelData) incrementally deletes bulk historical data, such as revocation logs, in small batches. The changes also include logic to resume interrupted purges on startup and updates to channel scanning to skip channels pending cleanup. Feedback identifies opportunities to improve the conciseness of the ErrChannelPendingCleanup documentation and to consolidate redundant comments regarding the cleanup sentinel key.

Comment thread channeldb/channel.go
Comment on lines +187 to +195
// ErrChannelPendingCleanup is returned when a channel bucket exists
// in openChannelBucket but chanInfoKey is absent and the
// pendingCleanupKey sentinel is present. This means CloseChannel
// (Phase 1) has completed — the channel is logically closed — but
// PurgeClosedChannelData (Phase 2) has not yet deleted the bulk
// historical data from this bucket. Callers that scan all open
// channels should skip such buckets; callers doing targeted lookups
// should treat the channel as not found.
ErrChannelPendingCleanup = fmt.Errorf("channel pending cleanup")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for ErrChannelPendingCleanup is quite long and could be more concise to improve readability while still explaining the purpose.

Comment thread channeldb/db.go
Comment on lines +464 to +478
var (
// pendingChanCleanupBucket tracks channels whose bulk historical data
// (revocation log, forwarding packages) has not yet been deleted after
// close. Each key is a serialized wire.OutPoint; the value is empty.
// The presence of a key means Phase 2 (PurgeClosedChannelData) is still
// outstanding for that channel.
pendingChanCleanupBucket = []byte("pending-chan-cleanup")

// pendingCleanupKey is written into a channel's own bucket by Phase 1
// (CloseChannel) to mark that bulk data deletion is still pending. It
// allows fetchNodeChannels to distinguish intentional Phase 1 state
// from genuine data corruption when chanInfoKey is absent. The key is
// removed automatically when Phase 2 deletes the channel bucket.
pendingCleanupKey = []byte("phase2-pending")
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pendingCleanupKey comment is slightly redundant with the ErrChannelPendingCleanup comment. Consider consolidating the explanation of the sentinel key's purpose.

@ziggie1984 ziggie1984 added this to the v0.21.0 milestone Apr 15, 2026
@ziggie1984 ziggie1984 added this to v0.21 Apr 15, 2026
@ziggie1984 ziggie1984 added performance code health Related to code commenting, refactoring, and other non-behaviour improvements labels Apr 15, 2026
@warioishere
Copy link
Copy Markdown

warioishere commented Apr 15, 2026

Massivelly affected by this. Closing a channnel with more then 2M states locks my node up about 30-50minutes depening on the number of states. Also lnd doesn't register correct channel closure on tx confirmation. During closure a lot of channels go offline, and I see force closures due to revocation errors lnd is throwing because of db locks

Running PG and lnd 0.20.1 here

Possibly also
#10320

Is related to this and should be solved by this PR

@AuthenticityBTC
Copy link
Copy Markdown

Authenticity is also affected by this. Closing a channnel with more then 500k states locks up my node for 20-30 minutes. I am running oversized RAM/CPU on both my LND (20.1) and PG Server (on separate nodes). I currently have 180 public and 15 private channels with total states on my node at 40million. I occasionally cycle channels to reduce total number of states, but with this issue there is downtime with every channel close.

Logs constantly is showing timeouts/db retries in lnd.log ("closed: db tx retries exceeded" and other similar errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code health Related to code commenting, refactoring, and other non-behaviour improvements kvdb performance severity-critical Requires expert review - security/consensus critical sql

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants