Skip to content

sqldb: avoid materializing non-terminal payments#10851

Open
ziggie1984 wants to merge 1 commit into
lightningnetwork:masterfrom
ziggie1984:fix-fetch-nonterminal-pagination
Open

sqldb: avoid materializing non-terminal payments#10851
ziggie1984 wants to merge 1 commit into
lightningnetwork:masterfrom
ziggie1984:fix-fetch-nonterminal-pagination

Conversation

@ziggie1984
Copy link
Copy Markdown
Collaborator

Motivation

FetchInFlightPayments uses FetchNonTerminalPayments during SQL payment recovery, including router startup payment resume and TrackPayments subscription setup.

The current query first builds a non_terminal_ids CTE using UNION, then applies pagination outside the CTE:

FROM non_terminal_ids n
JOIN payments p ON p.id = n.id
WHERE p.id > $1
ORDER BY p.id ASC
LIMIT $2

On SQLite this can force materialization of the full candidate set before the outer WHERE/ORDER BY/LIMIT can be applied. On a node with a large payment history, EXPLAIN QUERY PLAN showed:

MATERIALIZE non_terminal_ids
UNION USING TEMP B-TREE
SCAN ha

That means the query can effectively scan historical payment/attempt/resolution tables before returning a single page. In one observed DB, payments WHERE fail_reason IS NULL alone contained ~250k rows, causing startup recovery to spend a long time inside FetchNonTerminalPayments.

What Changed

This rewrites FetchNonTerminalPayments to be payment-driven:

FROM payments p
WHERE p.id > $1
...
ORDER BY p.id ASC
LIMIT $2

The non-terminal checks are kept as correlated EXISTS / NOT EXISTS predicates on each candidate payment.

This lets SQLite use the primary-key payment cursor first and stop once the requested page is found, instead of materializing all non-terminal IDs up front.

The observed query plan for the rewritten shape is:

SEARCH p USING INTEGER PRIMARY KEY (rowid>?)
CORRELATED SCALAR SUBQUERY
  SEARCH ha USING INDEX idx_htlc_payment_id_attempt_time (payment_id=?)
  SEARCH hr USING INTEGER PRIMARY KEY (rowid=?)
SEARCH pi USING INDEX sqlite_autoindex_payment_intents_1 (payment_id=?)

On the affected DB, the rewritten query returned the first page in ~0.4s.

Semantics

The returned payment set is unchanged:

  • payments that have not permanently failed and have no settled attempt
  • payments with at least one unresolved HTLC attempt

The change only alters the selector shape so pagination can be applied before scanning historical attempts/resolutions.

Testing

  • Ran make sqlc
  • Manually compared SQLite query plans on an affected payment DB
  • Verified the rewritten query returns quickly on the affected DB

@ziggie1984 ziggie1984 self-assigned this May 27, 2026
@ziggie1984 ziggie1984 added this to v0.21 May 27, 2026
@ziggie1984 ziggie1984 moved this to In review in v0.21 May 27, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the FetchNonTerminalPayments SQL query within the sqldb package. By transforming the query from a CTE-based approach to a payment-driven structure with correlated subqueries, it resolves a performance bottleneck where SQLite would materialize a large candidate set before pagination. This change drastically improves the efficiency of payment recovery and tracking, especially on databases with extensive payment histories, without altering the logical definition of a non-terminal payment.

Highlights

  • SQL Query Optimization: The FetchNonTerminalPayments SQL query was refactored to improve performance, particularly on SQLite, by avoiding the premature materialization of non-terminal payment IDs.
  • Enhanced Database Performance: The revised query now leverages primary-key indexing and applies pagination earlier, significantly reducing query execution time for large payment histories, which was observed to drop from scanning hundreds of thousands of rows to ~0.4 seconds for the first page.
  • Semantic Preservation: The update ensures that the criteria for identifying non-terminal payments remain unchanged, guaranteeing that the set of payments returned by the function is semantically identical to the previous implementation.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ziggie1984 ziggie1984 added this to the v0.21.0 milestone May 27, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the FetchNonTerminalPayments SQL query in both the source SQL file and the generated Go code. The common table expression (CTE) non_terminal_ids and its UNION operation have been removed. Instead, the non-terminal payment filtering logic has been inline-expanded directly into the main query's WHERE clause using OR and EXISTS/NOT EXISTS subqueries. I have no feedback to provide as there are no review comments.

@ziggie1984
Copy link
Copy Markdown
Collaborator Author

Postgres validation

Wanted to add some context on why the SQLite plan was problematic and confirm the rewrite is safe on Postgres, since the PR description only covers SQLite.

What was happening on SQLite

The original query expressed "non-terminal payments" as a CTE that unioned two ID sets — payments without a settled attempt, and payments with any unresolved attempt — and then joined that CTE back to payments with the cursor + LIMIT applied outside.

SQLite's planner could not push the cursor into the CTE. It produced:

MATERIALIZE non_terminal_ids
UNION USING TEMP B-TREE
SCAN ha

In other words, every call to FetchNonTerminalPayments rebuilt the full non-terminal set by scanning the entire payment_htlc_attempts table (and joining to payment_htlc_attempt_resolutions) and writing the result into a temp B-tree for dedup, before the outer WHERE p.id > $1 … LIMIT $2 could trim it. On the affected node (~250k rows in payments WHERE fail_reason IS NULL alone, plus the associated attempts) this dominated startup recovery.

The rewrite turns the selector payment-driven: scan payments in PK order from the cursor, evaluate the non-terminal predicate as correlated EXISTS clauses per row, stop when LIMIT is reached. SQLite then plans it as:

SEARCH p USING INTEGER PRIMARY KEY (rowid>?)
CORRELATED SCALAR SUBQUERY
  SEARCH ha USING INDEX idx_htlc_payment_id_attempt_time (payment_id=?)
  SEARCH hr USING INTEGER PRIMARY KEY (rowid=?)
SEARCH pi USING INDEX sqlite_autoindex_payment_intents_1 (payment_id=?)

First page on the affected DB drops from multi-second to ~0.4 s.

Postgres — does this regress?

Ran EXPLAIN (ANALYZE, BUFFERS, VERBOSE) on a local Postgres DB against three shapes for comparison. Caveat: the local DB is small (handful of payments, 0 non-terminal), so this is a plan-shape check, not a perf benchmark.

Query shape Buffers Time Cursor on payments_pkey?
Master (WITH non_terminal_ids …) hit=9 0.52 ms Yes (Bitmap Heap Scan … Recheck Cond: (p.id > 0))
This PR (OR EXISTS) hit=10 0.28 ms Yes (Index Scan using payments_pkey, Index Cond: (p.id > 0))
Alternative UNION rewrite hit=24 0.95 ms Yes (both branches)

Master plan excerpt — note the CTE is inlined into a Hash Join, not materialized as a separate node:

Limit (actual time=0.512..0.524 rows=0 loops=1) Buffers: shared hit=9
  Sort (Sort Key: p.id)
    Nested Loop Left Join
      Hash Join (Hash Cond: (p.id = p_1.id))
        Bitmap Heap Scan on payments p (Recheck Cond: (p.id > 0))

This PR's plan excerpt — OR EXISTS becomes a hashed SubPlan, cursor reaches the PK index:

Limit (actual time=0.280..0.283 rows=0 loops=1) Buffers: shared hit=10
  Merge Left Join (Merge Cond: (p.id = pi.payment_id))
    Index Scan using payments_pkey on payments p
      Index Cond: (p.id > 0)
      Filter: (((p.fail_reason IS NULL) AND (NOT (hashed SubPlan 2))) OR (hashed SubPlan 4))
      Rows Removed by Filter: 5

Why Postgres doesn't have the SQLite problem

Postgres 12+ inlines non-recursive CTEs by default, so on master the non_terminal_ids CTE is merged into the main plan rather than materialized as a separate CTE Scan. The pathological MATERIALIZE … UNION USING TEMP B-TREE shape from SQLite simply doesn't occur. Master and this PR end up doing structurally equivalent work on Postgres.

For this PR's shape, Postgres chooses hashed SubPlan for the two EXISTS predicates — it evaluates each subplan once, hashes the matching payment_ids, and probes per outer row. The outer cursor still reaches payments_pkey, so pagination is preserved. On large attempt/resolution tables this is one full scan per call to build the hashes — same total work as master's inlined CTE, just expressed differently.

Conclusion

  • SQLite: real win — see plan transformation above and the ~0.4 s first-page result on the affected node.
  • Postgres: performance-neutral vs. master. Cursor + payments_pkey index scan intact, plan shape structurally equivalent, buffer counts within noise on the test data.
  • No need to refactor further (e.g. to a UNION ALL form); on idle nodes the current shape actually does less work.

@ziggie1984 ziggie1984 requested a review from yyforyongyu May 27, 2026 20:47
@ziggie1984 ziggie1984 added the backport-v0.21.x-branch This label triggers a backport to branch `v0.21.x-branch ` label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-v0.21.x-branch This label triggers a backport to branch `v0.21.x-branch ` bug fix no-changelog sql

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

1 participant