Skip to content

fix(spill): stop at max level instead of erroring, fix memory leaks#24848

Merged
mergify[bot] merged 2 commits into
matrixorigin:mainfrom
aunjgr:fix/spill-max-level-and-leaks
Jun 4, 2026
Merged

fix(spill): stop at max level instead of erroring, fix memory leaks#24848
mergify[bot] merged 2 commits into
matrixorigin:mainfrom
aunjgr:fix/spill-max-level-and-leaks

Conversation

@aunjgr
Copy link
Copy Markdown
Contributor

@aunjgr aunjgr commented Jun 4, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

Fixes #23353
Fixes #24836

What this PR does / why we need it:

Four fixes in the spill code across group, hashbuild, and hashjoin:

1. Spill max level: stop instead of erroring. When group-by spill hits the 3-level limit (32³ = 32768 buckets), previously the query failed with "spill level too deep". Now it gracefully degrades — data stays in memory at level 3 instead of killing the query.

2. spillAggList leak on error paths. loadSpilledData allocated spillAggList inside the loop body and freed it only on the success path. Every error return (10+ paths) leaked the list. Fixed by adding freeSpillAggList() to the deferred cleanup at function entry.

3. Group spill cached buffers never freed. spillBuf (1MB+ bytes.Buffer), spillReader (bufio.Reader), spillHashCodes, spillChunkFlags, spillFlagFlat, spillNonEmptyBuckets, spillBucketRowIds — all cached on the container but never nilled in free(). Persisted across operator reuse via reuse.Alloc.

4. Hashbuild/hashjoin spill executors and buffers never freed. freeSpillExprExecs() / freeSpillBuildExprExecs() were called during re-init but not from Free(). Same for spillKeyVecs, spillHashValues, spillBucketRowIds, spillNonEmptyBuckets.

Checklist:

  • I have added unit tests for the changes
  • I have run make static-check to verify code quality
  • I have updated the documentation accordingly
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

- group spill: return nil at spillMaxPass instead of erroring, keeping
  remaining data in memory rather than failing the query.
- group spill: fix spillAggList leak on error paths in loadSpilledData
  by freeing in deferred cleanup.
- group/hashbuild/hashjoin: nil cached spill buffers (bufio.Reader,
  bytes.Buffer, hashCodes, chunkFlags, flagFlat, nonEmptyBuckets,
  bucketRowIds, expression executors) in Free() to allow GC on
  operator reuse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@aunjgr aunjgr requested a review from ouyuanning as a code owner June 4, 2026 09:45
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on coverage for the behavior changes in this PR.

The code changes themselves look plausible, but this patch changes spill semantics and multiple cleanup paths without adding targeted regression tests in the touched areas:

  • pkg/sql/colexec/group/helper.go: spillDataToDisk no longer errors at spillMaxPass; it silently keeps data in memory instead. That is a real behavior change in the group spill path, but I don't see a group-side regression test that drives the max-depth case and proves the operator still completes correctly.
  • pkg/sql/colexec/group/helper.go / types2.go: the spillAggList error-path cleanup and cached spill buffer cleanup are easy to regress, but there is no new test that exercises those paths.
  • pkg/sql/colexec/hashbuild/types.go and pkg/sql/colexec/hashjoin/types.go: Free() now additionally tears down cached spill executors/buffers, but again there is no matching regression coverage for operator reuse / cleanup behavior.

Given that this is spill + memory-lifetime code, I don't think we should merge the semantic change and cleanup changes without the tests the PR checklist claims were added. Once there is focused coverage for the group max-depth fallback and the new cleanup paths, I'm happy to re-review.

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating my review based on the latest discussion.

On a strict pass, the gap I was blocking on was targeted regression coverage for the spill max-depth fallback and the new cleanup paths. I did not identify another concrete correctness/concurrency/performance bug in the latest code itself. Given the accepted risk on coverage, this looks okay to merge.

@mergify mergify Bot added the queued label Jun 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 4, 2026

Merge Queue Status

  • Entered queue2026-06-04 13:45 UTC · Rule: main
  • Checks passed · in-place
  • Merged2026-06-04 14:49 UTC · at 2b14b882a64f92b8b0402a5142623b818a65ae6d · squash

This pull request spent 1 hour 3 minutes 50 seconds in the queue, including 1 hour 3 minutes 27 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • github-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

@mergify mergify Bot merged commit 0e5cddb into matrixorigin:main Jun 4, 2026
23 of 24 checks passed
@mergify mergify Bot removed the queued label Jun 4, 2026
mergify Bot pushed a commit that referenced this pull request Jun 5, 2026
…24850)

Cherry-pick of #24848 to 4.0-dev. Four fixes in the spill code:

**1. Spill max level: stop instead of erroring.** When group-by spill hits the 3-level limit, now returns nil instead of erroring — data stays in memory.

**2. `spillAggList` leak on error paths.** Added `freeSpillAggList()` to deferred cleanup in `loadSpilledData`.

**3. Group spill cached buffers never freed.** Nil `spillBuf`, `spillReader`, `spillHashCodes`, `spillChunkFlags`, `spillFlagFlat`, `spillNonEmptyBuckets`, `spillBucketRowIds` in `free()`.

**4. Hashbuild/hashjoin spill executors and buffers never freed.** Added `freeSpillExprExecs`/`freeSpillBuildExprExecs` and buffer nil to `Free()`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Approved by: @ouyuanning, @XuPeng-SH
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/S Denotes a PR that changes [10,99] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] TPCH 1T q18 fails with spill level too deep on main [Feature Request]: Support Spill for Sort and Hash Join

4 participants