fix(spill): stop at max level instead of erroring, fix memory leaks#24848
Conversation
- group spill: return nil at spillMaxPass instead of erroring, keeping remaining data in memory rather than failing the query. - group spill: fix spillAggList leak on error paths in loadSpilledData by freeing in deferred cleanup. - group/hashbuild/hashjoin: nil cached spill buffers (bufio.Reader, bytes.Buffer, hashCodes, chunkFlags, flagFlat, nonEmptyBuckets, bucketRowIds, expression executors) in Free() to allow GC on operator reuse. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
XuPeng-SH
left a comment
There was a problem hiding this comment.
Blocking on coverage for the behavior changes in this PR.
The code changes themselves look plausible, but this patch changes spill semantics and multiple cleanup paths without adding targeted regression tests in the touched areas:
pkg/sql/colexec/group/helper.go:spillDataToDiskno longer errors atspillMaxPass; it silently keeps data in memory instead. That is a real behavior change in the group spill path, but I don't see a group-side regression test that drives the max-depth case and proves the operator still completes correctly.pkg/sql/colexec/group/helper.go/types2.go: thespillAggListerror-path cleanup and cached spill buffer cleanup are easy to regress, but there is no new test that exercises those paths.pkg/sql/colexec/hashbuild/types.goandpkg/sql/colexec/hashjoin/types.go:Free()now additionally tears down cached spill executors/buffers, but again there is no matching regression coverage for operator reuse / cleanup behavior.
Given that this is spill + memory-lifetime code, I don't think we should merge the semantic change and cleanup changes without the tests the PR checklist claims were added. Once there is focused coverage for the group max-depth fallback and the new cleanup paths, I'm happy to re-review.
XuPeng-SH
left a comment
There was a problem hiding this comment.
Updating my review based on the latest discussion.
On a strict pass, the gap I was blocking on was targeted regression coverage for the spill max-depth fallback and the new cleanup paths. I did not identify another concrete correctness/concurrency/performance bug in the latest code itself. Given the accepted risk on coverage, this looks okay to merge.
Merge Queue Status
This pull request spent 1 hour 3 minutes 50 seconds in the queue, including 1 hour 3 minutes 27 seconds running CI. Required conditions to merge
|
…24850) Cherry-pick of #24848 to 4.0-dev. Four fixes in the spill code: **1. Spill max level: stop instead of erroring.** When group-by spill hits the 3-level limit, now returns nil instead of erroring — data stays in memory. **2. `spillAggList` leak on error paths.** Added `freeSpillAggList()` to deferred cleanup in `loadSpilledData`. **3. Group spill cached buffers never freed.** Nil `spillBuf`, `spillReader`, `spillHashCodes`, `spillChunkFlags`, `spillFlagFlat`, `spillNonEmptyBuckets`, `spillBucketRowIds` in `free()`. **4. Hashbuild/hashjoin spill executors and buffers never freed.** Added `freeSpillExprExecs`/`freeSpillBuildExprExecs` and buffer nil to `Free()`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Approved by: @ouyuanning, @XuPeng-SH
What type of PR is this?
Which issue(s) this PR fixes:
Fixes #23353
Fixes #24836
What this PR does / why we need it:
Four fixes in the spill code across group, hashbuild, and hashjoin:
1. Spill max level: stop instead of erroring. When group-by spill hits the 3-level limit (32³ = 32768 buckets), previously the query failed with "spill level too deep". Now it gracefully degrades — data stays in memory at level 3 instead of killing the query.
2.
spillAggListleak on error paths.loadSpilledDataallocatedspillAggListinside the loop body and freed it only on the success path. Every error return (10+ paths) leaked the list. Fixed by addingfreeSpillAggList()to the deferred cleanup at function entry.3. Group spill cached buffers never freed.
spillBuf(1MB+ bytes.Buffer),spillReader(bufio.Reader),spillHashCodes,spillChunkFlags,spillFlagFlat,spillNonEmptyBuckets,spillBucketRowIds— all cached on the container but never nilled infree(). Persisted across operator reuse viareuse.Alloc.4. Hashbuild/hashjoin spill executors and buffers never freed.
freeSpillExprExecs()/freeSpillBuildExprExecs()were called during re-init but not fromFree(). Same forspillKeyVecs,spillHashValues,spillBucketRowIds,spillNonEmptyBuckets.Checklist:
make static-checkto verify code qualityCo-Authored-By: Claude Opus 4.7 noreply@anthropic.com