fix(plan): prevent OOM from non-selective secondary index range scans#24309
Conversation
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
1509aeb to
add2a82
Compare
…s to prevent stampede OOM (matrixorigin#24307) When 100 concurrent INSERT IGNORE scans miss the same cache entry simultaneously, all goroutines perform redundant I/O reads, loading identical bloom filters and object metadata into Go heap 100x. This causes heap_sys to balloon to 39 GiB (well beyond GOMEMLIMIT=24G) and triggers OOM under 55 GiB cgroup limit. Add dedupLoad using sync.Map + channel pattern (singleflight equivalent) so that only one goroutine performs the actual ReadExtent/ReadBloomFilter for a given key while others wait and share the result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LeftHandCold
left a comment
There was a problem hiding this comment.
Found one blocker in the group/merge_group mpool cleanup change.
Group.Prepare and MergeGroup.Prepare now delete ctr.mp before freeing or recreating the container state that may still own allocations from that pool. For example, Group.Prepare deletes the old pool at pkg/sql/colexec/group/exec2.go:51, but prepareGroupAndAggArg can then reuse the existing ctr.aggList when the aggregate count matches and calls ag.Free() / ag.GroupGrow(1) on aggregators that were created with the old ctr.mp in container.makeAggList. MergeGroup.Prepare has the same pattern, and buildOneBatch can reuse ctr.aggList when lengths match.
This means allocations/frees can continue through a deleted/deregistered mpool, while ctr.memUsed() and spill decisions read the newly created pool. It can hide memory from the spill accounting and also risks freeing old groupByBatches / spill buffers with the new pool during later cleanup.
Please free/reset the full container state with the old pool before deleting/replacing ctr.mp, or force all pool-owning state (aggList, hash state, group batches, spill buffers/evaluators, etc.) to be released and recreated before the old pool is deleted.
…tor reuse (matrixorigin#24307) When a Group/MergeGroup operator is reused (e.g. via StarCountMergeGroup cached on Scope), Prepare() creates a new mpool without freeing the old one, orphaning potentially gigabytes of allocated memory indefinitely. Use ctr.free() instead of bare mpool.DeleteMPool() to properly release all pool-owning state (aggList, groupByBatches, spill buffers, hash state, evaluators) with the old pool before deleting it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge Queue Status
This pull request spent 1 hour 51 minutes 44 seconds in the queue, including 58 minutes 40 seconds running CI. Required conditions to merge
|
What type of PR is this?
Which issue(s) does this PR fix?
Fixes #24307
What this PR does / why we need it:
Fixes OOM regression in sysbench 1000w insert-ignore scenario introduced by #24275.
Two root causes identified:
Missing selectivity guard in
tryIndexOnlyScan: The function runs before the general selectivity check inapplyIndicesForFiltersRegularIndex, so non-selective range queries (>= / <) recognized by feat(plan): support range queries on secondary indexes #24275'scheckIndexFiltercould trigger index-only scans without any cost check on large tables. This causes full index table scans → excessive metadata/bloom filter re-reads → unbounded mpool growth → OOM. Fixed by adding a guard that skips index-only scan for non-equality leading conditions on tables >= 50000 rows when selectivity > 0.3 or output cardinality > 10000.Buffer overread in
ConstructBasePKFilter'sin_rangecase:types.DecodeInt64(vals[2])reads 8 bytes from a 1-byte uint8 flag slice, causing undefined behavior. Fixed by reading the flag byte directly asvals[2][0].