fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks#24373
fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks#24373jiangxinmeng1 wants to merge 5 commits into
Conversation
…ecks Under 1000 concurrent TPCC NEW_ORDER transactions, each LockOp triggers PKPersistedBetween to verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table, this produces a thundering-herd of block I/O that exhausts mpool capacity. Three mitigations: 1. Early exit when changed objects exceed threshold (64) — conservatively returns "may be modified" to avoid loading hundreds of object meta/BF. 2. Early exit when candidate blocks exceed threshold (32) — avoids reading too many blocks that passed zonemap + bloom filter checks. 3. Global semaphore (capacity 16) on the block I/O phase — caps peak concurrent mpool allocations from LoadColumns. Also extends the secondary index selectivity guard in getIndexForNonEquiCond to cover in_range operators (previously only single range ops were guarded), preventing non-selective in_range conditions from triggering full index table scans on large tables. Fixes matrixorigin#24348 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
gouhongshen
left a comment
There was a problem hiding this comment.
🔍 Code Review Report
📊 概览
| 指标 | 值 |
|---|---|
| 审查文件数 | 2 |
| 总发现数 | 7 (去重后) |
| 🔴 Must Fix | 1 |
| 🟡 Should Fix | 4 |
| 🟢 Nit | 2 |
| 🧨 破坏性测试结论 | 不足 |
📝 总结
PR 方向正确——用三道闸门(对象数/块数/信号量)控制 PKPersistedBetween 在高并发下的内存峰值,保守返回 true 触发冲突重试是安全的 RC 语义。主要问题是 信号量作用范围意外扩大(defer 语义导致 tombstone I/O 路径也被限流)以及计数器语义污染。apply_indices.go 的 in_range 扩展合理无回归风险。
🧨 破坏性测试审判
- 结论: 不足
- 已覆盖失败模式: 无(PR 无任何自动化测试变更)
- 缺失失败模式: ① 阈值边界行为(64→65 切换点); ② 信号量争抢下 ctx 超时行为; ③ 保守 bail-out 后事务重试正确性; ④ 并发信号量饥饿
- 裁决理由: OOM 修复是高风险改动,引入 3 条新 early-return 路径 + 1 个全局信号量,全部缺少自动化验证。依赖手动 TPCC 测试无法防止 CI 回归。
🔴 Must Fix
1. Semaphore 持有范围过大,覆盖了 tombstone I/O — txn_table.go
- 类别: 并发设计缺陷
- 发现者: 老K、虫姐 (共识)
- 描述:
defer func() { <-pkCheckSemaphore }()在if len(candidateBlks) > 0内注册,但 defer 是函数级。信号量持有到函数结束,包括后续tombstonePKExistsInRange路径(有独立 block I/O)。实际效果:tombstone 检查也被 16 并发卡住,且单 goroutine 串行读 32 blocks 可持有 slot 达数秒,造成队头阻塞。 - 根治方案: 将信号量粒度缩窄到 candidateBlks 循环本身(非 defer),或改为 per-block acquire/release:
这样 tombstone 路径不被限流,单 goroutine 持有时间也可控。
pkCheckSemaphore <- struct{}{} for _, blk := range candidateBlks { release, err := ioutil.LoadColumns(...) // ... use data ... release() } <-pkCheckSemaphore
🟡 Should Fix
1. Counter 语义污染 — TxnPKChangeCheckChangedCounter
- 类别: 可观测性缺陷
- 发现者: 老K、虫姐、补丁犬 (共识 3/4)
- 描述: 同一 counter 现混合三种语义:① 错误且 changed(原)② 对象数超限 bail-out(新)③ 块数超限 bail-out(新)。监控无法区分"真实冲突"与"主动降级"。
- 根治方案: 新增
TxnPKChangeCheckBailoutCounter或给现有 counter 加 label 区分reason="conflict"/"objects_cap"/"blocks_cap"。
2. IOCounter 虚计 — ctx 取消时已计数但未做 IO
- 类别: 可观测性缺陷
- 发现者: 虫姐
- 描述:
v2.TxnPKChangeCheckIOCounter.Inc()在 semaphore acquire 之前,若 ctx.Done 触发则从未做 IO 却已计数。 - 根治方案: 将
.Inc()移到 semaphore 获取成功之后。
3. Objects cap=64 在 ZoneMap 过滤之前触发 — false positive 风险
- 类别: 性能 / 正确性权衡
- 发现者: 闪电、虫姐
- 描述:
len(cObjs)+len(delObjs) > 64在ForeachCommittedObjects(ZM/BF 过滤)之前检查。长事务场景下,即使 90%+ objects 会被 ZM 排除,也会触发保守退出 → 不必要的冲突重试 → 吞吐下降。另外delObjs通常不含冲突 PK,全额计入阈值过度保守。 - 根治方案: 提高阈值至 256(因为 ZM 排除代价极低,只读内存中的 ObjectStats),或仅对
cObjs计数,或将阈值移到ForeachCommittedObjects循环内对 "需要 IO 的对象" 计数。
4. 硬编码阈值无运行时可调手段
- 类别: 可运维性
- 发现者: 老K、补丁犬
- 描述:
64/32/16全部编译期固定。不同部署规格(8G vs 64G)对 OOM 容忍度差异巨大,命中后只能发版修改。 - 根治方案: 收敛为
runtime.Parameter注册参数,或至少从 mpool 总量/cgroup 内存反推 semaphore 容量。
🟢 Nit
1. delObjs 全额计入阈值可能过度保守
- 可考虑
len(cObjs) + len(delObjs)/2或仅计cObjs。
2. isSingleRangeOp → isRangeOp 重命名
- 函数名改善了语义准确性,无问题。但缺少对应 UT 验证
in_range确实被 guard。
💬 陪审团分歧记录
| 议题 | 分歧 | 裁决 |
|---|---|---|
| 信号量设计合理性 | 闪电: 峰值内存可控✅; 虫姐: 队头阻塞严重 |
🔴 Must Fix — 内存控制有效但 scope 和粒度需修正 |
| objects cap=64 位置 | 闪电: 应移到 ZM 之后; 虫姐: delObjs 计入过度保守 | 🟡 — 当前可工作但 false positive 率偏高 |
Review by 7-agent jury system (code-review skill v1.5)
XuPeng-SH
left a comment
There was a problem hiding this comment.
Review
The approach is directionally correct — conservatively returning "may be modified" is safe RC semantics and avoids the thundering-herd I/O. A few issues to address before merge:
🔴 Must Fix: Semaphore held across tombstone I/O path
defer func() { <-pkCheckSemaphore }() is registered inside if len(candidateBlks) > 0, but defer is function-scoped — it won't release until PKPersistedBetween returns. After the candidate-block loop, the function may proceed to tombstonePKExistsInRange (which performs its own block I/O). The semaphore stays held throughout, meaning:
- Tombstone checking is unintentionally rate-limited by
pkCheckSemaphore - A single goroutine holds a slot for candidateBlk reads + tombstone reads, creating head-of-line blocking
Fix: scope the release to just the candidateBlks loop:
pkCheckSemaphore <- struct{}{}
for _, blk := range candidateBlks {
// ...
}
<-pkCheckSemaphore🟡 Should Fix
-
Counter semantics pollution —
TxnPKChangeCheckChangedCounteris now overloaded: actual conflict (original), objects-cap bail-out, blocks-cap bail-out. Dashboards can't distinguish real conflicts from protective bail-outs. Use a separate counter or add labels. -
IO counter incremented before semaphore acquire —
v2.TxnPKChangeCheckIOCounter.Inc()fires before the semaphore select. Ifctx.Done()fires, no I/O happens but the counter already incremented. Move.Inc()after successful acquire. -
Objects cap (64) fires before ZoneMap filtering — ZoneMap checks in
ForeachCommittedObjectsare cheap in-memory reads that eliminate most objects. Under long transactions (not just high concurrency), many objects accumulate but most are irrelevant. Consider raising the threshold, counting onlycObjs, or moving the cap inside the loop to count objects that actually need I/O. -
Hardcoded thresholds (64/32/16) — Deployments with 8GB vs 64GB RAM have very different tolerance. Consider runtime parameters or deriving from mpool capacity.
- TestPkCheckSemaphore_LimitsConcurrency: verifies the global semaphore actually limits concurrent goroutines to its capacity (16) - TestPkCheckSemaphore_RespectsContextCancellation: verifies cancelled context causes immediate return without blocking - TestIsRangeOp: verifies the expanded range operator classification covers in_range in addition to single comparison ops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ecks (cherry-pick to 4.0-dev) (#24380) ## What type of PR is this? - [ ] API-change - [x] BUG - [ ] Improvement - [ ] Documentation - [ ] Feature - [ ] Test and CI - [ ] Code Refactoring ## Which issue(s) does this PR fix or relate to? Cherry-pick of #24373 to 4.0-dev branch. Fixes #24348 ## What this PR does / why we need it: Under 1000 concurrent TPCC NEW_ORDER transactions, each `LockOp` triggers `PKPersistedBetween` to verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table (e.g., `stock`), this produces a thundering-herd of block I/O that exhausts mpool capacity → OOM. ### Changes **`pkg/vm/engine/disttae/txn_table.go`** — Three mitigations in `PKPersistedBetween`: | Mechanism | Threshold | Effect | |-----------|-----------|--------| | Changed objects cap | 64 | Conservatively return `true` to skip expensive I/O | | Candidate blocks cap | 32 | Avoid reading too many blocks that passed zonemap + BF | | Global semaphore | 16 concurrent | Cap peak mpool allocations from `LoadColumns` | **`pkg/sql/plan/apply_indices.go`** — Extend selectivity guard in `getIndexForNonEquiCond` to cover `in_range` operators. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor PKPersistedBetween guard handling into testable helpers and add coverage for bailout and semaphore branches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What type of PR is this?
Which issue(s) does this PR fix or relate to?
Fixes #24348
What this PR does / why we need it:
Under 1000 concurrent TPCC NEW_ORDER transactions, each
LockOptriggersPKPersistedBetweento verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table (e.g.,stock), this produces a thundering-herd of block I/O that exhausts mpool capacity → OOM.Root Cause
1000 goroutines × N candidate blocks × ~683KB/block = unbounded peak memory.
Changes
pkg/vm/engine/disttae/txn_table.go— Three mitigations inPKPersistedBetween:trueto skip expensive I/OLoadColumnsReturning
trueconservatively means the transaction gets a refresh timestamp and re-reads at a newer snapshot — correct RC behavior, just one extra read.pkg/sql/plan/apply_indices.go— Extend selectivity guard ingetIndexForNonEquiCond:>=/>/<=/<(single range ops) were guarded against non-selective secondary index scans on large tablesin_rangeis also guarded, preventing two-sided range conditions from triggering full index table scansTest plan
random_rangesTPS not regressed (index-only scan path uses different threshold viatryIndexOnlyScan)PKPersistedBetweenstill detects actual PK conflicts correctly (conservativeness only triggers above thresholds)🤖 Generated with Claude Code