Skip to content

fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks#24373

Open
jiangxinmeng1 wants to merge 5 commits into
matrixorigin:mainfrom
jiangxinmeng1:fix/tpcc-new-order-oom
Open

fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks#24373
jiangxinmeng1 wants to merge 5 commits into
matrixorigin:mainfrom
jiangxinmeng1:fix/tpcc-new-order-oom

Conversation

@jiangxinmeng1
Copy link
Copy Markdown
Contributor

@jiangxinmeng1 jiangxinmeng1 commented May 13, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) does this PR fix or relate to?

Fixes #24348

What this PR does / why we need it:

Under 1000 concurrent TPCC NEW_ORDER transactions, each LockOp triggers PKPersistedBetween to verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table (e.g., stock), this produces a thundering-herd of block I/O that exhausts mpool capacity → OOM.

Root Cause

LockOp.Call → callNonBlocking → doLock → hasNewVersionInRange
 → PKPersistedBetween
  → GetChangedObjsBetween (many objects under high concurrency)
  → ForeachCommittedObjects (load meta + BF for each)
  → LoadColumns for each candidate block (mpool allocation)

1000 goroutines × N candidate blocks × ~683KB/block = unbounded peak memory.

Changes

pkg/vm/engine/disttae/txn_table.go — Three mitigations in PKPersistedBetween:

Mechanism Threshold Effect
Changed objects cap 64 Conservatively return true to skip expensive I/O
Candidate blocks cap 32 Avoid reading too many blocks that passed zonemap + BF
Global semaphore 16 concurrent Cap peak mpool allocations from LoadColumns

Returning true conservatively means the transaction gets a refresh timestamp and re-reads at a newer snapshot — correct RC behavior, just one extra read.

pkg/sql/plan/apply_indices.go — Extend selectivity guard in getIndexForNonEquiCond:

  • Previously only >=/>/<=/< (single range ops) were guarded against non-selective secondary index scans on large tables
  • Now in_range is also guarded, preventing two-sided range conditions from triggering full index table scans

Test plan

  • Verify TPCC 1000-terminal NEW_ORDER no longer OOMs under 55 GiB cgroup
  • Verify sysbench random_ranges TPS not regressed (index-only scan path uses different threshold via tryIndexOnlyScan)
  • Verify PKPersistedBetween still detects actual PK conflicts correctly (conservativeness only triggers above thresholds)

🤖 Generated with Claude Code

…ecks

Under 1000 concurrent TPCC NEW_ORDER transactions, each LockOp triggers
PKPersistedBetween to verify primary key conflicts. This path loads object
metadata, bloom filters, and block PK columns for all changed objects since
the transaction's snapshot. With many concurrent writers on the same table,
this produces a thundering-herd of block I/O that exhausts mpool capacity.

Three mitigations:
1. Early exit when changed objects exceed threshold (64) — conservatively
   returns "may be modified" to avoid loading hundreds of object meta/BF.
2. Early exit when candidate blocks exceed threshold (32) — avoids reading
   too many blocks that passed zonemap + bloom filter checks.
3. Global semaphore (capacity 16) on the block I/O phase — caps peak
   concurrent mpool allocations from LoadColumns.

Also extends the secondary index selectivity guard in getIndexForNonEquiCond
to cover in_range operators (previously only single range ops were guarded),
preventing non-selective in_range conditions from triggering full index table
scans on large tables.

Fixes matrixorigin#24348

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copy link
Copy Markdown
Contributor

@gouhongshen gouhongshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Code Review Report

📊 概览

指标
审查文件数 2
总发现数 7 (去重后)
🔴 Must Fix 1
🟡 Should Fix 4
🟢 Nit 2
🧨 破坏性测试结论 不足

📝 总结

PR 方向正确——用三道闸门(对象数/块数/信号量)控制 PKPersistedBetween 在高并发下的内存峰值,保守返回 true 触发冲突重试是安全的 RC 语义。主要问题是 信号量作用范围意外扩大(defer 语义导致 tombstone I/O 路径也被限流)以及计数器语义污染。apply_indices.goin_range 扩展合理无回归风险。

🧨 破坏性测试审判

  • 结论: 不足
  • 已覆盖失败模式: 无(PR 无任何自动化测试变更)
  • 缺失失败模式: ① 阈值边界行为(64→65 切换点); ② 信号量争抢下 ctx 超时行为; ③ 保守 bail-out 后事务重试正确性; ④ 并发信号量饥饿
  • 裁决理由: OOM 修复是高风险改动,引入 3 条新 early-return 路径 + 1 个全局信号量,全部缺少自动化验证。依赖手动 TPCC 测试无法防止 CI 回归。

🔴 Must Fix

1. Semaphore 持有范围过大,覆盖了 tombstone I/O — txn_table.go

  • 类别: 并发设计缺陷
  • 发现者: 老K、虫姐 (共识)
  • 描述: defer func() { <-pkCheckSemaphore }()if len(candidateBlks) > 0 内注册,但 defer 是函数级。信号量持有到函数结束,包括后续 tombstonePKExistsInRange 路径(有独立 block I/O)。实际效果:tombstone 检查也被 16 并发卡住,且单 goroutine 串行读 32 blocks 可持有 slot 达数秒,造成队头阻塞。
  • 根治方案: 将信号量粒度缩窄到 candidateBlks 循环本身(非 defer),或改为 per-block acquire/release:
    pkCheckSemaphore <- struct{}{}
    for _, blk := range candidateBlks {
        release, err := ioutil.LoadColumns(...)
        // ... use data ...
        release()
    }
    <-pkCheckSemaphore
    这样 tombstone 路径不被限流,单 goroutine 持有时间也可控。

🟡 Should Fix

1. Counter 语义污染 — TxnPKChangeCheckChangedCounter

  • 类别: 可观测性缺陷
  • 发现者: 老K、虫姐、补丁犬 (共识 3/4)
  • 描述: 同一 counter 现混合三种语义:① 错误且 changed(原)② 对象数超限 bail-out(新)③ 块数超限 bail-out(新)。监控无法区分"真实冲突"与"主动降级"。
  • 根治方案: 新增 TxnPKChangeCheckBailoutCounter 或给现有 counter 加 label 区分 reason="conflict"/"objects_cap"/"blocks_cap"

2. IOCounter 虚计 — ctx 取消时已计数但未做 IO

  • 类别: 可观测性缺陷
  • 发现者: 虫姐
  • 描述: v2.TxnPKChangeCheckIOCounter.Inc() 在 semaphore acquire 之前,若 ctx.Done 触发则从未做 IO 却已计数。
  • 根治方案: 将 .Inc() 移到 semaphore 获取成功之后。

3. Objects cap=64 在 ZoneMap 过滤之前触发 — false positive 风险

  • 类别: 性能 / 正确性权衡
  • 发现者: 闪电、虫姐
  • 描述: len(cObjs)+len(delObjs) > 64ForeachCommittedObjects(ZM/BF 过滤)之前检查。长事务场景下,即使 90%+ objects 会被 ZM 排除,也会触发保守退出 → 不必要的冲突重试 → 吞吐下降。另外 delObjs 通常不含冲突 PK,全额计入阈值过度保守。
  • 根治方案: 提高阈值至 256(因为 ZM 排除代价极低,只读内存中的 ObjectStats),或仅对 cObjs 计数,或将阈值移到 ForeachCommittedObjects 循环内对 "需要 IO 的对象" 计数。

4. 硬编码阈值无运行时可调手段

  • 类别: 可运维性
  • 发现者: 老K、补丁犬
  • 描述: 64/32/16 全部编译期固定。不同部署规格(8G vs 64G)对 OOM 容忍度差异巨大,命中后只能发版修改。
  • 根治方案: 收敛为 runtime.Parameter 注册参数,或至少从 mpool 总量/cgroup 内存反推 semaphore 容量。

🟢 Nit

1. delObjs 全额计入阈值可能过度保守

  • 可考虑 len(cObjs) + len(delObjs)/2 或仅计 cObjs

2. isSingleRangeOpisRangeOp 重命名

  • 函数名改善了语义准确性,无问题。但缺少对应 UT 验证 in_range 确实被 guard。

💬 陪审团分歧记录

议题 分歧 裁决
信号量设计合理性 闪电: 峰值内存可控✅; 虫姐: 队头阻塞严重⚠️; 老K: scope 过大🔴 🔴 Must Fix — 内存控制有效但 scope 和粒度需修正
objects cap=64 位置 闪电: 应移到 ZM 之后; 虫姐: delObjs 计入过度保守 🟡 — 当前可工作但 false positive 率偏高

Review by 7-agent jury system (code-review skill v1.5)

Copy link
Copy Markdown
Contributor

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The approach is directionally correct — conservatively returning "may be modified" is safe RC semantics and avoids the thundering-herd I/O. A few issues to address before merge:

🔴 Must Fix: Semaphore held across tombstone I/O path

defer func() { <-pkCheckSemaphore }() is registered inside if len(candidateBlks) > 0, but defer is function-scoped — it won't release until PKPersistedBetween returns. After the candidate-block loop, the function may proceed to tombstonePKExistsInRange (which performs its own block I/O). The semaphore stays held throughout, meaning:

  1. Tombstone checking is unintentionally rate-limited by pkCheckSemaphore
  2. A single goroutine holds a slot for candidateBlk reads + tombstone reads, creating head-of-line blocking

Fix: scope the release to just the candidateBlks loop:

pkCheckSemaphore <- struct{}{}
for _, blk := range candidateBlks {
    // ...
}
<-pkCheckSemaphore

🟡 Should Fix

  1. Counter semantics pollutionTxnPKChangeCheckChangedCounter is now overloaded: actual conflict (original), objects-cap bail-out, blocks-cap bail-out. Dashboards can't distinguish real conflicts from protective bail-outs. Use a separate counter or add labels.

  2. IO counter incremented before semaphore acquirev2.TxnPKChangeCheckIOCounter.Inc() fires before the semaphore select. If ctx.Done() fires, no I/O happens but the counter already incremented. Move .Inc() after successful acquire.

  3. Objects cap (64) fires before ZoneMap filtering — ZoneMap checks in ForeachCommittedObjects are cheap in-memory reads that eliminate most objects. Under long transactions (not just high concurrency), many objects accumulate but most are irrelevant. Consider raising the threshold, counting only cObjs, or moving the cap inside the loop to count objects that actually need I/O.

  4. Hardcoded thresholds (64/32/16) — Deployments with 8GB vs 64GB RAM have very different tolerance. Consider runtime parameters or deriving from mpool capacity.

- TestPkCheckSemaphore_LimitsConcurrency: verifies the global semaphore
  actually limits concurrent goroutines to its capacity (16)
- TestPkCheckSemaphore_RespectsContextCancellation: verifies cancelled
  context causes immediate return without blocking
- TestIsRangeOp: verifies the expanded range operator classification
  covers in_range in addition to single comparison ops

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
heni02 pushed a commit that referenced this pull request May 13, 2026
…ecks (cherry-pick to 4.0-dev) (#24380)

## What type of PR is this?

- [ ] API-change
- [x] BUG
- [ ] Improvement
- [ ] Documentation
- [ ] Feature
- [ ] Test and CI
- [ ] Code Refactoring

## Which issue(s) does this PR fix or relate to?

Cherry-pick of #24373 to 4.0-dev branch.

Fixes #24348

## What this PR does / why we need it:

Under 1000 concurrent TPCC NEW_ORDER transactions, each `LockOp`
triggers `PKPersistedBetween` to verify primary key conflicts. This path
loads object metadata, bloom filters, and block PK columns for all
changed objects since the transaction's snapshot. With many concurrent
writers on the same table (e.g., `stock`), this produces a
thundering-herd of block I/O that exhausts mpool capacity → OOM.

### Changes

**`pkg/vm/engine/disttae/txn_table.go`** — Three mitigations in
`PKPersistedBetween`:

| Mechanism | Threshold | Effect |
|-----------|-----------|--------|
| Changed objects cap | 64 | Conservatively return `true` to skip
expensive I/O |
| Candidate blocks cap | 32 | Avoid reading too many blocks that passed
zonemap + BF |
| Global semaphore | 16 concurrent | Cap peak mpool allocations from
`LoadColumns` |

**`pkg/sql/plan/apply_indices.go`** — Extend selectivity guard in
`getIndexForNonEquiCond` to cover `in_range` operators.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor PKPersistedBetween guard handling into testable helpers and add coverage for bailout and semaphore branches.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: [date 5.11]main tke regression: tpcc 100w 1000 terminal test cn oom

7 participants