fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks by jiangxinmeng1 · Pull Request #24373 · matrixorigin/matrixone

jiangxinmeng1 · 2026-05-13T07:00:50Z

What type of PR is this?

Which issue(s) does this PR fix or relate to?

What this PR does / why we need it:

Under 1000 concurrent TPCC NEW_ORDER transactions, each LockOp triggers PKPersistedBetween to verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table (e.g., stock), this produces a thundering-herd of block I/O that exhausts mpool capacity → OOM.

Root Cause

LockOp.Call → callNonBlocking → doLock → hasNewVersionInRange
 → PKPersistedBetween
  → GetChangedObjsBetween (many objects under high concurrency)
  → ForeachCommittedObjects (load meta + BF for each)
  → LoadColumns for each candidate block (mpool allocation)

1000 goroutines × N candidate blocks × ~683KB/block = unbounded peak memory.

Changes

pkg/vm/engine/disttae/txn_table.go — Three mitigations in PKPersistedBetween:

Mechanism	Threshold	Effect
Changed objects cap	64	Conservatively return `true` to skip expensive I/O
Candidate blocks cap	32	Avoid reading too many blocks that passed zonemap + BF
Global semaphore	16 concurrent	Cap peak mpool allocations from `LoadColumns`

Returning true conservatively means the transaction gets a refresh timestamp and re-reads at a newer snapshot — correct RC behavior, just one extra read.

pkg/sql/plan/apply_indices.go — Extend selectivity guard in getIndexForNonEquiCond:

Previously only >=/>/<=/< (single range ops) were guarded against non-selective secondary index scans on large tables
Now in_range is also guarded, preventing two-sided range conditions from triggering full index table scans

Test plan

Verify TPCC 1000-terminal NEW_ORDER no longer OOMs under 55 GiB cgroup
Verify sysbench random_ranges TPS not regressed (index-only scan path uses different threshold via tryIndexOnlyScan)
Verify PKPersistedBetween still detects actual PK conflicts correctly (conservativeness only triggers above thresholds)

🤖 Generated with Claude Code

…ecks Under 1000 concurrent TPCC NEW_ORDER transactions, each LockOp triggers PKPersistedBetween to verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table, this produces a thundering-herd of block I/O that exhausts mpool capacity. Three mitigations: 1. Early exit when changed objects exceed threshold (64) — conservatively returns "may be modified" to avoid loading hundreds of object meta/BF. 2. Early exit when candidate blocks exceed threshold (32) — avoids reading too many blocks that passed zonemap + bloom filter checks. 3. Global semaphore (capacity 16) on the block I/O phase — caps peak concurrent mpool allocations from LoadColumns. Also extends the secondary index selectivity guard in getIndexForNonEquiCond to cover in_range operators (previously only single range ops were guarded), preventing non-selective in_range conditions from triggering full index table scans on large tables. Fixes matrixorigin#24348 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

qodo-code-review · 2026-05-13T07:00:55Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

gouhongshen

🔍 Code Review Report

📊 概览

指标	值
审查文件数	2
总发现数	7 (去重后)
🔴 Must Fix	1
🟡 Should Fix	4
🟢 Nit	2
🧨 破坏性测试结论	不足

📝 总结

PR 方向正确——用三道闸门（对象数/块数/信号量）控制 PKPersistedBetween 在高并发下的内存峰值，保守返回 true 触发冲突重试是安全的 RC 语义。主要问题是 信号量作用范围意外扩大（defer 语义导致 tombstone I/O 路径也被限流）以及计数器语义污染。apply_indices.go 的 in_range 扩展合理无回归风险。

🧨 破坏性测试审判

结论: 不足
已覆盖失败模式: 无（PR 无任何自动化测试变更）
缺失失败模式: ① 阈值边界行为（64→65 切换点）; ② 信号量争抢下 ctx 超时行为; ③ 保守 bail-out 后事务重试正确性; ④ 并发信号量饥饿
裁决理由: OOM 修复是高风险改动，引入 3 条新 early-return 路径 + 1 个全局信号量，全部缺少自动化验证。依赖手动 TPCC 测试无法防止 CI 回归。

🔴 Must Fix

1. Semaphore 持有范围过大，覆盖了 tombstone I/O — `txn_table.go`

类别: 并发设计缺陷
发现者: 老K、虫姐 (共识)
描述: defer func() { <-pkCheckSemaphore }() 在 if len(candidateBlks) > 0 内注册，但 defer 是函数级。信号量持有到函数结束，包括后续 tombstonePKExistsInRange 路径（有独立 block I/O）。实际效果：tombstone 检查也被 16 并发卡住，且单 goroutine 串行读 32 blocks 可持有 slot 达数秒，造成队头阻塞。
根治方案: 将信号量粒度缩窄到 candidateBlks 循环本身（非 defer），或改为 per-block acquire/release：
```
pkCheckSemaphore <- struct{}{}
for _, blk := range candidateBlks {
    release, err := ioutil.LoadColumns(...)
    // ... use data ...
    release()
}
<-pkCheckSemaphore
```
这样 tombstone 路径不被限流，单 goroutine 持有时间也可控。

🟡 Should Fix

1. Counter 语义污染 — `TxnPKChangeCheckChangedCounter`

类别: 可观测性缺陷
发现者: 老K、虫姐、补丁犬 (共识 3/4)
描述: 同一 counter 现混合三种语义：① 错误且 changed（原）② 对象数超限 bail-out（新）③ 块数超限 bail-out（新）。监控无法区分"真实冲突"与"主动降级"。
根治方案: 新增 TxnPKChangeCheckBailoutCounter 或给现有 counter 加 label 区分 reason="conflict"/"objects_cap"/"blocks_cap"。

2. IOCounter 虚计 — ctx 取消时已计数但未做 IO

类别: 可观测性缺陷
发现者: 虫姐
描述: v2.TxnPKChangeCheckIOCounter.Inc() 在 semaphore acquire 之前，若 ctx.Done 触发则从未做 IO 却已计数。
根治方案: 将 .Inc() 移到 semaphore 获取成功之后。

3. Objects cap=64 在 ZoneMap 过滤之前触发 — false positive 风险

类别: 性能 / 正确性权衡
发现者: 闪电、虫姐
描述: len(cObjs)+len(delObjs) > 64 在 ForeachCommittedObjects（ZM/BF 过滤）之前检查。长事务场景下，即使 90%+ objects 会被 ZM 排除，也会触发保守退出 → 不必要的冲突重试 → 吞吐下降。另外 delObjs 通常不含冲突 PK，全额计入阈值过度保守。
根治方案: 提高阈值至 256（因为 ZM 排除代价极低，只读内存中的 ObjectStats），或仅对 cObjs 计数，或将阈值移到 ForeachCommittedObjects 循环内对 "需要 IO 的对象" 计数。

4. 硬编码阈值无运行时可调手段

类别: 可运维性
发现者: 老K、补丁犬
描述: 64/32/16 全部编译期固定。不同部署规格（8G vs 64G）对 OOM 容忍度差异巨大，命中后只能发版修改。
根治方案: 收敛为 runtime.Parameter 注册参数，或至少从 mpool 总量/cgroup 内存反推 semaphore 容量。

🟢 Nit

1. `delObjs` 全额计入阈值可能过度保守

可考虑 len(cObjs) + len(delObjs)/2 或仅计 cObjs。

2. `isSingleRangeOp` → `isRangeOp` 重命名

函数名改善了语义准确性，无问题。但缺少对应 UT 验证 in_range 确实被 guard。

💬 陪审团分歧记录

议题	分歧	裁决
信号量设计合理性	闪电: 峰值内存可控✅; 虫姐: 队头阻塞严重⚠️; 老K: scope 过大🔴	🔴 Must Fix — 内存控制有效但 scope 和粒度需修正
objects cap=64 位置	闪电: 应移到 ZM 之后; 虫姐: delObjs 计入过度保守	🟡 — 当前可工作但 false positive 率偏高

Review by 7-agent jury system (code-review skill v1.5)

XuPeng-SH

Review

The approach is directionally correct — conservatively returning "may be modified" is safe RC semantics and avoids the thundering-herd I/O. A few issues to address before merge:

🔴 Must Fix: Semaphore held across tombstone I/O path

defer func() { <-pkCheckSemaphore }() is registered inside if len(candidateBlks) > 0, but defer is function-scoped — it won't release until PKPersistedBetween returns. After the candidate-block loop, the function may proceed to tombstonePKExistsInRange (which performs its own block I/O). The semaphore stays held throughout, meaning:

Tombstone checking is unintentionally rate-limited by pkCheckSemaphore
A single goroutine holds a slot for candidateBlk reads + tombstone reads, creating head-of-line blocking

Fix: scope the release to just the candidateBlks loop:

pkCheckSemaphore <- struct{}{}
for _, blk := range candidateBlks {
    // ...
}
<-pkCheckSemaphore

🟡 Should Fix

Counter semantics pollution — TxnPKChangeCheckChangedCounter is now overloaded: actual conflict (original), objects-cap bail-out, blocks-cap bail-out. Dashboards can't distinguish real conflicts from protective bail-outs. Use a separate counter or add labels.
IO counter incremented before semaphore acquire — v2.TxnPKChangeCheckIOCounter.Inc() fires before the semaphore select. If ctx.Done() fires, no I/O happens but the counter already incremented. Move .Inc() after successful acquire.
Objects cap (64) fires before ZoneMap filtering — ZoneMap checks in ForeachCommittedObjects are cheap in-memory reads that eliminate most objects. Under long transactions (not just high concurrency), many objects accumulate but most are irrelevant. Consider raising the threshold, counting only cObjs, or moving the cap inside the loop to count objects that actually need I/O.
Hardcoded thresholds (64/32/16) — Deployments with 8GB vs 64GB RAM have very different tolerance. Consider runtime parameters or deriving from mpool capacity.

- TestPkCheckSemaphore_LimitsConcurrency: verifies the global semaphore actually limits concurrent goroutines to its capacity (16) - TestPkCheckSemaphore_RespectsContextCancellation: verifies cancelled context causes immediate return without blocking - TestIsRangeOp: verifies the expanded range operator classification covers in_range in addition to single comparison ops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ecks (cherry-pick to 4.0-dev) (#24380) ## What type of PR is this? - [ ] API-change - [x] BUG - [ ] Improvement - [ ] Documentation - [ ] Feature - [ ] Test and CI - [ ] Code Refactoring ## Which issue(s) does this PR fix or relate to? Cherry-pick of #24373 to 4.0-dev branch. Fixes #24348 ## What this PR does / why we need it: Under 1000 concurrent TPCC NEW_ORDER transactions, each `LockOp` triggers `PKPersistedBetween` to verify primary key conflicts. This path loads object metadata, bloom filters, and block PK columns for all changed objects since the transaction's snapshot. With many concurrent writers on the same table (e.g., `stock`), this produces a thundering-herd of block I/O that exhausts mpool capacity → OOM. ### Changes **`pkg/vm/engine/disttae/txn_table.go`** — Three mitigations in `PKPersistedBetween`: | Mechanism | Threshold | Effect | |-----------|-----------|--------| | Changed objects cap | 64 | Conservatively return `true` to skip expensive I/O | | Candidate blocks cap | 32 | Avoid reading too many blocks that passed zonemap + BF | | Global semaphore | 16 concurrent | Cap peak mpool allocations from `LoadColumns` | **`pkg/sql/plan/apply_indices.go`** — Extend selectivity guard in `getIndexForNonEquiCond` to cover `in_range` operators. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor PKPersistedBetween guard handling into testable helpers and add coverage for bailout and semaphore branches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiangxinmeng1 requested review from XuPeng-SH, aunjgr, gouhongshen and ouyuanning as code owners May 13, 2026 07:00

jiangxinmeng1 had a problem deploying to ci May 13, 2026 07:01 — with GitHub Actions Failure

jiangxinmeng1 temporarily deployed to ci May 13, 2026 07:01 — with GitHub Actions Inactive

jiangxinmeng1 had a problem deploying to ci May 13, 2026 07:01 — with GitHub Actions Failure

jiangxinmeng1 temporarily deployed to ci May 13, 2026 07:01 — with GitHub Actions Inactive

matrix-meow added the size/S Denotes a PR that changes [10,99] lines label May 13, 2026

mergify Bot added the kind/bug Something isn't working label May 13, 2026

gouhongshen reviewed May 13, 2026

View reviewed changes

ouyuanning approved these changes May 13, 2026

View reviewed changes

XuPeng-SH requested changes May 13, 2026

View reviewed changes

jiangxinmeng1 had a problem deploying to ci May 13, 2026 08:29 — with GitHub Actions Error

jiangxinmeng1 temporarily deployed to ci May 13, 2026 08:29 — with GitHub Actions Inactive

jiangxinmeng1 had a problem deploying to ci May 13, 2026 08:29 — with GitHub Actions Error

aunjgr approved these changes May 13, 2026

View reviewed changes

jiangxinmeng1 had a problem deploying to ci May 13, 2026 09:40 — with GitHub Actions Failure

jiangxinmeng1 mentioned this pull request May 13, 2026

fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks (cherry-pick to 4.0-dev) #24380

Merged

7 tasks

jiangxinmeng1 had a problem deploying to ci May 13, 2026 11:03 — with GitHub Actions Failure

jiangxinmeng1 had a problem deploying to ci May 13, 2026 12:53 — with GitHub Actions Failure

fix: extract bailout functions for testability to improve coverage

736032b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jiangxinmeng1 temporarily deployed to ci May 14, 2026 01:51 — with GitHub Actions Inactive

jiangxinmeng1 had a problem deploying to ci May 14, 2026 01:51 — with GitHub Actions Failure

test: cover PK check guard helpers

7b425ad

Refactor PKPersistedBetween guard handling into testable helpers and add coverage for bailout and semaphore branches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiangxinmeng1 temporarily deployed to ci May 14, 2026 04:05 — with GitHub Actions Inactive

jiangxinmeng1 had a problem deploying to ci May 14, 2026 04:05 — with GitHub Actions Failure

jiangxinmeng1 temporarily deployed to ci May 14, 2026 04:05 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks#24373

fix: prevent OOM in PKPersistedBetween under high-concurrency lock checks#24373
jiangxinmeng1 wants to merge 5 commits into
matrixorigin:mainfrom
jiangxinmeng1:fix/tpcc-new-order-oom

jiangxinmeng1 commented May 13, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 13, 2026

Uh oh!

gouhongshen left a comment

Uh oh!

XuPeng-SH left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

jiangxinmeng1 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Which issue(s) does this PR fix or relate to?

What this PR does / why we need it:

Root Cause

Changes

Test plan

Uh oh!

qodo-code-review Bot commented May 13, 2026

Qodo reviews are paused for this user.

Uh oh!

gouhongshen left a comment

Choose a reason for hiding this comment

🔍 Code Review Report

📊 概览

📝 总结

🧨 破坏性测试审判

🔴 Must Fix

1. Semaphore 持有范围过大，覆盖了 tombstone I/O — txn_table.go

🟡 Should Fix

1. Counter 语义污染 — TxnPKChangeCheckChangedCounter

2. IOCounter 虚计 — ctx 取消时已计数但未做 IO

3. Objects cap=64 在 ZoneMap 过滤之前触发 — false positive 风险

4. 硬编码阈值无运行时可调手段

🟢 Nit

1. delObjs 全额计入阈值可能过度保守

2. isSingleRangeOp → isRangeOp 重命名

💬 陪审团分歧记录

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Review

🔴 Must Fix: Semaphore held across tombstone I/O path

🟡 Should Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jiangxinmeng1 commented May 13, 2026 •

edited

Loading

1. Semaphore 持有范围过大，覆盖了 tombstone I/O — `txn_table.go`

1. Counter 语义污染 — `TxnPKChangeCheckChangedCounter`

1. `delObjs` 全额计入阈值可能过度保守

2. `isSingleRangeOp` → `isRangeOp` 重命名