TKE checkin: TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load

## Summary

`MO Checkin Regression On TKE` has recently started failing frequently on main/4.0 runs. The visible failure pattern is TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load, followed by client connection loss and lock-table related errors.

## Good / Bad Boundary

Known good run:
- Run: https://github.com/matrixorigin/matrixone/actions/runs/26881644689
- Namespace: `mo-checkin-regression-24774`
- Image: `commit-fca451f`
- Result: success
- TP CN restarts: 0
- TP CN RSS peak: ~10.1/10.3 GiB
- Go heap peak: ~4.1/4.3 GiB

First observed bad run:
- Run: https://github.com/matrixorigin/matrixone/actions/runs/26886279863
- Namespace: `mo-checkin-regression-24803`
- Image: `commit-0935043`
- Result: failure
- TP CN `mo-checkin-regression-tp-cn-nnbj8` OOMKilled
- RSS peak: ~13.56 GiB
- Go heap peak: ~10.9 GiB

Later runs showed a similar signature:
- https://github.com/matrixorigin/matrixone/actions/runs/26933239941 (`mo-checkin-regression-24798`)
- https://github.com/matrixorigin/matrixone/actions/runs/26936440344 (`mo-checkin-regression-24792`)
- retained namespace `mo-checkin-regression-24624`

## Symptoms

Common test/log symptoms:
- TP CN container exits with OOMKilled / exit 137
- TPCC/SSB/TPCH report `Lost connection to MySQL server during query`
- sysbench and background tasks report many `lock table bind changed` errors
- the failure concentrates memory on one TP CN while the peer CN stays much lower

## Profile / Metrics Observed

Bad TP CN profile/metrics around failure:
- RSS grows to ~13.5 GiB, close to the 14 GiB container limit
- Go heap grows to ~10-12 GiB range
- `malloc:inuse_space` / mpool profile has no effective signal in checked windows
- heap profile is dominated by lockservice-related stacks such as holders/waiter queue/btree structures

## Infrastructure Notes

No clear evidence that the runner or TP CN resource limit was recently reduced:
- TP CN requests/limits observed as `12Gi` / `14Gi`
- `GOMEMLIMIT=8000MiB`
- first good and first bad runs used the same TKE worker nodes for the TP CNs
- retained namespace nodes are `SA3.2XLARGE16`; node memory pressure was false when checked

## CI Note

The CI workflow cleanup behavior is being changed separately so failed checkin namespaces are no longer retained by default: matrixorigin/CI#360


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TKE checkin: TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load #24846

Summary

Good / Bad Boundary

Symptoms

Profile / Metrics Observed

Infrastructure Notes

CI Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

TKE checkin: TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load #24846

Description

Summary

Good / Bad Boundary

Symptoms

Profile / Metrics Observed

Infrastructure Notes

CI Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions