Summary
MO Checkin Regression On TKE has recently started failing frequently on main/4.0 runs. The visible failure pattern is TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load, followed by client connection loss and lock-table related errors.
Good / Bad Boundary
Known good run:
First observed bad run:
Later runs showed a similar signature:
Symptoms
Common test/log symptoms:
- TP CN container exits with OOMKilled / exit 137
- TPCC/SSB/TPCH report
Lost connection to MySQL server during query
- sysbench and background tasks report many
lock table bind changed errors
- the failure concentrates memory on one TP CN while the peer CN stays much lower
Profile / Metrics Observed
Bad TP CN profile/metrics around failure:
- RSS grows to ~13.5 GiB, close to the 14 GiB container limit
- Go heap grows to ~10-12 GiB range
malloc:inuse_space / mpool profile has no effective signal in checked windows
- heap profile is dominated by lockservice-related stacks such as holders/waiter queue/btree structures
Infrastructure Notes
No clear evidence that the runner or TP CN resource limit was recently reduced:
- TP CN requests/limits observed as
12Gi / 14Gi
GOMEMLIMIT=8000MiB
- first good and first bad runs used the same TKE worker nodes for the TP CNs
- retained namespace nodes are
SA3.2XLARGE16; node memory pressure was false when checked
CI Note
The CI workflow cleanup behavior is being changed separately so failed checkin namespaces are no longer retained by default: matrixorigin/CI#360
Summary
MO Checkin Regression On TKEhas recently started failing frequently on main/4.0 runs. The visible failure pattern is TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load, followed by client connection loss and lock-table related errors.Good / Bad Boundary
Known good run:
mo-checkin-regression-24774commit-fca451fFirst observed bad run:
mo-checkin-regression-24803commit-0935043mo-checkin-regression-tp-cn-nnbj8OOMKilledLater runs showed a similar signature:
mo-checkin-regression-24798)mo-checkin-regression-24792)mo-checkin-regression-24624Symptoms
Common test/log symptoms:
Lost connection to MySQL server during querylock table bind changederrorsProfile / Metrics Observed
Bad TP CN profile/metrics around failure:
malloc:inuse_space/ mpool profile has no effective signal in checked windowsInfrastructure Notes
No clear evidence that the runner or TP CN resource limit was recently reduced:
12Gi/14GiGOMEMLIMIT=8000MiBSA3.2XLARGE16; node memory pressure was false when checkedCI Note
The CI workflow cleanup behavior is being changed separately so failed checkin namespaces are no longer retained by default: matrixorigin/CI#360