Skip to content

TKE checkin: TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load #24846

@aptend

Description

@aptend

Summary

MO Checkin Regression On TKE has recently started failing frequently on main/4.0 runs. The visible failure pattern is TP CN OOMKilled during TPCC/SSB/TPCH/sysbench load, followed by client connection loss and lock-table related errors.

Good / Bad Boundary

Known good run:

First observed bad run:

Later runs showed a similar signature:

Symptoms

Common test/log symptoms:

  • TP CN container exits with OOMKilled / exit 137
  • TPCC/SSB/TPCH report Lost connection to MySQL server during query
  • sysbench and background tasks report many lock table bind changed errors
  • the failure concentrates memory on one TP CN while the peer CN stays much lower

Profile / Metrics Observed

Bad TP CN profile/metrics around failure:

  • RSS grows to ~13.5 GiB, close to the 14 GiB container limit
  • Go heap grows to ~10-12 GiB range
  • malloc:inuse_space / mpool profile has no effective signal in checked windows
  • heap profile is dominated by lockservice-related stacks such as holders/waiter queue/btree structures

Infrastructure Notes

No clear evidence that the runner or TP CN resource limit was recently reduced:

  • TP CN requests/limits observed as 12Gi / 14Gi
  • GOMEMLIMIT=8000MiB
  • first good and first bad runs used the same TKE worker nodes for the TP CNs
  • retained namespace nodes are SA3.2XLARGE16; node memory pressure was false when checked

CI Note

The CI workflow cleanup behavior is being changed separately so failed checkin namespaces are no longer retained by default: matrixorigin/CI#360

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions