Skip to content

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#12996

Open
kpurdon wants to merge 2 commits intogoogle:masterfrom
kpurdon:kpurdon/12838/fix-taskoom-empty-id
Open

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#12996
kpurdon wants to merge 2 commits intogoogle:masterfrom
kpurdon:kpurdon/12838/fix-taskoom-empty-id

Conversation

@kpurdon
Copy link
Copy Markdown
Contributor

@kpurdon kpurdon commented Apr 23, 2026

Summary

Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with reason:Error instead of reason:OOMKilled.

PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes TaskOOM{ContainerID: s.id}, and s.id is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.

Root cause

  1. pkg/shim/v1/manager.go:newCommand() spawns the shim binary with -namespace, -address, and -publish-binary only — no -id.
  2. In the spawned shim, containerd's vendored shim framework parses the flags; -id defaults to "" (runtime/v2/shim/shim.go: flag.StringVar(&id, "id", "", ...)). With manager == nil, that empty id is threaded into runsc.NewTaskService(ctx, id, ...), so runscService.id = "".
  3. Every TaskOOM{ContainerID: s.id} — from the async oom_v2.go's run() and from the sync service.go's checkProcesses() added by fix(shim): synchronously check cgroup for OOM at container exit #12843 — ships with ContainerID="".
  4. containerd's CRI handler (internal/cri/server/events.go) routes TaskOOM via containerStore.Get(e.ContainerID). Get("") returns ErrEmptyPrefix (from internal/truncindex). That is not IsNotFound, so handleEvent wraps and returns the error; containerd logs can't find container for TaskOOM event: prefix can't be empty, and the container's Status.Reason is never set.
  5. TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID — different field, different lookup.

Fix

Don't rely on s.id for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.

  • CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the CreateTaskRequest) rather than s.id. The id stored in the watcher's cgroups and lastOOM maps is now non-empty and correctly keyed per container, so both the async (EventChan -> run) and sync (isOOM at exit) paths publish with a real ContainerID.
  • checkProcesses: locate the container whose process matched the exit event, use that container's id for isOOM, and use it for both TaskOOM.ContainerID and TaskExit.ContainerID.

oom_v2.go is unchanged — with add() keyed by rfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id via i.id.

Validation

End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):

  • Before: OOM'd pods show reason:Error; containerd journal shows TaskOOM event (trailing space = empty ContainerID) and Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty".
  • After: OOM'd pods show reason:OOMKilled; journal shows TaskOOM event container_id:"<real-64-hex>" and zero prefix can't be empty errors.

Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report reason:OOMKilled on both arches.

Scope

This PR touches only the cgroups v2 init-exit path (CreateWithFSRestore and checkProcesses in runsc/service.go). The cgroups v1 poller (epoll.go) has the same structural reliance on s.id and should get the equivalent treatment — happy to file a follow-up once this lands.

Follow-ups noted in this work (separate PRs)

  • cgroups v1 shim (epoll.go) has the same empty-id TaskOOM path.
  • pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath documents %ID% substitution but the implementation only appends a filename when the path ends in /. As a result the shim currently writes all container logs into a literal /var/log/runsc/%ID%/ directory rather than per-container directories. Noticed while gathering logs for this PR.

@kpurdon kpurdon force-pushed the kpurdon/12838/fix-taskoom-empty-id branch 2 times, most recently from aef69a5 to 9af7cfe Compare April 23, 2026 17:11
@milantracy
Copy link
Copy Markdown
Contributor

Thanks! I'll happy to review when it is ready

@kpurdon kpurdon force-pushed the kpurdon/12838/fix-taskoom-empty-id branch from 9af7cfe to 203ee86 Compare April 23, 2026 19:56
On aarch64 (and any architecture where the containerd shim race
between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed
pods run by gVisor report kubelet reason=Error instead of
reason=OOMKilled.

Root cause: the shim binary is spawned by pkg/shim/v1/manager.go's
newCommand() without the `-id` flag. containerd's shim framework then
defaults `id` to "", which threads through to runscService.id = "".
Every TaskOOM{ContainerID: s.id} therefore carries an empty string.

containerd's CRI event handler routes TaskOOM by ContainerID through
containerStore.Get(e.ContainerID). With an empty id, the truncindex
lookup returns ErrEmptyPrefix (which is not a NotFound), so the CRI
event handler wraps and returns the error. The container's
Status.Reason is never set, so kubelet falls back to reason:Error.

TaskExit is unaffected because containerd routes TaskExit by e.ID
(init process id), not e.ContainerID.

The prior fix in google#12843 added a synchronous cgroup check at init
exit, which does close the async race — but that path also publishes
TaskOOM{ContainerID: s.id}, so on aarch64 the empty-id bug continues
to produce reason=Error after google#12843 merged. Don't rely on s.id.

This change:

* CreateWithFSRestore: register the cgroup with rfs.Create.ID (the
  per-container id from the Create request) instead of s.id. The id
  stored in the watcher's cgroups and lastOOM maps is now non-empty
  and correctly keyed, so both async (EventChan -> run) and sync
  (isOOM at exit) paths publish TaskOOM with a real ContainerID.

* checkProcesses: find the container whose process matched the exit
  event, use that container's id for isOOM, and use it for both the
  TaskOOM and TaskExit ContainerID fields.

No changes to oom_v2.go: with add() keyed by rfs.Create.ID, the
existing async publish path already emits TaskOOM with the correct
id via i.id.

Validated end-to-end on AL2023 aarch64 (kernel 6.12, containerd
2.2.1, kubelet 1.34) with a forced-OOM pod: before this change pods
show reason:Error and journald reports "can't find container for
TaskOOM event: prefix can't be empty"; after, pods show
reason:OOMKilled and the TaskOOM event carries the real container id.

Assisted-by: Claude Code
@kpurdon kpurdon force-pushed the kpurdon/12838/fix-taskoom-empty-id branch from 203ee86 to f4c6013 Compare April 24, 2026 15:29
@kpurdon kpurdon changed the title fix(shim): fan out TaskOOM to per-container IDs (fixes #12838) fix(shim): use per-container id for TaskOOM routing (fixes #12838) Apr 24, 2026
@kpurdon kpurdon marked this pull request as ready for review April 24, 2026 15:37
@kpurdon
Copy link
Copy Markdown
Contributor Author

kpurdon commented Apr 24, 2026

@milantracy validate this fully today, should be good to go!

@milantracy milantracy requested review from fvoznika and milantracy and removed request for milantracy April 30, 2026 22:58
Copy link
Copy Markdown
Contributor

@milantracy milantracy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @kpurdon ! i added some comments, but most of change LGTM

Comment thread pkg/shim/v1/runsc/service.go
Comment thread pkg/shim/v1/runsc/service.go
Address review feedback on google#12996 — use the more idiomatic name.
@kpurdon kpurdon requested a review from milantracy May 5, 2026 21:12
copybara-service Bot pushed a commit that referenced this pull request May 6, 2026
## Summary

Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with `reason:Error` instead of `reason:OOMKilled`.

PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes `TaskOOM{ContainerID: s.id}`, and `s.id` is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.

## Root cause

1. `pkg/shim/v1/manager.go:newCommand()` spawns the shim binary with `-namespace`, `-address`, and `-publish-binary` only — no `-id`.
2. In the spawned shim, containerd's vendored shim framework parses the flags; `-id` defaults to `""` (`runtime/v2/shim/shim.go`: `flag.StringVar(&id, "id", "", ...)`). With `manager == nil`, that empty id is threaded into `runsc.NewTaskService(ctx, id, ...)`, so `runscService.id = ""`.
3. Every `TaskOOM{ContainerID: s.id}` — from the async `oom_v2.go`'s `run()` and from the sync `service.go`'s `checkProcesses()` added by #12843 — ships with `ContainerID=""`.
4. containerd's CRI handler (`internal/cri/server/events.go`) routes TaskOOM via `containerStore.Get(e.ContainerID)`. `Get("")` returns `ErrEmptyPrefix` (from `internal/truncindex`). That is **not** `IsNotFound`, so `handleEvent` wraps and returns the error; containerd logs `can't find container for TaskOOM event: prefix can't be empty`, and the container's `Status.Reason` is never set.
5. `TaskExit` is unaffected because containerd routes TaskExit by `e.ID` (init process id), not `e.ContainerID` — different field, different lookup.

## Fix

Don't rely on `s.id` for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.

* `CreateWithFSRestore`: register the cgroup with `rfs.Create.ID` (the per-container id from the CreateTaskRequest) rather than `s.id`. The id stored in the watcher's `cgroups` and `lastOOM` maps is now non-empty and correctly keyed per container, so both the async (`EventChan -> run`) and sync (`isOOM` at exit) paths publish with a real `ContainerID`.
* `checkProcesses`: locate the container whose process matched the exit event, use that container's id for `isOOM`, and use it for both `TaskOOM.ContainerID` and `TaskExit.ContainerID`.

`oom_v2.go` is unchanged — with `add()` keyed by `rfs.Create.ID`, the existing async publish path already emits TaskOOM with the correct id via `i.id`.

## Validation

End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):

* **Before**: OOM'd pods show `reason:Error`; containerd journal shows `TaskOOM event ` (trailing space = empty ContainerID) and `Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty"`.
* **After**: OOM'd pods show `reason:OOMKilled`; journal shows `TaskOOM event container_id:"<real-64-hex>"` and zero `prefix can't be empty` errors.

Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report `reason:OOMKilled` on both arches.

## Scope

This PR touches only the cgroups v2 init-exit path (`CreateWithFSRestore` and `checkProcesses` in `runsc/service.go`). The cgroups v1 poller (`epoll.go`) has the same structural reliance on `s.id` and should get the equivalent treatment — happy to file a follow-up once this lands.

## Follow-ups noted in this work (separate PRs)

* cgroups v1 shim (`epoll.go`) has the same empty-id TaskOOM path.
* `pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath` documents `%ID%` substitution but the implementation only appends a filename when the path ends in `/`. As a result the shim currently writes all container logs into a literal `/var/log/runsc/%ID%/` directory rather than per-container directories. Noticed while gathering logs for this PR.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594
PiperOrigin-RevId: 911236015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOMKilled reason not reported on aarch64 — container shows reason: Error instead of reason: OOMKilled

2 participants