fix(shim): use per-container id for TaskOOM routing (fixes #12838) by kpurdon · Pull Request #12996 · google/gvisor

kpurdon · 2026-04-23T17:01:00Z

Summary

Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with reason:Error instead of reason:OOMKilled.

PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes TaskOOM{ContainerID: s.id}, and s.id is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.

Root cause

pkg/shim/v1/manager.go:newCommand() spawns the shim binary with -namespace, -address, and -publish-binary only — no -id.
In the spawned shim, containerd's vendored shim framework parses the flags; -id defaults to "" (runtime/v2/shim/shim.go: flag.StringVar(&id, "id", "", ...)). With manager == nil, that empty id is threaded into runsc.NewTaskService(ctx, id, ...), so runscService.id = "".
Every TaskOOM{ContainerID: s.id} — from the async oom_v2.go's run() and from the sync service.go's checkProcesses() added by fix(shim): synchronously check cgroup for OOM at container exit #12843 — ships with ContainerID="".
containerd's CRI handler (internal/cri/server/events.go) routes TaskOOM via containerStore.Get(e.ContainerID). Get("") returns ErrEmptyPrefix (from internal/truncindex). That is not IsNotFound, so handleEvent wraps and returns the error; containerd logs can't find container for TaskOOM event: prefix can't be empty, and the container's Status.Reason is never set.
TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID — different field, different lookup.

Fix

Don't rely on s.id for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.

CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the CreateTaskRequest) rather than s.id. The id stored in the watcher's cgroups and lastOOM maps is now non-empty and correctly keyed per container, so both the async (EventChan -> run) and sync (isOOM at exit) paths publish with a real ContainerID.
checkProcesses: locate the container whose process matched the exit event, use that container's id for isOOM, and use it for both TaskOOM.ContainerID and TaskExit.ContainerID.

oom_v2.go is unchanged — with add() keyed by rfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id via i.id.

Validation

End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):

Before: OOM'd pods show reason:Error; containerd journal shows TaskOOM event (trailing space = empty ContainerID) and Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty".
After: OOM'd pods show reason:OOMKilled; journal shows TaskOOM event container_id:"<real-64-hex>" and zero prefix can't be empty errors.

Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report reason:OOMKilled on both arches.

Scope

This PR touches only the cgroups v2 init-exit path (CreateWithFSRestore and checkProcesses in runsc/service.go). The cgroups v1 poller (epoll.go) has the same structural reliance on s.id and should get the equivalent treatment — happy to file a follow-up once this lands.

Follow-ups noted in this work (separate PRs)

cgroups v1 shim (epoll.go) has the same empty-id TaskOOM path.
pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath documents %ID% substitution but the implementation only appends a filename when the path ends in /. As a result the shim currently writes all container logs into a literal /var/log/runsc/%ID%/ directory rather than per-container directories. Noticed while gathering logs for this PR.

milantracy · 2026-04-23T17:45:25Z

Thanks! I'll happy to review when it is ready

On aarch64 (and any architecture where the containerd shim race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor report kubelet reason=Error instead of reason=OOMKilled. Root cause: the shim binary is spawned by pkg/shim/v1/manager.go's newCommand() without the `-id` flag. containerd's shim framework then defaults `id` to "", which threads through to runscService.id = "". Every TaskOOM{ContainerID: s.id} therefore carries an empty string. containerd's CRI event handler routes TaskOOM by ContainerID through containerStore.Get(e.ContainerID). With an empty id, the truncindex lookup returns ErrEmptyPrefix (which is not a NotFound), so the CRI event handler wraps and returns the error. The container's Status.Reason is never set, so kubelet falls back to reason:Error. TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID. The prior fix in google#12843 added a synchronous cgroup check at init exit, which does close the async race — but that path also publishes TaskOOM{ContainerID: s.id}, so on aarch64 the empty-id bug continues to produce reason=Error after google#12843 merged. Don't rely on s.id. This change: * CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the Create request) instead of s.id. The id stored in the watcher's cgroups and lastOOM maps is now non-empty and correctly keyed, so both async (EventChan -> run) and sync (isOOM at exit) paths publish TaskOOM with a real ContainerID. * checkProcesses: find the container whose process matched the exit event, use that container's id for isOOM, and use it for both the TaskOOM and TaskExit ContainerID fields. No changes to oom_v2.go: with add() keyed by rfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id via i.id. Validated end-to-end on AL2023 aarch64 (kernel 6.12, containerd 2.2.1, kubelet 1.34) with a forced-OOM pod: before this change pods show reason:Error and journald reports "can't find container for TaskOOM event: prefix can't be empty"; after, pods show reason:OOMKilled and the TaskOOM event carries the real container id. Assisted-by: Claude Code

kpurdon · 2026-04-24T15:38:04Z

@milantracy validate this fully today, should be good to go!

milantracy

thanks @kpurdon ! i added some comments, but most of change LGTM

Address review feedback on google#12996 — use the more idiomatic name.

## Summary Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with `reason:Error` instead of `reason:OOMKilled`. PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes `TaskOOM{ContainerID: s.id}`, and `s.id` is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing. ## Root cause 1. `pkg/shim/v1/manager.go:newCommand()` spawns the shim binary with `-namespace`, `-address`, and `-publish-binary` only — no `-id`. 2. In the spawned shim, containerd's vendored shim framework parses the flags; `-id` defaults to `""` (`runtime/v2/shim/shim.go`: `flag.StringVar(&id, "id", "", ...)`). With `manager == nil`, that empty id is threaded into `runsc.NewTaskService(ctx, id, ...)`, so `runscService.id = ""`. 3. Every `TaskOOM{ContainerID: s.id}` — from the async `oom_v2.go`'s `run()` and from the sync `service.go`'s `checkProcesses()` added by #12843 — ships with `ContainerID=""`. 4. containerd's CRI handler (`internal/cri/server/events.go`) routes TaskOOM via `containerStore.Get(e.ContainerID)`. `Get("")` returns `ErrEmptyPrefix` (from `internal/truncindex`). That is **not** `IsNotFound`, so `handleEvent` wraps and returns the error; containerd logs `can't find container for TaskOOM event: prefix can't be empty`, and the container's `Status.Reason` is never set. 5. `TaskExit` is unaffected because containerd routes TaskExit by `e.ID` (init process id), not `e.ContainerID` — different field, different lookup. ## Fix Don't rely on `s.id` for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed. * `CreateWithFSRestore`: register the cgroup with `rfs.Create.ID` (the per-container id from the CreateTaskRequest) rather than `s.id`. The id stored in the watcher's `cgroups` and `lastOOM` maps is now non-empty and correctly keyed per container, so both the async (`EventChan -> run`) and sync (`isOOM` at exit) paths publish with a real `ContainerID`. * `checkProcesses`: locate the container whose process matched the exit event, use that container's id for `isOOM`, and use it for both `TaskOOM.ContainerID` and `TaskExit.ContainerID`. `oom_v2.go` is unchanged — with `add()` keyed by `rfs.Create.ID`, the existing async publish path already emits TaskOOM with the correct id via `i.id`. ## Validation End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch): * **Before**: OOM'd pods show `reason:Error`; containerd journal shows `TaskOOM event ` (trailing space = empty ContainerID) and `Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty"`. * **After**: OOM'd pods show `reason:OOMKilled`; journal shows `TaskOOM event container_id:"<real-64-hex>"` and zero `prefix can't be empty` errors. Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report `reason:OOMKilled` on both arches. ## Scope This PR touches only the cgroups v2 init-exit path (`CreateWithFSRestore` and `checkProcesses` in `runsc/service.go`). The cgroups v1 poller (`epoll.go`) has the same structural reliance on `s.id` and should get the equivalent treatment — happy to file a follow-up once this lands. ## Follow-ups noted in this work (separate PRs) * cgroups v1 shim (`epoll.go`) has the same empty-id TaskOOM path. * `pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath` documents `%ID%` substitution but the implementation only appends a filename when the path ends in `/`. As a result the shim currently writes all container logs into a literal `/var/log/runsc/%ID%/` directory rather than per-container directories. Noticed while gathering logs for this PR. FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594 PiperOrigin-RevId: 911236015

kpurdon force-pushed the kpurdon/12838/fix-taskoom-empty-id branch 2 times, most recently from aef69a5 to 9af7cfe Compare April 23, 2026 17:11

kpurdon force-pushed the kpurdon/12838/fix-taskoom-empty-id branch from 9af7cfe to 203ee86 Compare April 23, 2026 19:56

kpurdon force-pushed the kpurdon/12838/fix-taskoom-empty-id branch from 203ee86 to f4c6013 Compare April 24, 2026 15:29

kpurdon changed the title ~~fix(shim): fan out TaskOOM to per-container IDs (fixes #12838)~~ fix(shim): use per-container id for TaskOOM routing (fixes #12838) Apr 24, 2026

kpurdon marked this pull request as ready for review April 24, 2026 15:37

milantracy requested review from fvoznika and milantracy and removed request for milantracy April 30, 2026 22:58

milantracy requested changes Apr 30, 2026

View reviewed changes

Comment thread pkg/shim/v1/runsc/service.go

Comment thread pkg/shim/v1/runsc/service.go

fix(shim): rename exitingCID to containerID for readability

ef14594

Address review feedback on google#12996 — use the more idiomatic name.

kpurdon requested a review from milantracy May 5, 2026 21:12

milantracy approved these changes May 6, 2026

View reviewed changes

milantracy added the ready to pull label May 6, 2026

copybara-service Bot mentioned this pull request May 6, 2026

fix(shim): use per-container id for TaskOOM routing (fixes #12838) #13097

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#12996

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#12996
kpurdon wants to merge 2 commits intogoogle:masterfrom
kpurdon:kpurdon/12838/fix-taskoom-empty-id

kpurdon commented Apr 23, 2026 •

edited

Loading

Uh oh!

milantracy commented Apr 23, 2026

Uh oh!

kpurdon commented Apr 24, 2026

Uh oh!

milantracy left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kpurdon commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Validation

Scope

Follow-ups noted in this work (separate PRs)

Uh oh!

milantracy commented Apr 23, 2026

Uh oh!

kpurdon commented Apr 24, 2026

Uh oh!

milantracy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kpurdon commented Apr 23, 2026 •

edited

Loading