fix(shim): use per-container id for TaskOOM routing (fixes #12838) by copybara-service[bot] · Pull Request #13097 · google/gvisor

copybara-service · 2026-05-06T10:56:08Z

fix(shim): use per-container id for TaskOOM routing (fixes #12838)

Summary

Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with reason:Error instead of reason:OOMKilled.

PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes TaskOOM{ContainerID: s.id}, and s.id is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.

Root cause

pkg/shim/v1/manager.go:newCommand() spawns the shim binary with -namespace, -address, and -publish-binary only — no -id.
In the spawned shim, containerd's vendored shim framework parses the flags; -id defaults to "" (runtime/v2/shim/shim.go: flag.StringVar(&id, "id", "", ...)). With manager == nil, that empty id is threaded into runsc.NewTaskService(ctx, id, ...), so runscService.id = "".
Every TaskOOM{ContainerID: s.id} — from the async oom_v2.go's run() and from the sync service.go's checkProcesses() added by fix(shim): synchronously check cgroup for OOM at container exit #12843 — ships with ContainerID="".
containerd's CRI handler (internal/cri/server/events.go) routes TaskOOM via containerStore.Get(e.ContainerID). Get("") returns ErrEmptyPrefix (from internal/truncindex). That is not IsNotFound, so handleEvent wraps and returns the error; containerd logs can't find container for TaskOOM event: prefix can't be empty, and the container's Status.Reason is never set.
TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID — different field, different lookup.

Fix

Don't rely on s.id for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.

CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the CreateTaskRequest) rather than s.id. The id stored in the watcher's cgroups and lastOOM maps is now non-empty and correctly keyed per container, so both the async (EventChan -> run) and sync (isOOM at exit) paths publish with a real ContainerID.
checkProcesses: locate the container whose process matched the exit event, use that container's id for isOOM, and use it for both TaskOOM.ContainerID and TaskExit.ContainerID.

oom_v2.go is unchanged — with add() keyed by rfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id via i.id.

Validation

End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):

Before: OOM'd pods show reason:Error; containerd journal shows TaskOOM event (trailing space = empty ContainerID) and Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty".
After: OOM'd pods show reason:OOMKilled; journal shows TaskOOM event container_id:"<real-64-hex>" and zero prefix can't be empty errors.

Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report reason:OOMKilled on both arches.

Scope

This PR touches only the cgroups v2 init-exit path (CreateWithFSRestore and checkProcesses in runsc/service.go). The cgroups v1 poller (epoll.go) has the same structural reliance on s.id and should get the equivalent treatment — happy to file a follow-up once this lands.

Follow-ups noted in this work (separate PRs)

cgroups v1 shim (epoll.go) has the same empty-id TaskOOM path.
pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath documents %ID% substitution but the implementation only appends a filename when the path ends in /. As a result the shim currently writes all container logs into a literal /var/log/runsc/%ID%/ directory rather than per-container directories. Noticed while gathering logs for this PR.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594

## Summary Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with `reason:Error` instead of `reason:OOMKilled`. PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes `TaskOOM{ContainerID: s.id}`, and `s.id` is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing. ## Root cause 1. `pkg/shim/v1/manager.go:newCommand()` spawns the shim binary with `-namespace`, `-address`, and `-publish-binary` only — no `-id`. 2. In the spawned shim, containerd's vendored shim framework parses the flags; `-id` defaults to `""` (`runtime/v2/shim/shim.go`: `flag.StringVar(&id, "id", "", ...)`). With `manager == nil`, that empty id is threaded into `runsc.NewTaskService(ctx, id, ...)`, so `runscService.id = ""`. 3. Every `TaskOOM{ContainerID: s.id}` — from the async `oom_v2.go`'s `run()` and from the sync `service.go`'s `checkProcesses()` added by #12843 — ships with `ContainerID=""`. 4. containerd's CRI handler (`internal/cri/server/events.go`) routes TaskOOM via `containerStore.Get(e.ContainerID)`. `Get("")` returns `ErrEmptyPrefix` (from `internal/truncindex`). That is **not** `IsNotFound`, so `handleEvent` wraps and returns the error; containerd logs `can't find container for TaskOOM event: prefix can't be empty`, and the container's `Status.Reason` is never set. 5. `TaskExit` is unaffected because containerd routes TaskExit by `e.ID` (init process id), not `e.ContainerID` — different field, different lookup. ## Fix Don't rely on `s.id` for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed. * `CreateWithFSRestore`: register the cgroup with `rfs.Create.ID` (the per-container id from the CreateTaskRequest) rather than `s.id`. The id stored in the watcher's `cgroups` and `lastOOM` maps is now non-empty and correctly keyed per container, so both the async (`EventChan -> run`) and sync (`isOOM` at exit) paths publish with a real `ContainerID`. * `checkProcesses`: locate the container whose process matched the exit event, use that container's id for `isOOM`, and use it for both `TaskOOM.ContainerID` and `TaskExit.ContainerID`. `oom_v2.go` is unchanged — with `add()` keyed by `rfs.Create.ID`, the existing async publish path already emits TaskOOM with the correct id via `i.id`. ## Validation End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch): * **Before**: OOM'd pods show `reason:Error`; containerd journal shows `TaskOOM event ` (trailing space = empty ContainerID) and `Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty"`. * **After**: OOM'd pods show `reason:OOMKilled`; journal shows `TaskOOM event container_id:"<real-64-hex>"` and zero `prefix can't be empty` errors. Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report `reason:OOMKilled` on both arches. ## Scope This PR touches only the cgroups v2 init-exit path (`CreateWithFSRestore` and `checkProcesses` in `runsc/service.go`). The cgroups v1 poller (`epoll.go`) has the same structural reliance on `s.id` and should get the equivalent treatment — happy to file a follow-up once this lands. ## Follow-ups noted in this work (separate PRs) * cgroups v1 shim (`epoll.go`) has the same empty-id TaskOOM path. * `pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath` documents `%ID%` substitution but the implementation only appends a filename when the path ends in `/`. As a result the shim currently writes all container logs into a literal `/var/log/runsc/%ID%/` directory rather than per-container directories. Noticed while gathering logs for this PR. FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594 PiperOrigin-RevId: 911236015

copybara-service Bot added the exported Issue was exported automatically label May 6, 2026

copybara-service Bot force-pushed the test/cl911236015 branch from 9eea54a to d6923fe Compare May 6, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#13097

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#13097
copybara-service[bot] wants to merge 1 commit intomasterfrom
test/cl911236015

copybara-service Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

copybara-service Bot commented May 6, 2026

Summary

Root cause

Fix

Validation

Scope

Follow-ups noted in this work (separate PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant