Skip to content

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#13097

Open
copybara-service[bot] wants to merge 1 commit intomasterfrom
test/cl911236015
Open

fix(shim): use per-container id for TaskOOM routing (fixes #12838)#13097
copybara-service[bot] wants to merge 1 commit intomasterfrom
test/cl911236015

Conversation

@copybara-service
Copy link
Copy Markdown

fix(shim): use per-container id for TaskOOM routing (fixes #12838)

Summary

Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with reason:Error instead of reason:OOMKilled.

PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes TaskOOM{ContainerID: s.id}, and s.id is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.

Root cause

  1. pkg/shim/v1/manager.go:newCommand() spawns the shim binary with -namespace, -address, and -publish-binary only — no -id.
  2. In the spawned shim, containerd's vendored shim framework parses the flags; -id defaults to "" (runtime/v2/shim/shim.go: flag.StringVar(&id, "id", "", ...)). With manager == nil, that empty id is threaded into runsc.NewTaskService(ctx, id, ...), so runscService.id = "".
  3. Every TaskOOM{ContainerID: s.id} — from the async oom_v2.go's run() and from the sync service.go's checkProcesses() added by fix(shim): synchronously check cgroup for OOM at container exit #12843 — ships with ContainerID="".
  4. containerd's CRI handler (internal/cri/server/events.go) routes TaskOOM via containerStore.Get(e.ContainerID). Get("") returns ErrEmptyPrefix (from internal/truncindex). That is not IsNotFound, so handleEvent wraps and returns the error; containerd logs can't find container for TaskOOM event: prefix can't be empty, and the container's Status.Reason is never set.
  5. TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID — different field, different lookup.

Fix

Don't rely on s.id for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.

  • CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the CreateTaskRequest) rather than s.id. The id stored in the watcher's cgroups and lastOOM maps is now non-empty and correctly keyed per container, so both the async (EventChan -> run) and sync (isOOM at exit) paths publish with a real ContainerID.
  • checkProcesses: locate the container whose process matched the exit event, use that container's id for isOOM, and use it for both TaskOOM.ContainerID and TaskExit.ContainerID.

oom_v2.go is unchanged — with add() keyed by rfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id via i.id.

Validation

End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):

  • Before: OOM'd pods show reason:Error; containerd journal shows TaskOOM event (trailing space = empty ContainerID) and Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty".
  • After: OOM'd pods show reason:OOMKilled; journal shows TaskOOM event container_id:"<real-64-hex>" and zero prefix can't be empty errors.

Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report reason:OOMKilled on both arches.

Scope

This PR touches only the cgroups v2 init-exit path (CreateWithFSRestore and checkProcesses in runsc/service.go). The cgroups v1 poller (epoll.go) has the same structural reliance on s.id and should get the equivalent treatment — happy to file a follow-up once this lands.

Follow-ups noted in this work (separate PRs)

  • cgroups v1 shim (epoll.go) has the same empty-id TaskOOM path.
  • pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath documents %ID% substitution but the implementation only appends a filename when the path ends in /. As a result the shim currently writes all container logs into a literal /var/log/runsc/%ID%/ directory rather than per-container directories. Noticed while gathering logs for this PR.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594

@copybara-service copybara-service Bot added the exported Issue was exported automatically label May 6, 2026
## Summary

Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with `reason:Error` instead of `reason:OOMKilled`.

PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes `TaskOOM{ContainerID: s.id}`, and `s.id` is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.

## Root cause

1. `pkg/shim/v1/manager.go:newCommand()` spawns the shim binary with `-namespace`, `-address`, and `-publish-binary` only — no `-id`.
2. In the spawned shim, containerd's vendored shim framework parses the flags; `-id` defaults to `""` (`runtime/v2/shim/shim.go`: `flag.StringVar(&id, "id", "", ...)`). With `manager == nil`, that empty id is threaded into `runsc.NewTaskService(ctx, id, ...)`, so `runscService.id = ""`.
3. Every `TaskOOM{ContainerID: s.id}` — from the async `oom_v2.go`'s `run()` and from the sync `service.go`'s `checkProcesses()` added by #12843 — ships with `ContainerID=""`.
4. containerd's CRI handler (`internal/cri/server/events.go`) routes TaskOOM via `containerStore.Get(e.ContainerID)`. `Get("")` returns `ErrEmptyPrefix` (from `internal/truncindex`). That is **not** `IsNotFound`, so `handleEvent` wraps and returns the error; containerd logs `can't find container for TaskOOM event: prefix can't be empty`, and the container's `Status.Reason` is never set.
5. `TaskExit` is unaffected because containerd routes TaskExit by `e.ID` (init process id), not `e.ContainerID` — different field, different lookup.

## Fix

Don't rely on `s.id` for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.

* `CreateWithFSRestore`: register the cgroup with `rfs.Create.ID` (the per-container id from the CreateTaskRequest) rather than `s.id`. The id stored in the watcher's `cgroups` and `lastOOM` maps is now non-empty and correctly keyed per container, so both the async (`EventChan -> run`) and sync (`isOOM` at exit) paths publish with a real `ContainerID`.
* `checkProcesses`: locate the container whose process matched the exit event, use that container's id for `isOOM`, and use it for both `TaskOOM.ContainerID` and `TaskExit.ContainerID`.

`oom_v2.go` is unchanged — with `add()` keyed by `rfs.Create.ID`, the existing async publish path already emits TaskOOM with the correct id via `i.id`.

## Validation

End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):

* **Before**: OOM'd pods show `reason:Error`; containerd journal shows `TaskOOM event ` (trailing space = empty ContainerID) and `Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty"`.
* **After**: OOM'd pods show `reason:OOMKilled`; journal shows `TaskOOM event container_id:"<real-64-hex>"` and zero `prefix can't be empty` errors.

Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report `reason:OOMKilled` on both arches.

## Scope

This PR touches only the cgroups v2 init-exit path (`CreateWithFSRestore` and `checkProcesses` in `runsc/service.go`). The cgroups v1 poller (`epoll.go`) has the same structural reliance on `s.id` and should get the equivalent treatment — happy to file a follow-up once this lands.

## Follow-ups noted in this work (separate PRs)

* cgroups v1 shim (`epoll.go`) has the same empty-id TaskOOM path.
* `pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath` documents `%ID%` substitution but the implementation only appends a filename when the path ends in `/`. As a result the shim currently writes all container logs into a literal `/var/log/runsc/%ID%/` directory rather than per-container directories. Noticed while gathering logs for this PR.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594
PiperOrigin-RevId: 911236015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

exported Issue was exported automatically

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOMKilled reason not reported on aarch64 — container shows reason: Error instead of reason: OOMKilled

1 participant