fix(shim): use per-container id for TaskOOM routing (fixes #12838)#12996
Open
kpurdon wants to merge 2 commits intogoogle:masterfrom
Open
fix(shim): use per-container id for TaskOOM routing (fixes #12838)#12996kpurdon wants to merge 2 commits intogoogle:masterfrom
kpurdon wants to merge 2 commits intogoogle:masterfrom
Conversation
aef69a5 to
9af7cfe
Compare
Contributor
|
Thanks! I'll happy to review when it is ready |
9af7cfe to
203ee86
Compare
On aarch64 (and any architecture where the containerd shim race
between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed
pods run by gVisor report kubelet reason=Error instead of
reason=OOMKilled.
Root cause: the shim binary is spawned by pkg/shim/v1/manager.go's
newCommand() without the `-id` flag. containerd's shim framework then
defaults `id` to "", which threads through to runscService.id = "".
Every TaskOOM{ContainerID: s.id} therefore carries an empty string.
containerd's CRI event handler routes TaskOOM by ContainerID through
containerStore.Get(e.ContainerID). With an empty id, the truncindex
lookup returns ErrEmptyPrefix (which is not a NotFound), so the CRI
event handler wraps and returns the error. The container's
Status.Reason is never set, so kubelet falls back to reason:Error.
TaskExit is unaffected because containerd routes TaskExit by e.ID
(init process id), not e.ContainerID.
The prior fix in google#12843 added a synchronous cgroup check at init
exit, which does close the async race — but that path also publishes
TaskOOM{ContainerID: s.id}, so on aarch64 the empty-id bug continues
to produce reason=Error after google#12843 merged. Don't rely on s.id.
This change:
* CreateWithFSRestore: register the cgroup with rfs.Create.ID (the
per-container id from the Create request) instead of s.id. The id
stored in the watcher's cgroups and lastOOM maps is now non-empty
and correctly keyed, so both async (EventChan -> run) and sync
(isOOM at exit) paths publish TaskOOM with a real ContainerID.
* checkProcesses: find the container whose process matched the exit
event, use that container's id for isOOM, and use it for both the
TaskOOM and TaskExit ContainerID fields.
No changes to oom_v2.go: with add() keyed by rfs.Create.ID, the
existing async publish path already emits TaskOOM with the correct
id via i.id.
Validated end-to-end on AL2023 aarch64 (kernel 6.12, containerd
2.2.1, kubelet 1.34) with a forced-OOM pod: before this change pods
show reason:Error and journald reports "can't find container for
TaskOOM event: prefix can't be empty"; after, pods show
reason:OOMKilled and the TaskOOM event carries the real container id.
Assisted-by: Claude Code
203ee86 to
f4c6013
Compare
Contributor
Author
|
@milantracy validate this fully today, should be good to go! |
milantracy
requested changes
Apr 30, 2026
Contributor
milantracy
left a comment
There was a problem hiding this comment.
thanks @kpurdon ! i added some comments, but most of change LGTM
Address review feedback on google#12996 — use the more idiomatic name.
milantracy
approved these changes
May 6, 2026
copybara-service Bot
pushed a commit
that referenced
this pull request
May 6, 2026
## Summary Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with `reason:Error` instead of `reason:OOMKilled`. PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes `TaskOOM{ContainerID: s.id}`, and `s.id` is empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing. ## Root cause 1. `pkg/shim/v1/manager.go:newCommand()` spawns the shim binary with `-namespace`, `-address`, and `-publish-binary` only — no `-id`. 2. In the spawned shim, containerd's vendored shim framework parses the flags; `-id` defaults to `""` (`runtime/v2/shim/shim.go`: `flag.StringVar(&id, "id", "", ...)`). With `manager == nil`, that empty id is threaded into `runsc.NewTaskService(ctx, id, ...)`, so `runscService.id = ""`. 3. Every `TaskOOM{ContainerID: s.id}` — from the async `oom_v2.go`'s `run()` and from the sync `service.go`'s `checkProcesses()` added by #12843 — ships with `ContainerID=""`. 4. containerd's CRI handler (`internal/cri/server/events.go`) routes TaskOOM via `containerStore.Get(e.ContainerID)`. `Get("")` returns `ErrEmptyPrefix` (from `internal/truncindex`). That is **not** `IsNotFound`, so `handleEvent` wraps and returns the error; containerd logs `can't find container for TaskOOM event: prefix can't be empty`, and the container's `Status.Reason` is never set. 5. `TaskExit` is unaffected because containerd routes TaskExit by `e.ID` (init process id), not `e.ContainerID` — different field, different lookup. ## Fix Don't rely on `s.id` for TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed. * `CreateWithFSRestore`: register the cgroup with `rfs.Create.ID` (the per-container id from the CreateTaskRequest) rather than `s.id`. The id stored in the watcher's `cgroups` and `lastOOM` maps is now non-empty and correctly keyed per container, so both the async (`EventChan -> run`) and sync (`isOOM` at exit) paths publish with a real `ContainerID`. * `checkProcesses`: locate the container whose process matched the exit event, use that container's id for `isOOM`, and use it for both `TaskOOM.ContainerID` and `TaskExit.ContainerID`. `oom_v2.go` is unchanged — with `add()` keyed by `rfs.Create.ID`, the existing async publish path already emits TaskOOM with the correct id via `i.id`. ## Validation End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch): * **Before**: OOM'd pods show `reason:Error`; containerd journal shows `TaskOOM event ` (trailing space = empty ContainerID) and `Failed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty"`. * **After**: OOM'd pods show `reason:OOMKilled`; journal shows `TaskOOM event container_id:"<real-64-hex>"` and zero `prefix can't be empty` errors. Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report `reason:OOMKilled` on both arches. ## Scope This PR touches only the cgroups v2 init-exit path (`CreateWithFSRestore` and `checkProcesses` in `runsc/service.go`). The cgroups v1 poller (`epoll.go`) has the same structural reliance on `s.id` and should get the equivalent treatment — happy to file a follow-up once this lands. ## Follow-ups noted in this work (separate PRs) * cgroups v1 shim (`epoll.go`) has the same empty-id TaskOOM path. * `pkg/shim/v1/runsccmd/utils.go:FormatShimLogPath` documents `%ID%` substitution but the implementation only appends a filename when the path ends in `/`. As a result the shim currently writes all container logs into a literal `/var/log/runsc/%ID%/` directory rather than per-container directories. Noticed while gathering logs for this PR. FUTURE_COPYBARA_INTEGRATE_REVIEW=#12996 from kpurdon:kpurdon/12838/fix-taskoom-empty-id ef14594 PiperOrigin-RevId: 911236015
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #12838 for real. On aarch64 (and any architecture where the containerd race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor are reported by kubelet with
reason:Errorinstead ofreason:OOMKilled.PR #12843 added a synchronous cgroup check at init exit to close the async OOM-vs-exit race. That part works — but it also publishes
TaskOOM{ContainerID: s.id}, ands.idis empty in the normal shim spawn path. So on aarch64 the symptom survives #12843 because the bug is routing, not just timing.Root cause
pkg/shim/v1/manager.go:newCommand()spawns the shim binary with-namespace,-address, and-publish-binaryonly — no-id.-iddefaults to""(runtime/v2/shim/shim.go:flag.StringVar(&id, "id", "", ...)). Withmanager == nil, that empty id is threaded intorunsc.NewTaskService(ctx, id, ...), sorunscService.id = "".TaskOOM{ContainerID: s.id}— from the asyncoom_v2.go'srun()and from the syncservice.go'scheckProcesses()added by fix(shim): synchronously check cgroup for OOM at container exit #12843 — ships withContainerID="".internal/cri/server/events.go) routes TaskOOM viacontainerStore.Get(e.ContainerID).Get("")returnsErrEmptyPrefix(frominternal/truncindex). That is notIsNotFound, sohandleEventwraps and returns the error; containerd logscan't find container for TaskOOM event: prefix can't be empty, and the container'sStatus.Reasonis never set.TaskExitis unaffected because containerd routes TaskExit bye.ID(init process id), note.ContainerID— different field, different lookup.Fix
Don't rely on
s.idfor TaskOOM routing. Use the per-container id that's already available at the point the OOM poller is wired up and at the point the exit is observed.CreateWithFSRestore: register the cgroup withrfs.Create.ID(the per-container id from the CreateTaskRequest) rather thans.id. The id stored in the watcher'scgroupsandlastOOMmaps is now non-empty and correctly keyed per container, so both the async (EventChan -> run) and sync (isOOMat exit) paths publish with a realContainerID.checkProcesses: locate the container whose process matched the exit event, use that container's id forisOOM, and use it for bothTaskOOM.ContainerIDandTaskExit.ContainerID.oom_v2.gois unchanged — withadd()keyed byrfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id viai.id.Validation
End-to-end in a prod-matching environment (AL2023 aarch64, kernel 6.12, containerd 2.2.1, kubelet 1.34, runsc built from this branch):
reason:Error; containerd journal showsTaskOOM event(trailing space = empty ContainerID) andFailed to handle backOff event for error="can't find container for TaskOOM event: prefix can't be empty".reason:OOMKilled; journal showsTaskOOM event container_id:"<real-64-hex>"and zeroprefix can't be emptyerrors.Also soaked on staging at Semgrep with real workloads across both x86_64 and aarch64 nodepools (multi-container gvisor pods, forced OOM on a 128Mi-limited python container alongside a busybox sidecar). Main-container OOMs now consistently report
reason:OOMKilledon both arches.Scope
This PR touches only the cgroups v2 init-exit path (
CreateWithFSRestoreandcheckProcessesinrunsc/service.go). The cgroups v1 poller (epoll.go) has the same structural reliance ons.idand should get the equivalent treatment — happy to file a follow-up once this lands.Follow-ups noted in this work (separate PRs)
epoll.go) has the same empty-id TaskOOM path.pkg/shim/v1/runsccmd/utils.go:FormatShimLogPathdocuments%ID%substitution but the implementation only appends a filename when the path ends in/. As a result the shim currently writes all container logs into a literal/var/log/runsc/%ID%/directory rather than per-container directories. Noticed while gathering logs for this PR.