fix(shim): synchronously check cgroup for OOM at container exit#12843
Merged
copybara-service[bot] merged 1 commit intogoogle:masterfrom Apr 9, 2026
Merged
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
da73340 to
89555b6
Compare
On aarch64, the async OOM notification via inotify/EventChan (cgroups v2) loses the race against the container exit notification from runtime.Wait(). This causes containerd to process the TaskExit before TaskOOM arrives, leading kubelet to report reason=Error instead of reason=OOMKilled. Add a synchronous check of the cgroup's memory.events at container exit time. When OOM kills are detected, publish the TaskOOM event through the same sequential event pipeline as TaskExit, guaranteeing the OOM event arrives first. The async and sync paths coordinate through a shared lastOOM map under mutex to prevent duplicate events. Fixes #12838
89555b6 to
f1c7618
Compare
milantracy
approved these changes
Apr 8, 2026
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 9, 2026
## Why On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes. ## What - Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference) - Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path - Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events - Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events - Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race - Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup Fixes #12838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618 PiperOrigin-RevId: 896802037
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 9, 2026
## Why On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes. ## What - Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference) - Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path - Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events - Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events - Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race - Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup Fixes #12838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618 PiperOrigin-RevId: 896802037
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 9, 2026
## Why On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes. ## What - Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference) - Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path - Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events - Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events - Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race - Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup Fixes #12838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618 PiperOrigin-RevId: 896802037
copybara-service Bot
pushed a commit
that referenced
this pull request
Apr 9, 2026
## Why On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes. ## What - Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference) - Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path - Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events - Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events - Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race - Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup Fixes #12838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618 PiperOrigin-RevId: 896802037
Contributor
Author
|
@milantracy when should I expect to see this in a release? Thanks! |
Collaborator
|
We usually cut releases on Wednesdays. Should be included in the next release. |
maci0
pushed a commit
to maci0/gvisor
that referenced
this pull request
Apr 23, 2026
…4-oom-reporting PiperOrigin-RevId: 896927425
kpurdon
added a commit
to kpurdon/gvisor
that referenced
this pull request
Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled. TaskExit
still works because containerd routes it by e.ID (the init process id),
not e.ContainerID.
Fix: do not rely on s.id for TaskOOM routing. In the async watcher,
take a containerIDs callback that returns all container IDs currently
tracked by the service, and publish a TaskOOM per id (unknown ids
return NotFound and are silently ignored by containerd, so the real
app container is the only one that gets marked reason=OOMKilled). In
the sync-at-exit path, find the container whose init process matched
the exit, use its id for isOOM, and fan out on confirmed OOM. Also use
the per-container id in TaskExit.ContainerID for consistency.
Fixes google#12838 (again — the prior synchronous-cgroup-check approach in
google#12843 still published with ContainerID="" so kubelet continued to see
reason=Error).
kpurdon
added a commit
to kpurdon/gvisor
that referenced
this pull request
Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled. TaskExit
still works because containerd routes it by e.ID (the init process id),
not e.ContainerID.
Fix: do not rely on s.id for TaskOOM routing. In the async watcher,
take a containerIDs callback that returns all container IDs currently
tracked by the service, and publish a TaskOOM per id (unknown ids
return NotFound and are silently ignored by containerd, so the real
app container is the only one that gets marked reason=OOMKilled). In
the sync-at-exit path, find the container whose init process matched
the exit, use its id for isOOM, and fan out on confirmed OOM. Also use
the per-container id in TaskExit.ContainerID for consistency.
Fixes google#12838 (again — the prior synchronous-cgroup-check approach in
google#12843 still published with ContainerID="" so kubelet continued to see
reason=Error).
Assisted-by: Claude Code
kpurdon
added a commit
to kpurdon/gvisor
that referenced
this pull request
Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled. TaskExit
still works because containerd routes it by e.ID (the init process id),
not e.ContainerID.
Fix: do not rely on s.id for TaskOOM routing. In the async watcher,
take a containerIDs callback that returns all container IDs currently
tracked by the service, and publish a TaskOOM per id (unknown ids
return NotFound and are silently ignored by containerd, so the real
app container is the only one that gets marked reason=OOMKilled). In
the sync-at-exit path, find the container whose init process matched
the exit, use its id for isOOM, and fan out on confirmed OOM. Also use
the per-container id in TaskExit.ContainerID for consistency.
Fixes google#12838 (again — the prior synchronous-cgroup-check approach in
google#12843 still published with ContainerID="" so kubelet continued to see
reason=Error).
Assisted-by: Claude Code
kpurdon
added a commit
to kpurdon/gvisor
that referenced
this pull request
Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled.
TaskExit is unaffected because containerd routes TaskExit by e.ID
(init process id), not e.ContainerID — two different fields on two
different lookup paths.
Fix: do not rely on s.id for TaskOOM routing. Use the per-container
id instead:
- pkg/shim/v1/runsc/service.go:
- CreateWithFSRestore: register the cgroup with rfs.Create.ID (the
per-container id from the Create request) rather than s.id. This
makes the id stored in the watcher's cgroups/lastOOM maps non-empty
and correctly keyed, so both the async (EventChan -> run) and sync
(isOOM at exit) paths have a real id to publish TaskOOM with.
- checkProcesses: find the container whose process matched the exit,
use that container's id for isOOM and for both TaskOOM and TaskExit
ContainerID fields. Drops the reliance on s.id entirely.
No changes to oom_v2.go — with add() keyed by rfs.Create.ID, the
existing async publish path in run() already emits TaskOOM with the
correct id via i.id.
Fixes google#12838 (the prior synchronous-cgroup-check in google#12843 also
published with ContainerID="" so kubelet continued to see reason=Error
on aarch64).
Assisted-by: Claude Code
kpurdon
added a commit
to kpurdon/gvisor
that referenced
this pull request
Apr 24, 2026
On aarch64 (and any architecture where the containerd shim race
between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed
pods run by gVisor report kubelet reason=Error instead of
reason=OOMKilled.
Root cause: the shim binary is spawned by pkg/shim/v1/manager.go's
newCommand() without the `-id` flag. containerd's shim framework then
defaults `id` to "", which threads through to runscService.id = "".
Every TaskOOM{ContainerID: s.id} therefore carries an empty string.
containerd's CRI event handler routes TaskOOM by ContainerID through
containerStore.Get(e.ContainerID). With an empty id, the truncindex
lookup returns ErrEmptyPrefix (which is not a NotFound), so the CRI
event handler wraps and returns the error. The container's
Status.Reason is never set, so kubelet falls back to reason:Error.
TaskExit is unaffected because containerd routes TaskExit by e.ID
(init process id), not e.ContainerID.
The prior fix in google#12843 added a synchronous cgroup check at init
exit, which does close the async race — but that path also publishes
TaskOOM{ContainerID: s.id}, so on aarch64 the empty-id bug continues
to produce reason=Error after google#12843 merged. Don't rely on s.id.
This change:
* CreateWithFSRestore: register the cgroup with rfs.Create.ID (the
per-container id from the Create request) instead of s.id. The id
stored in the watcher's cgroups and lastOOM maps is now non-empty
and correctly keyed, so both async (EventChan -> run) and sync
(isOOM at exit) paths publish TaskOOM with a real ContainerID.
* checkProcesses: find the container whose process matched the exit
event, use that container's id for isOOM, and use it for both the
TaskOOM and TaskExit ContainerID fields.
No changes to oom_v2.go: with add() keyed by rfs.Create.ID, the
existing async publish path already emits TaskOOM with the correct
id via i.id.
Validated end-to-end on AL2023 aarch64 (kernel 6.12, containerd
2.2.1, kubelet 1.34) with a forced-OOM pod: before this change pods
show reason:Error and journald reports "can't find container for
TaskOOM event: prefix can't be empty"; after, pods show
reason:OOMKilled and the TaskOOM event carries the real container id.
Assisted-by: Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
On aarch64 (ARM64/Graviton), the async OOM notification via inotify/
EventChan(cgroups v2) loses the race against the container exit fromruntime.Wait(). Containerd processes theTaskExitbeforeTaskOOMarrives, so kubelet reportsreason: Errorinstead ofreason: OOMKilled. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.What
isOOM(id string) boolto theoomPollerinterface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)isOOMfor cgroups v2 (watcherV2) usingcg.Stat()to readmemory.eventsdirectly, bypassing the async inotify pathlastOOMMapfrom goroutine-local inrun()to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicateTaskOOMeventsisOOMfor cgroups v1 (epoller) since its eventfd mechanism doesn't have this raceFixes #12838