fix(shim): synchronously check cgroup for OOM at container exit by kpurdon · Pull Request #12843 · google/gvisor

kpurdon · 2026-04-01T15:05:56Z

Why

On aarch64 (ARM64/Graviton), the async OOM notification via inotify/EventChan (cgroups v2) loses the race against the container exit from runtime.Wait(). Containerd processes the TaskExit before TaskOOM arrives, so kubelet reports reason: Error instead of reason: OOMKilled. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.

What

Added isOOM(id string) bool to the oomPoller interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)
Implemented isOOM for cgroups v2 (watcherV2) using cg.Stat() to read memory.events directly, bypassing the async inotify path
Promoted lastOOMMap from goroutine-local in run() to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate TaskOOM events
Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events
Added no-op isOOM for cgroups v1 (epoller) since its eventfd mechanism doesn't have this race
Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup

Fixes #12838

google-cla · 2026-04-01T15:06:20Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

On aarch64, the async OOM notification via inotify/EventChan (cgroups v2) loses the race against the container exit notification from runtime.Wait(). This causes containerd to process the TaskExit before TaskOOM arrives, leading kubelet to report reason=Error instead of reason=OOMKilled. Add a synchronous check of the cgroup's memory.events at container exit time. When OOM kills are detected, publish the TaskOOM event through the same sequential event pipeline as TaskExit, guaranteeing the OOM event arrives first. The async and sync paths coordinate through a shared lastOOM map under mutex to prevent duplicate events. Fixes #12838

## Why On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes. ## What - Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference) - Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path - Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events - Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events - Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race - Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup Fixes #12838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618 PiperOrigin-RevId: 896802037

kpurdon · 2026-04-13T16:10:19Z

@milantracy when should I expect to see this in a release? Thanks!

ayushr2 · 2026-04-13T17:30:42Z

We usually cut releases on Wednesdays. Should be included in the next release.

…4-oom-reporting PiperOrigin-RevId: 896927425

On aarch64 (and likely others), OOM-killed gvisor pods get kubelet reason=Error instead of reason=OOMKilled. Root cause: the shim manager's newCommand() spawns the shim binary without -id, so containerd's shim framework defaults id to "". NewTaskService then sets runscService.id = "", and every TaskOOM{ContainerID: s.id} carries an empty string. containerd's CRI does containerStore.Get("") which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is dropped and the container's Reason is never set to OOMKilled. TaskExit still works because containerd routes it by e.ID (the init process id), not e.ContainerID. Fix: do not rely on s.id for TaskOOM routing. In the async watcher, take a containerIDs callback that returns all container IDs currently tracked by the service, and publish a TaskOOM per id (unknown ids return NotFound and are silently ignored by containerd, so the real app container is the only one that gets marked reason=OOMKilled). In the sync-at-exit path, find the container whose init process matched the exit, use its id for isOOM, and fan out on confirmed OOM. Also use the per-container id in TaskExit.ContainerID for consistency. Fixes google#12838 (again — the prior synchronous-cgroup-check approach in google#12843 still published with ContainerID="" so kubelet continued to see reason=Error).

On aarch64 (and likely others), OOM-killed gvisor pods get kubelet reason=Error instead of reason=OOMKilled. Root cause: the shim manager's newCommand() spawns the shim binary without -id, so containerd's shim framework defaults id to "". NewTaskService then sets runscService.id = "", and every TaskOOM{ContainerID: s.id} carries an empty string. containerd's CRI does containerStore.Get("") which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is dropped and the container's Reason is never set to OOMKilled. TaskExit still works because containerd routes it by e.ID (the init process id), not e.ContainerID. Fix: do not rely on s.id for TaskOOM routing. In the async watcher, take a containerIDs callback that returns all container IDs currently tracked by the service, and publish a TaskOOM per id (unknown ids return NotFound and are silently ignored by containerd, so the real app container is the only one that gets marked reason=OOMKilled). In the sync-at-exit path, find the container whose init process matched the exit, use its id for isOOM, and fan out on confirmed OOM. Also use the per-container id in TaskExit.ContainerID for consistency. Fixes google#12838 (again — the prior synchronous-cgroup-check approach in google#12843 still published with ContainerID="" so kubelet continued to see reason=Error). Assisted-by: Claude Code

On aarch64 (and likely others), OOM-killed gvisor pods get kubelet reason=Error instead of reason=OOMKilled. Root cause: the shim manager's newCommand() spawns the shim binary without -id, so containerd's shim framework defaults id to "". NewTaskService then sets runscService.id = "", and every TaskOOM{ContainerID: s.id} carries an empty string. containerd's CRI does containerStore.Get("") which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is dropped and the container's Reason is never set to OOMKilled. TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID — two different fields on two different lookup paths. Fix: do not rely on s.id for TaskOOM routing. Use the per-container id instead: - pkg/shim/v1/runsc/service.go: - CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the Create request) rather than s.id. This makes the id stored in the watcher's cgroups/lastOOM maps non-empty and correctly keyed, so both the async (EventChan -> run) and sync (isOOM at exit) paths have a real id to publish TaskOOM with. - checkProcesses: find the container whose process matched the exit, use that container's id for isOOM and for both TaskOOM and TaskExit ContainerID fields. Drops the reliance on s.id entirely. No changes to oom_v2.go — with add() keyed by rfs.Create.ID, the existing async publish path in run() already emits TaskOOM with the correct id via i.id. Fixes google#12838 (the prior synchronous-cgroup-check in google#12843 also published with ContainerID="" so kubelet continued to see reason=Error on aarch64). Assisted-by: Claude Code

On aarch64 (and any architecture where the containerd shim race between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed pods run by gVisor report kubelet reason=Error instead of reason=OOMKilled. Root cause: the shim binary is spawned by pkg/shim/v1/manager.go's newCommand() without the `-id` flag. containerd's shim framework then defaults `id` to "", which threads through to runscService.id = "". Every TaskOOM{ContainerID: s.id} therefore carries an empty string. containerd's CRI event handler routes TaskOOM by ContainerID through containerStore.Get(e.ContainerID). With an empty id, the truncindex lookup returns ErrEmptyPrefix (which is not a NotFound), so the CRI event handler wraps and returns the error. The container's Status.Reason is never set, so kubelet falls back to reason:Error. TaskExit is unaffected because containerd routes TaskExit by e.ID (init process id), not e.ContainerID. The prior fix in google#12843 added a synchronous cgroup check at init exit, which does close the async race — but that path also publishes TaskOOM{ContainerID: s.id}, so on aarch64 the empty-id bug continues to produce reason=Error after google#12843 merged. Don't rely on s.id. This change: * CreateWithFSRestore: register the cgroup with rfs.Create.ID (the per-container id from the Create request) instead of s.id. The id stored in the watcher's cgroups and lastOOM maps is now non-empty and correctly keyed, so both async (EventChan -> run) and sync (isOOM at exit) paths publish TaskOOM with a real ContainerID. * checkProcesses: find the container whose process matched the exit event, use that container's id for isOOM, and use it for both the TaskOOM and TaskExit ContainerID fields. No changes to oom_v2.go: with add() keyed by rfs.Create.ID, the existing async publish path already emits TaskOOM with the correct id via i.id. Validated end-to-end on AL2023 aarch64 (kernel 6.12, containerd 2.2.1, kubelet 1.34) with a forced-OOM pod: before this change pods show reason:Error and journald reports "can't find container for TaskOOM event: prefix can't be empty"; after, pods show reason:OOMKilled and the TaskOOM event carries the real container id. Assisted-by: Claude Code

kpurdon force-pushed the kpurdon/12838/fix-aarch64-oom-reporting branch 2 times, most recently from da73340 to 89555b6 Compare April 1, 2026 17:04

kpurdon force-pushed the kpurdon/12838/fix-aarch64-oom-reporting branch from 89555b6 to f1c7618 Compare April 1, 2026 17:46

kpurdon marked this pull request as ready for review April 1, 2026 17:49

tranji-cloud requested a review from milantracy April 3, 2026 23:51

milantracy approved these changes Apr 8, 2026

View reviewed changes

milantracy added the ready to pull label Apr 9, 2026

copybara-service Bot mentioned this pull request Apr 9, 2026

fix(shim): synchronously check cgroup for OOM at container exit #12900

Merged

copybara-service Bot merged commit 7b4fe74 into google:master Apr 9, 2026
6 checks passed

kpurdon deleted the kpurdon/12838/fix-aarch64-oom-reporting branch April 10, 2026 17:55

maci0 pushed a commit to maci0/gvisor that referenced this pull request Apr 23, 2026

Merge pull request google#12843 from kpurdon:kpurdon/12838/fix-aarch6…

50e9278

…4-oom-reporting PiperOrigin-RevId: 896927425

kpurdon mentioned this pull request Apr 23, 2026

fix(shim): use per-container id for TaskOOM routing (fixes #12838) #12996

Open

copybara-service Bot mentioned this pull request May 6, 2026

fix(shim): use per-container id for TaskOOM routing (fixes #12838) #13097

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shim): synchronously check cgroup for OOM at container exit#12843

fix(shim): synchronously check cgroup for OOM at container exit#12843
copybara-service[bot] merged 1 commit intogoogle:masterfrom
kpurdon:kpurdon/12838/fix-aarch64-oom-reporting

kpurdon commented Apr 1, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented Apr 1, 2026

Uh oh!

Uh oh!

kpurdon commented Apr 13, 2026

Uh oh!

ayushr2 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kpurdon commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

google-cla Bot commented Apr 1, 2026

Uh oh!

Uh oh!

kpurdon commented Apr 13, 2026

Uh oh!

ayushr2 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kpurdon commented Apr 1, 2026 •

edited

Loading