Skip to content

fix(shim): synchronously check cgroup for OOM at container exit#12843

Merged
copybara-service[bot] merged 1 commit intogoogle:masterfrom
kpurdon:kpurdon/12838/fix-aarch64-oom-reporting
Apr 9, 2026
Merged

fix(shim): synchronously check cgroup for OOM at container exit#12843
copybara-service[bot] merged 1 commit intogoogle:masterfrom
kpurdon:kpurdon/12838/fix-aarch64-oom-reporting

Conversation

@kpurdon
Copy link
Copy Markdown
Contributor

@kpurdon kpurdon commented Apr 1, 2026

Why

On aarch64 (ARM64/Graviton), the async OOM notification via inotify/EventChan (cgroups v2) loses the race against the container exit from runtime.Wait(). Containerd processes the TaskExit before TaskOOM arrives, so kubelet reports reason: Error instead of reason: OOMKilled. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.

What

  • Added isOOM(id string) bool to the oomPoller interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)
  • Implemented isOOM for cgroups v2 (watcherV2) using cg.Stat() to read memory.events directly, bypassing the async inotify path
  • Promoted lastOOMMap from goroutine-local in run() to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate TaskOOM events
  • Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events
  • Added no-op isOOM for cgroups v1 (epoller) since its eventfd mechanism doesn't have this race
  • Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup

Fixes #12838

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 1, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@kpurdon kpurdon force-pushed the kpurdon/12838/fix-aarch64-oom-reporting branch 2 times, most recently from da73340 to 89555b6 Compare April 1, 2026 17:04
On aarch64, the async OOM notification via inotify/EventChan (cgroups
v2) loses the race against the container exit notification from
runtime.Wait(). This causes containerd to process the TaskExit before
TaskOOM arrives, leading kubelet to report reason=Error instead of
reason=OOMKilled.

Add a synchronous check of the cgroup's memory.events at container exit
time. When OOM kills are detected, publish the TaskOOM event through the
same sequential event pipeline as TaskExit, guaranteeing the OOM event
arrives first. The async and sync paths coordinate through a shared
lastOOM map under mutex to prevent duplicate events.

Fixes #12838
@kpurdon kpurdon force-pushed the kpurdon/12838/fix-aarch64-oom-reporting branch from 89555b6 to f1c7618 Compare April 1, 2026 17:46
@kpurdon kpurdon marked this pull request as ready for review April 1, 2026 17:49
@tranji-cloud tranji-cloud requested a review from milantracy April 3, 2026 23:51
copybara-service Bot pushed a commit that referenced this pull request Apr 9, 2026
## Why

On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.

## What

- Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)
- Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path
- Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events
- Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events
- Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race
- Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup

Fixes #12838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618
PiperOrigin-RevId: 896802037
copybara-service Bot pushed a commit that referenced this pull request Apr 9, 2026
## Why

On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.

## What

- Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)
- Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path
- Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events
- Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events
- Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race
- Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup

Fixes #12838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618
PiperOrigin-RevId: 896802037
copybara-service Bot pushed a commit that referenced this pull request Apr 9, 2026
## Why

On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.

## What

- Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)
- Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path
- Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events
- Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events
- Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race
- Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup

Fixes #12838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618
PiperOrigin-RevId: 896802037
copybara-service Bot pushed a commit that referenced this pull request Apr 9, 2026
## Why

On aarch64 (ARM64/Graviton), the async OOM notification via inotify/`EventChan` (cgroups v2) loses the race against the container exit from `runtime.Wait()`. Containerd processes the `TaskExit` before `TaskOOM` arrives, so kubelet reports `reason: Error` instead of `reason: OOMKilled`. This breaks monitoring, autoscaling, and alerting on aarch64 gVisor nodes.

## What

- Added `isOOM(id string) bool` to the `oomPoller` interface for synchronous OOM checking at exit time (one-shot: consumes the cgroup reference)
- Implemented `isOOM` for cgroups v2 (`watcherV2`) using `cg.Stat()` to read `memory.events` directly, bypassing the async inotify path
- Promoted `lastOOMMap` from goroutine-local in `run()` to a shared mutex-protected field, coordinating between async and sync paths to prevent duplicate `TaskOOM` events
- Guarded the sync OOM check to init process exits only — exec processes share the sandbox cgroup and would produce spurious events
- Added no-op `isOOM` for cgroups v1 (`epoller`) since its eventfd mechanism doesn't have this race
- Added 6 unit tests covering async publish, dedup, sync/async coordination, and error cleanup

Fixes #12838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#12843 from kpurdon:kpurdon/12838/fix-aarch64-oom-reporting f1c7618
PiperOrigin-RevId: 896802037
@copybara-service copybara-service Bot merged commit 7b4fe74 into google:master Apr 9, 2026
6 checks passed
@kpurdon kpurdon deleted the kpurdon/12838/fix-aarch64-oom-reporting branch April 10, 2026 17:55
@kpurdon
Copy link
Copy Markdown
Contributor Author

kpurdon commented Apr 13, 2026

@milantracy when should I expect to see this in a release? Thanks!

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented Apr 13, 2026

We usually cut releases on Wednesdays. Should be included in the next release.

maci0 pushed a commit to maci0/gvisor that referenced this pull request Apr 23, 2026
…4-oom-reporting

PiperOrigin-RevId: 896927425
kpurdon added a commit to kpurdon/gvisor that referenced this pull request Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled. TaskExit
still works because containerd routes it by e.ID (the init process id),
not e.ContainerID.

Fix: do not rely on s.id for TaskOOM routing. In the async watcher,
take a containerIDs callback that returns all container IDs currently
tracked by the service, and publish a TaskOOM per id (unknown ids
return NotFound and are silently ignored by containerd, so the real
app container is the only one that gets marked reason=OOMKilled). In
the sync-at-exit path, find the container whose init process matched
the exit, use its id for isOOM, and fan out on confirmed OOM. Also use
the per-container id in TaskExit.ContainerID for consistency.

Fixes google#12838 (again — the prior synchronous-cgroup-check approach in
google#12843 still published with ContainerID="" so kubelet continued to see
reason=Error).
kpurdon added a commit to kpurdon/gvisor that referenced this pull request Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled. TaskExit
still works because containerd routes it by e.ID (the init process id),
not e.ContainerID.

Fix: do not rely on s.id for TaskOOM routing. In the async watcher,
take a containerIDs callback that returns all container IDs currently
tracked by the service, and publish a TaskOOM per id (unknown ids
return NotFound and are silently ignored by containerd, so the real
app container is the only one that gets marked reason=OOMKilled). In
the sync-at-exit path, find the container whose init process matched
the exit, use its id for isOOM, and fan out on confirmed OOM. Also use
the per-container id in TaskExit.ContainerID for consistency.

Fixes google#12838 (again — the prior synchronous-cgroup-check approach in
google#12843 still published with ContainerID="" so kubelet continued to see
reason=Error).

Assisted-by: Claude Code
kpurdon added a commit to kpurdon/gvisor that referenced this pull request Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled. TaskExit
still works because containerd routes it by e.ID (the init process id),
not e.ContainerID.

Fix: do not rely on s.id for TaskOOM routing. In the async watcher,
take a containerIDs callback that returns all container IDs currently
tracked by the service, and publish a TaskOOM per id (unknown ids
return NotFound and are silently ignored by containerd, so the real
app container is the only one that gets marked reason=OOMKilled). In
the sync-at-exit path, find the container whose init process matched
the exit, use its id for isOOM, and fan out on confirmed OOM. Also use
the per-container id in TaskExit.ContainerID for consistency.

Fixes google#12838 (again — the prior synchronous-cgroup-check approach in
google#12843 still published with ContainerID="" so kubelet continued to see
reason=Error).

Assisted-by: Claude Code
kpurdon added a commit to kpurdon/gvisor that referenced this pull request Apr 23, 2026
On aarch64 (and likely others), OOM-killed gvisor pods get kubelet
reason=Error instead of reason=OOMKilled. Root cause: the shim
manager's newCommand() spawns the shim binary without -id, so
containerd's shim framework defaults id to "". NewTaskService then
sets runscService.id = "", and every TaskOOM{ContainerID: s.id}
carries an empty string. containerd's CRI does containerStore.Get("")
which returns ErrEmptyPrefix (not NotFound), so the TaskOOM is
dropped and the container's Reason is never set to OOMKilled.

TaskExit is unaffected because containerd routes TaskExit by e.ID
(init process id), not e.ContainerID — two different fields on two
different lookup paths.

Fix: do not rely on s.id for TaskOOM routing. Use the per-container
id instead:

- pkg/shim/v1/runsc/service.go:
  - CreateWithFSRestore: register the cgroup with rfs.Create.ID (the
    per-container id from the Create request) rather than s.id. This
    makes the id stored in the watcher's cgroups/lastOOM maps non-empty
    and correctly keyed, so both the async (EventChan -> run) and sync
    (isOOM at exit) paths have a real id to publish TaskOOM with.
  - checkProcesses: find the container whose process matched the exit,
    use that container's id for isOOM and for both TaskOOM and TaskExit
    ContainerID fields. Drops the reliance on s.id entirely.

No changes to oom_v2.go — with add() keyed by rfs.Create.ID, the
existing async publish path in run() already emits TaskOOM with the
correct id via i.id.

Fixes google#12838 (the prior synchronous-cgroup-check in google#12843 also
published with ContainerID="" so kubelet continued to see reason=Error
on aarch64).

Assisted-by: Claude Code
kpurdon added a commit to kpurdon/gvisor that referenced this pull request Apr 24, 2026
On aarch64 (and any architecture where the containerd shim race
between TaskOOM and TaskExit routinely resolves OOM-last), OOM-killed
pods run by gVisor report kubelet reason=Error instead of
reason=OOMKilled.

Root cause: the shim binary is spawned by pkg/shim/v1/manager.go's
newCommand() without the `-id` flag. containerd's shim framework then
defaults `id` to "", which threads through to runscService.id = "".
Every TaskOOM{ContainerID: s.id} therefore carries an empty string.

containerd's CRI event handler routes TaskOOM by ContainerID through
containerStore.Get(e.ContainerID). With an empty id, the truncindex
lookup returns ErrEmptyPrefix (which is not a NotFound), so the CRI
event handler wraps and returns the error. The container's
Status.Reason is never set, so kubelet falls back to reason:Error.

TaskExit is unaffected because containerd routes TaskExit by e.ID
(init process id), not e.ContainerID.

The prior fix in google#12843 added a synchronous cgroup check at init
exit, which does close the async race — but that path also publishes
TaskOOM{ContainerID: s.id}, so on aarch64 the empty-id bug continues
to produce reason=Error after google#12843 merged. Don't rely on s.id.

This change:

* CreateWithFSRestore: register the cgroup with rfs.Create.ID (the
  per-container id from the Create request) instead of s.id. The id
  stored in the watcher's cgroups and lastOOM maps is now non-empty
  and correctly keyed, so both async (EventChan -> run) and sync
  (isOOM at exit) paths publish TaskOOM with a real ContainerID.

* checkProcesses: find the container whose process matched the exit
  event, use that container's id for isOOM, and use it for both the
  TaskOOM and TaskExit ContainerID fields.

No changes to oom_v2.go: with add() keyed by rfs.Create.ID, the
existing async publish path already emits TaskOOM with the correct
id via i.id.

Validated end-to-end on AL2023 aarch64 (kernel 6.12, containerd
2.2.1, kubelet 1.34) with a forced-OOM pod: before this change pods
show reason:Error and journald reports "can't find container for
TaskOOM event: prefix can't be empty"; after, pods show
reason:OOMKilled and the TaskOOM event carries the real container id.

Assisted-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOMKilled reason not reported on aarch64 — container shows reason: Error instead of reason: OOMKilled

3 participants