OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

jcodybaker · 2023-11-16T15:15:02Z

Description

When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.

In this configuration, gVisor runs as a child (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope) of the pod's cgroup (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroups memory.events file. It seems like these events must not propagate to the child.

I've been able to illustrate this with tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values include max which makes me suspect it's not considered since there are no limits.

$ pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice

$ tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs

==> memory.events <==
low 0
high 0
max 566
oom 23
oom_kill 23
oom_group_kill 0

==> cgroup.procs <==

==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/memory.events <==
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0

==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/cgroup.procs <==
139422
139423
139452
139473
139540

Steps to reproduce

Kubernetes + gVisor + cgroups v2-based OS (debian bookworm)
https://gist.github.com/jcodybaker/dda983722831263536be04538e5eb7de

Create a pod which exceeds the memory available.

cat << 'EOF' | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: example
  name: example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - command:
        - bash
        - -c
        - big_var=data; while true; do big_var="$big_var$big_var"; done
        image: ubuntu:jammy
        name: ubuntu
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor-ptrace
      tolerations:
      - operator: Exists
EOF

Wait for the pod to crash.

Then inspect its status:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    128
      Started:      Thu, 16 Nov 2023 10:12:13 -0500
      Finished:     Thu, 16 Nov 2023 10:12:18 -0500

runsc version

runsc version release-20231106.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux node 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux

kubectl (if using Kubernetes)

v1.28.2

repo state (if built from source)

No response

runsc debug logs (if available)

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-11T00:17:35Z

A friendly reminder that this issue had no activity for 120 days.

github-actions · 2024-07-10T00:19:22Z

This issue has been closed due to lack of activity.

markusthoemmes · 2024-10-10T10:41:22Z

@manninglucas qq on this one since it was closed due to inactivity: Is this supposed to be fixed at HEAD?

markusthoemmes · 2024-10-11T08:20:04Z

Dr. Empirical says: Not fixed at HEAD 😢 . @manninglucas any chances of a fix here?

manninglucas · 2024-10-15T01:08:35Z

Looking into this now. I think we need the equivalent of this commit in runsc. Right now our events implementation only handles stats, not OOMs.

The reason for this is outlined in containerd/containerd@7275411. TL;DR: K8s sets mem limits on the pod cgroup (slice), which means the scope cgroup (container) gets OOMKilled, not OOMed. Fixes #9723 PiperOrigin-RevId: 686297071

manninglucas · 2024-10-16T01:41:14Z

My initial theory was incorrect. The issue is because we didn't update our containerd shim to reflect this change. The fix is here #11044, let me know if you patch it and it still doesn't work for you.

The reason for this is outlined in containerd/containerd@7275411. TL;DR: K8s sets mem limits on the pod cgroup (slice), which means the scope cgroup (container) gets OOMKilled, not OOMed. Fixes #9723 PiperOrigin-RevId: 686297071

jcodybaker added the type: bug Something isn't working label Nov 16, 2023

jcodybaker changed the title ~~OOM Reporting Broken for Kubernetes + CgroupsV2~~ OOM OCI Events Broken for Kubernetes + CgroupsV2 Nov 16, 2023

EtiennePerot assigned manninglucas Nov 16, 2023

manninglucas added the area: container runtime Issue related to docker, kubernetes, OCI runtime label Dec 12, 2023

github-actions bot added the stale-issue This issue has not been updated in 120 days. label Apr 11, 2024

github-actions bot added the auto-closed label Jul 10, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2024

manninglucas reopened this Oct 14, 2024

manninglucas added no-auto-close and removed stale-issue This issue has not been updated in 120 days. auto-closed labels Oct 14, 2024

copybara-service bot mentioned this issue Oct 16, 2024

Update containerd shim to use OOMKill instead of OOM. #11044

Merged

copybara-service bot closed this as completed in 7882783 Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

jcodybaker commented Nov 16, 2023 •

edited

Loading

github-actions bot commented Apr 11, 2024

github-actions bot commented Jul 10, 2024

markusthoemmes commented Oct 10, 2024

markusthoemmes commented Oct 11, 2024

manninglucas commented Oct 15, 2024

manninglucas commented Oct 16, 2024

OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

Comments

jcodybaker commented Nov 16, 2023 • edited Loading

Description

Steps to reproduce

Create a pod which exceeds the memory available.

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

github-actions bot commented Apr 11, 2024

github-actions bot commented Jul 10, 2024

markusthoemmes commented Oct 10, 2024

markusthoemmes commented Oct 11, 2024

manninglucas commented Oct 15, 2024

manninglucas commented Oct 16, 2024

jcodybaker commented Nov 16, 2023 •

edited

Loading