Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

Closed
jcodybaker opened this issue Nov 16, 2023 · 6 comments · Fixed by #11044
Closed

OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723

jcodybaker opened this issue Nov 16, 2023 · 6 comments · Fixed by #11044
Assignees
Labels
area: container runtime Issue related to docker, kubernetes, OCI runtime no-auto-close type: bug Something isn't working

Comments

@jcodybaker
Copy link

jcodybaker commented Nov 16, 2023

Description

When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.

In this configuration, gVisor runs as a child (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope) of the pod's cgroup (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroups memory.events file. It seems like these events must not propagate to the child.

I've been able to illustrate this with tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values include max which makes me suspect it's not considered since there are no limits.

$ pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice

$ tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs

==> memory.events <==
low 0
high 0
max 566
oom 23
oom_kill 23
oom_group_kill 0

==> cgroup.procs <==

==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/memory.events <==
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0

==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/cgroup.procs <==
139422
139423
139452
139473
139540

Steps to reproduce

Kubernetes + gVisor + cgroups v2-based OS (debian bookworm)
https://gist.github.com/jcodybaker/dda983722831263536be04538e5eb7de

Create a pod which exceeds the memory available.

cat << 'EOF' | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: example
  name: example
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - command:
        - bash
        - -c
        - big_var=data; while true; do big_var="$big_var$big_var"; done
        image: ubuntu:jammy
        name: ubuntu
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 200m
            ephemeral-storage: 200M
            memory: "214748364"
      dnsPolicy: Default
      hostNetwork: true
      restartPolicy: Always
      runtimeClassName: gvisor-ptrace
      tolerations:
      - operator: Exists
EOF

Wait for the pod to crash.

Then inspect its status:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    128
      Started:      Thu, 16 Nov 2023 10:12:13 -0500
      Finished:     Thu, 16 Nov 2023 10:12:18 -0500

runsc version

runsc version release-20231106.0
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux node 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux

kubectl (if using Kubernetes)

v1.28.2

repo state (if built from source)

No response

runsc debug logs (if available)

No response

@jcodybaker jcodybaker added the type: bug Something isn't working label Nov 16, 2023
@jcodybaker jcodybaker changed the title OOM Reporting Broken for Kubernetes + CgroupsV2 OOM OCI Events Broken for Kubernetes + CgroupsV2 Nov 16, 2023
@manninglucas manninglucas added the area: container runtime Issue related to docker, kubernetes, OCI runtime label Dec 12, 2023
Copy link

A friendly reminder that this issue had no activity for 120 days.

@github-actions github-actions bot added the stale-issue This issue has not been updated in 120 days. label Apr 11, 2024
Copy link

This issue has been closed due to lack of activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2024
@markusthoemmes
Copy link

@manninglucas qq on this one since it was closed due to inactivity: Is this supposed to be fixed at HEAD?

@markusthoemmes
Copy link

Dr. Empirical says: Not fixed at HEAD 😢 . @manninglucas any chances of a fix here?

@manninglucas manninglucas reopened this Oct 14, 2024
@manninglucas manninglucas added no-auto-close and removed stale-issue This issue has not been updated in 120 days. auto-closed labels Oct 14, 2024
@manninglucas
Copy link
Contributor

Looking into this now. I think we need the equivalent of this commit in runsc. Right now our events implementation only handles stats, not OOMs.

copybara-service bot pushed a commit that referenced this issue Oct 16, 2024
The reason for this is outlined in containerd/containerd@7275411.

TL;DR: K8s sets mem limits on the pod cgroup (slice), which means
the scope cgroup (container) gets OOMKilled, not OOMed.

Fixes #9723

PiperOrigin-RevId: 686297071
@manninglucas
Copy link
Contributor

My initial theory was incorrect. The issue is because we didn't update our containerd shim to reflect this change. The fix is here #11044, let me know if you patch it and it still doesn't work for you.

copybara-service bot pushed a commit that referenced this issue Oct 16, 2024
The reason for this is outlined in containerd/containerd@7275411.

TL;DR: K8s sets mem limits on the pod cgroup (slice), which means
the scope cgroup (container) gets OOMKilled, not OOMed.

Fixes #9723

PiperOrigin-RevId: 686297071
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: container runtime Issue related to docker, kubernetes, OCI runtime no-auto-close type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants