-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM OCI Events Broken for Kubernetes + CgroupsV2 #9723
Comments
A friendly reminder that this issue had no activity for 120 days. |
This issue has been closed due to lack of activity. |
@manninglucas qq on this one since it was closed due to inactivity: Is this supposed to be fixed at HEAD? |
Dr. Empirical says: Not fixed at HEAD 😢 . @manninglucas any chances of a fix here? |
Looking into this now. I think we need the equivalent of this commit in runsc. Right now our events implementation only handles stats, not OOMs. |
The reason for this is outlined in containerd/containerd@7275411. TL;DR: K8s sets mem limits on the pod cgroup (slice), which means the scope cgroup (container) gets OOMKilled, not OOMed. Fixes #9723 PiperOrigin-RevId: 686297071
My initial theory was incorrect. The issue is because we didn't update our containerd shim to reflect this change. The fix is here #11044, let me know if you patch it and it still doesn't work for you. |
The reason for this is outlined in containerd/containerd@7275411. TL;DR: K8s sets mem limits on the pod cgroup (slice), which means the scope cgroup (container) gets OOMKilled, not OOMed. Fixes #9723 PiperOrigin-RevId: 686297071
Description
When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.
In this configuration, gVisor runs as a child (ex.
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope
) of the pod's cgroup (ex./sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice
). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroupsmemory.events
file. It seems like these events must not propagate to the child.I've been able to illustrate this with
tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs
to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values includemax
which makes me suspect it's not considered since there are no limits.Steps to reproduce
Kubernetes + gVisor + cgroups v2-based OS (debian bookworm)
https://gist.github.com/jcodybaker/dda983722831263536be04538e5eb7de
Create a pod which exceeds the memory available.
Wait for the pod to crash.
Then inspect its status:
runsc version
docker version (if using docker)
No response
uname
Linux node 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
No response
runsc debug logs (if available)
No response
The text was updated successfully, but these errors were encountered: