Resource limits not enforced in EKS cluster #8047

DobromirM · 2022-10-04T16:40:06Z

Description

When a Kubernetes deployment is created with gVisor, the resource limits are not enforced. Even if the pods exceed the capacity of the node, they will not terminate but will instead crash the node.

The limits work fine with the default runtime and the pods can be stopped manually.

containerd/config.toml

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.runtime.v1.linux"]
shim_debug = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

Steps to reproduce

Create an EKS cluster.
Create an EKS node group.

eksNodeGroup.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: CLUSTER_NAME
  region: us-west-1
vpc:
  id: VPC_ID
  securityGroup: SECURITY_GROUP
  subnets:
    public:
      public1:
          id: SUBNET_1
      public2:
          id: SUBNET_2
managedNodeGroups:
  - name: eks-node-group
    instanceType: t3.small
    minSize: 1
    maxSize: 1
    desiredCapacity: 1
    ssh:
      allow: true
      publicKeyPath: KEY_PATH
    volumeSize: 30
    ami: ami-013fb9424761e4389
    overrideBootstrapCommand: |
      #!/bin/bash
      
      # Install gvisor
      
      ARCH=$(uname -m)
      URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
      wget ${URL}/runsc ${URL}/runsc.sha512 \
      ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
      sha512sum -c runsc.sha512 -c containerd-shim-runsc-v1.sha512
      rm -f *.sha512
      chmod a+rx runsc containerd-shim-runsc-v1
      sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
      
      # Create the config file
      
      cat <<EOF > /etc/eks/custom-containerd-config.toml
      version = 2
      root = "/var/lib/containerd"
      state = "/run/containerd"
      
      [grpc]
      address = "/run/containerd/containerd.sock"
      
      [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      
      [plugins."io.containerd.grpc.v1.cri"]
      sandbox_image = "602401143452.dkr.ecr.us-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1"
      
      [plugins."io.containerd.runtime.v1.linux"]
      shim_debug = true
      
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
      runtime_type = "io.containerd.runsc.v1"
      
      [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      EOF
      
      # Restart containerd
      sudo systemctl restart containerd
      
      # Run the bootstrap script
      /etc/eks/bootstrap.sh CLUSTER_NAME --container-runtime containerd --containerd-config-file /etc/eks/custom-containerd-config.toml

eksctl create nodegroup --config-file eksNodeGroup.yaml

Create a runtime class.

runtimeClass.yaml

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor 
handler: runsc

kubectl apply -f runtimeClass.yaml

Create a deployment with resource limits and gVisor runtime.

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      runtimeClassName: gvisor
      containers:
      - image: nginx
        name: nginx
        ports:
        - containerPort: 80
          name: http
        resources:
          requests:
            cpu: "1m" 
            memory: "1Mi"
          limits:
            cpu: "1m"
            memory: "1Mi"

kubectl apply -f deployment.yaml

runsc version

runsc version release-20220919.0
spec: 1.0.2-dev

docker version (if using docker)

No response

uname

Linux ip-172-31-11-114.us-west-1.compute.internal 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

kubectl version:

Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-13+d2965f0db10712", GitCommit:"d2965f0db1071203c6f5bc662c2827c71fc8b20d", GitTreeState:"clean", BuildDate:"2021-06-26T01:02:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

kubeclt get nodes:

NAME                                          STATUS   ROLES    AGE   VERSION
ip-172-31-11-114.us-west-1.compute.internal   Ready    <none>   18m   v1.23.9-eks-ba74326



### repo state (if built from source)

_No response_

### runsc debug logs (if available)

_No response_

The text was updated successfully, but these errors were encountered:

DobromirM · 2022-10-04T16:41:40Z

@zkoopmans

zkoopmans · 2022-10-04T16:58:51Z

@fvoznika and @konstantin-s-bogom, this is AWS, and this doesn't look like kubernetes/kubernetes#107172. Have we tried this on GKE yet?

If it is us, I suspect it is some mismatch of a message in the shim, but I don't know that code well enough yet to say either way.

konstantin-s-bogom · 2022-10-04T18:26:51Z

I think it's likely that kubernetes/kubernetes#107172 is at play here, because AFAIK kubelet versions <1.25 will prefer cAdvisor stats, which report incorrect resource usage for runsc containers.

@DobromirM I don't know if EKS supports using a custom kubelet, but if you know of a way, then try one with this patch applied and see if it fixes the issue: kubernetes/kubernetes@9c3a4aa

fvoznika · 2022-10-04T23:56:50Z

This issue is not related to cAdvisor. cAdvisor just affects reported metrics, not cgroup limits.

In K8s both pods and containers have limits. Users can only set container limits. Pod limits are automatically set with the aggregate of all containers. For example, if one container has a memory limit of 512MB and another has a 128MB limit, the pod limit will be 640MB.

gVisor doesn't currently enforce container limits because it doesn't have cgroups support. However, pod limits are enforced. This is done by making the sandbox process join the pod cgroup in the host. In the example above, a container may be allowed to go over its own limit, let's say the 128MB limit container will be allowed to allocate more than 128MB, but it won't be allowed to allocate more than 640MB which is the pod limit. It important, however, to ensure that all containers in the pod have a limit set, otherwise the aggregate limit for the pod will be unlimited.

fvoznika · 2022-10-05T00:03:43Z

I see that in your example you have a single container, in that case the pod limit should be enforced, unless EKS is not setting up pod cgroup appropriately. You can check it by looking at the cgroup configuration in the node. Find the sandbox process, then look up its cgroup, and check whether a limit is being set. This is what I get running in GKE:

$ cat /proc/$(pidof runsc-sandbox)/cgroup
$ cat /sys/fs/cgroup/memory/kubepods/burstable/poda6624945-2024-4cd1-94da-8abb842f568f/memory.limit_in_bytes
671088640

DobromirM · 2022-10-05T12:02:44Z

$ cat /proc/$(pidof runsc-sandbox)/cgroup

11:cpuset:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
10:net_cls,net_prio:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
9:hugetlb:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
8:memory:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
7:devices:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
6:pids:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
5:freezer:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
4:perf_event:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
3:blkio:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
2:cpu,cpuacct:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
1:name=systemd:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40

$ cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice/memory.limit_in_bytes

1048576

fvoznika · 2022-10-05T18:35:38Z

When I try to run your deployment, the pod doesn't even start because gVisor fails to boot due to the low memory limit (which is the expected behavior when such a low memory limit is set). Can you describe in more details what happens when you try to run the deployment? Can you collect and attach debug logs (see instructions here and here)?

DobromirM · 2022-10-06T14:14:53Z

When I run the deployment with gVisor as the runtime, the pod starts and runs as if the limits are not there. Running the deployment with the default runtime makes it fail instantly due to the low limit, which as you said, is the expected behaviour.

I've created a new node on the cluster with the following debug configs:

etc/containerd/runsc.toml

log_path = "/var/log/runsc/%ID%/shim.log"
log_level = "debug"
  
[runsc_config]
debug = "true"
debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log"

etc/containerd/config.toml

...

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"

...

After applying only the deployment.yaml from my first comment, the /var/log/runsc directory was created on the node and it contained two sub-directories: 094dd09636d33b2d96488bb0ccb5281b0dae9bd26845186402d16577e4812121 and 620519fc2170e56bd9ab81bf669001e737955db2f10f5029b578d8e9c04ef1c7.

I've uploaded the logs here: https://github.com/DobromirM/gvisor-logs

fvoznika · 2022-10-11T17:51:22Z

I believe the problem is that the node is using systemd cgroup driver, given the path below:

    "cgroupsPath": "kubepods-pod3d6523c3_ac3f_4f51_bd36_ca6b432568d7.slice:cri-containerd:094dd09636d33b2d96488bb0ccb5281b0dae9bd26845186402d16577e4812121",

The options available in runsc are:

cgroupfs v1
cgroupfs v2
systemd cgroup v2

DobromirM · 2022-10-27T15:11:11Z

We switched to an EKS optimized Ubuntu Linux AMI to avoid the original problem.

fvoznika · 2022-10-27T23:09:32Z

Thanks for the update.

DobromirM added the type: bug Something isn't working label Oct 4, 2022

zkoopmans assigned zkoopmans, konstantin-s-bogom and fvoznika Oct 4, 2022

DobromirM mentioned this issue Oct 12, 2022

Use cgroup v2 awslabs/amazon-eks-ami#824

Closed

fvoznika closed this as completed Oct 27, 2022

kevinGC mentioned this issue Oct 1, 2024

k8s Pod should get OOMKilled but does not #10980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource limits not enforced in EKS cluster #8047

Resource limits not enforced in EKS cluster #8047

DobromirM commented Oct 4, 2022

DobromirM commented Oct 4, 2022

zkoopmans commented Oct 4, 2022

konstantin-s-bogom commented Oct 4, 2022

fvoznika commented Oct 4, 2022

fvoznika commented Oct 5, 2022

DobromirM commented Oct 5, 2022

fvoznika commented Oct 5, 2022 •

edited

Loading

DobromirM commented Oct 6, 2022 •

edited

Loading

fvoznika commented Oct 11, 2022

DobromirM commented Oct 27, 2022

fvoznika commented Oct 27, 2022

Resource limits not enforced in EKS cluster #8047

Resource limits not enforced in EKS cluster #8047

Comments

DobromirM commented Oct 4, 2022

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

DobromirM commented Oct 4, 2022

zkoopmans commented Oct 4, 2022

konstantin-s-bogom commented Oct 4, 2022

fvoznika commented Oct 4, 2022

fvoznika commented Oct 5, 2022

DobromirM commented Oct 5, 2022

fvoznika commented Oct 5, 2022 • edited Loading

DobromirM commented Oct 6, 2022 • edited Loading

fvoznika commented Oct 11, 2022

DobromirM commented Oct 27, 2022

fvoznika commented Oct 27, 2022

fvoznika commented Oct 5, 2022 •

edited

Loading

DobromirM commented Oct 6, 2022 •

edited

Loading