Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource limits not enforced in EKS cluster #8047

Closed
DobromirM opened this issue Oct 4, 2022 · 11 comments
Closed

Resource limits not enforced in EKS cluster #8047

DobromirM opened this issue Oct 4, 2022 · 11 comments
Assignees
Labels
type: bug Something isn't working

Comments

@DobromirM
Copy link

Description

When a Kubernetes deployment is created with gVisor, the resource limits are not enforced. Even if the pods exceed the capacity of the node, they will not terminate but will instead crash the node.

The limits work fine with the default runtime and the pods can be stopped manually.

containerd/config.toml

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.runtime.v1.linux"]
shim_debug = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

Steps to reproduce

  1. Create an EKS cluster.
  2. Create an EKS node group.

eksNodeGroup.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: CLUSTER_NAME
  region: us-west-1
vpc:
  id: VPC_ID
  securityGroup: SECURITY_GROUP
  subnets:
    public:
      public1:
          id: SUBNET_1
      public2:
          id: SUBNET_2
managedNodeGroups:
  - name: eks-node-group
    instanceType: t3.small
    minSize: 1
    maxSize: 1
    desiredCapacity: 1
    ssh:
      allow: true
      publicKeyPath: KEY_PATH
    volumeSize: 30
    ami: ami-013fb9424761e4389
    overrideBootstrapCommand: |
      #!/bin/bash
      
      # Install gvisor
      
      ARCH=$(uname -m)
      URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
      wget ${URL}/runsc ${URL}/runsc.sha512 \
      ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
      sha512sum -c runsc.sha512 -c containerd-shim-runsc-v1.sha512
      rm -f *.sha512
      chmod a+rx runsc containerd-shim-runsc-v1
      sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
      
      # Create the config file
      
      cat <<EOF > /etc/eks/custom-containerd-config.toml
      version = 2
      root = "/var/lib/containerd"
      state = "/run/containerd"
      
      [grpc]
      address = "/run/containerd/containerd.sock"
      
      [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      
      [plugins."io.containerd.grpc.v1.cri"]
      sandbox_image = "602401143452.dkr.ecr.us-west-1.amazonaws.com/eks/pause:3.1-eksbuild.1"
      
      [plugins."io.containerd.runtime.v1.linux"]
      shim_debug = true
      
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
      
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
      runtime_type = "io.containerd.runsc.v1"
      
      [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      EOF
      
      # Restart containerd
      sudo systemctl restart containerd
      
      # Run the bootstrap script
      /etc/eks/bootstrap.sh CLUSTER_NAME --container-runtime containerd --containerd-config-file /etc/eks/custom-containerd-config.toml

eksctl create nodegroup --config-file eksNodeGroup.yaml

  1. Create a runtime class.

runtimeClass.yaml

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor 
handler: runsc  

kubectl apply -f runtimeClass.yaml

  1. Create a deployment with resource limits and gVisor runtime.

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      runtimeClassName: gvisor
      containers:
      - image: nginx
        name: nginx
        ports:
        - containerPort: 80
          name: http
        resources:
          requests:
            cpu: "1m" 
            memory: "1Mi"
          limits:
            cpu: "1m"
            memory: "1Mi"

kubectl apply -f deployment.yaml

runsc version

runsc version release-20220919.0
spec: 1.0.2-dev

docker version (if using docker)

No response

uname

Linux ip-172-31-11-114.us-west-1.compute.internal 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

kubectl version:

Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-13+d2965f0db10712", GitCommit:"d2965f0db1071203c6f5bc662c2827c71fc8b20d", GitTreeState:"clean", BuildDate:"2021-06-26T01:02:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

kubeclt get nodes:

NAME                                          STATUS   ROLES    AGE   VERSION
ip-172-31-11-114.us-west-1.compute.internal   Ready    <none>   18m   v1.23.9-eks-ba74326


### repo state (if built from source)

_No response_

### runsc debug logs (if available)

_No response_
@DobromirM DobromirM added the type: bug Something isn't working label Oct 4, 2022
@DobromirM
Copy link
Author

@zkoopmans

@zkoopmans
Copy link
Contributor

@fvoznika and @konstantin-s-bogom, this is AWS, and this doesn't look like kubernetes/kubernetes#107172. Have we tried this on GKE yet?

If it is us, I suspect it is some mismatch of a message in the shim, but I don't know that code well enough yet to say either way.

@konstantin-s-bogom
Copy link
Member

I think it's likely that kubernetes/kubernetes#107172 is at play here, because AFAIK kubelet versions <1.25 will prefer cAdvisor stats, which report incorrect resource usage for runsc containers.

@DobromirM I don't know if EKS supports using a custom kubelet, but if you know of a way, then try one with this patch applied and see if it fixes the issue: kubernetes/kubernetes@9c3a4aa

@fvoznika
Copy link
Member

fvoznika commented Oct 4, 2022

This issue is not related to cAdvisor. cAdvisor just affects reported metrics, not cgroup limits.

In K8s both pods and containers have limits. Users can only set container limits. Pod limits are automatically set with the aggregate of all containers. For example, if one container has a memory limit of 512MB and another has a 128MB limit, the pod limit will be 640MB.

gVisor doesn't currently enforce container limits because it doesn't have cgroups support. However, pod limits are enforced. This is done by making the sandbox process join the pod cgroup in the host. In the example above, a container may be allowed to go over its own limit, let's say the 128MB limit container will be allowed to allocate more than 128MB, but it won't be allowed to allocate more than 640MB which is the pod limit. It important, however, to ensure that all containers in the pod have a limit set, otherwise the aggregate limit for the pod will be unlimited.

@fvoznika
Copy link
Member

fvoznika commented Oct 5, 2022

I see that in your example you have a single container, in that case the pod limit should be enforced, unless EKS is not setting up pod cgroup appropriately. You can check it by looking at the cgroup configuration in the node. Find the sandbox process, then look up its cgroup, and check whether a limit is being set. This is what I get running in GKE:

$ cat /proc/$(pidof runsc-sandbox)/cgroup
$ cat /sys/fs/cgroup/memory/kubepods/burstable/poda6624945-2024-4cd1-94da-8abb842f568f/memory.limit_in_bytes
671088640

@DobromirM
Copy link
Author

$ cat /proc/$(pidof runsc-sandbox)/cgroup

11:cpuset:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
10:net_cls,net_prio:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
9:hugetlb:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
8:memory:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
7:devices:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
6:pids:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
5:freezer:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
4:perf_event:/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
3:blkio:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
2:cpu,cpuacct:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
1:name=systemd:/system.slice/containerd.service/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice:cri-containerd:5b476c97aeb484d3ad966bb1d7928c2ec93f4cd86733ecfdf847ede1e9e29b40
$ cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-poddb2b4922_5472_44af_9e3d_fd5f8625877f.slice/memory.limit_in_bytes

1048576

@fvoznika
Copy link
Member

fvoznika commented Oct 5, 2022

When I try to run your deployment, the pod doesn't even start because gVisor fails to boot due to the low memory limit (which is the expected behavior when such a low memory limit is set). Can you describe in more details what happens when you try to run the deployment? Can you collect and attach debug logs (see instructions here and here)?

@DobromirM
Copy link
Author

DobromirM commented Oct 6, 2022

When I run the deployment with gVisor as the runtime, the pod starts and runs as if the limits are not there. Running the deployment with the default runtime makes it fail instantly due to the low limit, which as you said, is the expected behaviour.

I've created a new node on the cluster with the following debug configs:

etc/containerd/runsc.toml

log_path = "/var/log/runsc/%ID%/shim.log"
log_level = "debug"
  
[runsc_config]
debug = "true"
debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log"

etc/containerd/config.toml

...

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"

...

After applying only the deployment.yaml from my first comment, the /var/log/runsc directory was created on the node and it contained two sub-directories: 094dd09636d33b2d96488bb0ccb5281b0dae9bd26845186402d16577e4812121 and 620519fc2170e56bd9ab81bf669001e737955db2f10f5029b578d8e9c04ef1c7.

I've uploaded the logs here: https://github.com/DobromirM/gvisor-logs

@fvoznika
Copy link
Member

I believe the problem is that the node is using systemd cgroup driver, given the path below:

    "cgroupsPath": "kubepods-pod3d6523c3_ac3f_4f51_bd36_ca6b432568d7.slice:cri-containerd:094dd09636d33b2d96488bb0ccb5281b0dae9bd26845186402d16577e4812121",

The options available in runsc are:

  • cgroupfs v1
  • cgroupfs v2
  • systemd cgroup v2

@DobromirM
Copy link
Author

We switched to an EKS optimized Ubuntu Linux AMI to avoid the original problem.

@fvoznika
Copy link
Member

Thanks for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants