Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s container v1.22 and newer fails on docker-desktop and k3d clusters #4873

Closed
FabianKramm opened this issue Jan 6, 2022 · 17 comments
Closed
Assignees
Milestone

Comments

@FabianKramm
Copy link

FabianKramm commented Jan 6, 2022

Environmental Info:
K3s Version:

k3s version v1.22.2+k3s2 (3f5774b4)
go version go1.16.8

Node(s) CPU architecture, OS, and Version:

Linux test-0 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64 GNU/Linux

Cluster Configuration:

container k3s, single server, no agents

Describe the bug:
Hello! Thanks a lot for the great project! I'm one of the maintainers of vcluster and we are using k3s as minimal control plane for our virtual cluster implementation. Unfortunately it seems like k3s stopped working for us since version v1.22 (essentially every version released after PR #4086), emitting the following error on docker-desktop, kind and k3s host clusters:

time="2022-01-06T10:30:21Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"

It worked fine with earlier versions and works fine with vanilla k8s or k0s v1.22 containers.

We have a little bit of a special setup where we run k3s without agent and scheduler and I'm not sure what exactly is causing this error as it works on GKE for example, but would it be somehow possible to not run the root cgroup evacuation if agent is not enabled in order to have similar behaviour like in older versions? If not, is it possible to introduce a flag to disable this?

Steps To Reproduce:

  • Install docker-desktop, kind or equivalent v1.22 or higher host cluster
  • Create a new k3s container within that host cluster:
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.2-k3s2
      name: k3s

This doesn't work with v1.22 and newer, while it works with v1.21 (e.g. image rancher/k3s:v1.21.2-k3s1) and lower.

Expected behavior:

k3s container should be running without errors

Actual behavior:

k3s container fails with error:

time="2022-01-06T10:30:21Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"

Backporting

  • [ x ] Needs backporting to older releases
@brandond
Copy link
Member

brandond commented Jan 6, 2022

Hmm. Disabling cgroup evac when the agent is disabled should be easy enough to do, but we are considering dropping the hidden --disable-agent flag entirely, since it has long been unsupported and some features (managed etcd for example) will not work properly with the agent disabled.

@FabianKramm
Copy link
Author

FabianKramm commented Jan 6, 2022

@brandond thanks for the answer! Mhh thats sad to hear as this would eliminate k3s as a viable solution for our use case and we really think that k3s is currently a great fit for virtual Kubernetes clusters as it provides a minimal control plane which is exactly what we need and has quite some advantages over a regular k8s deployment. We could switch to another distro such as k0s or vanilla k8s containers, which work fine currently, but we were very happy so far with what k3s provided for us and it worked really well for our users up until this point.

I know that the disable flag was kind of a workaround to begin with and we would be fine if certain features would not work with it, but removing it would definitely render k3s not useable for us anymore. So we would be very grateful if you would consider to continue support for disabling the agent, which might be useful for other use cases as well that only need parts of the control plane.

@brandond
Copy link
Member

brandond commented Jan 6, 2022

What's the downside of running the kubelet in your container? Do you just want to avoid seeing a node object in the virtual cluster?

@FabianKramm
Copy link
Author

FabianKramm commented Jan 6, 2022

@brandond vcluster just virtualizes the control plane and schedules actual workloads on the host cluster, which means the virtual cluster just consists of an api server, controller manager, storage backend and hypervisor that translates objects between the virtual control plane and the acutal host cluster, which does not require any additional kubelets. It would be also possible to run an actual kubelet in the container, but this would most certainly require more permissions on the node, while vcluster is mostly targeted at multi tenancy use cases, where for example you only have a access to a single namespace, but need to install a new CRD or webhook etc. in it, which the control plane virtualization allows.

@brandond
Copy link
Member

brandond commented Jan 6, 2022

It's a bit of a hack, but since cgroup evacuation only runs if k3s is pid 1, you could try running /bin/k3s from /bin/sh:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - -c
        - /bin/k3s server
          --write-kubeconfig=/data/k3s-config/kube-config.yaml
          --disable=traefik,servicelb,metrics-server,local-storage,coredns
          --disable-network-policy
          --disable-agent
          --disable-scheduler
          --disable-cloud-controller
          --flannel-backend=none
          --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
          --service-cidr=10.96.0.0/12
          && true
      command:
        - /bin/sh
      image: rancher/k3s:v1.22.5-k3s1
      name: k3s

note that && true is necessary to prevent /bin/sh from just execing k3s which would leave it as pid 1 again.

@brandond
Copy link
Member

brandond commented Jan 6, 2022

That said, I'm curious what about your environment makes the cgroups read-only - we regularly run K3s in Docker and Containerd without issue.

@FabianKramm
Copy link
Author

@brandond great, thanks a lot, that certainly helps. I tested this with my Docker Desktop 4.3.2 (72729) Kubernetes cluster v1.22.4 on an Intel macOS Monterey 12.1. We have several test machines and it doesn't work anymore for all of those with the new docker desktop version that uses the new v1.22 Kubernetes cluster. Maybe they changed something there, but we also received reports about this not working anymore in v1.22 kind or k3d on linux machines.

@brandond
Copy link
Member

brandond commented Jan 6, 2022

k3d doesn't run k3s as pid 1 (it uses its own entrypoint script that does the cgroup evacuation, among other things) so it wouldn't be affected. This behavior was added as a workaround for cgroupv2 systems.

I don't personally use kind, is it normal for that to set up the containers with read-only cgroups?

@FabianKramm
Copy link
Author

FabianKramm commented Jan 6, 2022

@brandond but we are running k3s within k3d as a container and that container then fails, so I guess the k3s container would run as pid 1 within the k3d cluster correct? I'm not sure why those cgroups are read-only, I don't really have a lot of expertise in that to be honest, but it certainly very weird that this only occurs on some systems

@brandond
Copy link
Member

brandond commented Jan 7, 2022

Can you identify which host operating systems/distros it's read-only on?

@brandond
Copy link
Member

brandond commented Jan 7, 2022

Cc @iwilltry42

@FabianKramm
Copy link
Author

FabianKramm commented Jan 7, 2022

@brandond this is probably a non exhaustive list, but docker desktop seems to use linuxkit (Linux version 5.10.76-linuxkit (root@buildkitsandbox) (gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, GNU ld (GNU Binutils) 2.35.2) #1 SMP Mon Nov 8 10:21:19 UTC 2021) and one of our users who reported that bug as well is using Manjaro Linux with kernel 5.10.

But one thing that caught my eye was that it worked with docker desktop v4.2.0, but didn't with v4.3.0 and the following is added in their release notes for v4.3.0:

Docker Desktop now uses cgroupv2. If you need to run systemd in a container then:
- Ensure your version of systemd supports cgroupv2. It must be at least systemd 247. Consider upgrading any centos:7 images to centos:8.
- Containers running systemd need the following options: --privileged --cgroupns=host -v /sys/fs/cgroup:/sys/fs/cgroup:rw.

Especially the last sentence indicates that just running the k3s container in their Kubernetes distribution will not be enough as we would need privileged access there as well as those rw permissions. Unfortunately running k3s as a privileged container wouldn't be an option for us as in multi-tenancy scenarios this is pretty much a no go.

Thanks so much for your help!

@iwilltry42
Copy link
Collaborator

iwilltry42 commented Jan 8, 2022

Not sure if it helps, but let me just drop some info here:

  • k3d cluster create test-cluster --image rancher/k3s:v1.22.5-k3s1 works without problems for me
    • Docker 20.10.12 on Ubuntu 21.10 (kernel 5.15.8) with k3d v5.2.2
  • k3d runs K3s containers in privileged mode by default
  • k3d runs K3s containers with docker-init and a custom entrypoint (e.g. as mentioned for the cgroup evacuation):
    / # ps aux
    PID   USER     COMMAND
         1 0        /sbin/docker-init -- /bin/k3d-entrypoint.sh server --tls-san 0.0.0.0
         7 0        /bin/k3s server
       69 0        containerd 

UPDATE 1: Just tested with Docker for Desktop on Windows 10 without a problem 🤔

  • Docker v20.10.11 (DfD v4.3.2)
    • Kernel 5.10.76-linuxkit
    • cgroup2/cgroupfs
  • k3d v5.2.2
  • k3s v1.22.5-k3s1

@FabianKramm
Copy link
Author

FabianKramm commented Jan 8, 2022

@iwilltry42 thanks so much for your reply and investigation! Our use case is a little bit different from the default k3d setup and we do not run k3s in docker directly, but rather use an already existing k3d cluster, docker desktop or kind Kubernetes cluster to schedule a new limited k3s pod (basically just the data store, api server and controller manager, while everything else such as scheduler, agent etc. is disabled) in there. The problem then is that this pod fails to start (as k3s is trying to evacuate the cgroups on a read only file system and k3s runs in non privileged mode, which for our use case wouldn't be necessary at all I guess), so its basically Kubernetes within Kubernetes instead of Kubernetes within docker. To reproduce the problem you can setup the k3d like you did and then schedule a pod in there like this, which should fail with the above error message (but mysteriously for some system this works as well as for example GKE or older docker desktop versions, which might not use cgroupsv2):

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.5-k3s1
      name: k3s

We then have an additional component that syncs created pods in that minimal control plane to the actual Kubernetes cluster which schedules those on then real nodes, but the created k3s pod itself is actually not able to schedule any pods as there are no real nodes joined. The advantage of this is that you essentially can split up the control plane and allow users access to a fully working Kubernetes cluster with CRDs, Webhooks, ClusterRoles etc., while the actual workloads are synced to the same namespace on the host cluster, which is great for multi-tenancy scenarios, where you would like to give different people limited access to the host Kubernetes cluster.

@brandond
Copy link
Member

brandond commented Jan 8, 2022

Not mucking about with cgroups when not running the kubelet seems reasonable; I'll take a shot at that for the next patch release.

@FabianKramm
Copy link
Author

@brandond thanks so much, sounds great!

@rancher-max
Copy link
Contributor

Validated in all of v1.20.15-rc1+k3s1, v1.21.9-rc1+k3s1, v1.22.6-rc1+k3s1, and v1.23.2-rc1+k3s1

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.5-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-120
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.20.15-rc1-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-121
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.21.9-rc1-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-122
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.6-rc1-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-123
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.23.2-rc1-k3s1
      name: k3s
  • All pods, other than the original test pod, are up and running successfully, as expected:
# kubectl get nodes,pods -A -o wide
NAME                             STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION            CONTAINER-RUNTIME
node/k3d-test-cluster-server-0   Ready    control-plane,master   13m   v1.22.5+k3s1   172.18.0.2    <none>        K3s dev    5.11.12-300.fc34.x86_64   containerd://1.5.8-k3s1

NAMESPACE     NAME                                         READY   STATUS             RESTARTS      AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
kube-system   pod/coredns-85cb69466-qcs64                  1/1     Running            0             13m     10.42.0.4    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/local-path-provisioner-64ffb68fd-vkzzn   1/1     Running            0             13m     10.42.0.2    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/metrics-server-9cf544f65-w5fbw           1/1     Running            0             13m     10.42.0.3    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/helm-install-traefik-crd--1-jc8lh        0/1     Completed          0             13m     10.42.0.5    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/helm-install-traefik--1-kzcbp            0/1     Completed          2             13m     10.42.0.6    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/svclb-traefik-ft5p8                      2/2     Running            0             12m     10.42.0.7    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/traefik-786ff64748-fxj6f                 1/1     Running            0             12m     10.42.0.8    k3d-test-cluster-server-0   <none>           <none>
default       pod/test-122                                 1/1     Running            0             6m14s   10.42.0.14   k3d-test-cluster-server-0   <none>           <none>
default       pod/test-120                                 1/1     Running            0             6m14s   10.42.0.11   k3d-test-cluster-server-0   <none>           <none>
default       pod/test-123                                 1/1     Running            0             6m14s   10.42.0.12   k3d-test-cluster-server-0   <none>           <none>
default       pod/test-121                                 1/1     Running            0             6m14s   10.42.0.13   k3d-test-cluster-server-0   <none>           <none>
default       pod/test                                     0/1     CrashLoopBackOff   6 (46s ago)   6m14s   10.42.0.10   k3d-test-cluster-server-0   <none>           <none>
  • The original test pod has the expected error:
# k logs test
time="2022-01-24T18:29:11Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants