Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet crashes with: root container [kubepods] doesn't exist #95488

Closed
b10s opened this issue Oct 12, 2020 · 31 comments
Closed

kubelet crashes with: root container [kubepods] doesn't exist #95488

b10s opened this issue Oct 12, 2020 · 31 comments
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@b10s
Copy link
Contributor

b10s commented Oct 12, 2020

What happened:
kubelet became unhealthy after a while, in few days after healthy run

Only two ways to bring it back:

  • rebooting node (restarting systemd service doesn't help);
  • use "workaround" kubelet parameters, and restart service:
--cgroups-per-qos=false
--enforce-node-allocatable=""

What you expected to happen:

  • kubelet keep running;
  • in case of crash, resurrect kubelet by restarting via systemd automatically.

How to reproduce it (as minimally and precisely as possible):
Run plain kubelet binary on CoreOS 2135.6.0 node. Before we run it as hyperkube image.

I know, CoreOS in it’s End Of Life state now, however still need to find root cause and fix it if possible. Or use workaround with full understanding of current situation.

Anything else we need to know?:
error message:
Failed to start ContainerManager failed to initialize top level QOS containers: root container [kubepods] doesn't exist

Environment:

  • Kubernetes version (use kubectl version):
    v1.19.2 (latest as of now)
  • Cloud provider or hardware configuration:
    bare metal node
  • OS (e.g: cat /etc/os-release):
DISTRIB_ID="Container Linux by CoreOS"
DISTRIB_RELEASE=2135.6.0
DISTRIB_CODENAME="Rhyolite"
DISTRIB_DESCRIPTION="Container Linux by CoreOS 2135.6.0 (Rhyolite)"
  • Kernel (e.g. uname -a):
    4.19.56-coreos-r1
  • Install tools:
    manually crafted systemd unit file:
[Service]
EnvironmentFile=/etc/environment
Environment=DNS_SERVICE_IP=1.1.1.1
Environment=NETWORK_PLUGIN=cni
Environment=SERVICE_IP_RANGE=10.10.0.0/16

ExecStartPre=/usr/bin/mkdir -p /etc/cni
ExecStartPre=/usr/bin/mkdir -p /opt/cni/bin
ExecStartPre=/usr/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/usr/bin/mkdir -p /var/log/containers
ExecStart=/opt/bin/kubelet \
  --config=/etc/kubernetes/kubelet-kubelet-configuration.yaml \
  --kubeconfig=/etc/kubernetes/kubelet-kubeconfig.yaml \
  --node-labels=role=master,node.kubernetes.io/master=,machine-type=Physical \
  --cni-conf-dir=/etc/cni/net.d \
  --network-plugin=${NETWORK_PLUGIN} \
  --container-runtime=docker
Restart=always
RestartSec=10
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

Other:

On my node I can see such hierarchy under /sys/fs/cgroup/memory/kubepods/:

# ls -l /sys/fs/cgroup/memory/kubepods/ | head -n 5
total 0
drwxr-xr-x.  2 root root 0 Oct 12 13:57 besteffort
drwxr-xr-x. 14 root root 0 Oct 12 13:57 burstable
-rw-r--r--.  1 root root 0 Oct 12 13:57 cgroup.clone_children
--w--w--w-.  1 root root 0 Oct 12 13:57 cgroup.event_control

Do I understand right, since the --cgroups-per-qos by default is true, in each cgroup hierarchy - e.g. memory, cpu,cpuacct, etc - kubelet creates directry kubepods which calls root container?

Also in kubepods in each hierarchy the kubelet creates subdirectory only for two qos - besteffort, burstable - but how does kubelet utilize these two subdirectories in each hierarchy? (I was able to find only a setting of minShares)

Would be good to know the root cause of the issue and understanding of what is kubepod, root container, how node allocatable and cgroups per qos relate to it.

@b10s b10s added the kind/bug Categorizes issue or PR as related to a bug. label Oct 12, 2020
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 12, 2020
@neolit123
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 12, 2020
@b10s
Copy link
Contributor Author

b10s commented Nov 9, 2020

Just got a new crash of one worker node just after update to v1.19.2 from v1.18.6 and changing hyperkube image to plain kubelet.

The error from kubelet logs is:

Nov 09 15:53:43 node-name kubelet[20659]: F1109 15:53:43.371350   20659 kubelet.go:1296] Failed to start ContainerManager failed to initialize top level QOS containers: root container [kubepods] doesn't exist

All kubepods cgroups seems to be presented (compared to healthy node, all 15 are here):

# ls -ld /sys/fs/cgroup/*/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/blkio/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/cpu,cpuacct/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/cpu/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/cpuacct/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/cpuset/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/devices/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/freezer/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/hugetlb/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/memory/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/net_cls,net_prio/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/net_cls/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/net_prio/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/perf_event/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/pids/kubepods
drwxr-xr-x. 4 root root 0 Jul 16 08:06 /sys/fs/cgroup/systemd/kubepods

Probably need to investigate deeper the cgroups structure.
If you have any ideas what to check, will appreciate it

@b10s
Copy link
Contributor Author

b10s commented Nov 9, 2020

Selection_999(088)
I've straced kubelet a bit. Seems it tries to newfstatat() cgroup which doesn't exist:

newfstatat(AT_FDCWD, "/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15/kubepods", 0xc0018d1968, 0) = -1 ENOENT (No such file or directory)

Why it might happen after update?

@b10s
Copy link
Contributor Author

b10s commented Nov 9, 2020

Is the following cgroup a valid cgroup?

/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15/kubepods

It has !two kubepods in it's path (see above screen shoot).

I checked with find on other nodes - seems kubepods appears only once in each cgroup hierarchy:

# find /sys/fs/cgroup/ -name kubepods -type d
/sys/fs/cgroup/pids/kubepods
/sys/fs/cgroup/hugetlb/kubepods
/sys/fs/cgroup/net_cls,net_prio/kubepods
/sys/fs/cgroup/cpuset/kubepods
/sys/fs/cgroup/blkio/kubepods
/sys/fs/cgroup/devices/kubepods
/sys/fs/cgroup/perf_event/kubepods
/sys/fs/cgroup/memory/kubepods
/sys/fs/cgroup/cpu,cpuacct/kubepods
/sys/fs/cgroup/freezer/kubepods
/sys/fs/cgroup/systemd/kubepods

However if I manually remove last kubepods part I can get a valid path:

# ls /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15/
cgroup.clone_children  cgroup.procs  notify_on_release  tasks

@b10s
Copy link
Contributor Author

b10s commented Nov 10, 2020

Built kubelet with debug info:

mkdir -p $GOPATH/src/k8s.io
cd $GOPATH/src/k8s.io
git clone https://github.com/kubernetes/kubernetes
cd kubernetes
git checkout tags/v1.19.2
make all GOGCFLAGS="-N -l" GOLDFLAGS=""

brought it to an unhealthy node and run under Delve.

I can see unexpected cgroupPaths in systemd part:

(dlv) p cgroupPaths
map[string]string [
        "memory": "/sys/fs/cgroup/memory/kubepods", 
        "cpuset": "/sys/fs/cgroup/cpuset/kubepods", 
        "net_cls": "/sys/fs/cgroup/net_cls,net_prio/kubepods", 
        "net_prio": "/sys/fs/cgroup/net_cls,net_prio/kubepods", 
        "pids": "/sys/fs/cgroup/pids/kubepods", 
        "perf_event": "/sys/fs/cgroup/perf_event/kubepods", 
        "systemd": "/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-...+91 more", 
        "freezer": "/sys/fs/cgroup/freezer/kubepods", 
        "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods", 
        "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods", 
        "devices": "/sys/fs/cgroup/devices/kubepods", 
        "blkio": "/sys/fs/cgroup/blkio/kubepods", 
        "hugetlb": "/sys/fs/cgroup/hugetlb/kubepods", 
]

at point
https://github.com/kubernetes/kubernetes/blob/v1.19.2/pkg/kubelet/cm/cgroup_manager_linux.go#L278

Probably the buildCgroupPaths has some unexpected behavior

cgroupPaths := m.buildCgroupPaths(name)

@b10s
Copy link
Contributor Author

b10s commented Nov 10, 2020

Seems during parsing of /proc/self/mountinfo there is something wrong:
https://github.com/opencontainers/runc/blob/v1.0.0-rc91/libcontainer/cgroups/v1_utils.go#L120

@b10s
Copy link
Contributor Author

b10s commented Nov 10, 2020

Selection_999(091)

https://man7.org/linux/man-pages/man5/proc.5.html

On healthy node:

# cat /proc/self/mountinfo | grep -P ' - cgroup'
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
29 25 0:26 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,devices
30 25 0:27 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,perf_event
31 25 0:28 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,hugetlb
32 25 0:29 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,blkio
33 25 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,freezer
34 25 0:31 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,net_cls,net_prio
35 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,cpu,cpuacct
36 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,memory
37 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
38 25 0:35 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,cpuset

on unhealthy node:

# cat /proc/self/mountinfo | grep -P ' - cgroup'
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
29 25 0:26 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,freezer
30 25 0:27 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,cpu,cpuacct
31 25 0:28 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,memory
32 25 0:29 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,perf_event
33 25 0:30 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,devices
34 25 0:31 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,blkio
35 25 0:32 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,cpuset
36 25 0:33 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
37 25 0:34 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,hugetlb
38 25 0:35 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,pids
2826 26 0:23 /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd

Why such "bad" (last one) record might appear?

@b10s
Copy link
Contributor Author

b10s commented Nov 11, 2020

Seems the kubelet code

https://github.com/kubernetes/kubernetes/blob/v1.19.2/pkg/kubelet/cm/helpers_linux.go#L203-L208

squeezes the array of 12 elements into map of 13 elements.

But why "squeezes" if the map is bigger than the array?!

Because there are few hierarchies which belong to multiple subsystems and only one subsystem - systemd - has two hierarchies, which at the end in kubelet's map will have only one hierarchy.


The array returned from runc (important - here are two mount points of systemd subsystem):

> k8s.io/kubernetes/pkg/kubelet/cm.getCgroupSubsystemsV1() /home/b10s/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/cm/helpers_linux.go:197 (PC: 0x1d36554)

(dlv) p allCgroups
[]k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups.Mount len: 12, cap: 14, [
        {
                Mountpoint: "/sys/fs/cgroup/systemd",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["systemd"],},
        {
                Mountpoint: "/sys/fs/cgroup/freezer",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["freezer"],},
        {
                Mountpoint: "/sys/fs/cgroup/cpu,cpuacct",
                Root: "/",
                Subsystems: []string len: 2, cap: 2, ["cpu","cpuacct"],},
        {
                Mountpoint: "/sys/fs/cgroup/memory",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["memory"],},
        {
                Mountpoint: "/sys/fs/cgroup/perf_event",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, [
                        "perf_event",
                ],},
        {
                Mountpoint: "/sys/fs/cgroup/devices",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["devices"],},
        {
                Mountpoint: "/sys/fs/cgroup/blkio",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["blkio"],},
        {
                Mountpoint: "/sys/fs/cgroup/cpuset",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["cpuset"],},
        {
                Mountpoint: "/sys/fs/cgroup/net_cls,net_prio",
                Root: "/",
                Subsystems: []string len: 2, cap: 2, [
                        "net_cls",
                        "net_prio",
                ],},
        {
                Mountpoint: "/sys/fs/cgroup/hugetlb",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["hugetlb"],},
        {
                Mountpoint: "/sys/fs/cgroup/pids",
                Root: "/",
                Subsystems: []string len: 1, cap: 1, ["pids"],},
        {
                Mountpoint: "/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-...+82 more",
                Root: "/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842...+60 more",
                Subsystems: []string len: 1, cap: 1, ["systemd"],},
]

The map[string] which is made by kubelet from the above array:

> k8s.io/kubernetes/pkg/kubelet/cm.getCgroupSubsystemsV1() /home/b10s/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/kubelet/cm/helpers_linux.go:209 (PC: 0x1d36655)

(dlv) p mountPoints
map[string]string [
        "systemd": "/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15", 
        "cpu": "/sys/fs/cgroup/cpu,cpuacct", 
        "perf_event": "/sys/fs/cgroup/perf_event", 
        "blkio": "/sys/fs/cgroup/blkio", 
        "cpuset": "/sys/fs/cgroup/cpuset", 
        "net_prio": "/sys/fs/cgroup/net_cls,net_prio", 
        "hugetlb": "/sys/fs/cgroup/hugetlb", 
        "freezer": "/sys/fs/cgroup/freezer", 
        "cpuacct": "/sys/fs/cgroup/cpu,cpuacct", 
        "memory": "/sys/fs/cgroup/memory", 
        "devices": "/sys/fs/cgroup/devices", 
        "net_cls": "/sys/fs/cgroup/net_cls,net_prio", 
        "pids": "/sys/fs/cgroup/pids", 
]

Probably there is expectation to have only one mount point per subsystem as said in Rule 2 of Red Hat doc.

However, a single subsystem can be attached to two hierarchies if both of those hierarchies have only that subsystem attached.

May I ask, is it by intention or there is unexpected squeeze?

@odinuge
Copy link
Member

odinuge commented Nov 16, 2020

Hmm. Interesting..

I see you are using dockershim, but what cgroup driver is docker and kubelet using? If you use systemd as init you should also use systemd as cgroup driver for both kubelet and the container runtime.

Also, do you have any pods with containers running systemd? Maybe that is causing the "double" mount?

There is also a preflight warning in kubeadm that says:

[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/

@b10s
Copy link
Contributor Author

b10s commented Nov 16, 2020

@odinuge thank you for the reply!

what cgroup driver is docker and kubelet using?

The kubelet and container runtime (docker) are using cgroupfs driver:

# cat /etc/kubernetes/kubelet-kubelet-configuration.yaml | grep -i cgroupDriver
cgroupDriver: "cgroupfs"

and

# docker info | grep -i cgr
Cgroup Driver: cgroupfs

If you use systemd as init you should also use systemd as cgroup driver

Yeah, we do consider moving to systemd cgroup driver as recommended in soon future updates of our cluster but not now, I think.

do you have any pods with containers running systemd?

That is a good point! I'm not sure but it might be the reason if such container exists and has access to mount namespace of the host, so it can mount named systemd hierarchy. Or there is even no need to have access to host mount namespace?
I think it should be easy to reproduce and confirm if the assumption is correct.
Let me try! Do you think it will be enough to run a pod with container with systemd on board? Or some special privileges are required for such container to reproduce double systemd named mount?

There is also a preflight warning in kubeadm

Unfortunately we do not use kubeadm but our own set of scripts to deploy k8s. Probably we can consider in future to improve kubelet to emit the same warning or even do not run with cgroupfs cgroup driver on systems with systemd init. What do you think?

@odinuge
Copy link
Member

odinuge commented Nov 16, 2020

Ahh, yeah, running with cgroupfs together with systemd is probably what is causing the problems.

Do you think it will be enough to run a pod with container with systemd on board? Or some special privileges are required for such container to reproduce double systemd named mount?

I do not remember what it will require from the top of my head, but it does requite more than just the default values for pods. But it is possible with the right permissions.

Unfortunately we do not use kubeadm but our own set of scripts to deploy k8s. Probably we can consider in future to improve kubelet to emit the same warning or even do not run with cgroupfs cgroup driver on systems with systemd init. What do you think?

Ahh, makes sense. Hmm. I guess we can fail kubelet on startup (as we do with swap), but that would might end up with upgrades failing. Log entries with a warning would probably not help that much, but I guess that is possible.

@b10s
Copy link
Contributor Author

b10s commented Nov 18, 2020

@odinuge I've tried two scenarios:

  • run privileged pod with systemd on board;
  • run container on the same host via docker;

Unfortunately was not able to reproduce.

Maybe there is some other way to escape the mount namespace and be able to make mount on the host machine like

/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15

or more detailed:

mount ID         2826
parent ID        26
major:minor      0:23
root             /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
mount point      /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
mount options    rw,nosuid,nodev,noexec,relatime
optional fields  shared:6
separator        -
filesystem type  cgroup
mount source     cgroup
super options    rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd

What I tried so far in details

run privileged pod with systemd on board

The way I run:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: po
  name: po
spec:
  nodeSelector:
    "kubernetes.io/hostname": "node4"
  containers:
  - args:
    - sleep
    - inf
    image: centos/systemd
    name: po
    securityContext:
      privileged: true
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}

On the node I've checked some runc arguments:

# cat /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022/config.json | jq '.mounts'
..
    {
      "destination": "/sys/fs/cgroup",
      "type": "bind",
      "source": "/var/lib/docker/volumes/1b979964a75408e2e7975095723cbdfca9b4cd1e5cdfacaa293e266b5cbd3bd9/_data",
      "options": [
        "rbind"
      ]
    },
..

and

# cat /var/run/docker/runtime-runc/moby/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022/state.json | jq '.'
..
      {
        "source": "/var/lib/docker/volumes/1b979964a75408e2e7975095723cbdfca9b4cd1e5cdfacaa293e266b5cbd3bd9/_data",
        "destination": "/sys/fs/cgroup",
        "device": "bind",
        "flags": 20480,
        "propagation_flags": null,
        "data": "",
        "relabel": "",
        "extensions": 0,
        "premount_cmds": null,
        "postmount_cmds": null
      },
..

and

# cat /var/run/docker/runtime-runc/moby/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022/state.json | jq '.cgroup_paths'
{
  "blkio": "/sys/fs/cgroup/blkio/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "cpu": "/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "cpuset": "/sys/fs/cgroup/cpuset/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "devices": "/sys/fs/cgroup/devices/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "freezer": "/sys/fs/cgroup/freezer/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "hugetlb": "/sys/fs/cgroup/hugetlb/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "memory": "/sys/fs/cgroup/memory/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "name=systemd": "/sys/fs/cgroup/systemd/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "net_cls": "/sys/fs/cgroup/net_cls,net_prio/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "net_prio": "/sys/fs/cgroup/net_cls,net_prio/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "perf_event": "/sys/fs/cgroup/perf_event/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022",
  "pids": "/sys/fs/cgroup/pids/kubepods/burstable/podee9951df-fb87-4568-98f1-bde2c62fbad6/5dd3245a5134f58848b80089a740edf159c0e5f2f52af6c672ba7b4793b10022"
}

From container:

[root@po /]# cat /proc/self/mountinfo 
12305 10220 0:1040 / / rw,relatime master:2355 - overlay overlay rw,lowerdir=/var/lib/docker/overlay2/l/XXQHBPSMFNDJP5VKTGABM7IVJV:/var/lib/docker/overlay2/l/LKMAVKN7QYKZW4G7J33C5O5URP:/var/lib/docker/overlay2/l/LQJ32S2WO7FBV6ZFK46FXS4HGS,upperdir=/var/lib/docker/overlay2/81d1217983e77c5e0d439bbe106939e283f347fbaaf1864949459aec88de9b9f/diff,workdir=/var/lib/docker/overlay2/81d1217983e77c5e0d439bbe106939e283f347fbaaf1864949459aec88de9b9f/work
12306 12305 0:1041 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
12307 12305 0:1042 / /dev rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
12308 12307 0:1043 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=666
12339 12305 0:1153 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs ro,seclabel
12340 12307 0:1149 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw,seclabel
12341 12307 8:9 /var/lib/kubelet/pods/ee9951df-fb87-4568-98f1-bde2c62fbad6/containers/po/8a03d371 /dev/termination-log rw,relatime - ext4 /dev/sda9 rw,seclabel
12342 12305 8:9 /var/lib/docker/containers/076fc59c896cbe7e6eb02af64156aac4dbaca340ae3c5a0a3750f46191843d1f/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda9 rw,seclabel
12343 12305 8:9 /var/lib/docker/containers/076fc59c896cbe7e6eb02af64156aac4dbaca340ae3c5a0a3750f46191843d1f/hostname /etc/hostname rw,relatime - ext4 /dev/sda9 rw,seclabel
12344 12305 8:9 /var/lib/kubelet/pods/ee9951df-fb87-4568-98f1-bde2c62fbad6/etc-hosts /etc/hosts rw,relatime - ext4 /dev/sda9 rw,seclabel
12345 12307 0:1148 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,context="system_u:object_r:svirt_lxc_file_t:s0:c777,c1001",size=65536k
12346 12339 8:9 /var/lib/docker/volumes/1b979964a75408e2e7975095723cbdfca9b4cd1e5cdfacaa293e266b5cbd3bd9/_data /sys/fs/cgroup rw,relatime master:1 - ext4 /dev/sda9 rw,seclabel
12347 12305 0:1146 / /run/secrets/kubernetes.io/serviceaccount ro,relatime - tmpfs tmpfs rw,seclabel

run container on the same host via docker

The way I run:
docker run -d --tmpfs /tmp --tmpfs /run -v /sys/fs/cgroup:/sys/fs/cgroup:ro centos/systemd

On the node I've checked some runc arguments:

# cat /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c/config.json | jq '.mounts'
..
  {
    "destination": "/sys/fs/cgroup",
    "type": "bind",
    "source": "/sys/fs/cgroup",
    "options": [
      "rbind",
      "ro",
      "rprivate"
    ]
  }
..

and

# cat /var/run/docker/runtime-runc/moby/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c/state.json | jq '.config.mounts' 
..
  {
    "source": "/sys/fs/cgroup",
    "destination": "/sys/fs/cgroup",
    "device": "bind",
    "flags": 20481,
    "propagation_flags": [
      278528
    ],
    "data": "",
    "relabel": "",
    "extensions": 0,
    "premount_cmds": null,
    "postmount_cmds": null
  }
..

and

# cat /var/run/docker/runtime-runc/moby/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c/state.json | jq '.cgroup_paths'
{
  "blkio": "/sys/fs/cgroup/blkio/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "cpu": "/sys/fs/cgroup/cpu,cpuacct/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "cpuacct": "/sys/fs/cgroup/cpu,cpuacct/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "cpuset": "/sys/fs/cgroup/cpuset/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "devices": "/sys/fs/cgroup/devices/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "freezer": "/sys/fs/cgroup/freezer/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "hugetlb": "/sys/fs/cgroup/hugetlb/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "memory": "/sys/fs/cgroup/memory/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "name=systemd": "/sys/fs/cgroup/systemd/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "net_cls": "/sys/fs/cgroup/net_cls,net_prio/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "net_prio": "/sys/fs/cgroup/net_cls,net_prio/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "perf_event": "/sys/fs/cgroup/perf_event/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c",
  "pids": "/sys/fs/cgroup/pids/docker/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c"
}

From container:

[root@0143b5b44d27 /]# cat /proc/self/mountinfo 
12527 11910 0:1227 / / rw,relatime master:2565 - overlay overlay rw,context="system_u:object_r:svirt_lxc_file_t:s0:c387,c436",lowerdir=/var/lib/docker/overlay2/l/EQUTKPT3OGECJ3HBZA3DQOIKFE:/var/lib/docker/overlay2/l/LKMAVKN7QYKZW4G7J33C5O5URP:/var/lib/docker/overlay2/l/LQJ32S2WO7FBV6ZFK46FXS4HGS,upperdir=/var/lib/docker/overlay2/a9ff1d270f400b017efdf08065592af6d9ae31813982e449f929a82c44f5e6bb/diff,workdir=/var/lib/docker/overlay2/a9ff1d270f400b017efdf08065592af6d9ae31813982e449f929a82c44f5e6bb/work
12528 12527 0:1230 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
12529 12527 0:1231 / /dev rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
12530 12529 0:1232 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=666
12531 12527 0:1233 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro,seclabel
12532 12529 0:1229 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw,seclabel
12533 12527 0:1234 / /tmp rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,seclabel
12534 12527 0:1235 / /run rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,seclabel
12535 12527 8:9 /var/lib/docker/containers/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda9 rw,seclabel
12536 12527 8:9 /var/lib/docker/containers/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c/hostname /etc/hostname rw,relatime - ext4 /dev/sda9 rw,seclabel
12537 12527 8:9 /var/lib/docker/containers/0143b5b44d276de4af8b7c4ba020bccf684552e4518932efaf90c45c380f680c/hosts /etc/hosts rw,relatime - ext4 /dev/sda9 rw,seclabel
12538 12529 0:1228 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,context="system_u:object_r:svirt_lxc_file_t:s0:c387,c436",size=65536k
12539 12531 0:22 / /sys/fs/cgroup ro - tmpfs tmpfs ro,seclabel,mode=755
12540 12539 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
12541 12539 0:26 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
12542 12539 0:27 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
12543 12539 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
12544 12539 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
12545 12539 0:30 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
12546 12539 0:31 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
12547 12539 0:32 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
12548 12539 0:33 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
12549 12539 0:34 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
12550 12539 0:35 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset
11911 12528 0:1230 /bus /proc/bus ro,relatime - proc proc rw
11912 12528 0:1230 /fs /proc/fs ro,relatime - proc proc rw
11913 12528 0:1230 /irq /proc/irq ro,relatime - proc proc rw
11914 12528 0:1230 /sys /proc/sys ro,relatime - proc proc rw
11915 12528 0:1230 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw
11916 12528 0:1236 / /proc/acpi ro,relatime - tmpfs tmpfs ro,seclabel
11917 12528 0:1231 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
11918 12528 0:1231 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
11919 12528 0:1231 /null /proc/latency_stats rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
11920 12528 0:1231 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
11921 12528 0:1231 /null /proc/sched_debug rw,nosuid - tmpfs tmpfs rw,seclabel,size=65536k,mode=755
11922 12528 0:1237 / /proc/scsi ro,relatime - tmpfs tmpfs ro,seclabel
11923 12531 0:1238 / /sys/firmware ro,relatime - tmpfs tmpfs ro,seclabel

Seems runc has many arguments for bind:
https://github.com/opencontainers/runc/blob/4d4d19ce528ac40cc357ef92cd3a6931dba19316/libcontainer/specconv/spec_linux.go#L753

I think all which supported by https://man7.org/linux/man-pages/man2/mount.2.html

One therory is - there is ability to pass such set of argumetns that runc will create such mount on a host:

mount ID         2826
parent ID        26
major:minor      0:23
root             /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
mount point      /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
mount options    rw,nosuid,nodev,noexec,relatime
optional fields  shared:6
separator        -
filesystem type  cgroup
mount source     cgroup
super options    rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd

@odinuge
Copy link
Member

odinuge commented Nov 18, 2020

Hey! Did some testing again and found some things;

If i run:

docker run --privileged -ti -v /tmp/cg/:/sys/fs/cgroup/:shared -p 80:80 centos/systemd

I do actually end up with these mounts, and I guess it is possible to do the same in kubernetes as well (with some combination of cri and pod config):

root@odin:~# cat /proc/self/mountinfo | grep cgroup
33 24 0:28 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:9 - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755
34 33 0:29 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:10 - cgroup2 cgroup2 rw,nsdelegate
35 33 0:30 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
38 33 0:33 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,cpu,cpuacct
39 33 0:34 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,blkio
40 33 0:35 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,pids
41 33 0:36 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,net_cls,net_prio
42 33 0:37 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,cpuset
43 33 0:38 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,freezer
44 33 0:39 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,hugetlb
45 33 0:40 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,devices
46 33 0:41 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,memory
47 33 0:42 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,rdma
48 33 0:43 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:25 - cgroup cgroup rw,perf_event
457 29 0:30 / /tmp/cg/systemd rw,nosuid,nodev,noexec,relatime shared:305 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
527 29 0:33 / /tmp/cg/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:311 - cgroup cgroup rw,cpu,cpuacct
541 29 0:43 / /tmp/cg/perf_event rw,nosuid,nodev,noexec,relatime shared:317 - cgroup cgroup rw,perf_event
670 29 0:40 / /tmp/cg/devices rw,nosuid,nodev,noexec,relatime shared:323 - cgroup cgroup rw,devices
684 29 0:36 / /tmp/cg/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:329 - cgroup cgroup rw,net_cls,net_prio
698 29 0:39 / /tmp/cg/hugetlb rw,nosuid,nodev,noexec,relatime shared:335 - cgroup cgroup rw,hugetlb
712 29 0:37 / /tmp/cg/cpuset rw,nosuid,nodev,noexec,relatime shared:341 - cgroup cgroup rw,cpuset
726 29 0:41 / /tmp/cg/memory rw,nosuid,nodev,noexec,relatime shared:347 - cgroup cgroup rw,memory
740 29 0:34 / /tmp/cg/blkio rw,nosuid,nodev,noexec,relatime shared:353 - cgroup cgroup rw,blkio
754 29 0:42 / /tmp/cg/rdma rw,nosuid,nodev,noexec,relatime shared:359 - cgroup cgroup rw,rdma
768 29 0:38 / /tmp/cg/freezer rw,nosuid,nodev,noexec,relatime shared:365 - cgroup cgroup rw,freezer
782 29 0:35 / /tmp/cg/pids rw,nosuid,nodev,noexec,relatime shared:371 - cgroup cgroup rw,pids

I guess it isn't smart to do this, but it certainly looks like it is possible..

Also, if the node with the problem you are reffering to are still running, it would be nice if you could see what pod and container that "owns" that mount.

Something like crictl inspect <container-id> should work (with container-id set to 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 in your case).

@b10s
Copy link
Contributor Author

b10s commented Nov 18, 2020

@odinuge thank you!
That is amazing. I didn't know about propagation options such as "shared".

Interestingly the mount is still there even after container is stopped and deleted:

# docker ps | grep init
948257fdef1a        centos/systemd                                                                   "/usr/sbin/init"         31 seconds ago      Up 31 seconds       0.0.0.0:80->80/tcp   sad_lamarr

# docker stop 948257fdef1a
948257fdef1a

# docker rm 948257fdef1a
948257fdef1a

# cat /proc/self/mountinfo | grep systemd | grep cgr
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
11806 47 0:23 / /tmp/cg/systemd rw,nosuid,nodev,noexec,relatime shared:2600 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd

It means we almost can reproduce such state with docker!
There are two differencies:

  • originally mount seems to be cyclyc, inside /sys/fs/cgroup/systemd, which is systemd hierarchy, someone has mounted the same hierarchy under /kubepods/podID/containerID path;
  • after removing the docker container all mounts are persist but in case of /kubepods/podID/containerID only systemd hierarchy still mounted.

Also, if the node with the problem you are reffering to are still running, it would be nice if you could see what pod and container that "owns" that mount.

Yeah, the node is still running but seems there is no such container anymore.

The docker ps -a | grep 884 gives me nothing and there is no crictl on CoreOS (if you think it worth to bring crictl here, I'll try)

However searching on filesystem gives me some result with systemd-nspawn:

# find / -name '*8842def24*'
/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
/sys/fs/cgroup/systemd/machine.slice/systemd-nspawn@centos75.service/payload/system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount

There are two machines running on each node by systemd:

# machinectl list
MACHINE  CLASS     SERVICE        OS     VERSION ADDRESSES
centos75 container systemd-nspawn centos 7       -        
frr      container systemd-nspawn ubuntu 18.04   -        

2 machines listed.

Interesting, how systemd-nspawn and k8s (kubelet, docker, containerd, runc) may interact with each other?

@odinuge
Copy link
Member

odinuge commented Nov 19, 2020

Hmm, interesting. Thanks for your detailed feedback!

The docker ps -a | grep 884 gives me nothing and there is no crictl on CoreOS (if you think it worth to bring crictl here, I'll try)

No, I don't think it will give any more insight into what you have already uncovered.

Interesting, how systemd-nspawn and k8s (kubelet, docker, containerd, runc) may interact with each other?

Interesting! Not familiar with systemd-nspawn so I have no real idea about this.

Overall it looks like the most viable solution is to switch to cgroupfs=systemd. The "extra" cgroup mounts make some troubles tho, and mounting multiple cgroup-v1 controllers in the same mount namespace isn't currently supported by runc, and leads to strange behavior. A lot of this mount issues will probably be easier for cgroup v2 since it only has a single mount for the whole cgroup hierarchy. Strange.

@arjunagl
Copy link

arjunagl commented Dec 3, 2020

/assign

@arjunagl
Copy link

/unassign

@b10s
Copy link
Contributor Author

b10s commented Dec 24, 2020

I've updated cluster to systemd cgroup driver and caught a crash with such mount on host:

# cat /proc/self/mountinfo | grep systemd | grep cgr
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
5855 26 0:23 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode4193b51_e5f8_4348_90b3_5423cb74e0d1.slice/docker-57353a2ab1bd38bc5b00500bd4dc9795b6bf0e6e3a15479c474ccb3be605cef8.scope /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode4193b51_e5f8_4348_90b3_5423cb74e0d1.slice/docker-57353a2ab1bd38bc5b00500bd4dc9795b6bf0e6e3a15479c474ccb3be605cef8.scope rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
764 763 0:23 / /var/lib/rkt/pods/run/7bb7089e-5b9c-4a35-9552-93a68664d3ae/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
765 764 0:23 /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode4193b51_e5f8_4348_90b3_5423cb74e0d1.slice/docker-57353a2ab1bd38bc5b00500bd4dc9795b6bf0e6e3a15479c474ccb3be605cef8.scope /var/lib/rkt/pods/run/7bb7089e-5b9c-4a35-9552-93a68664d3ae/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode4193b51_e5f8_4348_90b3_5423cb74e0d1.slice/docker-57353a2ab1bd38bc5b00500bd4dc9795b6bf0e6e3a15479c474ccb3be605cef8.scope rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2021
@matthyx
Copy link
Contributor

matthyx commented Jun 24, 2021

/remove-lifecycle rotten
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021
@matthyx
Copy link
Contributor

matthyx commented Jun 24, 2021

/area kubelet

@Jean-Daniel
Copy link

Jean-Daniel commented Jun 27, 2021

Everything was fine with my cluster until I upgrade kubelet from 1.21.1 to 1.21.2, and now they fails to start with this error.

I'm using cgroup2 and systemd.

Passing the 2 flags mentioned as workaround make the problem disappear, but this is not a satisfying solution.

@Jean-Daniel
Copy link

I found my issue is #102676

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 25, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node Bugs automation moved this from Triaged to Done Nov 24, 2021
@b10s
Copy link
Contributor Author

b10s commented Jan 21, 2022

@odinuge the magic mounts very similar with what we can see here
https://security.googleblog.com/2021/12/exploring-container-security-storage.html

: )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

9 participants