Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

Closed
rodnymolina opened this issue Sep 12, 2020 · 10 comments
Closed

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

rodnymolina opened this issue Sep 12, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@rodnymolina
Copy link
Member

rodnymolina commented Sep 12, 2020

I ran into this one while trying to scope the level of effort required to launch K8s PODs through Sysbox runtime.

I initially stumbled into issue #66, which hasn't been properly fixed yet, and then reproduced the problem described herein. Notice that even though the symptoms are identical (i.e, unable to mount sysfs), the cause seems to be different in this case, and that's why we are tracking this issue separately.

After multiple attempts at bysecting the container's OCI spec, i was able to identify the spec instruction causing this problem; however, the low-level root-cause has not been found yet.

Problem is reproduced whenever a sandbox container (e.g. "pause") is instantiated by K8s master. There's nothing specially relevant in the spec of this container, except for the fact that a "path" element is passed as part of the network-namespace element:

        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "path": "/var/run/netns/cni-ca69f110-38f9-4be8-dca4-10cbb16f8695",
                "type": "network"
            }
        ],

As per OCI's specification, a compliant runtime is expected to place the to-be-created container in the network namespace indicated by this file (which in turn, represents a bind-mount of a "/proc/pid/ns/net").

path (string, OPTIONAL) - namespace file. This value MUST be an absolute path in the runtime mount namespace. The runtime MUST place the container process in the namespace associated with that path. The runtime MUST generate an error if path is not associated with a namespace of type type. If path is not specified, the runtime MUST create a new container namespace of type type.

We can re-create the observed behavior by following the steps indicated below ...

Let's start by creating the shared network namespace that our POD will be part of:

rmolina@heavy-vm-bionic:~/wsp$ sudo ip netns add test-ns-1

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ ls -li /run/netns/
total 0
4026532321 -r--r--r-- 1 root root 0 May 13 02:31 test-ns-1
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ findmnt
...
├─/run                                tmpfs                  tmpfs       rw,nosuid,noexec,relatime,size=815200k,mode=755
│ ├─/run/lock                         tmpfs                  tmpfs       rw,nosuid,nodev,noexec,relatime,size=5120k
│ ├─/run/user/1000                    tmpfs                  tmpfs       rw,nosuid,nodev,relatime,size=815196k,mode=700,uid=1000,gid=1000
│ ├─/run/user/1001                    tmpfs                  tmpfs       rw,nosuid,nodev,relatime,size=815196k,mode=700,uid=1001,gid=1001
│ ├─/run/netns/test-ns-1              nsfs[net:[4026532321]] nsfs        rw
│ └─/run/netns                        tmpfs[/netns]          tmpfs       rw,nosuid,noexec,relatime,size=815200k,mode=755
│   └─/run/netns/test-ns-1            nsfs[net:[4026532321]] nsfs        rw
├─/boot                               /dev/sda1              ext4        rw,relatime
...

Let's now add this network-ns file to our own baked spec:

		"namespaces": [
			{
				"type": "pid"
			},
		        {
			        "path": "/var/run/netns/test-ns-1",
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			},
			{
				"type": "cgroup"
			}
		],

Problem is right away reproduced:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo sysbox-runc run ubuntu-1
container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"rootfs_linux.go:58: setting up rootfs mounts caused \\\"rootfs_linux.go:928: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\""
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

As expected, problem is not reproduced with upstream runc in the default configuration (no user-ns), as this would also fail in all K8s deployments. However, the same exact issue is reproduced the moment that we request user-ns creation.

See no issue with runc when relying on the above spec:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo runc run ubuntu-1
#

Let's modify the spec to explicitly activate user-ns creation:

root@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu# cat /etc/subuid
lxd:100000:65536
root:100000:65536
vagrant:165536:65536
rmolina:231072:65536
sysbox:296608:268435456

root@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu# cat config.json
...
        "linux": {
        "uidMappings": [
            {
                "hostID": 296608,
                "containerID": 0,
                "size": 268435456
            }
        ],
        "gidMappings": [
            {
                "hostID": 296608,
                "containerID": 0,
                "size": 268435456
            }
        ],
            "namespaces": [
                        {
                                "type": "pid"
                        },
                        {
                                "path": "/var/run/netns/test-ns-1",
                                "type": "network"
                        },
                        {
                                "type": "ipc"
                        },
                        {
                                "type": "uts"
                        },
                        {
                                "type": "mount"
                        },
                        {
                                "type": "user"
                        },
                        {
                                "type": "cgroup"
                        }
                ],
...

Trying runc once again shows the same problem reported by sysbox-runc:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo runc run ubuntu-1
WARN[0000] exit status 1
ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"sysfs\\\" to rootfs \\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\" at \\\"/sys\\\" caused \\\"operation not permitted\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"sysfs\\\" to rootfs \\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\" at \\\"/sys\\\" caused \\\"operation not permitted\\\"\""
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

Problem seems to be caused by some sort of kernel limitation or requirement imposed on user-namespaces and their relationship with network-namespaces. See that issue is also reproduced when leaving runtimes out of the equation:


<-- With network-ns:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo unshare -m -u -i -n -p -U -f -r bash -c "mkdir /root/sys && mount -t sysfs sysfs /root/sys"
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ echo $?
0

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo rm -rf /root/sys

<-- No network-ns:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo unshare -m -u -i -p -U -f -r bash -c "mkdir /root/sys && mount -t sysfs sysfs /root/sys"
mount: /root/sys: permission denied.
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

More details to come ...

@rodnymolina rodnymolina added the enhancement New feature or request label Sep 12, 2020
@rodnymolina
Copy link
Member Author

Btw, it's interesting to know that Docker doesn't have a proper solution to this problem. See that they don't support user-namespace along "--net=host" functionality.

https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

On the other hand, i can see some logic written in Docker (libnetwork), as well as in the K8s dockershim implementation, to deal with these scenarios through the use of container "hooks". But it's not clear to me how mature this implementation is, nor if anyone is actually using K8s along user-namespaces.

References:

opencontainers/runc#799
moby/moby#21800
opencontainers/runc#807
systemd/systemd#1555
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=87a8ebd637dafc255070f503909a053cf0d98d3f

Will get back to this when done with the ongoing release cycle.

@rodnymolina
Copy link
Member Author

I found what could be a valid explanation for the behavior observed above. As suspected, kernel is imposing certain restrictions to users trying to mount procfs and sysfs from within a non-init user-namespace.

In these scenarios, kernel is expecting the user creating the container to have CAP_SYS_ADMIN rights in the user-ns that owns the network-ns in question. By the time we mount sysfs we are already "inside" the new user-ns, and by then we have no rights in any resource owned by the root (init) user-ns, including the root network-ns.

This is the kernel patch that added this restriction:

https://lists.linuxfoundation.org/pipermail/containers/2013-August/033388.html

A potential solution i can think of is to extend our existing sysbox-runc's "proxy" handler to have the parent-process being the one performing the sysfs mount on behalf of the init-process. I'll get back to this in a couple of weeks once we are done with our current release cycle.

@rodnymolina
Copy link
Member Author

Ref #64.

@ctalledo
Copy link
Member

Here is a quick experiment to reproduce this issue with unshare:

$ sudo unshare -n bash                                                                                                                                                                                                                                                                                                        
$ unshare -U -m -i -p -u -C -f -r --mount-proc bash                                                                                                                                                                                                                                                                           
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
mount: /home/cesar//rootfs/sys: permission denied                                                                                                                                                                                                                                                                  
$ unshare -n bash                                                                                                                                                                                                                                                                                                             
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
(no problem)   

In other words, when the netns is created before the userns, mounting sysfs inside that userns fails (for some reason). But if a netns is created inside the userns, mounting sysfs works without problem.

@kylecarbs
Copy link

Any updates on this or possible workarounds?

@rodnymolina
Copy link
Member Author

@kylecarbs, unfortunately there's no workaround for this one at the moment, but we're actively working on this issue and we expect to have good news soon. Please stay tuned.

@ctalledo
Copy link
Member

Here is a quick experiment to reproduce this issue with unshare:

$ sudo unshare -n bash                                                                                                                                                                                                                                                                                                        
$ unshare -U -m -i -p -u -C -f -r --mount-proc bash                                                                                                                                                                                                                                                                           
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
mount: /home/cesar//rootfs/sys: permission denied                                                                                                                                                                                                                                                                  
$ unshare -n bash                                                                                                                                                                                                                                                                                                             
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
(no problem)   

In other words, when the netns is created before the userns, mounting sysfs inside that userns fails (for some reason). But if a netns is created inside the userns, mounting sysfs works without problem.

Found the reason for this behavior: when a process mounts sysfs, the kernel checks that the process has CAP_SYS_ADMIN in the user namespace associated with that network namespace.

In the example above, the net-ns was created before the user-ns was created, so that net-ns is associated with the init user-ns, not the newly created user-ns. As a result, a process inside the user-ns can't mount sysfs because it does not have CAP_SYS_ADMIN in the init user-ns.

However, when we later unshare the net-ns inside the user-ns, the situation changes: that new net-ns is associated with that user-ns, and the process invoking the mount of sysfs does have CAP_SYS_ADMIN in that user-ns. As a result, the sysfs mount succeeds.

@ctalledo
Copy link
Member

ctalledo commented Nov 12, 2020

This issue was found by Rodny while trying to scope the level of effort required to launch K8s PODs using the Sysbox runtime (aka sysbox pods).

Sysbox always uses the linux user-ns in the containers/pods it creates. It's a must-have for proper functionality & isolation.

From the prior comment, it's clear that the pod's user-ns must be created before the network-ns in order for sysfs mounts to work inside the pod. This requirement applies to other kernel network resources too (e..g, those exposed via procfs).

As a result, in order for K8s to create pods with sysbox, the user-ns associated with the pod must be created before the network ns (and all other namespaces too) associated with that pod.

This can't be done by sysbox itself, because per the K8s CRI spec, it's the CRI implementation (e.g., dockershim, containerd, or cri-o) that sequences this.

I've done some research and found that cri-o has experimental support for enabling the user-ns in pods. That is, upstream versions of cri-o are capable of creating the user-ns for a pod first, then create the remaining ns as required.

At a high level, the way this will works is:

  1. The user creates a pod with an annotation indicating use of the "user-ns" and sets runtimeClass to the sysbox runtime.
  2. k8s deploys that pod on a node
  3. The k8s kubelet on that node talks to cri-o to create the pod with user-ns and sysbox
  4. cri-o creates the user-ns and network-ns, then tells sysbox to create the containers for the pod
  5. sysbox creates the containers

Other than cri-o, I don't believe the other CRI implementations (dockershim and containerd) support user-ns functionality. dockershim is in fact out of the question because it does not even support the runtimeClass spec required for K8s to deploy containers with sysbox.

Thus, it looks like an initial implementation of K8s + sysbox would require the latest versions of cri-o at this time.

I am working on this right now.

@ctalledo ctalledo changed the title Unable to mount sysfs (EPERM) in shared network-ns scenarios Sysbox fails mount sysfs (EPERM) during container creation in shared network-ns scenarios Nov 12, 2020
@ctalledo ctalledo changed the title Sysbox fails mount sysfs (EPERM) during container creation in shared network-ns scenarios Sysbox fails mount sysfs (EPERM) during container creation Nov 12, 2020
@ctalledo
Copy link
Member

ctalledo commented Nov 12, 2020

In the prior comments, we determined that the EPERM error that sysbox gets when mounting sysfs into a container occurred in scenarios where the network-ns for the sys container was created before the user-ns for that same sys container.

But there is another way to hit this EPERM error too: when sysbox runs in an environment where sysfs is mounted read-only.

For example, I was able to reproduce this by having K8s deploy a privileged pod that had docker and sysbox inside. The pod was deployed using the OCI runc. Inside the privileged pod, I started Docker and Sysbox, and then I tried to create a system container. But this failed with:

root@pod-with-sysbox:~/nestybox/sysbox# docker run --runtime=sysbox-runc -it --rm alpine                                                                       
docker: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"rootfs_linux.go:62: setting up rootfs mounts caused \\\"rootfs_linux.go:932: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2
/fbdaf2935a2b8ffe777060a1db2e63be8da034f35c47315d5b544a4ca6718bf6/merged\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\"": unknown.

The reason for this failure is that inside the privileged pod, sysfs is mounted read-only:

root@pod-with-sysbox:~/nestybox/sysbox# findmnt | grep "sysfs"                                                                                                 
|-/sys                                      sysfs                                                                                                                                 sysfs   ro,nosuid,nodev,noexec,relatime    

This is a bit unexpected since the pod is privileged, so we expect sysfs to be mounted read-write. The reason sysfs is mounted read-only is that when the pod was created, the K8s pause container mounts it as read-only by default, and this read-only attribute is propagating to the other containers in the pod. This propagation occurs because sysfs is tightly coupled to the network namespace, and all containers in the pod share that namespace.

The fix is simple: remount sysfs as read-write:

root@pod-with-sysbox:~/nestybox/sysbox# mount -o remount,rw /sys /sys                                                                                          
root@pod-with-sysbox:~/nestybox/sysbox# findmnt | grep "sysfs"                                                                                                  
|-/sys                                      sysfs                                                                                                                                 sysfs   rw,relatime

After this, I was able to deploy a sys container with Docker + Sysbox without problem.

@ctalledo ctalledo changed the title Sysbox fails mount sysfs (EPERM) during container creation K8s + Sysbox: mount sysfs fails (EPERM) during pod creation Nov 27, 2020
@ctalledo
Copy link
Member

Closing as the problem and solution are understood. The solution is to use a CRI that supports user-namespaces (e.g., CRI-O) in order to deploy K8s pods with the sysbox runtime. This task is tracked by issue #64.

staticfloat added a commit to JuliaContainerization/Sandbox.jl that referenced this issue Jul 29, 2022
It turns out that we can't bindmount `sysfs` if we're using the
unprivileged executor, which is our favorite executor to use.
X-ref: nestybox/sysbox#67 (comment)

This reverts commit a58ccf0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants