-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67
Comments
Btw, it's interesting to know that Docker doesn't have a proper solution to this problem. See that they don't support user-namespace along "--net=host" functionality. https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations On the other hand, i can see some logic written in Docker (libnetwork), as well as in the K8s dockershim implementation, to deal with these scenarios through the use of container "hooks". But it's not clear to me how mature this implementation is, nor if anyone is actually using K8s along user-namespaces. References: opencontainers/runc#799 Will get back to this when done with the ongoing release cycle. |
I found what could be a valid explanation for the behavior observed above. As suspected, kernel is imposing certain restrictions to users trying to mount procfs and sysfs from within a non-init user-namespace. In these scenarios, kernel is expecting the user creating the container to have CAP_SYS_ADMIN rights in the user-ns that owns the network-ns in question. By the time we mount sysfs we are already "inside" the new user-ns, and by then we have no rights in any resource owned by the root (init) user-ns, including the root network-ns. This is the kernel patch that added this restriction: https://lists.linuxfoundation.org/pipermail/containers/2013-August/033388.html A potential solution i can think of is to extend our existing sysbox-runc's "proxy" handler to have the parent-process being the one performing the sysfs mount on behalf of the init-process. I'll get back to this in a couple of weeks once we are done with our current release cycle. |
Ref #64. |
Here is a quick experiment to reproduce this issue with
In other words, when the netns is created before the userns, mounting sysfs inside that userns fails (for some reason). But if a netns is created inside the userns, mounting sysfs works without problem. |
Any updates on this or possible workarounds? |
@kylecarbs, unfortunately there's no workaround for this one at the moment, but we're actively working on this issue and we expect to have good news soon. Please stay tuned. |
Found the reason for this behavior: when a process mounts sysfs, the kernel checks that the process has CAP_SYS_ADMIN in the user namespace associated with that network namespace. In the example above, the net-ns was created before the user-ns was created, so that net-ns is associated with the init user-ns, not the newly created user-ns. As a result, a process inside the user-ns can't mount sysfs because it does not have CAP_SYS_ADMIN in the init user-ns. However, when we later unshare the net-ns inside the user-ns, the situation changes: that new net-ns is associated with that user-ns, and the process invoking the mount of sysfs does have CAP_SYS_ADMIN in that user-ns. As a result, the sysfs mount succeeds. |
This issue was found by Rodny while trying to scope the level of effort required to launch K8s PODs using the Sysbox runtime (aka sysbox pods). Sysbox always uses the linux user-ns in the containers/pods it creates. It's a must-have for proper functionality & isolation. From the prior comment, it's clear that the pod's user-ns must be created before the network-ns in order for sysfs mounts to work inside the pod. This requirement applies to other kernel network resources too (e..g, those exposed via procfs). As a result, in order for K8s to create pods with sysbox, the user-ns associated with the pod must be created before the network ns (and all other namespaces too) associated with that pod. This can't be done by sysbox itself, because per the K8s CRI spec, it's the CRI implementation (e.g., dockershim, containerd, or cri-o) that sequences this. I've done some research and found that cri-o has experimental support for enabling the user-ns in pods. That is, upstream versions of cri-o are capable of creating the user-ns for a pod first, then create the remaining ns as required. At a high level, the way this will works is:
Other than cri-o, I don't believe the other CRI implementations (dockershim and containerd) support user-ns functionality. dockershim is in fact out of the question because it does not even support the runtimeClass spec required for K8s to deploy containers with sysbox. Thus, it looks like an initial implementation of K8s + sysbox would require the latest versions of cri-o at this time. I am working on this right now. |
In the prior comments, we determined that the EPERM error that sysbox gets when mounting sysfs into a container occurred in scenarios where the network-ns for the sys container was created before the user-ns for that same sys container. But there is another way to hit this EPERM error too: when sysbox runs in an environment where sysfs is mounted read-only. For example, I was able to reproduce this by having K8s deploy a privileged pod that had docker and sysbox inside. The pod was deployed using the OCI runc. Inside the privileged pod, I started Docker and Sysbox, and then I tried to create a system container. But this failed with:
The reason for this failure is that inside the privileged pod, sysfs is mounted read-only:
This is a bit unexpected since the pod is privileged, so we expect sysfs to be mounted read-write. The reason sysfs is mounted read-only is that when the pod was created, the K8s pause container mounts it as read-only by default, and this read-only attribute is propagating to the other containers in the pod. This propagation occurs because sysfs is tightly coupled to the network namespace, and all containers in the pod share that namespace. The fix is simple: remount sysfs as read-write:
After this, I was able to deploy a sys container with Docker + Sysbox without problem. |
Closing as the problem and solution are understood. The solution is to use a CRI that supports user-namespaces (e.g., CRI-O) in order to deploy K8s pods with the sysbox runtime. This task is tracked by issue #64. |
It turns out that we can't bindmount `sysfs` if we're using the unprivileged executor, which is our favorite executor to use. X-ref: nestybox/sysbox#67 (comment) This reverts commit a58ccf0.
I ran into this one while trying to scope the level of effort required to launch K8s PODs through Sysbox runtime.
I initially stumbled into issue #66, which hasn't been properly fixed yet, and then reproduced the problem described herein. Notice that even though the symptoms are identical (i.e, unable to mount sysfs), the cause seems to be different in this case, and that's why we are tracking this issue separately.
After multiple attempts at bysecting the container's OCI spec, i was able to identify the spec instruction causing this problem; however, the low-level root-cause has not been found yet.
Problem is reproduced whenever a sandbox container (e.g. "pause") is instantiated by K8s master. There's nothing specially relevant in the spec of this container, except for the fact that a "path" element is passed as part of the network-namespace element:
As per OCI's specification, a compliant runtime is expected to place the to-be-created container in the network namespace indicated by this file (which in turn, represents a bind-mount of a "/proc/pid/ns/net").
We can re-create the observed behavior by following the steps indicated below ...
Let's start by creating the shared network namespace that our POD will be part of:
Let's now add this network-ns file to our own baked spec:
Problem is right away reproduced:
As expected, problem is not reproduced with upstream runc in the default configuration (no user-ns), as this would also fail in all K8s deployments. However, the same exact issue is reproduced the moment that we request user-ns creation.
See no issue with runc when relying on the above spec:
Let's modify the spec to explicitly activate user-ns creation:
Trying runc once again shows the same problem reported by sysbox-runc:
Problem seems to be caused by some sort of kernel limitation or requirement imposed on user-namespaces and their relationship with network-namespaces. See that issue is also reproduced when leaving runtimes out of the equation:
More details to come ...
The text was updated successfully, but these errors were encountered: