New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding the minimal set of privileges for a docker container to spawn rootless containers #1456

Open
ggoodman opened this Issue May 19, 2017 · 7 comments

Comments

Projects
None yet
4 participants
@ggoodman
Copy link

ggoodman commented May 19, 2017

I've been flailing away at the idea to run a pool of rootless containers as children of a docker container. My intent is to have the docker container run a web server that will spin up a pool of child, rootless containers to which requests can be proxied. These children would be designed to be isolated from each other and the host system from the side-effects of running untrusted code.

I need to pass additional file descriptors to these children which precludes running children as siblings using the host docker daemon. So here I am and I hope I'm not overstepping my bounds by asking for guidance via an issue.

Set up

Create a root filesystem tgz:

$ docker export $(docker create alpine) > rootfs.tgz

Dockerfile with runc, libseccomp2 and the rootfs:

FROM buildpack-deps

RUN apt-get update && apt-get install -y --no-install-recommends \
		libseccomp2 \
	&& rm -rf /var/lib/apt/lists/*

ADD rootfs.tgz /child/rootfs
ADD runc /usr/local/sbin/runc

WORKDIR /child/rootfs

RUN runc spec --rootless

CMD ["runc", "run", "child"]

False starts:

Build and run the container, adding CAP_SYS_ADMIN:

$ docker run --rm -it --cap-add SYS_ADMIN $(docker build -q .)
container_linux.go:265: starting container process caused "process_linux.go:261: applying cgroup configuration
for process caused \"mkdir /sys/fs/cgroup/cpuset/child: read-only file system\""

Same, but mount /sys/fs/cgroup as rw:

$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .)
container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"could
not create session key: operation not permitted\""

Same, but invoke runc with --no-new-keyring:

$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .) runc run --no-new-keyring child
container_linux.go:265: starting container process caused "process_linux.go:339: container init caused \"rootfs
_linux.go:104: jailing process inside rootfs caused \\\"pivot_root operation not permitted\\\"\""

Finally 'working':

Same, but also add --no-pivot:

$ docker run --rm -it --cap-add SYS_ADMIN -v /sys/fs/cgroup:/sys/fs/cgroup:rw $(do
cker build -q .) runc run --no-new-keyring --no-pivot child
/ #

Disclaimer: I'm still wrapping my head around all of the complexity and nuances of all the technologies we call 'containers' so please correct me if I'm wrong.

Removing pivot_root seems like a bad idea given my objectives so I created a copy of the default seccomp profile and added the pivot_root syscall to the big list of SCMP_ACT_ALLOW calls. This let me drop --no-pivot.

What kind of exposure am I creating by opening up by whitelisting the pivot_root syscall?

Also, I'm past my abilities in trying to figure out how I might avoid --no-new-keyring

What kind of exposure am I creating by using the --no-new-keyring flag?

@cyphar

This comment has been minimized.

Copy link
Member

cyphar commented May 24, 2017

docker run --rm -it --cap-add SYS_ADMIN

You are already running privileged containers at this point. You'd need to do --cap-drop all and a few other flags to entire drop privileges inside Docker. Basically you're running as mostly-root if you add that capability (and Docker has a bunch enabled by default).

What kind of exposure am I creating by opening up by whitelisting the pivot_root syscall?

None really. pivot_root is a more secure chroot. Docker disables it because it involves messing with mount namespaces (which normal containers shouldn't be doing) but it's not a security issue in principle (maybe @jessfraz might remember why it was added).

What kind of exposure am I creating by using the --no-new-keyring flag?

Processes inside the container can access the host's kernel keyring directly (which contains various crypto stuff that some system components use) if a process is running with privileges. Though processes inside the inner container (if you're using user namespaces) might be blocked (I haven't read that kernel code in a while though).

In general I would discourage it, but to be honest if you don't specify --no-new-keyring I would wager that there's a kernel-level hole that just hasn't been solved yet from the kernel side.

@cyphar

This comment has been minimized.

Copy link
Member

cyphar commented May 24, 2017

Oh sorry, this too

Same, but mount /sys/fs/cgroup as rw

Rootless containers most definitely cannot do this. The reason that cgroups actually work in the inner container is because your container has CAP_SYS_ADMIN, which is basically the majority of root functionality.

@ggoodman

This comment has been minimized.

Copy link

ggoodman commented May 24, 2017

@cyphar thank you for taking the time. I'm super excited about the whole rootless container story and all the work you've been doing.

I've now been able to get a working prototype of the concept described above. The key differences are that I have been unable to get the networking setup without adding CAP_NET_ADMIN to the parent, docker container and running the container's process as root. Despite running as root, I still seem to need to mount the host cgroup fs and run the child sandboxes with the --no-new-keyring option.

Would it be fair to say that the machinery is not yet in place to be able to run rootless containers with networking as children of unprivileged containers themselves?

To provide network access, I've picked up @jessfraz' netns to great, frictionless effect and run this as a prestart hook. This also meant mounting the host's /lib/modules (from the alpine-based xhyve vm used by docker-for-mac) and installing iptables and kmod (for modprobe) in the debian-based parent docker container.

Given the lack of granularity in CAP_SYS_ADMIN, are we forced to play with seccomp filters as a mechanism to limit access to the components of this flag's capabilities?

Please feel free to close at any time given that this isn't so much of an issue as it is a discussion. I hope that it helps someone else in their wacky experimentation down the road.

@jessfraz

This comment has been minimized.

Copy link
Contributor

jessfraz commented May 25, 2017

What kind of exposure am I creating by opening up by whitelisting the pivot_root syscall?

This was merely not included to prevent people from shooting themselves in the foot, it should be a privileged operation, also if people were using pivot_root they were probably using other things that were blocked as well like mount or chroot so it kinda just went hand in hand, there wasn't an exact reason

@frezbo

This comment has been minimized.

Copy link

frezbo commented Dec 27, 2017

@ggoodman @cyphar Good to see the awesome discussion, I'm also trying to run runc inside a container as a normal user (rootless), where I'm facing much issues, and I do not want to run the base container will all the privileges, I was able to get it working with udocker (https://github.com/indigo-dc/udocker), but I was hoping I could use runc and avoid all the other python dependency, the problem with udocker being it does not support OCI complaint images, so I would like to get your opinions on whats the best approach and the hurdles came across while running them. Thanks.

@cyphar

This comment has been minimized.

Copy link
Member

cyphar commented Dec 27, 2017

@frezbo You can use skopeo to convert a Docker image into an OCI image (I've added support to skopeo to allow you to convert a local file that you would generate from docker save). If you then want to unpack the OCI image into an OCI runtime configuration that can be used by runc, you can use umoci, which is a tool I wrote. umoci has a --rootless flag which generates a rootless OCI runtime configuration and root filesystem without needing root (please read the relevant section in umoci's documentation for caveats).

@frezbo

This comment has been minimized.

Copy link

frezbo commented Dec 27, 2017

@cyphar I have used both the awesome tools by you, and the relevant issues is here: indigo-dc/udocker#111, udcoker can only understand v1 spec and skopeo copies from oci to v2, so I'm planning to use runc inside a container, with the least possible privilege escalation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment