New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker runs privileged container as "unconfined" after restarting container #38075
Comments
To my knowledge, the @justincormack PTAL |
Yes, it is applied before the restart, but not after. |
That may actually be a bug, as Note that generally, it's really discouraged to use We should look into this though, and (at least) document the expected behavior and, if |
I feel like I should explain a bit about the use case, because I'm eager to remove The use-case is very simple: the container must be able to configure cgroups inside the docker container cgroup (for example create the cgroup /sys/fs/cgroup/cpu/docker/[container-id]/pool1, /sys/fs/cgroup/cpu/docker/[container-id]/pool2, ... if seen from the host). So the cgroup-fs inside the container must be writable. I have found no way other than --privileged to make the cgroup filesystem that is mounted into the container writeable. Is there an (undocumented) way to do that? To reduce attack surface, I paired --privileged with an hand-crafted apparmor profile (the docker-default profile in the example is just an example, because I suppose not many people hand-craft apparmor profiles themselves and have some for testing lying around). In our case, we have multiple apps in containers that use different apparmor profiles and a common seccomp profile. The problem now is that for example after rebooting the apps in the container cannot change apparmor hats anymore, because they run in "unconfined" profile and the "unconfined" profile has no hats defined. So, clearly not the same behaviour as with the first start ("docker run"). I'm more than happy to remove the --privileged, if there is a way to make the cgroup sysfs writeable. How is that achievable? Nevertheless, the inconsitency regarding to before and after restart must IMO be fixed, because it creates the very dangerous situation where the sysadmin expects the container to be run confined, but it is not. |
(just curious) that means the container itself can control how much resources it can use; is there a reason you cannot set those restrictions when starting the container?
Agreed. |
You understand correctly: The container needs to assign forked processes inside the container into cgroups. So it controls how much of the resources it is given itself it assigns to what tenant via the master process. For example the service can create 50 new processes by fork'ing. These groups of processes (in a cgroup) should have equal weight, but not the processes itself. So if for example one connection requires 50 proces to be forked and one connections needs only 1, the 50 proces would hog all resource and connection 2 would starve. Cgroups are supposed to solve exactly this problem. The tenants (and therefore cgroup names) are unknown at container startup and a huge number (400k) across the system, so they need to be created and assigned dynamically. The container itself is it's own cgroup created by docker, so containers don't hog resources. So yes, in the end we have a hierarchy like this:
Just to clarify: we have this setup in prod for more than a year and it has improved reliability in a major, major way. |
Have you tried bind mounting or mounting the cgroups mount points into the container instead (read-write)? This should make them modifiable. You could add them at a different mount point. I have done this with other mounts sometimes to modify them without using privileged. |
@justincormack IIRC I did, but I tried two years ago. I don't recall the problem exactly, but I'll make a test build and see if I can make it work now. That aside, what can we do to fix this? As I said, if someone thinks the container is apparmor-protected, that's wrong after a restart / system reboot and the processes in the container not confined as tought. I think that's still huge. I had a look at the code and tried to find the places where AppArmor support for start/restart is needed. I haven't really found where apparmor is done in the containerStart routine, can you give me a hint? |
Description
A privileged container confined to an apparmor profile specified with
--security-opt apparmor=profile-name
is run as "unconfined" after restarting the container.The profile is only used when first starting the container with
docker run
, but stopping the container and starting it withdocker start
starts it as unconfined (auto_restart works, too).Steps to reproduce the issue:
docker run --privileged --security-opt apparmor=docker-default
, note container id and stop itAnother way, same results:
Describe the results you received:
First of all,
aa-status
does not show the processes in the container to be in enforce mode. They are missing inaa-status
output completely.Secondly, audit.log shows the following log entry, coming from the "unconfined" profile if you try to change hat:
Describe the results you expected:
aa-status
showing the processes under "processes are in enforce mode"Additional information you deem important (e.g. issue happens only occasionally):
The bug only presents itself in privileged containers.
When you don't have code that changes apparmor hat, I don't see how the bug surfaces at all, unless you watch aa-status. As the container runs unconfined, you'll never really see any problems or hints of a problem.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Ubuntu trusty, KVM virtual machine
The text was updated successfully, but these errors were encountered: