Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker runs privileged container as "unconfined" after restarting container #38075

Open
phillipp opened this issue Oct 24, 2018 · 8 comments
Open

Comments

@phillipp
Copy link

Description

A privileged container confined to an apparmor profile specified with --security-opt apparmor=profile-name is run as "unconfined" after restarting the container.
The profile is only used when first starting the container with docker run, but stopping the container and starting it with docker start starts it as unconfined (auto_restart works, too).

Steps to reproduce the issue:

  1. (optional, gives proof in audit.log) Build container with a program that changes apparmor hat
  2. Start container with docker run --privileged --security-opt apparmor=docker-default, note container id and stop it
  3. Start container again with docker start

Another way, same results:

  1. Set restart_policy for container with apparmor profile
  2. let container auto-restart, for example reboot
  3. Processes in containers are now unconfined

Describe the results you received:

First of all, aa-status does not show the processes in the container to be in enforce mode. They are missing in aa-status output completely.

Secondly, audit.log shows the following log entry, coming from the "unconfined" profile if you try to change hat:

type=SYSCALL msg=audit(1540385552.452:1913897): arch=c000003e syscall=1 success=no exit=-2 a0=32 a1=30da220 a2=24 a3=7 items=0 ppid=91225 pid=92216 auid=4294967295 uid=377118 gid=65534 euid=377118 suid=377118 fsuid=377118 egid=65534 sgid=65534 fsgid=65534 tty=(none) ses=4294967295 comm="php-fpm" exe="/opt/lima-php/5.6/sbin/php-fpm" key=(null)
type=UNKNOWN[1327] msg=audit(1540385552.452:1913897): proctitle=7068702D66706D3A206D61737465722070726F6365737320282F6F70742F6C696D612D7068702F352E362F6574632F7068702D66706D2E636F6E6629
type=AVC msg=audit(1540385552.452:1913898): apparmor="DENIED" operation="change_profile" info="label not found" error=-2 profile="unconfined" name="unconfined//webdefault" pid=92217 comm="php-fpm"

Describe the results you expected:

  1. aa-status showing the processes under "processes are in enforce mode"
  2. AppArmor changing the hat without error message in audit.log

Additional information you deem important (e.g. issue happens only occasionally):

The bug only presents itself in privileged containers.

When you don't have code that changes apparmor hat, I don't see how the bug surfaces at all, unless you watch aa-status. As the container runs unconfined, you'll never really see any problems or hints of a problem.

Output of docker version:

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:20:43 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:28:38 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 43
 Running: 1
 Paused: 0
 Stopped: 42
Images: 18
Server Version: 18.06.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 296
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-116-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.73GiB
Name: qgmdrp7q
ID: XWH2:JCXJ:GQBQ:MZUK:2YGE:3NPY:W6FG:GIFJ:7O5F:IDNI:JQU6:FJSY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Ubuntu trusty, KVM virtual machine

@thaJeztah
Copy link
Member

To my knowledge, the --privileged option disables all security measures (including apparmor, SELinux, seccomp); is the apparmor profile applied before the container is restarted?

@justincormack PTAL

@phillipp
Copy link
Author

Yes, it is applied before the restart, but not after.

@thaJeztah
Copy link
Member

That may actually be a bug, as --privileged should disable all of these. I must admit though, that I don't know if manually setting the apparmor profile back to its default value was ever taken into account there. I suspect the cause of this situation here is that upon restart, the daemon won't be able to determine if the "default" profile was set manually, or not set at all (in which case, --privileged disables the profile)

Note that generally, it's really discouraged to use --privileged, as it's way too permissive; if specific capabilities are needed for a container, using --cap-add instead is a better approach.

We should look into this though, and (at least) document the expected behavior and, if --privileged and --security-opt can not be combined, produce an error, instead of silently ignoring the option on restart.

@phillipp
Copy link
Author

phillipp commented Oct 25, 2018

@thaJeztah

I feel like I should explain a bit about the use case, because I'm eager to remove --privileged, if possible.

The use-case is very simple: the container must be able to configure cgroups inside the docker container cgroup (for example create the cgroup /sys/fs/cgroup/cpu/docker/[container-id]/pool1, /sys/fs/cgroup/cpu/docker/[container-id]/pool2, ... if seen from the host). So the cgroup-fs inside the container must be writable.

I have found no way other than --privileged to make the cgroup filesystem that is mounted into the container writeable. Is there an (undocumented) way to do that?

To reduce attack surface, I paired --privileged with an hand-crafted apparmor profile (the docker-default profile in the example is just an example, because I suppose not many people hand-craft apparmor profiles themselves and have some for testing lying around). In our case, we have multiple apps in containers that use different apparmor profiles and a common seccomp profile.

The problem now is that for example after rebooting the apps in the container cannot change apparmor hats anymore, because they run in "unconfined" profile and the "unconfined" profile has no hats defined. So, clearly not the same behaviour as with the first start ("docker run").

I'm more than happy to remove the --privileged, if there is a way to make the cgroup sysfs writeable. How is that achievable?

Nevertheless, the inconsitency regarding to before and after restart must IMO be fixed, because it creates the very dangerous situation where the sysadmin expects the container to be run confined, but it is not.

@thaJeztah
Copy link
Member

The use-case is very simple: the container must be able to configure cgroups inside the docker container cgroup

(just curious) that means the container itself can control how much resources it can use; is there a reason you cannot set those restrictions when starting the container?

Nevertheless, the inconsitency regarding to before and after restart must IMO be fixed, because it creates the very dangerous situation where the sysadmin expects the container to be run confined, but it is not.

Agreed.

@phillipp
Copy link
Author

phillipp commented Oct 25, 2018

@thaJeztah

You understand correctly: The container needs to assign forked processes inside the container into cgroups. So it controls how much of the resources it is given itself it assigns to what tenant via the master process.

For example the service can create 50 new processes by fork'ing. These groups of processes (in a cgroup) should have equal weight, but not the processes itself. So if for example one connection requires 50 proces to be forked and one connections needs only 1, the 50 proces would hog all resource and connection 2 would starve. Cgroups are supposed to solve exactly this problem.

The tenants (and therefore cgroup names) are unknown at container startup and a huge number (400k) across the system, so they need to be created and assigned dynamically.

The container itself is it's own cgroup created by docker, so containers don't hog resources. So yes, in the end we have a hierarchy like this:

/sys/fs/cgroup/cpu/docker/[container-id]/tasks (for example master process)
/sys/fs/cgroup/cpu/docker/[container-id]/pool1/tasks (child procs of tenant 1, cg created by master)
/sys/fs/cgroup/cpu/docker/[container-id]/pool2/tasks (child procs of tenant 2, cg created by master)

etc.

Just to clarify: we have this setup in prod for more than a year and it has improved reliability in a major, major way.

@justincormack
Copy link
Contributor

Have you tried bind mounting or mounting the cgroups mount points into the container instead (read-write)? This should make them modifiable. You could add them at a different mount point. I have done this with other mounts sometimes to modify them without using privileged.

@phillipp
Copy link
Author

@justincormack IIRC I did, but I tried two years ago. I don't recall the problem exactly, but I'll make a test build and see if I can make it work now.

That aside, what can we do to fix this? As I said, if someone thinks the container is apparmor-protected, that's wrong after a restart / system reboot and the processes in the container not confined as tought. I think that's still huge.

I had a look at the code and tried to find the places where AppArmor support for start/restart is needed.

I haven't really found where apparmor is done in the containerStart routine, can you give me a hint?

@neersighted neersighted changed the title Docker runs container as "unconfined" after restarting container Docker runs privileged container as "unconfined" after restarting container Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants