Docker runs privileged container as "unconfined" after restarting container #38075

phillipp · 2018-10-24T13:34:51Z

Description

A privileged container confined to an apparmor profile specified with --security-opt apparmor=profile-name is run as "unconfined" after restarting the container.
The profile is only used when first starting the container with docker run, but stopping the container and starting it with docker start starts it as unconfined (auto_restart works, too).

Steps to reproduce the issue:

(optional, gives proof in audit.log) Build container with a program that changes apparmor hat
Start container with docker run --privileged --security-opt apparmor=docker-default, note container id and stop it
Start container again with docker start

Another way, same results:

Set restart_policy for container with apparmor profile
let container auto-restart, for example reboot
Processes in containers are now unconfined

Describe the results you received:

First of all, aa-status does not show the processes in the container to be in enforce mode. They are missing in aa-status output completely.

Secondly, audit.log shows the following log entry, coming from the "unconfined" profile if you try to change hat:

type=SYSCALL msg=audit(1540385552.452:1913897): arch=c000003e syscall=1 success=no exit=-2 a0=32 a1=30da220 a2=24 a3=7 items=0 ppid=91225 pid=92216 auid=4294967295 uid=377118 gid=65534 euid=377118 suid=377118 fsuid=377118 egid=65534 sgid=65534 fsgid=65534 tty=(none) ses=4294967295 comm="php-fpm" exe="/opt/lima-php/5.6/sbin/php-fpm" key=(null)
type=UNKNOWN[1327] msg=audit(1540385552.452:1913897): proctitle=7068702D66706D3A206D61737465722070726F6365737320282F6F70742F6C696D612D7068702F352E362F6574632F7068702D66706D2E636F6E6629
type=AVC msg=audit(1540385552.452:1913898): apparmor="DENIED" operation="change_profile" info="label not found" error=-2 profile="unconfined" name="unconfined//webdefault" pid=92217 comm="php-fpm"

Describe the results you expected:

aa-status showing the processes under "processes are in enforce mode"
AppArmor changing the hat without error message in audit.log

Additional information you deem important (e.g. issue happens only occasionally):

The bug only presents itself in privileged containers.

When you don't have code that changes apparmor hat, I don't see how the bug surfaces at all, unless you watch aa-status. As the container runs unconfined, you'll never really see any problems or hints of a problem.

Output of docker version:

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:20:43 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:28:38 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 43
 Running: 1
 Paused: 0
 Stopped: 42
Images: 18
Server Version: 18.06.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 296
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-116-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.73GiB
Name: qgmdrp7q
ID: XWH2:JCXJ:GQBQ:MZUK:2YGE:3NPY:W6FG:GIFJ:7O5F:IDNI:JQU6:FJSY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Ubuntu trusty, KVM virtual machine

The text was updated successfully, but these errors were encountered:

thaJeztah · 2018-10-24T21:53:41Z

To my knowledge, the --privileged option disables all security measures (including apparmor, SELinux, seccomp); is the apparmor profile applied before the container is restarted?

@justincormack PTAL

phillipp · 2018-10-24T22:11:14Z

Yes, it is applied before the restart, but not after.

thaJeztah · 2018-10-24T22:29:11Z

That may actually be a bug, as --privileged should disable all of these. I must admit though, that I don't know if manually setting the apparmor profile back to its default value was ever taken into account there. I suspect the cause of this situation here is that upon restart, the daemon won't be able to determine if the "default" profile was set manually, or not set at all (in which case, --privileged disables the profile)

Note that generally, it's really discouraged to use --privileged, as it's way too permissive; if specific capabilities are needed for a container, using --cap-add instead is a better approach.

We should look into this though, and (at least) document the expected behavior and, if --privileged and --security-opt can not be combined, produce an error, instead of silently ignoring the option on restart.

phillipp · 2018-10-25T15:53:27Z

@thaJeztah

I feel like I should explain a bit about the use case, because I'm eager to remove --privileged, if possible.

The use-case is very simple: the container must be able to configure cgroups inside the docker container cgroup (for example create the cgroup /sys/fs/cgroup/cpu/docker/[container-id]/pool1, /sys/fs/cgroup/cpu/docker/[container-id]/pool2, ... if seen from the host). So the cgroup-fs inside the container must be writable.

I have found no way other than --privileged to make the cgroup filesystem that is mounted into the container writeable. Is there an (undocumented) way to do that?

To reduce attack surface, I paired --privileged with an hand-crafted apparmor profile (the docker-default profile in the example is just an example, because I suppose not many people hand-craft apparmor profiles themselves and have some for testing lying around). In our case, we have multiple apps in containers that use different apparmor profiles and a common seccomp profile.

The problem now is that for example after rebooting the apps in the container cannot change apparmor hats anymore, because they run in "unconfined" profile and the "unconfined" profile has no hats defined. So, clearly not the same behaviour as with the first start ("docker run").

I'm more than happy to remove the --privileged, if there is a way to make the cgroup sysfs writeable. How is that achievable?

Nevertheless, the inconsitency regarding to before and after restart must IMO be fixed, because it creates the very dangerous situation where the sysadmin expects the container to be run confined, but it is not.

thaJeztah · 2018-10-25T16:32:26Z

The use-case is very simple: the container must be able to configure cgroups inside the docker container cgroup

(just curious) that means the container itself can control how much resources it can use; is there a reason you cannot set those restrictions when starting the container?

Nevertheless, the inconsitency regarding to before and after restart must IMO be fixed, because it creates the very dangerous situation where the sysadmin expects the container to be run confined, but it is not.

Agreed.

phillipp · 2018-10-25T17:15:57Z

@thaJeztah

You understand correctly: The container needs to assign forked processes inside the container into cgroups. So it controls how much of the resources it is given itself it assigns to what tenant via the master process.

For example the service can create 50 new processes by fork'ing. These groups of processes (in a cgroup) should have equal weight, but not the processes itself. So if for example one connection requires 50 proces to be forked and one connections needs only 1, the 50 proces would hog all resource and connection 2 would starve. Cgroups are supposed to solve exactly this problem.

The tenants (and therefore cgroup names) are unknown at container startup and a huge number (400k) across the system, so they need to be created and assigned dynamically.

The container itself is it's own cgroup created by docker, so containers don't hog resources. So yes, in the end we have a hierarchy like this:

/sys/fs/cgroup/cpu/docker/[container-id]/tasks (for example master process)
/sys/fs/cgroup/cpu/docker/[container-id]/pool1/tasks (child procs of tenant 1, cg created by master)
/sys/fs/cgroup/cpu/docker/[container-id]/pool2/tasks (child procs of tenant 2, cg created by master)

etc.

Just to clarify: we have this setup in prod for more than a year and it has improved reliability in a major, major way.

justincormack · 2019-01-23T15:10:06Z

Have you tried bind mounting or mounting the cgroups mount points into the container instead (read-write)? This should make them modifiable. You could add them at a different mount point. I have done this with other mounts sometimes to modify them without using privileged.

phillipp · 2019-01-23T18:48:01Z

@justincormack IIRC I did, but I tried two years ago. I don't recall the problem exactly, but I'll make a test build and see if I can make it work now.

That aside, what can we do to fix this? As I said, if someone thinks the container is apparmor-protected, that's wrong after a restart / system reboot and the processes in the container not confined as tought. I think that's still huge.

I had a look at the code and tried to find the places where AppArmor support for start/restart is needed.

I haven't really found where apparmor is done in the containerStart routine, can you give me a hint?

thaJeztah added the area/security/apparmor label Oct 25, 2018

neersighted changed the title ~~Docker runs container as "unconfined" after restarting container~~ Docker runs privileged container as "unconfined" after restarting container Feb 29, 2024

neersighted mentioned this issue Mar 5, 2024

seccomp: allow specifying a custom profile with --privileged #47500

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker runs privileged container as "unconfined" after restarting container #38075

Docker runs privileged container as "unconfined" after restarting container #38075

phillipp commented Oct 24, 2018

thaJeztah commented Oct 24, 2018

phillipp commented Oct 24, 2018

thaJeztah commented Oct 24, 2018

phillipp commented Oct 25, 2018 •

edited

thaJeztah commented Oct 25, 2018

phillipp commented Oct 25, 2018 •

edited

justincormack commented Jan 23, 2019

phillipp commented Jan 23, 2019

Docker runs privileged container as "unconfined" after restarting container #38075

Docker runs privileged container as "unconfined" after restarting container #38075

Comments

phillipp commented Oct 24, 2018

thaJeztah commented Oct 24, 2018

phillipp commented Oct 24, 2018

thaJeztah commented Oct 24, 2018

phillipp commented Oct 25, 2018 • edited

thaJeztah commented Oct 25, 2018

phillipp commented Oct 25, 2018 • edited

justincormack commented Jan 23, 2019

phillipp commented Jan 23, 2019

phillipp commented Oct 25, 2018 •

edited

phillipp commented Oct 25, 2018 •

edited