Some capabilities do not work in a container with user namespace support #25622

DeathTickle · 2016-08-11T15:25:58Z

I am trying to run a program inside a docker container with user namespace support that requires some capabilities. Inside the container I am able to see that the capabilities were added but when I make calls that require these capabilities they fail.

The same calls work fine when I bypass user namespace with --userns host and add the required capabilities with --cap-add.

So far I have tested changing the scheduling policy and thread priority with a C program and chrt. This requires the SYS_NICE capability.
I have also tested creating a message queue with larger sizes than allowed in /proc/sys/fs/mqueue. For the call to succeed the program would require the SYS_RESOURCE capability.

When reading the documentation for user namespace support in dockerd it doesn't seem like this would not be a supported feature. In my opinion this would be a very useful feature that would increase security by enabling users to run semi-privileged programs in their own user namespace.

In any case, in the Docker engine reference for dockerd under "User namespace known restrictions", the following paragraphs should be modified to reflect these new restrictions:

Using --privileged mode flag on docker run (unless also specifying --userns=host)

Using --privileged mode or --cap-add for certain capabilities on docker run (unless also specifying --userns=host)

Finally, while the root user inside a user namespaced container process has many of the expected admin privileges that go along with being the superuser, the Linux kernel has restrictions based on internal knowledge that this is a user namespaced process. The most notable restriction that we are aware of at this time is the inability to use mknod. Permission will be denied for device creation even as container root inside a user namespace.

This paragraph could be extended to list all the functions that fail or capabilities that have no effect.

BUG REPORT INFORMATION

Output of docker version:

Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:11:10 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:11:10 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 6
 Running: 1
 Paused: 0
 Stopped: 5
Images: 20
Server Version: 1.12.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/296608.296608/aufs
 Backing Filesystem: extfs
 Dirs: 25
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-34-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 2.768 GiB
Docker Root Dir: /var/lib/docker/296608.296608

dockerd run command:
/usr/bin/dockerd --userns-remap=default -H fd://

I am running an Ubuntu VM on VMware Player on Windows 7

_For scheduling:_
Steps to reproduce the issue:

docker run -it --rm --cap-add SYS_NICE ubuntu bash
chrt -f -p 99 1

Describe the results you received:
chrt: failed to set pid 1's policy: Operation not permitted

Describe the results you expected:
I get the expected results using --userns host

docker run -it --rm --userns host --cap-add SYS_NICE ubuntu bash
chrt -f -p 99 1
chrt -p 1
pid 1's current scheduling policy: SCHED_FIFO
pid 1's current scheduling priority: 99

_For the message queue:_
Here is a simple C program to test the message queue: https://gist.github.com/DeathTickle/aa8f980577d498850af4e819319636f9

Modify (or don't) the QUEUE_SIZE and QUEUE_MSG_SIZE defines to be greater than /proc/sys/fs/mqueue/msg_max and/or /proc/sys/fs/mqueue/msgsize_max

compile it with gcc mq_open_test.c -o mq_open_test -lrt

Steps to reproduce the issue:

docker run -it --rm --cap-add SYS_RESOURCE -v $PWD/mq_open_test:/mq_open_test ubuntu bash
./mq_open_test

Describe the results you received:
mq_open: Invalid argument

Describe the results you expected:
I get the expected results using --userns host

docker run -it --rm --userns host --cap-add SYS_RESOURCE -v $PWD/mq_open_test:/mq_open_test ubuntu bash
./mq_open_test

The text was updated successfully, but these errors were encountered:

cpuguy83 · 2016-08-11T15:31:24Z

I have a feeling this is blocked by seccomp, not userns.
Can you try with --security-opt seccomp:disabled?

DeathTickle · 2016-08-11T15:55:24Z

docker run -it --rm --cap-add SYS_NICE --security-opt seccomp=unconfined ubuntu bash
root@bbf029cd19d9:/# chrt -f -p 99 1
chrt: failed to set pid 1's policy: Operation not permitted

The option --security-opt seccomp:disabled was failing.
Maybe this is coming from Apparmor?

I have just tried the IPC_LOCK capability with mlock() and that fails as well.

cpuguy83 · 2016-08-11T15:59:11Z

Sorry yes unconfined is the correct term :)

Could be apparmor... --security-opt apparmor=unconfined

DeathTickle · 2016-08-11T16:03:51Z

Doesn't look like it ...
docker run -it --cap-add SYS_NICE --security-opt apparmor=unconfined --rm ubuntu bash
bashroot@a12f12df0c88:/# chrt -f -p 99 1
chrt: failed to set pid 1's policy: Operation not permitted

I also tried with both security settings set to unconfined

cpuguy83 · 2016-08-11T16:08:12Z

Does this work with --privileged?

DeathTickle · 2016-08-11T16:13:25Z

Yes, I showed that at the end of the bug report. It works with --userns host --privileged or --userns host --cap-add SYS_NICE.

And obviously it also works with the docker daemon running without user namespace support.

That seems to rule out everything except for the user namespace.

cpuguy83 · 2016-08-11T16:15:42Z

@DeathTickle In this case, the userns is just whatever the daemon is in. --privileged does a bunch of things none of which are related to userns (unless there's been a change).

ping @estesp

DeathTickle · 2016-08-11T16:19:55Z

I'll also add that the additional capabilities are being correctly detected by capsh
An example:
docker run -it --cap-add SYS_NICE --rm ubuntu bash
root@afbee179e505:/# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,**_cap_sys_nice_**,cap_mknod,cap_audit_write,cap_setfcap+eip

estesp · 2016-08-11T18:19:37Z

I've taken a bit of time to try and dig into the SYS_NICE specific problem; I'm afraid it is deep with the kernel's decision of whether you can call sched_setscheduler with the FIFO (your use of -f) scheduler option in a user namespace with the current capability set. Based on my strace/debug this syscall is where you are getting EPERM, and without kernel debug, I don't know where in that flow of code (there are several chances to hit EPERM in the source) you are being denied even though the cap is added.

In general I agree if we find a list of concrete restrictions that aren't solvable without changing Linux kernel restrictions, we should add them to the documentation.

justincormack · 2016-08-11T23:04:18Z

I think this is because scheduling is essentially a global property, so it is probably being denied for that reason. I agree that the documentation is unclear, but essentially you get capabilities over local (ie namespaced) not global resources - what counts as global vs local is not well documented.

DeathTickle · 2016-08-12T12:35:58Z

After more testing I have come to a similar conclusion as @justincormack's. Many Linux capabilities act on some global property that only the real root user can use. These capabilities can be given to another user in a user namespace but making calls requiring the capabilities do not work and return the EPERM error.

I have tested the same calls (sched_setscheduler, mlock and mq_open) in an LXC container and I get the same behaviour. I also found a helper program in the man page for user namespaces that creates an arbitrary user namespace: userns_child_exec.c. I get the same results inside a simple user only namespace created by this program.

In my opinion this behaviour is counter intuitive and could lead to bugs. If a program checks for it's capabilities with cap_get_proc and sees it has a capability it needs but then when it calls some function it fails, the program would have to handle an error that is specific to user namespaces.

I might try to make a list of the capabilities that are not effective in a user namespace and it would be nice to have it referenced in the documentation. So far they are SYS_NICE, SYS_RESOURCE and IPC_LOCK. Maybe the kernel documentation or developers have more information on this but I haven't been able to find it.

justincormack · 2016-08-12T16:32:56Z

The documentation is actually fairly clear how limited this is (having just done some testing and found that CAP_SYS_MKNOD does not work):

http://man7.org/linux/man-pages/man7/user_namespaces.7.html

       On the other hand, there are many privileged operations that affect
       resources that are not associated with any namespace type, for
       example, changing the system time (governed by CAP_SYS_TIME), loading
       a kernel module (governed by CAP_SYS_MODULE), and creating a device
       (governed by CAP_MKNOD).  Only a process with privileges in the
       initial user namespace can perform such operations.

I think that programs should not check their capabilities, they should try to do operations and see if they fail. Lots of subsystems can disable ability to do things now, so it is hard to work out whether something will be allowed other than trying.

DeathTickle · 2016-08-16T13:58:45Z

The documentation is clear but only for the 3 mentioned capabilities.

I have found more information regarding capabilities in user namespaces here:
https://lists.linuxcontainers.org/pipermail/lxc-users/2016-May/011665.html
https://lwn.net/Articles/420624/

If I have understood correctly any task calling the kernel with a capable() check instead of ns_capable() requires privileges in the initial user namespace since capable() is defined as:
return ns_capable(&init_user_ns, cap);
Kernel functions requiring capabilities that support being used in user namespaces make a check with ns_capable(some_user_ns).

Making a list of functions that use capable() in the kernel and mapping them to the required capabilities would show which capabilities one can add to a docker container that have no effect.

mterron · 2018-06-22T00:36:34Z

Not having IPC_LOCK is a big issue for containers that need to ensure memory is not swapped for security or performance reasons. Of the top of my head I can think of ElasticSearch and Hashicorp Vault, I'm sure there are more.

GordonTheTurtle added the version/1.12 label Aug 11, 2016

justincormack added the area/security label Aug 12, 2016

justincormack added the area/docs label Oct 10, 2016

jcberthon mentioned this issue Nov 6, 2016

Can't bind to privileged ports as non-root #8460

Closed

thaJeztah added the area/security/userns label Mar 21, 2023

This was referenced Sep 18, 2023

rootless docker doesn't apply capabilities #46501

Closed

rootless: document limitations of linux capabilities when running rootless docker/docs#18231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some capabilities do not work in a container with user namespace support #25622

Some capabilities do not work in a container with user namespace support #25622

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

estesp commented Aug 11, 2016

justincormack commented Aug 11, 2016

DeathTickle commented Aug 12, 2016

justincormack commented Aug 12, 2016

DeathTickle commented Aug 16, 2016

mterron commented Jun 22, 2018

Some capabilities do not work in a container with user namespace support #25622

Some capabilities do not work in a container with user namespace support #25622

Comments

DeathTickle commented Aug 11, 2016

BUG REPORT INFORMATION

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

cpuguy83 commented Aug 11, 2016

DeathTickle commented Aug 11, 2016

estesp commented Aug 11, 2016

justincormack commented Aug 11, 2016

DeathTickle commented Aug 12, 2016

justincormack commented Aug 12, 2016

DeathTickle commented Aug 16, 2016

mterron commented Jun 22, 2018