New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some capabilities do not work in a container with user namespace support #25622

Open
DeathTickle opened this Issue Aug 11, 2016 · 14 comments

Comments

Projects
None yet
6 participants
@DeathTickle

DeathTickle commented Aug 11, 2016

I am trying to run a program inside a docker container with user namespace support that requires some capabilities. Inside the container I am able to see that the capabilities were added but when I make calls that require these capabilities they fail.

The same calls work fine when I bypass user namespace with --userns host and add the required capabilities with --cap-add.

So far I have tested changing the scheduling policy and thread priority with a C program and chrt. This requires the SYS_NICE capability.
I have also tested creating a message queue with larger sizes than allowed in /proc/sys/fs/mqueue. For the call to succeed the program would require the SYS_RESOURCE capability.

When reading the documentation for user namespace support in dockerd it doesn't seem like this would not be a supported feature. In my opinion this would be a very useful feature that would increase security by enabling users to run semi-privileged programs in their own user namespace.

In any case, in the Docker engine reference for dockerd under "User namespace known restrictions", the following paragraphs should be modified to reflect these new restrictions:

Using --privileged mode flag on docker run (unless also specifying --userns=host)

Using --privileged mode or --cap-add for certain capabilities on docker run (unless also specifying --userns=host)

Finally, while the root user inside a user namespaced container process has many of the expected admin privileges that go along with being the superuser, the Linux kernel has restrictions based on internal knowledge that this is a user namespaced process. The most notable restriction that we are aware of at this time is the inability to use mknod. Permission will be denied for device creation even as container root inside a user namespace.

This paragraph could be extended to list all the functions that fail or capabilities that have no effect.


BUG REPORT INFORMATION

Output of docker version:

Client:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:11:10 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   8eab29e
 Built:        Thu Jul 28 22:11:10 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 6
 Running: 1
 Paused: 0
 Stopped: 5
Images: 20
Server Version: 1.12.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/296608.296608/aufs
 Backing Filesystem: extfs
 Dirs: 25
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null overlay host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-34-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 2.768 GiB
Docker Root Dir: /var/lib/docker/296608.296608

dockerd run command:
/usr/bin/dockerd --userns-remap=default -H fd://

I am running an Ubuntu VM on VMware Player on Windows 7

_For scheduling:_
Steps to reproduce the issue:

  1. docker run -it --rm --cap-add SYS_NICE ubuntu bash
  2. chrt -f -p 99 1

Describe the results you received:
chrt: failed to set pid 1's policy: Operation not permitted

Describe the results you expected:
I get the expected results using --userns host

  1. docker run -it --rm --userns host --cap-add SYS_NICE ubuntu bash
  2. chrt -f -p 99 1
  3. chrt -p 1
    pid 1's current scheduling policy: SCHED_FIFO
    pid 1's current scheduling priority: 99

_For the message queue:_
Here is a simple C program to test the message queue: https://gist.github.com/DeathTickle/aa8f980577d498850af4e819319636f9

Modify (or don't) the QUEUE_SIZE and QUEUE_MSG_SIZE defines to be greater than /proc/sys/fs/mqueue/msg_max and/or /proc/sys/fs/mqueue/msgsize_max

compile it with gcc mq_open_test.c -o mq_open_test -lrt

Steps to reproduce the issue:

  1. docker run -it --rm --cap-add SYS_RESOURCE -v $PWD/mq_open_test:/mq_open_test ubuntu bash
  2. ./mq_open_test

Describe the results you received:
mq_open: Invalid argument

Describe the results you expected:
I get the expected results using --userns host

  1. docker run -it --rm --userns host --cap-add SYS_RESOURCE -v $PWD/mq_open_test:/mq_open_test ubuntu bash
  2. ./mq_open_test
@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Aug 11, 2016

Contributor

I have a feeling this is blocked by seccomp, not userns.
Can you try with --security-opt seccomp:disabled?

Contributor

cpuguy83 commented Aug 11, 2016

I have a feeling this is blocked by seccomp, not userns.
Can you try with --security-opt seccomp:disabled?

@DeathTickle

This comment has been minimized.

Show comment
Hide comment
@DeathTickle

DeathTickle Aug 11, 2016

docker run -it --rm --cap-add SYS_NICE --security-opt seccomp=unconfined ubuntu bash
root@bbf029cd19d9:/# chrt -f -p 99 1
chrt: failed to set pid 1's policy: Operation not permitted

The option --security-opt seccomp:disabled was failing.
Maybe this is coming from Apparmor?

I have just tried the IPC_LOCK capability with mlock() and that fails as well.

DeathTickle commented Aug 11, 2016

docker run -it --rm --cap-add SYS_NICE --security-opt seccomp=unconfined ubuntu bash
root@bbf029cd19d9:/# chrt -f -p 99 1
chrt: failed to set pid 1's policy: Operation not permitted

The option --security-opt seccomp:disabled was failing.
Maybe this is coming from Apparmor?

I have just tried the IPC_LOCK capability with mlock() and that fails as well.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Aug 11, 2016

Contributor

Sorry yes unconfined is the correct term :)

Could be apparmor... --security-opt apparmor=unconfined

Contributor

cpuguy83 commented Aug 11, 2016

Sorry yes unconfined is the correct term :)

Could be apparmor... --security-opt apparmor=unconfined

@DeathTickle

This comment has been minimized.

Show comment
Hide comment
@DeathTickle

DeathTickle Aug 11, 2016

Doesn't look like it ...
docker run -it --cap-add SYS_NICE --security-opt apparmor=unconfined --rm ubuntu bash
bashroot@a12f12df0c88:/# chrt -f -p 99 1
chrt: failed to set pid 1's policy: Operation not permitted

I also tried with both security settings set to unconfined

DeathTickle commented Aug 11, 2016

Doesn't look like it ...
docker run -it --cap-add SYS_NICE --security-opt apparmor=unconfined --rm ubuntu bash
bashroot@a12f12df0c88:/# chrt -f -p 99 1
chrt: failed to set pid 1's policy: Operation not permitted

I also tried with both security settings set to unconfined

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Aug 11, 2016

Contributor

Does this work with --privileged?

Contributor

cpuguy83 commented Aug 11, 2016

Does this work with --privileged?

@DeathTickle

This comment has been minimized.

Show comment
Hide comment
@DeathTickle

DeathTickle Aug 11, 2016

Yes, I showed that at the end of the bug report. It works with --userns host --privileged or --userns host --cap-add SYS_NICE.

And obviously it also works with the docker daemon running without user namespace support.

That seems to rule out everything except for the user namespace.

DeathTickle commented Aug 11, 2016

Yes, I showed that at the end of the bug report. It works with --userns host --privileged or --userns host --cap-add SYS_NICE.

And obviously it also works with the docker daemon running without user namespace support.

That seems to rule out everything except for the user namespace.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Aug 11, 2016

Contributor

@DeathTickle In this case, the userns is just whatever the daemon is in. --privileged does a bunch of things none of which are related to userns (unless there's been a change).

ping @estesp

Contributor

cpuguy83 commented Aug 11, 2016

@DeathTickle In this case, the userns is just whatever the daemon is in. --privileged does a bunch of things none of which are related to userns (unless there's been a change).

ping @estesp

@DeathTickle

This comment has been minimized.

Show comment
Hide comment
@DeathTickle

DeathTickle Aug 11, 2016

I'll also add that the additional capabilities are being correctly detected by capsh
An example:
docker run -it --cap-add SYS_NICE --rm ubuntu bash
root@afbee179e505:/# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,**_cap_sys_nice_**,cap_mknod,cap_audit_write,cap_setfcap+eip

DeathTickle commented Aug 11, 2016

I'll also add that the additional capabilities are being correctly detected by capsh
An example:
docker run -it --cap-add SYS_NICE --rm ubuntu bash
root@afbee179e505:/# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,**_cap_sys_nice_**,cap_mknod,cap_audit_write,cap_setfcap+eip

@estesp

This comment has been minimized.

Show comment
Hide comment
@estesp

estesp Aug 11, 2016

Contributor

I've taken a bit of time to try and dig into the SYS_NICE specific problem; I'm afraid it is deep with the kernel's decision of whether you can call sched_setscheduler with the FIFO (your use of -f) scheduler option in a user namespace with the current capability set. Based on my strace/debug this syscall is where you are getting EPERM, and without kernel debug, I don't know where in that flow of code (there are several chances to hit EPERM in the source) you are being denied even though the cap is added.

In general I agree if we find a list of concrete restrictions that aren't solvable without changing Linux kernel restrictions, we should add them to the documentation.

Contributor

estesp commented Aug 11, 2016

I've taken a bit of time to try and dig into the SYS_NICE specific problem; I'm afraid it is deep with the kernel's decision of whether you can call sched_setscheduler with the FIFO (your use of -f) scheduler option in a user namespace with the current capability set. Based on my strace/debug this syscall is where you are getting EPERM, and without kernel debug, I don't know where in that flow of code (there are several chances to hit EPERM in the source) you are being denied even though the cap is added.

In general I agree if we find a list of concrete restrictions that aren't solvable without changing Linux kernel restrictions, we should add them to the documentation.

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Aug 11, 2016

Contributor

I think this is because scheduling is essentially a global property, so it is probably being denied for that reason. I agree that the documentation is unclear, but essentially you get capabilities over local (ie namespaced) not global resources - what counts as global vs local is not well documented.

Contributor

justincormack commented Aug 11, 2016

I think this is because scheduling is essentially a global property, so it is probably being denied for that reason. I agree that the documentation is unclear, but essentially you get capabilities over local (ie namespaced) not global resources - what counts as global vs local is not well documented.

@DeathTickle

This comment has been minimized.

Show comment
Hide comment
@DeathTickle

DeathTickle Aug 12, 2016

After more testing I have come to a similar conclusion as @justincormack's. Many Linux capabilities act on some global property that only the real root user can use. These capabilities can be given to another user in a user namespace but making calls requiring the capabilities do not work and return the EPERM error.

I have tested the same calls (sched_setscheduler, mlock and mq_open) in an LXC container and I get the same behaviour. I also found a helper program in the man page for user namespaces that creates an arbitrary user namespace: userns_child_exec.c. I get the same results inside a simple user only namespace created by this program.

In my opinion this behaviour is counter intuitive and could lead to bugs. If a program checks for it's capabilities with cap_get_proc and sees it has a capability it needs but then when it calls some function it fails, the program would have to handle an error that is specific to user namespaces.

I might try to make a list of the capabilities that are not effective in a user namespace and it would be nice to have it referenced in the documentation. So far they are SYS_NICE, SYS_RESOURCE and IPC_LOCK. Maybe the kernel documentation or developers have more information on this but I haven't been able to find it.

DeathTickle commented Aug 12, 2016

After more testing I have come to a similar conclusion as @justincormack's. Many Linux capabilities act on some global property that only the real root user can use. These capabilities can be given to another user in a user namespace but making calls requiring the capabilities do not work and return the EPERM error.

I have tested the same calls (sched_setscheduler, mlock and mq_open) in an LXC container and I get the same behaviour. I also found a helper program in the man page for user namespaces that creates an arbitrary user namespace: userns_child_exec.c. I get the same results inside a simple user only namespace created by this program.

In my opinion this behaviour is counter intuitive and could lead to bugs. If a program checks for it's capabilities with cap_get_proc and sees it has a capability it needs but then when it calls some function it fails, the program would have to handle an error that is specific to user namespaces.

I might try to make a list of the capabilities that are not effective in a user namespace and it would be nice to have it referenced in the documentation. So far they are SYS_NICE, SYS_RESOURCE and IPC_LOCK. Maybe the kernel documentation or developers have more information on this but I haven't been able to find it.

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Aug 12, 2016

Contributor

The documentation is actually fairly clear how limited this is (having just done some testing and found that CAP_SYS_MKNOD does not work):

http://man7.org/linux/man-pages/man7/user_namespaces.7.html

       On the other hand, there are many privileged operations that affect
       resources that are not associated with any namespace type, for
       example, changing the system time (governed by CAP_SYS_TIME), loading
       a kernel module (governed by CAP_SYS_MODULE), and creating a device
       (governed by CAP_MKNOD).  Only a process with privileges in the
       initial user namespace can perform such operations.

I think that programs should not check their capabilities, they should try to do operations and see if they fail. Lots of subsystems can disable ability to do things now, so it is hard to work out whether something will be allowed other than trying.

Contributor

justincormack commented Aug 12, 2016

The documentation is actually fairly clear how limited this is (having just done some testing and found that CAP_SYS_MKNOD does not work):

http://man7.org/linux/man-pages/man7/user_namespaces.7.html

       On the other hand, there are many privileged operations that affect
       resources that are not associated with any namespace type, for
       example, changing the system time (governed by CAP_SYS_TIME), loading
       a kernel module (governed by CAP_SYS_MODULE), and creating a device
       (governed by CAP_MKNOD).  Only a process with privileges in the
       initial user namespace can perform such operations.

I think that programs should not check their capabilities, they should try to do operations and see if they fail. Lots of subsystems can disable ability to do things now, so it is hard to work out whether something will be allowed other than trying.

@DeathTickle

This comment has been minimized.

Show comment
Hide comment
@DeathTickle

DeathTickle Aug 16, 2016

The documentation is clear but only for the 3 mentioned capabilities.

I have found more information regarding capabilities in user namespaces here:
https://lists.linuxcontainers.org/pipermail/lxc-users/2016-May/011665.html
https://lwn.net/Articles/420624/

If I have understood correctly any task calling the kernel with a capable() check instead of ns_capable() requires privileges in the initial user namespace since capable() is defined as:
return ns_capable(&init_user_ns, cap);
Kernel functions requiring capabilities that support being used in user namespaces make a check with ns_capable(some_user_ns).

Making a list of functions that use capable() in the kernel and mapping them to the required capabilities would show which capabilities one can add to a docker container that have no effect.

DeathTickle commented Aug 16, 2016

The documentation is clear but only for the 3 mentioned capabilities.

I have found more information regarding capabilities in user namespaces here:
https://lists.linuxcontainers.org/pipermail/lxc-users/2016-May/011665.html
https://lwn.net/Articles/420624/

If I have understood correctly any task calling the kernel with a capable() check instead of ns_capable() requires privileges in the initial user namespace since capable() is defined as:
return ns_capable(&init_user_ns, cap);
Kernel functions requiring capabilities that support being used in user namespaces make a check with ns_capable(some_user_ns).

Making a list of functions that use capable() in the kernel and mapping them to the required capabilities would show which capabilities one can add to a docker container that have no effect.

@mterron

This comment has been minimized.

Show comment
Hide comment
@mterron

mterron Jun 22, 2018

Not having IPC_LOCK is a big issue for containers that need to ensure memory is not swapped for security or performance reasons. Of the top of my head I can think of ElasticSearch and Hashicorp Vault, I'm sure there are more.

mterron commented Jun 22, 2018

Not having IPC_LOCK is a big issue for containers that need to ensure memory is not swapped for security or performance reasons. Of the top of my head I can think of ElasticSearch and Hashicorp Vault, I'm sure there are more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment