-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some capabilities do not work in a container with user namespace support #25622
Comments
I have a feeling this is blocked by seccomp, not userns. |
The option I have just tried the IPC_LOCK capability with mlock() and that fails as well. |
Sorry yes Could be apparmor... |
Doesn't look like it ... I also tried with both security settings set to unconfined |
Does this work with |
Yes, I showed that at the end of the bug report. It works with And obviously it also works with the docker daemon running without user namespace support. That seems to rule out everything except for the user namespace. |
@DeathTickle In this case, the userns is just whatever the daemon is in. ping @estesp |
I'll also add that the additional capabilities are being correctly detected by |
I've taken a bit of time to try and dig into the SYS_NICE specific problem; I'm afraid it is deep with the kernel's decision of whether you can call In general I agree if we find a list of concrete restrictions that aren't solvable without changing Linux kernel restrictions, we should add them to the documentation. |
I think this is because scheduling is essentially a global property, so it is probably being denied for that reason. I agree that the documentation is unclear, but essentially you get capabilities over local (ie namespaced) not global resources - what counts as global vs local is not well documented. |
After more testing I have come to a similar conclusion as @justincormack's. Many Linux capabilities act on some global property that only the real root user can use. These capabilities can be given to another user in a user namespace but making calls requiring the capabilities do not work and return the I have tested the same calls ( In my opinion this behaviour is counter intuitive and could lead to bugs. If a program checks for it's capabilities with I might try to make a list of the capabilities that are not effective in a user namespace and it would be nice to have it referenced in the documentation. So far they are |
The documentation is actually fairly clear how limited this is (having just done some testing and found that http://man7.org/linux/man-pages/man7/user_namespaces.7.html
I think that programs should not check their capabilities, they should try to do operations and see if they fail. Lots of subsystems can disable ability to do things now, so it is hard to work out whether something will be allowed other than trying. |
The documentation is clear but only for the 3 mentioned capabilities. I have found more information regarding capabilities in user namespaces here: If I have understood correctly any task calling the kernel with a Making a list of functions that use |
Not having IPC_LOCK is a big issue for containers that need to ensure memory is not swapped for security or performance reasons. Of the top of my head I can think of ElasticSearch and Hashicorp Vault, I'm sure there are more. |
I am trying to run a program inside a docker container with user namespace support that requires some capabilities. Inside the container I am able to see that the capabilities were added but when I make calls that require these capabilities they fail.
The same calls work fine when I bypass user namespace with
--userns host
and add the required capabilities with--cap-add
.So far I have tested changing the scheduling policy and thread priority with a C program and
chrt
. This requires theSYS_NICE
capability.I have also tested creating a message queue with larger sizes than allowed in
/proc/sys/fs/mqueue
. For the call to succeed the program would require theSYS_RESOURCE
capability.When reading the documentation for user namespace support in dockerd it doesn't seem like this would not be a supported feature. In my opinion this would be a very useful feature that would increase security by enabling users to run semi-privileged programs in their own user namespace.
In any case, in the Docker engine reference for dockerd under "User namespace known restrictions", the following paragraphs should be modified to reflect these new restrictions:
Using
--privileged
mode or--cap-add
for certain capabilities on docker run (unless also specifying --userns=host)This paragraph could be extended to list all the functions that fail or capabilities that have no effect.
BUG REPORT INFORMATION
Output of
docker version
:Output of
docker info
:dockerd run command:
/usr/bin/dockerd --userns-remap=default -H fd://
I am running an Ubuntu VM on VMware Player on Windows 7
_For scheduling:_
Steps to reproduce the issue:
docker run -it --rm --cap-add SYS_NICE ubuntu bash
chrt -f -p 99 1
Describe the results you received:
chrt: failed to set pid 1's policy: Operation not permitted
Describe the results you expected:
I get the expected results using
--userns host
docker run -it --rm --userns host --cap-add SYS_NICE ubuntu bash
chrt -f -p 99 1
chrt -p 1
pid 1's current scheduling policy: SCHED_FIFO
pid 1's current scheduling priority: 99
_For the message queue:_
Here is a simple C program to test the message queue: https://gist.github.com/DeathTickle/aa8f980577d498850af4e819319636f9
Modify (or don't) the
QUEUE_SIZE
andQUEUE_MSG_SIZE
defines to be greater than/proc/sys/fs/mqueue/msg_max
and/or/proc/sys/fs/mqueue/msgsize_max
compile it with
gcc mq_open_test.c -o mq_open_test -lrt
Steps to reproduce the issue:
docker run -it --rm --cap-add SYS_RESOURCE -v $PWD/mq_open_test:/mq_open_test ubuntu bash
./mq_open_test
Describe the results you received:
mq_open: Invalid argument
Describe the results you expected:
I get the expected results using
--userns host
docker run -it --rm --userns host --cap-add SYS_RESOURCE -v $PWD/mq_open_test:/mq_open_test ubuntu bash
./mq_open_test
The text was updated successfully, but these errors were encountered: