Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add proposal to introduce User Capabilities in Kubernetes and CRI.
- Loading branch information
1 parent
98d87ab
commit efac407
Showing
1 changed file
with
321 additions
and
0 deletions.
There are no files selected for viewing
321 changes: 321 additions & 0 deletions
321
contributors/design-proposals/node/user-capabilities.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,321 @@ | ||
# User Capabilities | ||
|
||
**Authors**: | ||
|
||
1. Filipe Brandenburger (@filbranden) | ||
|
||
**Last Updated**: 2018-06-18 | ||
|
||
**Status**: Draft | ||
|
||
This document proposes the introduction of a new "userCapabilities" setting in | ||
Kubernetes and the CRI, in order to support augmented capabilities for non-root | ||
users in containers. | ||
|
||
## Introduction | ||
|
||
Capabilities have been used in OCI containers to weaken root inside the | ||
container, preventing operations (such as creating device nodes, etc.) deemed to | ||
counter other security measures, or have other security implications. | ||
|
||
Capabilities can also be useful when applied to non-root users, in order to | ||
empower them to perform selective actions usually reserved to root. A great | ||
example of where this is useful is having a non-root user running nginx in a | ||
container listening on port 80 (which is typically reserved for root and | ||
requires a non-root user to have the `CAP_NET_BIND_SERVICE` capability to be | ||
able to perform bind on that port.) (See also kubernetes/kubernetes#56374.) | ||
|
||
This document proposes introducing a new "userCapabilities" setting to be able | ||
to add capabilities to a non-root user running inside the container. | ||
|
||
## Why not reuse "capabilities"? | ||
|
||
The main problem with reusing the existing setting (as suggested or implied in | ||
kubernetes/kubernetes#56374), is that it might be unclear at the time of | ||
configuration whether the workload in the container will run as root or | ||
non-root. | ||
|
||
For instance, the user to run an application inside the container might be set | ||
in an [`USER` directive of a | ||
Dockerfile](https://docs.docker.com/v17.09/engine/reference/builder/#user), in | ||
which case the operator writing the Kubernetes config might think they are | ||
configuring capabilities for root, when in fact they are configuring them for | ||
non-root, and might inadvertently give them more access than required, | ||
effectively making that non-root user behave the same as root inside the | ||
container. | ||
|
||
Such cases might arise when the full set of capabilities is redefined in the | ||
config, for example: | ||
|
||
```yaml | ||
securityContext: | ||
capabilities: | ||
drop: | ||
- all | ||
add: | ||
- CAP_CHOWN | ||
- CAP_DAC_OVERRIDE | ||
- CAP_FOWNER | ||
- CAP_FSETID | ||
- CAP_KILL | ||
- CAP_SETGID | ||
- CAP_SETUID | ||
- CAP_SETPCAP | ||
- CAP_NET_BIND_SERVICE | ||
- CAP_NET_RAW | ||
- CAP_SYS_CHROOT | ||
- CAP_MKNOD | ||
- CAP_AUDIT_WRITE | ||
- CAP_SETFCAP | ||
``` | ||
|
||
This defines the capability set to the OCI default for root (for example, to | ||
avoid any changes if the OCI default set is redefined, or perhaps to work around | ||
differences in how CRI implementations handle capabilities.) | ||
|
||
If such a setting were to be reused for non-root, suddenly the non-root user | ||
would become as powerful as root inside the container. | ||
|
||
Having two separate settings (one for root and one for non-root) also help | ||
clarify what should happen when the container is entered more than once using | ||
different users (for example, using `kubelet exec` into an existing container.) | ||
|
||
It also helps clarify what happens if a non-root user should execute a setuid | ||
binary inside the container (in which case, the existing "capabilities" setting | ||
can still be used to bound how root inside the container should behave.) | ||
|
||
For all these reasons, I propose that introducing a new, separate | ||
"userCapabilities" setting is the best approach. | ||
|
||
## Background | ||
|
||
### Ambient Capabitilies | ||
|
||
Until recently, Linux did not have a good API for setting capabilities and | ||
preserving them as non-root users. When running as non-root, all capabilities | ||
are by default dropped whenever a command is executed using the `execve()` | ||
family of syscalls. | ||
|
||
While the API includes a set of "inheritable" capabilities, those will only be | ||
preserved when the binaries executed are marked with specific corresponding | ||
capabilities bits, which, in practice, never actually happens (much less so in | ||
the context of OCI containers, where there is not even a mechanism to apply or | ||
preserve those bits in binaries inside the containers.) As a result, | ||
"inheritable" capabilities do not work in practice, and there was no good way to | ||
set capabilities for non-root users. | ||
|
||
The newly introduced "ambient" capabilities solve that problem, by copying those | ||
to the "permitted" and "effective" sets across `execve()` by non-root users, | ||
regardless of the binaries being executed having any special attributes set. | ||
|
||
This feature that makes it possible to implement "userCapabilities" is available | ||
starting with Linux kernel 4.3. | ||
|
||
### Capabilities support in runc and libcontainer, history in Docker | ||
|
||
While runc tries to preserve capabilites across changing uids (so, in theory | ||
making it possible to set capabilities for non-root users), the history lack of | ||
support for "ambient" capabilities made it such that somewhat akin to | ||
"userCapabilities" was never really supported. | ||
|
||
libcontainer has code to [preserve capabilities while changing | ||
users](https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/init_linux.go#L141). | ||
Historically, this only included "inheritable" capabilities (but not "ambient" | ||
capabilities), therefore all those settings would essentially just go away | ||
whenever `execve()` was called inside the container. The end effect was that, | ||
even though libcontainer took steps to preserve the capabilities, they were | ||
effectively dropped by the time a binary was executed as a non-root user inside | ||
the container. | ||
|
||
libcontainer also takes [separate capability masks for each capability | ||
set](https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/configs/config.go#L208), | ||
including "ambient" capabilities. We can now leverage that existing support to | ||
implement "userCapabilities" as described here. | ||
|
||
## Design | ||
|
||
For sake of clarity, let's use an example to describe how each capability mask | ||
should be set in order to implement root and non-root capabilities. In this | ||
example, we'll use the [standard set of | ||
capabilities](https://github.com/opencontainers/runc/blob/v1.0.0-rc5/libcontainer/SPEC.md#security) | ||
for the root capabilities mask, which encodes to `00000000a80425fb` in hex, and | ||
we'll use a capabilities mask that only sets `CAP_NET_BIND_SERVICE` for the | ||
non-root user, which encodes to `0000000000000400` in hex. | ||
|
||
### root capabilities | ||
|
||
In order to get the capabilities masks correctly set for the root user, we need | ||
to set them this way: | ||
|
||
``` | ||
CapInh: xxxxxxxxxxxxxxxx <- unimportant, in general | ||
CapPrm: 00000000a80425fb | ||
CapEff: 00000000a80425fb <- the essential setting | ||
CapBnd: 00000000a80425fb | ||
CapAmb: xxxxxxxxxxxxxxxx <- unimportant for root | ||
``` | ||
|
||
The key here is to set the "effective" mask to include the capabilities root | ||
inside the container should have. | ||
|
||
Furthermore, the "permitted" and "bounding" sets should use the same | ||
capabilities, to prevent root inside the container from gaining more | ||
capabilities by executing a setuid binary or a binary with file capabilities. | ||
|
||
The "inheritable" capabilities are unimportant, since they only work with file | ||
capabilities (which we can effectively ignore in general, and especially inside | ||
containers.) | ||
|
||
The "ambient" capabilities are mostly ignored when running as root, so also not | ||
very important in this context. | ||
|
||
### user capabilities | ||
|
||
To get non-root capabilities set correctly, we should set them this way: | ||
|
||
``` | ||
CapInh: xxxxxxxxxxxxxxxx <- unimportant, in general | ||
CapPrm: 00000000a80425fb <- important for setuid binaries | ||
CapEff: 0000000000000400 <- makes a difference before execve(), but not essential | ||
CapBnd: 00000000a80425fb | ||
CapAmb: 0000000000000400 <- the essential setting | ||
``` | ||
|
||
Here, the key is to set the "ambient" capabilities to the ones the non-root user | ||
wants. Whenever `execve()` is called, the "ambient" capabilities will be copied | ||
to the "effective" set, making those capabilities take effect. | ||
|
||
The "permitted" set is also important here, since it is used whenever a setuid | ||
binary is executed by the non-root user inside the container. We still want a | ||
limited root in container in that case, so setting the "permitted" capabilities | ||
correctly ensures this will be the case. | ||
|
||
The "effective" set should ideally be set to the more restrictive permissions of | ||
the non-root user. If runc/libcontainer sets them to the root capabilities | ||
(`00000000a80425fb`), then in effect runc/libcontainer will be running as | ||
non-root but with the *same capabilities as root has*. | ||
|
||
That is not really that big of a problem, since as soon as `execve()` is called | ||
(and one will be called to execute a binary inside the container), the "ambient" | ||
capabilities will prevail. So this only happens during the time when runc is | ||
running after switching to the non-root user, but before it executes a file | ||
inside the container, which is fairly short. | ||
|
||
It might be possible to update runc/libcontainer to fix that, by checking | ||
whether it switched id to non-root and then masking the "effective" capabilities | ||
to be applied by ANDing it with the "ambient" capabilities. | ||
|
||
### Putting both together | ||
|
||
Given that root capabilities control what the "effective", "permitted" and | ||
"bounding" sets should be set to and that the user capabilities control what the | ||
"ambient" capabilities should be set to, we can converge to a single setting | ||
that will work for both cases. | ||
|
||
The "inheritable" capabilities don't really matter much in this context, but | ||
setting them to match the root capabilities is probably fine, since they're only | ||
going in effect when executing a file with file capabilities and that's akin to | ||
a setuid binary, thus using a similar upper bound is OK in that case. | ||
|
||
So the full setting should be: | ||
|
||
``` | ||
CapInh: 00000000a80425fb | ||
CapPrm: 00000000a80425fb <- determines how setuid binaries act | ||
CapEff: 00000000a80425fb <- capabilities for root inside the container | ||
CapBnd: 00000000a80425fb | ||
CapAmb: 0000000000000400 <- capabilities for non-root inside the container | ||
``` | ||
|
||
As explained above, setting the "effective" capabilities to the root | ||
capabilities will make runc/libcontainer still run with privileges even after | ||
switching to a non-root user, but that situation will be fixed as soon as it | ||
executes a binary inside the container. | ||
|
||
## Implementation | ||
|
||
We will extend the CRI protocol to include two sets of capabilities, the current | ||
`Capabilities` and a new field `UserCapabilities`. Both of these will be | ||
available when configuring the "securityContext:" of a pod or container. | ||
|
||
The fields will be passed on to the Runtimes through the CRI, where they will | ||
eventually be decoded into the sets of capabilities (including "ambient" | ||
capabilities) to be passed to runc/libcontainer. | ||
|
||
Most of the code is plumbing to expose this new field in the configs, then pass | ||
it to the Runtime through the CRI, then in the Runtime implementations, using | ||
it to populate the "ambient" capabilities. | ||
|
||
### Kubernetes | ||
|
||
An early WIP for adding the field to Kubernetes can be found | ||
[here](https://github.com/filbranden/kubernetes/commit/e8561087343c81478221a4cd6f8a9cc7e17cf502). | ||
It still needs more checking around the values that can be set here (as it's | ||
currently done for the root capabilities.) | ||
|
||
### containerd/cri | ||
|
||
For containerd/cri, a first step is to update to latest | ||
opencontainers/runtime-tools which gives more granular access to setting each | ||
capability set individually. PR containerd/cri#820 covers that. | ||
|
||
A follow up is to take the newly passed user capabilities and use them to | ||
populate the "ambient" capabilities, which can be done | ||
[here](https://github.com/containerd/cri/blob/v1.0.3/pkg/server/container_create.go#L375). | ||
|
||
### CRI-O | ||
|
||
Similar to containerd, "ambient" capabilities are currently being cleared in | ||
CRI-O, so it's also clear [where in the | ||
code](https://github.com/kubernetes-incubator/cri-o/blob/v1.10.3/server/container_create.go#L502) | ||
we should populate those using the new "userCapabilities". | ||
|
||
### User Interface | ||
|
||
The user will have access to this new feature through a new field in their | ||
"securityContext:" config for a container. | ||
|
||
As an example of a container running netcat as user "nobody" with | ||
`CAP_NET_BIND_SERVICE` to be able to bind to port 80, while also adding | ||
`CAP_SYS_NICE` to root inside container (in case someone uses `kubectl exec` on | ||
it to renice a process): | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: Pod | ||
spec: | ||
containers: | ||
- name: netcat | ||
image: alpine | ||
args: | ||
- /bin/sh | ||
- -c | ||
- "nc -lk -p 80 -e echo hello" | ||
securityContext: | ||
runAsUser: nobody | ||
userCapabilities: | ||
add: | ||
- NET_BIND_SERVICE | ||
capabilities: | ||
add: | ||
- CAP_SYS_NICE | ||
``` | ||
|
||
Note that adding "all" or "ALL" to "userCapabilities:" will have no effect, | ||
since doing otherwise would just amount to turning non-root inside container | ||
into real root. That's too dangerous, so let's just avoid it. Technically, it's | ||
possible to get a non-root user get all capabilities by listing them all | ||
explicitly, but at least that's explicit, so it's easier to catch. | ||
|
||
### Compatibility | ||
|
||
With the fields being passed as protobufs using gRPC, the absence of support for | ||
the field on either side just reverts to the default behavior, which is ignoring | ||
that user capabilities exist and keeping the "ambient" capabilities set to none. | ||
|
||
In effect, if either side lacks support, user capabilities will simply be | ||
silently dropped. | ||
|
||
This is the most secure setting and should not be too hard to troubleshoot given | ||
the failure scenario is likely to make the container fail quickly with a message | ||
that is likely to point to the lack of specific capabilities. |