Kubernetes should configure the ambient capability set #56374

danderson · 2017-11-26T04:02:16Z

/kind bug

What happened:

The following takes place on a k8s 1.8.2 cluster.

I have a Docker container image that wants to listen on :80, and specifies a non-root USER. To get this running, in my pod spec the container has the following security context:

securityContext:
    capabilities:
        drop:
        - all
        add:
        - NET_BIND_SERVICE
    allowPrivilegeEscalation: false

When I schedule this pod on the cluster, the container fails to bind to :80 (permission denied), and goes into a crashloop. Note that Kubernetes did not complain that this configuration is in any way infeasible.

The reason for this is that Linux capabilities interact in surprising ways with other security mechanisms. In this case, the problem is that I'm also running the container as a non-root user, and Kubernetes/Docker are only setting the inherited, permitted, effective and bounding capability sets. The catch is: the effective and permitted sets get cleared when you transition from UID 0 to UID !0, so my container ends up with:

CapInh:	0000000000000400
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000000000000400
CapAmb:	0000000000000000
NoNewPrivs:	1

0x400 is CAP_NET_BIND_SERVICE, and as you can see my effective capabilities do not have this bit set.

The linux kernel corrected this very confusing behavior by introducing the ambient capability set, which does not have surprising behaviors when you transition from UID 0 to !0. If you've set a capability as ambient, you keep it unless you explicitly revoke it.

What you expected to happen:

I expect the capabilities I assign in my podspec to still exist when my main binary execs, regardless of other security context configuration (assuming k8s accepted my manifest as valid). To me, that translates to: k8s should be writing the caps described by securityContext.capabilities into the ambient capability set, as well as the other capability sets.

Alternatively, if you believe the current behavior of securityContext.capabilities is working as intended, there should be another knob somewhere that I can use to populate the ambient capability set. However, I would strongly encourage you to instead consider the current behavior of securityContext.capabilities combined with non-root users as a bug, because it will likely trip up ~everyone using it unless they know a lot about the linux capability implementation.

How to reproduce it (as minimally and precisely as possible):

Deploy this pod to a cluster using the default container runtime. You should see it crashlooping, with kubectl logs bug-demo showing that netcat is not allowed to bind to :80. If you comment out runAsUser and let the container binary run as root, it'll work fine. Similarly, if you modify the container to have a binary that has been altered with setcap net_bind_service=+ep, the contianer will run correctly as !root, because the setcap'd binary allows the container to regain the privileges it lost when transitioning out of UID 0.

apiVersion: v1
kind: Pod
metadata:
  name: bug-demo
spec:
  containers:
  - name: netcat
    image: danderson/bug-demo:latest
    args:
    - /bin/sh
    - -c
    - "netcat -l -p 80"
    securityContext:
      runAsUser: 65534
      capabilities:
        drop:
        - all
        add:
        - NET_BIND_SERVICE
      allowPrivilegeEscalation: false

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.8.2 server, 1.8.1 kubectl
Cloud provider or hardware configuration: bare metal cluster, single node (master taint removed), set up with kubeadm.
OS (e.g. from /etc/os-release): Debian Testing
Kernel (e.g. uname -a): Linux pandora 4.12.0-2-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.12.13-1 (2017-09-19) x86_64 GNU/Linux
Install tools: kubeadm
Others:

The text was updated successfully, but these errors were encountered:

danderson · 2017-11-26T04:05:22Z

Taking a guess that this is the domain of the node SIG...

@kubernetes/sig-node-bugs

k8s-ci-robot · 2017-11-26T04:05:29Z

@danderson: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

Taking a guess that this is the domain of the node SIG...

@kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

danderson · 2017-11-26T04:18:40Z

Ah, late-breaking news from Slack is that I failed to remember the surprising way in which Linux capabilities work. Because the ambient capabilities mask is not set, switching away from the root user clears all capabilities, and the binary in the container image must have the capability bits set via setcap to regain those capabilities.

So, I guess this bug could also be a feature request: please set the ambient capability bits to match the container securityContext, so that Kubernetes can grant images extra capabilities without having to respin the container image :)

tallclair · 2017-11-28T00:50:03Z

and the binary in the container image must have the capability bits set via setcap to regain those capabilities.

Note that that will not work with allowPrivilegeEscalation: false.

Adding APIs for adjusting the ambient set makes sense to me. @danderson Can you modify the issue description / title to better reflect this?

danderson · 2017-12-14T20:18:38Z

Updated issue title, and the top-level description with what I understand to be the issue, and my humble suggestion for how k8s should resolve it.

The speaker pod is more privileged, because of ARP: it needs access to the host networking stack, and it needs CAP_NET_RAW to receive and transmit ARP. Annoyingly, because of kubernetes/kubernetes#56374, we can't run the container as non-root but still grant it net_raw. So the closest we can get to minimum necessary privileges is to run as root, but drop all caps except net_raw, and disallow privilege escalation so that the container cannot regain other privileges.

php-coder · 2018-01-18T14:42:23Z

Is it really Kubernetes issue? AFAIU it's default Docker behavior that was fixed once and rolled back later (see moby/moby#8460 and moby/moby#26979).

fejta-bot · 2018-04-18T15:27:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

johngmyers · 2018-04-18T15:45:16Z

/remove-lifecycle stale

filbranden · 2018-06-08T16:26:14Z

So I really think we should do this one, but I think reusing capabilities: for that can be pretty dangerous.

We're currently using capabilities: to weaken root inside the container.

But if those same capabilities are applied to a non-root user, they will actually empower it to become essentially what root in the container is...

On the other hand, it's hard to tell whether the process launched inside the container will be root or not (for instance, set by USER in a docker file, also you can have a main process in the container and then kubectl exec into it later with a different user (?).)

So what I think we should do here is introduce a separate userCapabilities: that could be used to raise the normal capabilities of a non-root user, keeping the original capabilities: to define what root looks like in the container...

The trouble with that is that we need to plumb this through the whole ecosystem, all the way to runc/libcontainer. And we probably need Docker to use the same concepts on their side as well.

Having said that, I think it's a really useful feature and we should get it going. I'll try to bring this up in a SIG Node meeting.

Cheers,
Filipe

fejta-bot · 2018-09-06T16:47:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Since k8s drops ambient capabilities when switching uids. See kubernetes/kubernetes#56374

ehashman · 2021-06-24T01:34:56Z

#56374 (comment) has a good summary, and all the linked issues are now closed.

@mrunalp @Random-Liu @tianon PTAL -- should we try to figure out how we want to move forward on this one?

/remove-kind bug
/kind feature

I think this is effectively a feature request... Kubernetes isn't doing anything wrong, per se, it's just confusing how all the duct tape and layers of system work together.

There is a separate and related discussion about default sysctls like net.ipv4.ip_unprivileged_port_start, although I'm not sure if that addresses the issue of ambient capabilities.

BenTheElder · 2021-06-24T01:36:31Z

FYI there is a KEP now to cover such a feature kubernetes/enhancements#2763 thanks to @vinayakankugoyal kubernetes/enhancements#2757

tianon · 2021-06-26T00:14:45Z

@ehashman in my experience, NET_BIND_SERVICE is the most (only?) compelling use case for ambient capabilities (and the referenced KEP appears also to argue for that as the use case), and it's solved much more cleanly (IMO) by net.ipv4.ip_unprivileged_port_start=0, which Docker now sets appropriately by default for network namespaces it creates (moby/moby#41030) -- the downside is requiring a slightly newer kernel (4.11+ IIRC), but that's not exactly a "new" version anymore (2017)

BenTheElder · 2021-06-26T00:51:53Z

@tianon @thockin has an issue for setting that in #102612

pacoxu · 2021-09-28T07:09:29Z

#105309 is an example. We can use pod securityContext.sysctls instead of container securityContext.capabilities(NET_BIND_SERVICE).

      securityContext:
        sysctls:
        - name: net.ipv4.ip_unprivileged_port_start
          value: "53"

But kubelet supports it in 1.22, and cluster/apiserver should be compatible with early releases like n-1 or n-2.
I am not sure when can we move on to use securityContext.sysctls. 1.23 or 1.24?

liggitt · 2021-09-28T17:12:55Z

supported skew is n-2, so as of the 1.24 timeframe, all supported nodes running against a 1.24 cluster would support this

joebowbeer · 2022-07-15T16:43:41Z

Is it true that the following securityContext allowed by the restricted pod security standards (PSS) policies is not achievable?

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
    add:
    - NET_BIND_SERVICE
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault

If so, then should the information about NET_BIND_SERVICE be dropped from the restricted PSS spec?

Containers must drop ALL capabilities, and are only permitted to add back the NET_BIND_SERVICE capability.

The speaker pod is more privileged, because of ARP: it needs access to the host networking stack, and it needs CAP_NET_RAW to receive and transmit ARP. Annoyingly, because of kubernetes/kubernetes#56374, we can't run the container as non-root but still grant it net_raw. So the closest we can get to minimum necessary privileges is to run as root, but drop all caps except net_raw, and disallow privilege escalation so that the container cannot regain other privileges.

- This PR will resolve issue pretalx#55 by allowing admins to configure gunicorn to listen on any non-privileged port (>1024). Listening on privileged ports can cause issues on hardended Kubernetes clusters and runtimes such as Podman, as discussed in a [related Kubernetes issue](kubernetes/kubernetes#56374)

- This PR will resolve issue #55 by allowing admins to configure gunicorn to listen on any non-privileged port (>1024). Listening on privileged ports can cause issues on hardended Kubernetes clusters and runtimes such as Podman, as discussed in a [related Kubernetes issue](kubernetes/kubernetes#56374)

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 26, 2017

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Nov 26, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 26, 2017

danderson changed the title ~~Specified capabilities don't stick with docker + non-root container user~~ Kubernetes should configure the ambient capability set Dec 14, 2017

danderson mentioned this issue Dec 18, 2017

Tighten capabilities and RBAC for pods metallb/metallb#53

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2018

Random-Liu mentioned this issue Jun 8, 2018

Erase ambient capabilities. containerd/cri#808

Merged

Random-Liu mentioned this issue Aug 27, 2018

container using not root user can not bind 80 and 443 port containerd/containerd#2516

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2018

yujuhong added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2018

zhiyuan-lin mentioned this issue Sep 15, 2018

Add CAP_NET_BIND_SERVICE to shadowsocks-libev.service shadowsocks/shadowsocks-libev#2182

Merged

mkmik pushed a commit to mkmik/udig that referenced this issue Nov 30, 2018

Set CAP_NET_BIND_SERVICE on /udig binary

5d0fecb

Since k8s drops ambient capabilities when switching uids. See kubernetes/kubernetes#56374

danielpeach mentioned this issue Jan 2, 2019

Add PodSecurityPolicy knative/build#515

Merged

dduportal mentioned this issue Feb 8, 2019

You should not run as root containous/traefik-library-image#38

Open

tallclair mentioned this issue May 20, 2021

KEP 2763: Ambient capabilities in Kubernetes kubernetes/enhancements#2757

Merged

vinayakankugoyal mentioned this issue May 21, 2021

Support for ambient capabilities in kubernetes. kubernetes/enhancements#2763

Open

5 tasks

bboreham mentioned this issue Jun 22, 2021

Add support for running cortex through docker image without root uid cortexproject/cortex#4107

Closed

3 tasks

ehashman mentioned this issue Jun 24, 2021

Set sysctl net/ipv4/ip_unprivileged_port_start ? #102612

Open

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 24, 2021

pacoxu mentioned this issue Sep 28, 2021

Use ip_unprivileged_port_start=53 for coredns #105309

Closed

rptaylor mentioned this issue Oct 26, 2021

privilege elevation for binding to system ports traefik/traefik-helm-chart#517

Closed

2 tasks

BenTheElder mentioned this issue Nov 18, 2021

CRI: Support enable_unprivileged_icmp and enable_unprivileged_ports options containerd/containerd#6170

Merged

joebowbeer mentioned this issue Jul 15, 2022

enhance haproxy securityContexts for restrictive PSS compliance haproxytech/helm-charts#150

Closed

joebowbeer mentioned this issue Jul 15, 2022

fix: Enhance securityContext in ha manifests (#9810) argoproj/argo-cd#9930

Merged

10 tasks

marcredhat mentioned this issue Oct 30, 2022

port 80 is already in use. Please check the flag --http-port on GKE, ingress-nginx version 1.1.2 and 1.1.3 kubernetes/ingress-nginx#8461

Closed

pacoxu mentioned this issue Jul 14, 2023

Minimal kernel version plan: probably 4.18/4.19+ #116799

Open

21 tasks

JJGadgets mentioned this issue Oct 8, 2023

Autopause: Rootless improvements, mainly for Kubernetes itzg/docker-minecraft-server#2421

Closed

dgl mentioned this issue Dec 15, 2023

Run coredns as non root. coredns/coredns#5969

Merged

yaraskm mentioned this issue Apr 3, 2024

Privileged port usage - can't bind to port 80 pretalx/pretalx-docker#55

Closed

yaraskm mentioned this issue Apr 3, 2024

55: Allow changing the listen port for gunicorn pretalx/pretalx-docker#61

Merged

This was referenced Apr 20, 2024

Enable nats to run using nonroot user tamalsaha/nats-server#1

Closed

Enable nats to run using nonroot user nats-io/nats-server#5326

Closed

Add net_bind_service capability to nats-server binary nats-io/nats-docker#150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes should configure the ambient capability set #56374

Kubernetes should configure the ambient capability set #56374

danderson commented Nov 26, 2017 •

edited

danderson commented Nov 26, 2017

k8s-ci-robot commented Nov 26, 2017

danderson commented Nov 26, 2017

tallclair commented Nov 28, 2017

danderson commented Dec 14, 2017

php-coder commented Jan 18, 2018 •

edited

fejta-bot commented Apr 18, 2018

johngmyers commented Apr 18, 2018

filbranden commented Jun 8, 2018

fejta-bot commented Sep 6, 2018

ehashman commented Jun 24, 2021

BenTheElder commented Jun 24, 2021

tianon commented Jun 26, 2021

BenTheElder commented Jun 26, 2021

pacoxu commented Sep 28, 2021

liggitt commented Sep 28, 2021

joebowbeer commented Jul 15, 2022

Kubernetes should configure the ambient capability set #56374

Kubernetes should configure the ambient capability set #56374

Comments

danderson commented Nov 26, 2017 • edited

danderson commented Nov 26, 2017

k8s-ci-robot commented Nov 26, 2017

danderson commented Nov 26, 2017

tallclair commented Nov 28, 2017

danderson commented Dec 14, 2017

php-coder commented Jan 18, 2018 • edited

fejta-bot commented Apr 18, 2018

johngmyers commented Apr 18, 2018

filbranden commented Jun 8, 2018

fejta-bot commented Sep 6, 2018

ehashman commented Jun 24, 2021

BenTheElder commented Jun 24, 2021

tianon commented Jun 26, 2021

BenTheElder commented Jun 26, 2021

pacoxu commented Sep 28, 2021

liggitt commented Sep 28, 2021

joebowbeer commented Jul 15, 2022

danderson commented Nov 26, 2017 •

edited

php-coder commented Jan 18, 2018 •

edited