Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes should configure the ambient capability set #56374

Open
danderson opened this issue Nov 26, 2017 · 30 comments
Open

Kubernetes should configure the ambient capability set #56374

danderson opened this issue Nov 26, 2017 · 30 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/security Categorizes an issue or PR as relevant to SIG Security.

Comments

@danderson
Copy link

danderson commented Nov 26, 2017

/kind bug

What happened:

The following takes place on a k8s 1.8.2 cluster.

I have a Docker container image that wants to listen on :80, and specifies a non-root USER. To get this running, in my pod spec the container has the following security context:

securityContext:
    capabilities:
        drop:
        - all
        add:
        - NET_BIND_SERVICE
    allowPrivilegeEscalation: false

When I schedule this pod on the cluster, the container fails to bind to :80 (permission denied), and goes into a crashloop. Note that Kubernetes did not complain that this configuration is in any way infeasible.

The reason for this is that Linux capabilities interact in surprising ways with other security mechanisms. In this case, the problem is that I'm also running the container as a non-root user, and Kubernetes/Docker are only setting the inherited, permitted, effective and bounding capability sets. The catch is: the effective and permitted sets get cleared when you transition from UID 0 to UID !0, so my container ends up with:

CapInh:	0000000000000400
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000000000000400
CapAmb:	0000000000000000
NoNewPrivs:	1

0x400 is CAP_NET_BIND_SERVICE, and as you can see my effective capabilities do not have this bit set.

The linux kernel corrected this very confusing behavior by introducing the ambient capability set, which does not have surprising behaviors when you transition from UID 0 to !0. If you've set a capability as ambient, you keep it unless you explicitly revoke it.

What you expected to happen:

I expect the capabilities I assign in my podspec to still exist when my main binary execs, regardless of other security context configuration (assuming k8s accepted my manifest as valid). To me, that translates to: k8s should be writing the caps described by securityContext.capabilities into the ambient capability set, as well as the other capability sets.

Alternatively, if you believe the current behavior of securityContext.capabilities is working as intended, there should be another knob somewhere that I can use to populate the ambient capability set. However, I would strongly encourage you to instead consider the current behavior of securityContext.capabilities combined with non-root users as a bug, because it will likely trip up ~everyone using it unless they know a lot about the linux capability implementation.

How to reproduce it (as minimally and precisely as possible):

Deploy this pod to a cluster using the default container runtime. You should see it crashlooping, with kubectl logs bug-demo showing that netcat is not allowed to bind to :80. If you comment out runAsUser and let the container binary run as root, it'll work fine. Similarly, if you modify the container to have a binary that has been altered with setcap net_bind_service=+ep, the contianer will run correctly as !root, because the setcap'd binary allows the container to regain the privileges it lost when transitioning out of UID 0.

apiVersion: v1
kind: Pod
metadata:
  name: bug-demo
spec:
  containers:
  - name: netcat
    image: danderson/bug-demo:latest
    args:
    - /bin/sh
    - -c
    - "netcat -l -p 80"
    securityContext:
      runAsUser: 65534
      capabilities:
        drop:
        - all
        add:
        - NET_BIND_SERVICE
      allowPrivilegeEscalation: false

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.8.2 server, 1.8.1 kubectl
  • Cloud provider or hardware configuration: bare metal cluster, single node (master taint removed), set up with kubeadm.
  • OS (e.g. from /etc/os-release): Debian Testing
  • Kernel (e.g. uname -a): Linux pandora 4.12.0-2-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.12.13-1 (2017-09-19) x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 26, 2017
@danderson
Copy link
Author

Taking a guess that this is the domain of the node SIG...

@kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Nov 26, 2017
@k8s-ci-robot
Copy link
Contributor

@danderson: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

Taking a guess that this is the domain of the node SIG...

@kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 26, 2017
@danderson
Copy link
Author

Ah, late-breaking news from Slack is that I failed to remember the surprising way in which Linux capabilities work. Because the ambient capabilities mask is not set, switching away from the root user clears all capabilities, and the binary in the container image must have the capability bits set via setcap to regain those capabilities.

So, I guess this bug could also be a feature request: please set the ambient capability bits to match the container securityContext, so that Kubernetes can grant images extra capabilities without having to respin the container image :)

@tallclair
Copy link
Member

and the binary in the container image must have the capability bits set via setcap to regain those capabilities.

Note that that will not work with allowPrivilegeEscalation: false.

Adding APIs for adjusting the ambient set makes sense to me. @danderson Can you modify the issue description / title to better reflect this?

@danderson danderson changed the title Specified capabilities don't stick with docker + non-root container user Kubernetes should configure the ambient capability set Dec 14, 2017
@danderson
Copy link
Author

Updated issue title, and the top-level description with what I understand to be the issue, and my humble suggestion for how k8s should resolve it.

danderson added a commit to metallb/metallb that referenced this issue Dec 18, 2017
The speaker pod is more privileged, because of ARP: it needs access
to the host networking stack, and it needs CAP_NET_RAW to receive and
transmit ARP.

Annoyingly, because of kubernetes/kubernetes#56374, we can't run the
container as non-root but still grant it net_raw. So the closest we
can get to minimum necessary privileges is to run as root, but drop
all caps except net_raw, and disallow privilege escalation so that
the container cannot regain other privileges.
@php-coder
Copy link
Contributor

php-coder commented Jan 18, 2018

Is it really Kubernetes issue? AFAIU it's default Docker behavior that was fixed once and rolled back later (see moby/moby#8460 and moby/moby#26979).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2018
@johngmyers
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2018
@filbranden
Copy link
Contributor

So I really think we should do this one, but I think reusing capabilities: for that can be pretty dangerous.

We're currently using capabilities: to weaken root inside the container.

But if those same capabilities are applied to a non-root user, they will actually empower it to become essentially what root in the container is...

On the other hand, it's hard to tell whether the process launched inside the container will be root or not (for instance, set by USER in a docker file, also you can have a main process in the container and then kubectl exec into it later with a different user (?).)

So what I think we should do here is introduce a separate userCapabilities: that could be used to raise the normal capabilities of a non-root user, keeping the original capabilities: to define what root looks like in the container...

The trouble with that is that we need to plumb this through the whole ecosystem, all the way to runc/libcontainer. And we probably need Docker to use the same concepts on their side as well.

Having said that, I think it's a really useful feature and we should get it going. I'll try to bring this up in a SIG Node meeting.

Cheers,
Filipe

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2018
@yujuhong yujuhong added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2018
mkmik pushed a commit to mkmik/udig that referenced this issue Nov 30, 2018
Since k8s drops ambient capabilities when switching uids.

See kubernetes/kubernetes#56374
@ehashman
Copy link
Member

#56374 (comment) has a good summary, and all the linked issues are now closed.

@mrunalp @Random-Liu @tianon PTAL -- should we try to figure out how we want to move forward on this one?

/remove-kind bug
/kind feature

I think this is effectively a feature request... Kubernetes isn't doing anything wrong, per se, it's just confusing how all the duct tape and layers of system work together.

There is a separate and related discussion about default sysctls like net.ipv4.ip_unprivileged_port_start, although I'm not sure if that addresses the issue of ambient capabilities.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 24, 2021
@BenTheElder
Copy link
Member

FYI there is a KEP now to cover such a feature kubernetes/enhancements#2763 thanks to @vinayakankugoyal kubernetes/enhancements#2757

@tianon
Copy link

tianon commented Jun 26, 2021

@ehashman in my experience, NET_BIND_SERVICE is the most (only?) compelling use case for ambient capabilities (and the referenced KEP appears also to argue for that as the use case), and it's solved much more cleanly (IMO) by net.ipv4.ip_unprivileged_port_start=0, which Docker now sets appropriately by default for network namespaces it creates (moby/moby#41030) -- the downside is requiring a slightly newer kernel (4.11+ IIRC), but that's not exactly a "new" version anymore (2017)

@BenTheElder
Copy link
Member

@tianon @thockin has an issue for setting that in #102612

@pacoxu
Copy link
Member

pacoxu commented Sep 28, 2021

#105309 is an example. We can use pod securityContext.sysctls instead of container securityContext.capabilities(NET_BIND_SERVICE).

      securityContext:
        sysctls:
        - name: net.ipv4.ip_unprivileged_port_start
          value: "53"

But kubelet supports it in 1.22, and cluster/apiserver should be compatible with early releases like n-1 or n-2.
I am not sure when can we move on to use securityContext.sysctls. 1.23 or 1.24?

@liggitt
Copy link
Member

liggitt commented Sep 28, 2021

supported skew is n-2, so as of the 1.24 timeframe, all supported nodes running against a 1.24 cluster would support this

@joebowbeer
Copy link

Is it true that the following securityContext allowed by the restricted pod security standards (PSS) policies is not achievable?

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
    add:
    - NET_BIND_SERVICE
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault

If so, then should the information about NET_BIND_SERVICE be dropped from the restricted PSS spec?

Containers must drop ALL capabilities, and are only permitted to add back the NET_BIND_SERVICE capability.

novad03 pushed a commit to novad03/k8s-meta that referenced this issue Nov 25, 2023
The speaker pod is more privileged, because of ARP: it needs access
to the host networking stack, and it needs CAP_NET_RAW to receive and
transmit ARP.

Annoyingly, because of kubernetes/kubernetes#56374, we can't run the
container as non-root but still grant it net_raw. So the closest we
can get to minimum necessary privileges is to run as root, but drop
all caps except net_raw, and disallow privilege escalation so that
the container cannot regain other privileges.
yaraskm added a commit to yaraskm/pretalx-docker that referenced this issue Apr 3, 2024
- This PR will resolve issue pretalx#55 by allowing admins to configure
gunicorn to listen on any non-privileged port (>1024). Listening on
privileged ports can cause issues on hardended Kubernetes clusters and
runtimes such as Podman, as discussed in a [related Kubernetes issue](kubernetes/kubernetes#56374)
rixx pushed a commit to pretalx/pretalx-docker that referenced this issue Apr 4, 2024
- This PR will resolve issue #55 by allowing admins to configure
gunicorn to listen on any non-privileged port (>1024). Listening on
privileged ports can cause issues on hardended Kubernetes clusters and
runtimes such as Podman, as discussed in a [related Kubernetes issue](kubernetes/kubernetes#56374)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/security Categorizes an issue or PR as relevant to SIG Security.
Projects
None yet
Development

No branches or pull requests