New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes should configure the ambient capability set #56374
Comments
Taking a guess that this is the domain of the node SIG... @kubernetes/sig-node-bugs |
@danderson: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Ah, late-breaking news from Slack is that I failed to remember the surprising way in which Linux capabilities work. Because the ambient capabilities mask is not set, switching away from the root user clears all capabilities, and the binary in the container image must have the capability bits set via So, I guess this bug could also be a feature request: please set the ambient capability bits to match the container securityContext, so that Kubernetes can grant images extra capabilities without having to respin the container image :) |
Note that that will not work with Adding APIs for adjusting the ambient set makes sense to me. @danderson Can you modify the issue description / title to better reflect this? |
Updated issue title, and the top-level description with what I understand to be the issue, and my humble suggestion for how k8s should resolve it. |
The speaker pod is more privileged, because of ARP: it needs access to the host networking stack, and it needs CAP_NET_RAW to receive and transmit ARP. Annoyingly, because of kubernetes/kubernetes#56374, we can't run the container as non-root but still grant it net_raw. So the closest we can get to minimum necessary privileges is to run as root, but drop all caps except net_raw, and disallow privilege escalation so that the container cannot regain other privileges.
Is it really Kubernetes issue? AFAIU it's default Docker behavior that was fixed once and rolled back later (see moby/moby#8460 and moby/moby#26979). |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
So I really think we should do this one, but I think reusing We're currently using But if those same capabilities are applied to a non-root user, they will actually empower it to become essentially what root in the container is... On the other hand, it's hard to tell whether the process launched inside the container will be root or not (for instance, set by So what I think we should do here is introduce a separate The trouble with that is that we need to plumb this through the whole ecosystem, all the way to runc/libcontainer. And we probably need Docker to use the same concepts on their side as well. Having said that, I think it's a really useful feature and we should get it going. I'll try to bring this up in a SIG Node meeting. Cheers, |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Since k8s drops ambient capabilities when switching uids. See kubernetes/kubernetes#56374
#56374 (comment) has a good summary, and all the linked issues are now closed. @mrunalp @Random-Liu @tianon PTAL -- should we try to figure out how we want to move forward on this one? /remove-kind bug I think this is effectively a feature request... Kubernetes isn't doing anything wrong, per se, it's just confusing how all the duct tape and layers of system work together. There is a separate and related discussion about default sysctls like |
FYI there is a KEP now to cover such a feature kubernetes/enhancements#2763 thanks to @vinayakankugoyal kubernetes/enhancements#2757 |
@ehashman in my experience, |
#105309 is an example. We can use pod securityContext.sysctls instead of container securityContext.capabilities(NET_BIND_SERVICE).
But kubelet supports it in 1.22, and cluster/apiserver should be compatible with early releases like n-1 or n-2. |
supported skew is n-2, so as of the 1.24 timeframe, all supported nodes running against a 1.24 cluster would support this |
Is it true that the following securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault If so, then should the information about NET_BIND_SERVICE be dropped from the restricted PSS spec?
|
The speaker pod is more privileged, because of ARP: it needs access to the host networking stack, and it needs CAP_NET_RAW to receive and transmit ARP. Annoyingly, because of kubernetes/kubernetes#56374, we can't run the container as non-root but still grant it net_raw. So the closest we can get to minimum necessary privileges is to run as root, but drop all caps except net_raw, and disallow privilege escalation so that the container cannot regain other privileges.
- This PR will resolve issue pretalx#55 by allowing admins to configure gunicorn to listen on any non-privileged port (>1024). Listening on privileged ports can cause issues on hardended Kubernetes clusters and runtimes such as Podman, as discussed in a [related Kubernetes issue](kubernetes/kubernetes#56374)
- This PR will resolve issue #55 by allowing admins to configure gunicorn to listen on any non-privileged port (>1024). Listening on privileged ports can cause issues on hardended Kubernetes clusters and runtimes such as Podman, as discussed in a [related Kubernetes issue](kubernetes/kubernetes#56374)
What happened:
The following takes place on a k8s 1.8.2 cluster.
I have a Docker container image that wants to listen on :80, and specifies a non-root USER. To get this running, in my pod spec the container has the following security context:
When I schedule this pod on the cluster, the container fails to bind to :80 (permission denied), and goes into a crashloop. Note that Kubernetes did not complain that this configuration is in any way infeasible.
The reason for this is that Linux capabilities interact in surprising ways with other security mechanisms. In this case, the problem is that I'm also running the container as a non-root user, and Kubernetes/Docker are only setting the inherited, permitted, effective and bounding capability sets. The catch is: the effective and permitted sets get cleared when you transition from UID 0 to UID !0, so my container ends up with:
0x400 is CAP_NET_BIND_SERVICE, and as you can see my effective capabilities do not have this bit set.
The linux kernel corrected this very confusing behavior by introducing the ambient capability set, which does not have surprising behaviors when you transition from UID 0 to !0. If you've set a capability as ambient, you keep it unless you explicitly revoke it.
What you expected to happen:
I expect the capabilities I assign in my podspec to still exist when my main binary
exec
s, regardless of other security context configuration (assuming k8s accepted my manifest as valid). To me, that translates to: k8s should be writing the caps described bysecurityContext.capabilities
into the ambient capability set, as well as the other capability sets.Alternatively, if you believe the current behavior of
securityContext.capabilities
is working as intended, there should be another knob somewhere that I can use to populate the ambient capability set. However, I would strongly encourage you to instead consider the current behavior ofsecurityContext.capabilities
combined with non-root users as a bug, because it will likely trip up ~everyone using it unless they know a lot about the linux capability implementation.How to reproduce it (as minimally and precisely as possible):
Deploy this pod to a cluster using the default container runtime. You should see it crashlooping, with
kubectl logs bug-demo
showing that netcat is not allowed to bind to :80. If you comment outrunAsUser
and let the container binary run as root, it'll work fine. Similarly, if you modify the container to have a binary that has been altered withsetcap net_bind_service=+ep
, the contianer will run correctly as !root, because the setcap'd binary allows the container to regain the privileges it lost when transitioning out of UID 0.Anything else we need to know?:
Environment:
kubectl version
): 1.8.2 server, 1.8.1 kubectlkubeadm
.uname -a
): Linux pandora 4.12.0-2-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.12.13-1 (2017-09-19) x86_64 GNU/Linuxkubeadm
The text was updated successfully, but these errors were encountered: