-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc-dmz: Inheritable capabilities are dropped when they previously weren't #4125
Comments
Thanks for your report. @cyphar and @dgl , do you think we should fix it in |
@lifubang I think adding to the ambient set like that could be dangerous. For example a container may have a binary that runs as root and then deliberately drops privileges via Maybe along those lines though a fix could be opting out of |
From the PR you mention:
Note that CVE-2022-24769 did result in the inheritable set generally not being configurable, or at least warnings being generated. I think therefore in my previous comment I would mean "effective and ambient" (not inheritable), but some review from someone familiar with the previous CVE would be good. |
I agree, there is a similar fix for ‘selinux’, could you submit a patch for this issue. |
The grand irony being that runc-dmz exists almost entirely because of concerns about Kubernetes workloads with a lot of container churn. The previous mount-fd stuff was added to placate some Kubernetes e2e tests, and when we removed it because it caused performance issues (and was arguably not really secure) the same Kubernetes e2e test concerns came up. If it were entirely up to me, I'm not sure we would have runc-dmz in the first place (nice though it is, it adds extra complexity that shouldn't be necessary in general).
In fact, there was a Docker CVE related to ambient capabilities back in the day (CVE-2016-8867) which IIRC boiled down to Docker doing exactly this -- blindly setting ambient caps to the other cap sets (see moby/moby#27610). The only solution that keeps runc-dmz and the old capability behaviour is if we were to move the setting of capabilities to runc-dmz. There are three issues with this, which make this a non-started IMHO:
If Kubernetes doesn't mind having runc-dmz be disabled (through the mechanism you proposed), I really wonder whether we need runc-dmz at all. We have to disable runc-dmz for older SELinux systems too... @kolyshkin @AkihiroSuda? |
Maybe we should just disable dmz by default and make it opt-in. |
I think |
Unfortunately the capabilities issue applies to many Kubernetes pods, the default list of capabilities from Docker (per this) means that unless users have set capabilities to a more restrictive set (which they really should) many pods will have that set. The ideal way to do this would be to decide whether |
I've opened a draft PR #4129, it's a bit ugly, but does work in my basic testing. I'll test it within a Kubernetes cluster and see about cleaning it up (the reuse of |
This would be fundamentally unsafe to do, due to the obvious races. We really can't make security decisions based on container state the container itself controls. In addition, this wouldn't work for I also wouldn't be super happy with the idea of emulating the Linux capability system in order to determine whether
We unfortunately can't do the check that late because you need to have already done the The only choice I can see for making this automatic is to look at the configured capability sets and see if But I think making it opt-in is probably going to be less of a headache.
Right, they were known at the time, but at the time I assumed the SELinux issue was mostly hypothetical (an SELinux policy could be really annoying and block this, not that the current RHEL policy did block it at the time) and for capabilities I assumed everyone had enabled ambient capability support 4-5 years ago. I would never have guessed that Kubernetes still doesn't use them. |
I think even with ambient capability support this is a bit tricky to do right, as a username is (potentially) passed all the way through and ambient capabilities have no observable effect if runc is doing the execve as root (so it is safe to use If the optimization isn't going to work for that case, then I agree, maybe opt-in is best. (If only Docker had decided to resolve |
Luckily, in runc we only accept uid and gids, as opposed to usernames†. So we could in principle just check if the I am leaning towards making it opt-in, though. Let me think about it for a bit... † Despite how the code actually looks, the story behind that goes back to the original pre-runc Docker implementation of libcontainer. See // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.
User: fmt.Sprintf("%d:%d", p.User.UID, p.User.GID), |
So, the plan here is to change runc-dmz to be opt-it at compilation time? |
Can be just a run-time CLI flag IIUC |
We can make it so that |
dmz is now disabled by default, so I guess we can close this |
Description
runc-dmz results in a change in capabilities behaviour, for non-root users. Previously if a binary had file capabilities it would inherit those, if it was the first execve in the container. It turns out this worked as many people desired, if they didn't intend, as the service running in the container would get the ability to bind low ports.
This happens when ambient capabilities aren't used. Note Kubernetes does not set ambient capabilities currently, there is a KEP for this: kubernetes/enhancements#2763 but this is a change in observable runc behaviour.
Steps to reproduce the issue
I spotted this on a Kubernetes cluster using runc from main as CoreDNS wasn't starting successfully (CoreDNS >= v1.11 runs as non-root, which is in Kubernetes 1.29 or greater, depending exactly how the cluster is created).
One way to do that is:
ARG RUNC_VERSION="main"
)make quick
in the directory)kind build node-image ~/Code/kubernetes --image kindest/node:runc-main --base-image=gcr.io/k8s-staging-kind/base:v20231124-6a461ab5-dirty
However it can be reduced to running a runc container where the args point to something with setcap.
$ runc spec
[edit config.json to look something like this at the top:
(Basically run nc.openbsd attempting to listen on a <1024 port but drop the ambient capabilities.)
Make sure that file exists (netcat-openbsd in a Debian/Ubuntu rootfs works) and has file capabilities:
Describe the results you received and expected
Binary runs and is able to listen on <1024 port. Instead CoreDNS/other binary gives permission denied on bind:
What version of runc are you using?
Host OS information
Host kernel information
Linux 6.5.0
The text was updated successfully, but these errors were encountered: