Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc-dmz: Inheritable capabilities are dropped when they previously weren't #4125

Closed
Tracked by #4114
dgl opened this issue Nov 24, 2023 · 16 comments
Closed
Tracked by #4114

Comments

@dgl
Copy link

dgl commented Nov 24, 2023

Description

runc-dmz results in a change in capabilities behaviour, for non-root users. Previously if a binary had file capabilities it would inherit those, if it was the first execve in the container. It turns out this worked as many people desired, if they didn't intend, as the service running in the container would get the ability to bind low ports.

This happens when ambient capabilities aren't used. Note Kubernetes does not set ambient capabilities currently, there is a KEP for this: kubernetes/enhancements#2763 but this is a change in observable runc behaviour.

Steps to reproduce the issue

I spotted this on a Kubernetes cluster using runc from main as CoreDNS wasn't starting successfully (CoreDNS >= v1.11 runs as non-root, which is in Kubernetes 1.29 or greater, depending exactly how the cluster is created).

One way to do that is:

  1. Update kind's base image to use runc from main (edit images/base/Dockerfile ARG RUNC_VERSION="main")
  2. Build that (make quick in the directory)
  3. kind build node-image ~/Code/kubernetes --image kindest/node:runc-main --base-image=gcr.io/k8s-staging-kind/base:v20231124-6a461ab5-dirty
  4. Build a cluster with it:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:runc-main

However it can be reduced to running a runc container where the args point to something with setcap.

$ runc spec
[edit config.json to look something like this at the top:

{
        "ociVersion": "1.1.0+dev",
        "process": {
                "terminal": true,
                "user": {
                        "uid": 1000,
                        "gid": 1000
                },
                "args": [
                        "/usr/bin/nc.openbsd", "-l", "80"
                ],
                "env": [
                        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                        "TERM=xterm"
                ],
                "cwd": "/",
                "capabilities": {
                        "bounding": [
                                "CAP_NET_BIND_SERVICE"
                        ],
                        "effective": [
                                "CAP_AUDIT_WRITE",
                                "CAP_KILL",
                                "CAP_NET_BIND_SERVICE"
                        ],
                        "permitted": [
                                "CAP_AUDIT_WRITE",
                                "CAP_KILL",
                                "CAP_NET_BIND_SERVICE"
                        ]
                },

(Basically run nc.openbsd attempting to listen on a <1024 port but drop the ambient capabilities.)

Make sure that file exists (netcat-openbsd in a Debian/Ubuntu rootfs works) and has file capabilities:

$ sudo setcap cap_net_bind_service=+ep rootfs/usr/bin/nc.openbsd
$ sudo getcap rootfs/usr/bin/nc.openbsd          
rootfs/usr/bin/nc.openbsd cap_net_bind_service=ep

Describe the results you received and expected

Binary runs and is able to listen on <1024 port. Instead CoreDNS/other binary gives permission denied on bind:

$ sudo ./runc run config.json                    
nc.openbsd: Permission denied
$ sudo env RUNC_DMZ=legacy ./runc run config.json
[works as expected]

What version of runc are you using?

# runc --version
runc version 1.1.0+dev
commit: v1.1.0-855-g95a93c1
spec: 1.1.0+dev
go: go1.20.4
libseccomp: 2.5.4

Host OS information

$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"

Host kernel information

Linux 6.5.0

@lifubang
Copy link
Member

lifubang commented Nov 24, 2023

Thanks for your report.
There is a similar discussion when we proposed to introduce runc-dmz: #3983 (comment)
@cyphar has indicated that Due to the way capabilities have to be set by runc, capabilities need to be included in the inheritable and ambient sets anyway. in 6be763e , but unfortunately Kubernetes does not set ambient capabilities currently.

@cyphar and @dgl , do you think we should fix it in runc like this or not:
If we use runc-dmz to run a container, we should check that all capabilities in the bounding set should be included in the inheritable and ambient sets, otherwise add it to these sets before calling execve runc-dmz.

@dgl
Copy link
Author

dgl commented Nov 26, 2023

@lifubang I think adding to the ambient set like that could be dangerous. For example a container may have a binary that runs as root and then deliberately drops privileges via setuid and execing another binary, if runc infers the ambient set, then the user may (depending on how privs are dropped, etc.) see a visible change to the capabilities the supposedly unprivileged binary now has.

Maybe along those lines though a fix could be opting out of runc-dmz if there are any capabilities in the bounding set which aren't in the inheritable and ambient set. (Which currently has the property that it would opt-out any Kubernetes capability using container out of runc-dmz, until Kubernetes implements ambient capabs.)

@dgl
Copy link
Author

dgl commented Nov 27, 2023

From the PR you mention:

In fact, the runtime-spec having the inheritable set be explicitly configurable is quite strange now that I think about it -- because the runtime has to configure the capabilities and then exec the container process, configuring a strange inheritable configuration would only result in weird runtime behaviour.

Note that CVE-2022-24769 did result in the inheritable set generally not being configurable, or at least warnings being generated. I think therefore in my previous comment I would mean "effective and ambient" (not inheritable), but some review from someone familiar with the previous CVE would be good.

@lifubang
Copy link
Member

I agree, there is a similar fix for ‘selinux’, could you submit a patch for this issue.

@cyphar
Copy link
Member

cyphar commented Nov 27, 2023

Maybe along those lines though a fix could be opting out of runc-dmz if there are any capabilities in the bounding set which aren't in the inheritable and ambient set. (Which currently has the property that it would opt-out any Kubernetes capability using container out of runc-dmz, until Kubernetes implements ambient capabs.)

The grand irony being that runc-dmz exists almost entirely because of concerns about Kubernetes workloads with a lot of container churn. The previous mount-fd stuff was added to placate some Kubernetes e2e tests, and when we removed it because it caused performance issues (and was arguably not really secure) the same Kubernetes e2e test concerns came up. If it were entirely up to me, I'm not sure we would have runc-dmz in the first place (nice though it is, it adds extra complexity that shouldn't be necessary in general).

I think adding to the ambient set like that could be dangerous. For example a container may have a binary that runs as root and then deliberately drops privileges via setuid and execing another binary, if runc infers the ambient set, then the user may (depending on how privs are dropped, etc.) see a visible change to the capabilities the supposedly unprivileged binary now has.

In fact, there was a Docker CVE related to ambient capabilities back in the day (CVE-2016-8867) which IIRC boiled down to Docker doing exactly this -- blindly setting ambient caps to the other cap sets (see moby/moby#27610).

The only solution that keeps runc-dmz and the old capability behaviour is if we were to move the setting of capabilities to runc-dmz. There are three issues with this, which make this a non-started IMHO:

  1. In theory the ABI for capabilities is not readily exposed to userspace, so we would need to use libcap which would significantly increase the size of runc-dmz, reducing its usefulness.
  2. runc-dmz would need to do more than just set capabilities, because the ordering in *_init_linux.go is special and certain operations need to happen after configuring capabilities (such as setting seccomp filters), which means that runc-dmz would need to do those too. This would also increase the binary size.
  3. In order to facilitate (2) we would need to serialise and then deserialise the internal configuration (in C) in order to . Aside from general inefficiency, it would also be a potential attack vector and would also increase the binary size.

If Kubernetes doesn't mind having runc-dmz be disabled (through the mechanism you proposed), I really wonder whether we need runc-dmz at all. We have to disable runc-dmz for older SELinux systems too... @kolyshkin @AkihiroSuda?

@AkihiroSuda
Copy link
Member

If Kubernetes doesn't mind having runc-dmz be disabled (through the mechanism you proposed), I really wonder whether we need runc-dmz at all. We have to disable runc-dmz for older SELinux systems too... @kolyshkin @AkihiroSuda?

Maybe we should just disable dmz by default and make it opt-in.
Eventually we may deprecate it if nobody really opts-in.

@lifubang
Copy link
Member

I think capabilities issue, selinux issue and another one are the known issues when we were introducing runc-dmz.
Maybe we can open an issue to list these dmz opt-out situations, if there were more situations added in, we can disable dmz by default.

@dgl
Copy link
Author

dgl commented Nov 28, 2023

Unfortunately the capabilities issue applies to many Kubernetes pods, the default list of capabilities from Docker (per this) means that unless users have set capabilities to a more restrictive set (which they really should) many pods will have that set.

The ideal way to do this would be to decide whether runc-dmz should be used when it's known if the container asks for any capabilities and it is running as non-root (or even look at the file capabilities of the binary being run, although that is potentially racy). Doing this needs a check quite late on in runc's execution (e.g. finalizeNamespace in libcontainer/init_linux.go, as runc doesn't know the UID earlier than that).

@dgl
Copy link
Author

dgl commented Nov 28, 2023

I've opened a draft PR #4129, it's a bit ugly, but does work in my basic testing. I'll test it within a Kubernetes cluster and see about cleaning it up (the reuse of initConfig is just lazy, I'll see about passing whether the dmz binary should be used in a better way).

@cyphar
Copy link
Member

cyphar commented Nov 28, 2023

@dgl

or even look at the file capabilities of the binary being run, although that is potentially racy

This would be fundamentally unsafe to do, due to the obvious races. We really can't make security decisions based on container state the container itself controls. In addition, this wouldn't work for #!-style scripts (in that case, the relevant file capabilities are that of the script interpreter, not the script).

I also wouldn't be super happy with the idea of emulating the Linux capability system in order to determine whether runc-dmz would cause issues in the execve transition, even if it were safe.

Doing this needs a check quite late on in runc's execution (e.g. finalizeNamespace in libcontainer/init_linux.go, as runc doesn't know the UID earlier than that).

We unfortunately can't do the check that late because you need to have already done the /proc/self/exe cloning before the runc init re-exec (this is also why we have to detect potential SELinux issues much earlier and have a much more conservative check for SELinux). We could work around it but it would make things far more complicated.

The only choice I can see for making this automatic is to look at the configured capability sets and see if runc-dmz would cause a change in behaviour. I will need to take a really close look at the various interactions (the capability system in Linux is a complete mess...) but I think the most conservative option would be to only use runc-dmz if the ambient capability set is a superset of the effective and permissible sets (since the ambient capability rules are that ambient caps need to be both effective and permissible, in practice this means that we would only use runc-dmz if ambient == effective == permissible).

But I think making it opt-in is probably going to be less of a headache.

@lifubang

I think capabilities issue, selinux issue and another one are the known issues when we were introducing runc-dmz.

Right, they were known at the time, but at the time I assumed the SELinux issue was mostly hypothetical (an SELinux policy could be really annoying and block this, not that the current RHEL policy did block it at the time) and for capabilities I assumed everyone had enabled ambient capability support 4-5 years ago. I would never have guessed that Kubernetes still doesn't use them.

@dgl
Copy link
Author

dgl commented Nov 28, 2023

@cyphar

I think even with ambient capability support this is a bit tricky to do right, as a username is (potentially) passed all the way through and ambient capabilities have no observable effect if runc is doing the execve as root (so it is safe to use runc-dmz).

If the optimization isn't going to work for that case, then I agree, maybe opt-in is best.

(If only Docker had decided to resolve USER lines at build rather than runtime, but I doubt that is going to change, it causes user visible pain in Kubernetes too...)

@cyphar
Copy link
Member

cyphar commented Nov 28, 2023

Luckily, in runc we only accept uid and gids, as opposed to usernames†. So we could in principle just check if the config.json uid is 0 but it might require some changes to internal APIs (including stuff that is serialised to state.json, which is going to cause other issues...).

I am leaning towards making it opt-in, though. Let me think about it for a bit...

† Despite how the code actually looks, the story behind that goes back to the original pre-runc Docker implementation of libcontainer. See newProcess in utils_linux.go:

// TODO: fix libcontainer's API to better support uid/gid in a typesafe way.
User:            fmt.Sprintf("%d:%d", p.User.UID, p.User.GID),

@rata
Copy link
Member

rata commented Dec 5, 2023

So, the plan here is to change runc-dmz to be opt-it at compilation time?

@AkihiroSuda
Copy link
Member

So, the plan here is to change runc-dmz to be opt-it at compilation time?

Can be just a run-time CLI flag IIUC

@cyphar
Copy link
Member

cyphar commented Dec 7, 2023

We can make it so that RUNC_DMZ is used to opt-in to the feature. I'll send a PR once I finish with the cleanups of #3985.

@AkihiroSuda
Copy link
Member

dmz is now disabled by default, so I guess we can close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants