Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Reworking Seccomp Profiles to ENOSYS Default #42871

Open
cyphar opened this issue Sep 22, 2021 · 3 comments
Open

RFE: Reworking Seccomp Profiles to ENOSYS Default #42871

cyphar opened this issue Sep 22, 2021 · 3 comments
Labels
area/security/seccomp kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Comments

@cyphar
Copy link
Contributor

cyphar commented Sep 22, 2021

I think we need to rework the approach towards seccomp because the stub workaround I came up earlier this year in opencontainers/runc#2750 seems to not be doing a great job of solving the problem (maybe it's unclear how it works, or maybe the fact you need to be very careful with updating the syscall set in your filter is not being enforced by upstream projects or being missed somehow). In any case, it seems that the setup is just as fragile as it was before but with the added complication that it seems to only break sometimes (because it depends on whether you've included a new syscall without mentioning an old one).

Since Docker is one of the the primary users of runc (alongside containerd, crio, podman, etc) we should probably sit down and unify the rules for seccomp profiles. The only reason we didn't switch to ENOSYS by default is because it would cause existing Docker profiles to change their behaviour. However, with defaultErrnoRet each user of runc can on their own terms switch to a setup where they must explicitly set EPERM for the syscalls they wish to block rather than having this ill-advised setup where syscalls that are not mentioned give you EPERM. The idea was that the ENOSYS-for-syscalls-larger-than-the-largest-specified-in-the-profile behaviour implemented in opencontainers/runc#2750 was meant to be an interim solution until Docker could move to explicitly specifying which syscalls they want to EPERM and the rest should be ENOSYS. Unfortunately that didn't happen.

Basically someone needs to sit down and go through all the syscalls in the Docker profile which are currently not explicitly blocked and add a new EPERM rule for them. There are a few syscalls where this will be complicated or impossible (such as those that have multiple argument matches -- which podman has -- or the clone flag mask case) but we should probably have a discussion whether returning ENOSYS in those cases rather than EPERM is going to be a deal breaker or not. The point is that hopefully we should move all OCI users to defaultErrnoRet: 38 // ENOSYS sooner rather than later if possible.

Originally posted by @cyphar in #42681 (comment)

@cyphar
Copy link
Contributor Author

cyphar commented Sep 22, 2021

Just to make it perfectly clear how the current runc heuristic works for the context of this dicussion:

Runc prepends a tiny eBPF stub to every seccomp profile it is instructed to create. This program checks the syscall number and if the syscall number is larger than any syscall specified in the profile, it returns ENOSYS. Otherwise it executes the libseccomp-generated eBPF program. This means that adding any new syscalls to the seccomp profile must be done with extreme caution because if you skip a syscall, that syscall will now return EPERM and glibc will break in 6 months time as a result.

Unfortunately there wasn't a better solution, because libseccomp doesn't allow you to create conditionals based on syscall number, nor can you construct an ENOSYS-default profile using just the EPERM-default profile because there are limitations on what kinds of boolean logic you can use with libseccomp, and it is non-trivial to make the cut-off for returning ENOSYS smaller than the largest syscall in the profile because doing so would require rewriting every return (and thus every jump) in the profile which would be far more invasive and hard to verify correct than our current stub.

So the core issue is that we need to change what profiles are being provided to runc, because the above approach -- while workable in theory -- in practice hasn't solved the issue, and there's nothing more we can do purely in runc to solve the problem. Instead, Docker (as well as containerd, podman, crio, and everyone else using runc) needs to change their profiles to use -ENOSYS as the default errno. This is supported by the latest version of the runtime-spec (thanks to @guiseppe for opencontainers/runtime-spec#1087 -- in retrospect it's quite obvious that the larger changes to the seccomp landscape I discussed were a bit too ambitious to block us solving this with the default errno setup).

The main issue with this switch is that due to limitations in libseccomp, we cannot create explicit EPERM rules for every existing syscall rule. For instance, any rule which contains more than one rule allowing a syscall based on a given argument cannot be inverted because boolean OR inversion requires AND and libseccomp doesn't allow using AND on the same argument. SCMP_MASKED_NEQ doesn't exist yet so SCMP_MASKED_EQ has a non-trivial inversion that requires exponential instructions based on how many bits are set in the mask despite the fact that eBPF has an inverse operation that could be trivially used. This means that we need to figure out out whether anything will break if some syscalls return ENOSYS even though the syscall exists and can be used with a different set of arguments.

@thaJeztah thaJeztah added area/security/seccomp kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. labels Sep 22, 2021
@thaJeztah
Copy link
Member

Thanks for opening; some preparation work was done to at least have support for the errNoRet and defaultErrnoRet fields (#42005 (comment), #42604 (comment)), also with the intent to be able to switch to a different default, and to open the discussion on changing to ENOSYS

Unfortunately there wasn't a better solution, because libseccomp doesn't allow you to create conditionals based on syscall number, nor can you construct an ENOSYS-default profile using just the EPERM-default profile because there are limitations on what kinds of boolean logic you can use with libseccomp

...

Basically someone needs to sit down and go through all the syscalls in the Docker profile which are currently not explicitly blocked and add a new EPERM rule for them.

Does that apply to the "current" situation (where runc does the magic rewriting, and Docker uses EPERM as default), or is this still the case defaultErrnoRet in the profile would be set to ENOSYS?

@cyphar
Copy link
Contributor Author

cyphar commented Sep 22, 2021

The conditional limitations are a general problem, which means that swapping to a defaultErrnoRet of ENOSYS will result in some syscalls returning ENOSYS when they currently return EPERM (even if you try to write explicit rules to generate EPERM for syscalls you want to block like open_by_handle_at(2) or whatever -- the problem is that it is currently not possible to write perfect explicit EPERM rules for every possible seccomp rule with default EPERM). Whether that is a big problem for Docker is up to you folks really. I think that we should just do some testing and see how many things break if socket and clone return spurrious-seeming ENOSYS errors -- though I know of at least one program (runc) that explicitly has EPERM vs ENOSYS handling so I'm a bit worried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/security/seccomp kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.
Projects
None yet
Development

No branches or pull requests

2 participants