New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker fails to start container if certain syscalls are restricted by seccomp #22252

Closed
jpallen opened this Issue Apr 22, 2016 · 10 comments

Comments

Projects
None yet
6 participants
@jpallen
Copy link
Contributor

jpallen commented Apr 22, 2016

Description of problem: The following syscalls must be provided in the seccomp profile, even if they are not used by the process that is run in the container:

capget
capset
chdir
fchown
futex
getdents64
getpid
getppid
lstat
openat
prctl
setgid
setgroups
setuid
stat

If these are not allowed, the container will fail to run with varying error messages depending on the missing syscall. I'm not familiar with the internals, but I suspect the seccomp profile is applied before the container is set up, and these syscalls are needed for the container set up. For a security model that allows limiting syscalls, it should also be possible to deny these calls if they are not needed by the process that is actually run.

docker version:

Client:
 Version:      1.11.0-rc5
 API version:  1.23
 Go version:   go1.5.3
 Git commit:   6178547
 Built:        Mon Apr 11 21:16:15 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0-rc5
 API version:  1.23
 Go version:   go1.5.3
 Git commit:   6178547
 Built:        Mon Apr 11 21:16:15 2016
 OS/Arch:      linux/amd64

docker info:

Containers: 6
 Running: 0
 Paused: 0
 Stopped: 6
Images: 3
Server Version: 1.11.0-rc5
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 74
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: null host bridge
Kernel Version: 4.4.0-18-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.953 GiB
Name: sl-lin-stag-clsi-2
ID: 53XM:GFZU:I5QC:MGBM:HPGW:KENC:ZBK7:GN4C:TGXD:JJPC:AO3S:IVM4
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

uname -a: Linux sl-lin-stag-clsi-2 4.4.0-18-generic #34-Ubuntu SMP Wed Apr 6 14:01:02 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Environment details (AWS, VirtualBox, physical, etc.): Test machine was a VPS from Linode

How reproducible: Very. Just run a container with any of the above syscalls removed from the default docker profile.

Steps to Reproduce:

  1. Create a seccomp profile for echo with only the syscalls it needs. (Check using strace):
$ strace echo hi 2>&1 | grep -v '+++ exited' | cut -d'(' -f1 | grep -v ')' | sort | uniq
access
arch_prctl
brk
close
execve
exit_group
fstat
mmap
mprotect
munmap
open
read
write

The seccomp profile which allows only these syscalls is attached:
echo-seccomp-profile.json.txt

  1. Try to run echo in a container with this profile:
$ sudo docker run -it --rm --security-opt seccomp=echo-seccomp-profile.json ubuntu echo hi
docker: Error response from daemon: rpc error: code = 2 desc = "oci runtime error: open /proc/self/fd: operation not permitted".

Expected Results: It should run the echo command.

Additional info: I found the list of syscalls that docker needs (in the first paragraph of this report) by removing each syscall in turn from the default profile, and seeing which ones causes a run of echo hi to fail in the a container (except the ones explicitly needed by echo). It is likely that some of access, arch_prctl, brk, close, execve, exit_group, fstat, mmap, mprotect, munmap, open, read, write are also fundamental to the set up of the container, rather than just the echo command, since these are pretty fundamental calls.

@cpuguy83

This comment has been minimized.

Copy link
Contributor

cpuguy83 commented Apr 22, 2016

Nice find!
Ping @justincormack

@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented Apr 22, 2016

Yes, I did notice that there was an issue with the seccomp profile being applied a little earlier than ideal (ie before setting capabilities and a few other things) when we did the original addition, but it may also have moved around with the runc work. It would indeed be a good idea to move it as late as possible, as restricting some of those calls is a good idea. Will look at this and look at how to move it.

@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented Apr 27, 2016

Ok, I have it working with just futex, stat and execve needed. The stat can be moved later (it is Go trying to find the executable in the PATH), the futex is not very easy to remove but it is pretty essential anyway, and the execve is also hard to remove as we need to exec the new program, and the alternatvie fexecveat is only available in very new kernels.

There is a slight complication around no new privs and capabilities, will work out best solution there, there are a few options.

justincormack added a commit to justincormack/runc that referenced this issue Apr 27, 2016

If possible, apply seccomp rules immediately before exec
See moby/moby#22252

Previously we would apply seccomp rules before applying
capabilities, because it requires CAP_SYS_ADMIN. This
however means that a seccomp profile needs to allow
operations such as setcap() and setuid() which you
might reasonably want to disallow.

If prctl(PR_SET_NO_NEW_PRIVS) has been applied however
setting a seccomp filter is an unprivileged operation.
Therefore if this has been set, apply the seccomp
filter as late as possible, after capabilities have
been dropped and the uid set.

Note a small number of syscalls will take place
after the filter is applied, such as `futex`,
`stat` and `execve`, so these still need to be allowed
in addition to any the program itself needs.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented Apr 27, 2016

Ok, I have sent a PR to runc opencontainers/runc#789

This only moves seccomp to after setting capabilities if the --security-opt="no-new-privileges" is set, otherwise the current behaviour has to be kept as without that flag setting the seccomp filter is a privileged operation. It is a good idea to use that flag, and if you are already setting a custom filter adding it is not too much of a burden. (I wish that flag had been the default, but it was added later and it does cause problems if people want to use suid binaries in a container).

justincormack added a commit to justincormack/runc that referenced this issue Apr 27, 2016

If possible, apply seccomp rules immediately before exec
See moby/moby#22252

Previously we would apply seccomp rules before applying
capabilities, because it requires CAP_SYS_ADMIN. This
however means that a seccomp profile needs to allow
operations such as setcap() and setuid() which you
might reasonably want to disallow.

If prctl(PR_SET_NO_NEW_PRIVS) has been applied however
setting a seccomp filter is an unprivileged operation.
Therefore if this has been set, apply the seccomp
filter as late as possible, after capabilities have
been dropped and the uid set.

Note a small number of syscalls will take place
after the filter is applied, such as `futex`,
`stat` and `execve`, so these still need to be allowed
in addition to any the program itself needs.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
@mrunalp

This comment has been minimized.

Copy link
Contributor

mrunalp commented Apr 27, 2016

@justincormack Yep, the reason for not making no-new-privileges the default was to not break existing containers that use setuid binaries.

@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented Apr 28, 2016

This has been merged into runc now.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Apr 28, 2016

@justincormack think we need a PR to bump runC here to close this issue

@thaJeztah thaJeztah added this to the 1.12.0 milestone Apr 28, 2016

@vonpupp

This comment has been minimized.

Copy link

vonpupp commented May 3, 2016

I am having the exact same issue here on a Linode droplet using Ubuntu 16.04.

av@ubuntu:~$ docker version
Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:43:49 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Tue Apr 26 23:43:49 2016
 OS/Arch:      linux/amd64

av@ubuntu:~$ uname -a:
Linux ubuntu 4.5.0-x86_64-linode65 #2 SMP Mon Mar 14 18:01:58 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
@justincormack

This comment has been minimized.

Copy link
Contributor

justincormack commented May 4, 2016

@vonpupp which issue exactly?

@vonpupp

This comment has been minimized.

Copy link

vonpupp commented May 5, 2016

The original issue that opened this thread. Since it was a brand new droplet and I needed docker to work as soon as possible I downgraded to ubuntu 14.04.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment