Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process capabilities cannot be retained when starting a container as non-root with --security-opt=no-new-privileges #45491

Closed
vasiliy-ul opened this issue May 8, 2023 · 10 comments · Fixed by #45511
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/confirmed version/20.10 version/23.0

Comments

@vasiliy-ul
Copy link
Contributor

vasiliy-ul commented May 8, 2023

Description

When using docker as a runtime in kubernetes, the capabilities specified in the container's security context (in the pod yaml manifests) are not respected if running as non-root user:

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: false
      runAsGroup: 107
      runAsNonRoot: true
      runAsUser: 107
$ k exec -ti virt-launcher-testvm-XXXX -- bash
bash-5.1$ grep Cap /proc/1/status 
CapInh:	0000000000000000
CapPrm:	0000000000000000 # permitted caps zeroed
CapEff:	0000000000000000 # effective caps zeroed
CapBnd:	0000000000000400 # cap_net_bind_service
CapAmb:	0000000000000000

In KubeVirt project we had several similar issues reported: kubevirt/kubevirt#9465

This can be easily reproduced with minikube. Other runtimes (containerd and crio) handle the capabilities correctly:

CapInh:	0000000000000000
CapPrm:	0000000000000400 # cap_net_bind_service
CapEff:	0000000000000400 # cap_net_bind_service
CapBnd:	0000000000000400 # cap_net_bind_service
CapAmb:	0000000000000000

I briefly looked at the sources. Though I am not 100% confident that this snippet is actually causing the problem, but the bellow code looked suspicious to me:

moby/oci/oci.go

Lines 31 to 35 in c651a53

// Do not set Effective and Permitted capabilities for non-root users,
// to match what execve does.
s.Process.Capabilities = &specs.LinuxCapabilities{
Bounding: caplist,
}

It was introduced by this commit 349aeea (and refactored in 0d9a37d).

Reproduce

$ minikube start --driver=kvm2
$ k create -f https://github.com/kubevirt/kubevirt/releases/download/v0.59.0/kubevirt-operator.yaml
$ k create -f https://github.com/kubevirt/kubevirt/releases/download/v0.59.0/kubevirt-cr.yaml
$ wget https://kubevirt.io/labs/manifests/vm.yaml
$ vim vm.yaml # add annotation `kubevirt.io/keep-launcher-alive-after-failure: "true"`
$ k create -f vm.yaml
$ k edit vm testvm # set `running: true`
$ k logs -f virt-launcher-testvm-XXXX
...
{"component":"virt-launcher","level":"error","msg":"failed to start virtqemud","pos":"libvirt_helper.go:250","reason":"fork/exec /usr/sbin/virtqemud: errno 0","timestamp":"2023-05-08T09:34:32.370373Z"}
panic: fork/exec /usr/sbin/virtqemud: errno 0
...
$ k exec -ti virt-launcher-testvm-XXXX -- bash
bash-5.1$ grep Cap /proc/1/status 
CapInh:	0000000000000000
CapPrm:	0000000000000000 # permitted caps zeroed
CapEff:	0000000000000000 # effective caps zeroed
CapBnd:	0000000000000400 # cap_net_bind_service
CapAmb:	0000000000000000

Expected behavior

Effective/permitted caps should be set correctly:

CapPrm:	0000000000000400
CapEff:	0000000000000400

docker version

Client:
 Version:           20.10.23
 API version:       1.41
 Go version:        go1.18.10
 Git commit:        7155243
 Built:             Thu Jan 19 17:30:35 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.23
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.10
  Git commit:       6051f14
  Built:            Thu Jan 19 17:36:08 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.0
  GitCommit:        1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
 runc:
  Version:          1.1.5
  GitCommit:        f19387a6bec4944c770f7668ab51c4348d9c2f38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 34
  Running: 28
  Paused: 0
  Stopped: 6
 Images: 14
 Server Version: 20.10.23
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.57
 Operating System: Buildroot 2021.02.12
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.22GiB
 Name: minikube
 ID: 462Q:TJOC:6UQE:VT5O:7XAO:AS3J:5M6Q:VOT3:HXV2:HTVP:4TFY:4W7K
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
  provider=kvm2
 Experimental: false
 Insecure Registries:
  10.96.0.0/12
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support

Additional Info

This can also be reproduced without KubeVirt:

$ k apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: sleeper
spec:
  restartPolicy: Never
  terminationGracePeriodSeconds: 30
  containers:
  - name: sleeper
    image: busybox
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: false
      runAsGroup: 107
      runAsNonRoot: true
      runAsUser: 107
    command:
    - /bin/sh
    - "-euxc"
    - |
      sleep infinity
EOF
$ k exec -ti sleeper -- sh
~ $ ps aux
PID   USER     TIME  COMMAND
    1 107       0:00 /bin/sh -euxc sleep infinity 
   13 107       0:00 sh
   19 107       0:00 ps aux
~ $ grep Cap /proc/1/status
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000000000000400
CapAmb:	0000000000000000
@vasiliy-ul vasiliy-ul added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels May 8, 2023
@thaJeztah
Copy link
Member

I briefly looked at the sources. Though I am not 100% confident that this snippet is actually causing the problem, but the bellow code looked suspicious to me:

Those changes were part of docker 20.10.14; https://docs.docker.com/engine/release-notes/20.10/#201014 to address CVE-2022-24769

/cc @samuelkarp

@vasiliy-ul
Copy link
Contributor Author

vasiliy-ul commented May 8, 2023

It was introduced by this commit 0d9a37d.

Sorry, I was wrong. That commit only refactored the existing code with the effective and permitted caps drop. I think the real commit is this one 349aeea

	// if non root drop capabilities in the way execve does
	if s.Process.User.UID != 0 {
		s.Process.Capabilities.Effective = []string{}
		s.Process.Capabilities.Permitted = []string{}
	}

committed on Jun 16, 2018

Looks pretty old...

@neersighted
Copy link
Member

What version of Kubernetes or cri-dockerd are you using? I don't think they're in the loop here, but a full context is helpful.

@vasiliy-ul
Copy link
Contributor Author

I was able to reproduce the problem on minikube with Kubernetes v1.26.3:

$ /usr/bin/cri-dockerd --version
cri-dockerd 0.3.1 (9a87d6a)

@vasiliy-ul
Copy link
Contributor Author

Forgot to mention that the binary has the capabilities set:

bash-5.1$ getcap /usr/bin/virt-launcher-monitor
/usr/bin/virt-launcher-monitor cap_net_bind_service=ep

@corhere
Copy link
Contributor

corhere commented May 9, 2023

Clearing out the effective and permitted capability sets for non-root users was introduced in #36587. #36587 (comment) notes that the docker run --no-new-privileges option stops file capabilities from working, which, if I'm not mistaken, is equivalent to securityContext.allowPrivilegeEscalation=false in a Pod spec.

Notably, there was work to change cri-containerd to implement the same capability-set-clearing behaviour as moby with lots of agreement that it should happen (though it hasn't quite happened yet) which disproves the claim that moby's behaviour is wrong and the other projects are correct.

@vasiliy-ul
Copy link
Contributor Author

Hi @corhere, thank you for the clarification. This now makes more sense to me.

allowPrivilegeEscalation: Controls whether a process can gain more privileges than its parent process. This bool directly controls whether the no_new_privs flag gets set on the container process.

With no_new_privs set, execve() promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set, and LSMs will not relax constraints after execve.

So apparently we now have different behavior in k8s depending on the container runtime.

@xpivarc
Copy link
Contributor

xpivarc commented May 10, 2023

Hi @corhere , @justincormack
I am not quit getting why would you want to clear the capabilities before the execve. --no-new-privileges should not stop file capabilities working, only those which are not explicitly requested imho.

Can you help me understand the reasoning behind this? I think docker should keep the capabilities requested by the user.

@corhere
Copy link
Contributor

corhere commented May 10, 2023

The man pages document the behaviour of the no_new_privs attribute on execve as follows:

capabilities(7):

Note: during the capability transitions described above, file
capabilities may be ignored (treated as empty) for the same
reasons that the set-user-ID and set-group-ID bits are ignored;
see execve(2).

Note: according to the rules above, if a process with nonzero
user IDs performs an execve(2) then any capabilities that are
present in its permitted and effective sets will be cleared.

execve(2):

The aforementioned transformations of the effective IDs are not
performed (i.e., the set-user-ID and set-group-ID bits are
ignored) if any of the following is true:

  • the no_new_privs attribute is set for the calling thread (see
    prctl(2));

[...]

The capabilities of the program file (see capabilities(7)) are
also ignored if any of the above are true.

If a process with nonzero UIDs and the no_new_privs attribute set calls execve, the kernel will zero out its permitted and effective capability sets (or set them to the ambient set) irrespective of the file capabilities of the exec'ed program. At least according to those man pages. That is not the case in practice.

The actual semantics of no_new_privs are described in the kernel documentation.

With no_new_privs set, execve() promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set, and LSMs will not relax constraints after execve.

(Emphasis mine.)

File capabilities are taken into consideration when no_new_privs is set! The permitted set after execve is the intersection of the file capabilities and the old permitted set.

Moby is in the wrong here. It doesn't need to emulate execve by zeroing the permitted and effective capabilities; the kernel will do that on its own when the runtime changes its UID and calls execve. The only impact of zeroing out the permitted and effective capability sets in the OCI spec is blocking containers from utilizing file capabilities to retain permitted capabilities when exec'ed with no_new_privs.

@corhere corhere changed the title Incorrect process capabilities when running as non-root in kubernetes Process capabilities cannot be retained when starting a container as non-root with --security-opt=no-new-privileges May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/confirmed version/20.10 version/23.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants