Skip to content

Commit

Permalink
sig-node: Rootless mode
Browse files Browse the repository at this point in the history
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
  • Loading branch information
AkihiroSuda committed Jan 28, 2020
1 parent e485562 commit c8b1f15
Showing 1 changed file with 213 additions and 0 deletions.
213 changes: 213 additions & 0 deletions keps/sig-node/20190604-rootless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
title: Rootless mode
authors:
- "@AkihiroSuda"
owning-sig: sig-node
reviewers:
- TBD
approvers:
- TBD
creation-date: 2019-06-04
last-updated: 2020-01-28
status: provisional
see-also:
- https://github.com/kubernetes/enhancements/pull/1370
replaces:
- https://github.com/kubernetes/enhancements/pull/1084
---

# Rootless mode

## Table of Contents

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [FAQ: why not use admission controllers?](#faq-why-not-use-admission-controllers)
- [Goals](#goals)
- [Implementation Details](#implementation-details)
- [User namespaces](#user-namespaces)
- [Paths](#paths)
- [Pod SecurityContext](#pod-securitycontext)
- [Required changes to Kubernetes](#required-changes-to-kubernetes)
- [cgroup](#cgroup)
- [Required changes to Kubernetes](#required-changes-to-kubernetes-1)
- [Network namespaces](#network-namespaces)
- [Required changes to Kubernetes](#required-changes-to-kubernetes-2)
- [Risks and Mitigations](#risks-and-mitigations)
- [Graduation Criteria](#graduation-criteria)
- [Testing](#testing)
- [History](#history)
<!-- /toc -->

## Summary

Allow running the entire Kubernetes components (`kubelet`, CRI, OCI, CNI, and all `kube-*`) as a non-root user on the host.

See [FOSDEM 2019 talk "Rootless Kubernetes"](https://www.slideshare.net/AkihiroSuda/rootless-kubernetes) by [@AkihiroSuda](https://github.com/AkihiroSuda) and [@giuseppe](https://github.com/giuseppe) for the overview of Rootless mode.
See [Usernetes](https://github.com/rootless-containers/usernetes) for the POC.

Rootless mode has been already adopted by k3s, though it doesn't support setting resource limitation as it lacks support for cgroup.

In this KEP, the resource limitation is planned to be supported using cgroup v2, which officially supports delegating its control to a non-root user.

## Motivation

* Protect the host from potential container-breakout vulnerabilities. This is the main movation.
* Allow users of shared machines (especially HPC) to run Kubernetes without the risk of accidentally breaking their colleagues' environments.
Not recommended for real multi-tenancy where the users cannot be trusted.
* Safe [`kind`](https://github.com/kubernetes-sigs/kind).
* Safe Kubernetes-on-Kubernetes, to isolate workloads more strictly than Kubernetes API namespaces.

### FAQ: why not use admission controllers?
Admission controllers like PSP can restrict containers to use extra security options like SELinux, gVisor, and potentially [Node-level UserNS](https://github.com/kubernetes/enhancements/issues/127) in future.

However, these are not efficient to mitigate vulnerabilities of the node components themselves (kubelet, CRI, OCI...).

e.g.

- [CVE-2017-1002102](https://nvd.nist.gov/vuln/detail/CVE-2017-1002102): kubelet could delete files on the host during syncing secret/configMap/downwardAPI volumes
- [CVE-2019-11245](https://nvd.nist.gov/vuln/detail/CVE-2019-11245): Dockerfile USER instruction was ignored
- [CVE-2018-11235](https://nvd.nist.gov/vuln/detail/CVE-2018-11235): kubelet could execute an arbitrary command as the root via gitRepo volumes
- Potential image extraction [zip-slip](https://snyk.io/research/zip-slip-vulnerability) vulnerabilities in CRI runtimes. Both containerd and CRI-O are working on implementing supports for new archive formats like zstd, imgcrypt, and stargz. Potentially these implementations have such vulnerabilities.
- And lots of CRI/OCI vulnerabilities in the past.

## Goals
* Step 1: Allow `kubelet` and `kube-proxy` being executed inside user namespaces create by a non-root user [**mergable**]
* See [User namespaces](#user-namespaces) and https://github.com/kubernetes/kubernetes/pull/78635
* Step 2: Allow `kubelet` to configure cgroup v2 via a user-instance of systemd [**Waiting for cgroup2 KEP to settle down**]
* See [cgroup](#cgroup), https://github.com/kubernetes/enhancements/pull/1370
* Step 3: Allow `kube-proxy` to propagate NodePort ports from its network namespace into the initial network namespace (optional)
* See [Network namespaces](#network-namespaces)

## Implementation Details

### User namespaces
All components need to be executed inside a user namespace (along with a network namespace and a mount namespace)
to gain fake-root privileges, mostly for network and mount operations.

These namespaces are expected to be created using [RootlessKit](https://github.com/rootless-containers/rootlesskit).
In short, RootlessKit is an extended version of [`unshare -rmn`](http://man7.org/linux/man-pages/man1/unshare.1.html).

RootlessKit has been already adopted by Docker, BuildKit, Usernetes, and k3s.

> **Note**
> Rootless mode is unrelated to [the Node-level UserNS KEP](https://github.com/kubernetes/enhancements/issues/127).
>
> Rootless mode executes all components inside UserNS to mitigate vulnerabilities of all components,
> while Node-level UserNS executes only containers inside UserNS.
>
> Rootless mode and Node-level UserNS do not conflict and can be stacked together. (Node-level UserNS inside Rootless UserNS.)
#### Paths

Some paths like `/var/log/pods` are hardcoded in Kubernetes and hard to change.
As RootlessKit can bind-mount writable directories on these paths, Kubernetes does not need to be updated.

#### Pod SecurityContext

- `runAsUser`: supported, but the number of the UID is limited by `/etc/subuid`.
- `sysctls`: some sysctl parameters are supported, but some would fail in `EPERM`.
Creating Pod manifests with such sysctl parameters would fail.
If this behavior is problematic, user should write a Mutating Admission Webhook to remove such sysctl parameters from the Pod manifest.
- seccomp: supported
- SELinux: supported
- AppArmor: unsupported. Creating Pod manifests with AppArmor profile specified would fail.
- [Node-level UserNS KEP](https://github.com/kubernetes/enhancements/issues/127): can be supported. This UserNS will be nested inside the Rootless mode's UserNS.

#### Required changes to Kubernetes
`kubelet` needs to be updated to ignore errors during [`setupKernelTunables()`](https://github.com/kubernetes/kubernetes/blob/v1.18.0-alpha.2/pkg/kubelet/cm/container_manager_linux.go#L384-L423),
because most of sysctl parameters cannot be modified in the user namespace, and modifying them is not mandatory.

> **Note**
> These sysctl parameters are set by `kubelet` itself and unrelated to `.spec.securityContext.sysctls` in Pod manifests.
Same applies to the `setRLimit()` and the conntrack stuff in `kube-proxy`.

See PR https://github.com/kubernetes/kubernetes/pull/78635 (should be ready to merge).

### cgroup

[cgroup v2](https://github.com/kubernetes/enhancements/pull/1370) will be required.
In most environments, cgroup will be configured via a user-instance of systemd, just as in rootless Podman+crun.

cgroup v1 won't be supported due to security concerns.

[Alternatively, we could just disable cgroup](https://github.com/kubernetes/enhancements/pull/1084), but disabling cgroup is
not considered to be useful.

#### Required changes to Kubernetes

`pkg/kubelet/cm` needs to be updated to talk to systemd user-instance.

The [cgroup v2](https://github.com/kubernetes/enhancements/pull/1370) KEP is the blocker to start working on this.

### Network namespaces

RootlessKit supports two kinds of networks:
* TAP with pure usermode network stack (either `slirp4netns` or VPNKit)
* vEth with setuid binary `lxc-user-nic`

`slirp4netns` is preferred for security, `lxc-user-nic` is preferred for performance.

Flannel (VXLAN) is known to work with thse stacks.

#### Required changes to Kubernetes

As the components are executed inside a network namespace, `NodePorts` are not directly accessible from other hosts, and needs
port forwarding to be set up.

RootlessKit provides [API](https://github.com/rootless-containers/rootlesskit/blob/v0.7.2/pkg/api/openapi.yaml) over
a UNIX socket to request setting up port forwarding from the network namespace.

Two possible solutions to integrate the RootlessKit API client into Rootless Kubernetes:

1. Write an external controller that watches changes on `corev1.Service` resources and calls RootlessKit API.
This is what current k3s implementation does: https://github.com/rancher/k3s/blob/v1.17.2+k3s1/pkg/rootlessports/controller.go#L92-L96
2. Embed RootlessKit API client into `kube-proxy`. This is akin to Rootless Docker's `rootlesskit-docker-proxy`.

Probably the solution 1 is fine, but we can also consider the solution 2 for simplicity.

## Risks and Mitigations

Privileges:
- Privileged container cannot gain real root privileges, apparently.

cgroup:
- Does not work on cgroup v1 hosts.

Network:
- Some CNI plugins might not work. Flannel (VXLAN) is known to work.
- Limited network performance.
**Mitigation:** Install [`lxc-user-nic` (SETUID binary)](https://github.com/rootless-containers/rootlesskit/tree/v0.7.2#--netlxc-user-nic-experimental).
- NodePort less than 1024 cannot be exposed.
**Mitigation:** set `CAP_NET_BIND_SERVICE` file capability on `rootlesskit` binary.

Volumes:
- Only `emptyDir`, `hostPath`, `local`, and FUSE-based CSI volumes can be supported,
because user namespace only supports `tmpfs`, `bind`, and FUSE filesystems.
**Mitigation:** Write a "proxy" CSI plugin that talks to a privileged CSI plugin daemon.

## Graduation Criteria

- Alpha: Basic support for rootless mode on cgroups v2 hosts.

- Beta: e2e tests coverage, or have a plan for the failing tests.
Requires Kubernetes support for cgroup v2 (rootful) to be Beta or GA.

- GA: Assuming no negative user feedback based on production
experience, promote after >= 2 releases in beta.
Requires Kubernetes support for cgroup v2 (rootful) to be GA.

## Testing

Testing needs cgroup v2 CI infra.

## History

- 2018-07-20: Early POC implementation in [Usernetes project](https://github.com/rootless-containers/usernetes)
- 2019-04-10: [k3s adopted the Usernetes patches](https://github.com/rancher/k3s/pull/195)
- 2019-06-04: [Presented KEP to SIG-node (cgroupless version)](https://github.com/kubernetes/enhancements/pull/1084)
- 2019-07-08: Withdrew the cgroupless KEP
- 2019-11-19: @giuseppe submitted [cgroup v2 KEP](https://github.com/kubernetes/enhancements/pull/1370)
- 2019-11-19: present KEP to SIG-node (cgroup v2 version)

0 comments on commit c8b1f15

Please sign in to comment.