Skip to content

Commit

Permalink
Add blog for Speed up recursive SELinux label change beta
Browse files Browse the repository at this point in the history
  • Loading branch information
jsafrane committed Mar 29, 2023
1 parent 68750e7 commit 1c92352
Showing 1 changed file with 131 additions and 0 deletions.
@@ -0,0 +1,131 @@
---
layout: blog
title: "Kubernetes 1.27: Efficient SELinux volume relabeling (Beta)"
date: 2023-04-11T10:00:00-08:00
slug: kubernetes-1-27-efficient-selinux-relabeling-beta
---

**Author:** Jan Šafránek (Red Hat)

In Kubernetes 1.27 we graduated to beta a more efficient way, how SELinux labels
are applied to volumes used by Pods.

## Tl;Dr

If a Pod has SELinux context assigned **and** the operating system supports
SELinux **and** the Pod uses a PersistentVolume with
`accessMode: ReadWriteOncePod` **and** the CSI driver
that handles the volume announces `SELinuxMount: true` in its CSIDriver
instance, **then** Kubernetes + the CSI driver mounts the volume with the Pod's
SELinux label directly, and the container runtime will not relabel the files on
the volume.

Nothing changes on Windows or on Linux that does not use SELinux.

See below for more description and future direction.

## SELinux in containers

See excellent
[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide)
by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes
*Multi-Category Security* (MCS) mode using virtual machines as an example,
however, similar concept is used for containers.

See a series of blog posts for details how exactly SELinux is applied to
containers by container runtimes:

* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security)
* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers)

## SELinux in Kubernetes

Kubernetes allows setting the complete pod process label in `securityContext`
field of a Pod, or in `securityContext` of each container in the Pod.

Kubernetes passes the SELinux label to the container runtime, together
with pod's volumes and their subpaths. By default, Kubernetes tells the
container runtime to recursively apply the SELinux label to all files on all
volumes that support SELinux before running the pod containers.

{{< caution >}}
The container runtime relabels only the part of a volume that's visible to the
running container(s). If a container uses `subPath` of a volume, only that
`subPath` is relabeled.

This allows two pods that have two different SELinux labels to use the same
volume, as long as they use different subpaths of it.
{{< /caution >}}

{{< caution >}}
If a Pod does not have any SELinux label assigned in Kubernetes API, the
container runtime assigns a unique random one, so a process that potentially
escapes the container boundary cannot access data of any other container on the
host. The container runtime still recursively relabels all pod volumes with this
random SELinux label.
{{< /caution >}}

It's up to the cluster user, or a security related admission plugin, to set the
SELinux labels on Pods so Pods that should share volumes have the same SELinux
label.

# Improvement using mount options

Linux kernel with SELinux support allows the first mount of a volume to set
SELinux label on the whole volume using `-o context=<SELinux label>` mount
option. This way, all files will have assigned the given label in a constant
time, without recursively walking through the whole volumes.

`context` mount option cannot be applied to bind-mounts or re-mounts of already
mounted volumes. Since it's a CSI driver that does the first mount of a volume,
it must be the CSI driver who actually applies this mount option. We added a new
field `SELinuxMount` to CSI Driver object, so CSI drivers can announce if they
support `-o context` mount option.

If Kubernetes knows SELinux label of a Pod **and** CSI driver responsible for
a pod's volume announces `SELinuxMount: true` **and** the volume has Access Mode
`ReadWriteOncePod`, then it will ask the CSI driver to mount the volume with
mount option `context=<SELinux label>` **and** it will tell the container
runtime not to relabel content of the volume - all files already have the right
label.

{{< note >}}
Not all filesystems support `-o context` mount option out of the box. For
example, blindly passing `-o context=<SELinux label>` to mount of a share from a
NFS server would set the SELinux context for all subsequent mounts from the same
server. A CSI driver that uses NFS must be smart enough to add `nosharecache`
mount option, so a subsequent mount of a different volume from the same NFS
server can have a different `context` option. It's up to a CSI driver vendor
to carefully weight benefits of applying SELinux label in a constant time
and potential performance impact caused by the necessary mount options
and to test the CSI driver in a SELinux enabled environment before setting
`SELinuxMount` to `true`.
{{< /note >}}

## Limitation of the initial implementation

{{< caution >}}
Since the `context` mount option always applies to the whole volume, two pods
with two different SELinux context may not access the same volume, even if
they use different subpaths of it. It depends on the CSI driver if it supports
mounting a single volume multiple times with different SELinux labels - it's
often easy for shared filesystems like NFS, CIFS, GlusterFS and CephFS, but it's
impossible to mount a single block device with ext4 filesystem on the
same host twice with different SELinux contexts.
{{< /caution >}}

Due to this limitation, we've chosen to implement `context` mount only for
Persistent Volumes that have Access Mode `ReadWriteOncePod` in Kubernetes 1.27.
Such volumes can be used only by a single pod and thus only with one SELinux
label.

[The KEP describes additional metrics](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#monitoring-requirements)
that count how many pods would not start if we extended the implementation to
all volume Access Modes.

We kindly ask Kubernetes cluster admins to check the metrics and report any
breakage that would be caused by extending the `context` mount to *all* volumes.
Please tag `@jsafrane` in Kubernetes issues.

# How can I learn more?
Read the KEP: [Speed up SELinux volume relabeling using mounts](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling)

0 comments on commit 1c92352

Please sign in to comment.