Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add blog for Speed up recursive SELinux label change beta
- Loading branch information
Showing
1 changed file
with
131 additions
and
0 deletions.
There are no files selected for viewing
131 changes: 131 additions & 0 deletions
131
content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
--- | ||
layout: blog | ||
title: "Kubernetes 1.27: Efficient SELinux volume relabeling (Beta)" | ||
date: 2023-04-11T10:00:00-08:00 | ||
slug: kubernetes-1-27-efficient-selinux-relabeling-beta | ||
--- | ||
|
||
**Author:** Jan Šafránek (Red Hat) | ||
|
||
In Kubernetes 1.27 we graduated to beta a more efficient way, how SELinux labels | ||
are applied to volumes used by Pods. | ||
|
||
## Tl;Dr | ||
|
||
If a Pod has SELinux context assigned **and** the operating system supports | ||
SELinux **and** the Pod uses a PersistentVolume with | ||
`accessMode: ReadWriteOncePod` **and** the CSI driver | ||
that handles the volume announces `SELinuxMount: true` in its CSIDriver | ||
instance, **then** Kubernetes + the CSI driver mounts the volume with the Pod's | ||
SELinux label directly, and the container runtime will not relabel the files on | ||
the volume. | ||
|
||
Nothing changes on Windows or on Linux that does not use SELinux. | ||
|
||
See below for more description and future direction. | ||
|
||
## SELinux in containers | ||
|
||
See excellent | ||
[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide) | ||
by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes | ||
*Multi-Category Security* (MCS) mode using virtual machines as an example, | ||
however, similar concept is used for containers. | ||
|
||
See a series of blog posts for details how exactly SELinux is applied to | ||
containers by container runtimes: | ||
|
||
* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security) | ||
* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers) | ||
|
||
## SELinux in Kubernetes | ||
|
||
Kubernetes allows setting the complete pod process label in `securityContext` | ||
field of a Pod, or in `securityContext` of each container in the Pod. | ||
|
||
Kubernetes passes the SELinux label to the container runtime, together | ||
with pod's volumes and their subpaths. By default, Kubernetes tells the | ||
container runtime to recursively apply the SELinux label to all files on all | ||
volumes that support SELinux before running the pod containers. | ||
|
||
{{< caution >}} | ||
The container runtime relabels only the part of a volume that's visible to the | ||
running container(s). If a container uses `subPath` of a volume, only that | ||
`subPath` is relabeled. | ||
|
||
This allows two pods that have two different SELinux labels to use the same | ||
volume, as long as they use different subpaths of it. | ||
{{< /caution >}} | ||
|
||
{{< caution >}} | ||
If a Pod does not have any SELinux label assigned in Kubernetes API, the | ||
container runtime assigns a unique random one, so a process that potentially | ||
escapes the container boundary cannot access data of any other container on the | ||
host. The container runtime still recursively relabels all pod volumes with this | ||
random SELinux label. | ||
{{< /caution >}} | ||
|
||
It's up to the cluster user, or a security related admission plugin, to set the | ||
SELinux labels on Pods so Pods that should share volumes have the same SELinux | ||
label. | ||
|
||
# Improvement using mount options | ||
|
||
Linux kernel with SELinux support allows the first mount of a volume to set | ||
SELinux label on the whole volume using `-o context=<SELinux label>` mount | ||
option. This way, all files will have assigned the given label in a constant | ||
time, without recursively walking through the whole volumes. | ||
|
||
`context` mount option cannot be applied to bind-mounts or re-mounts of already | ||
mounted volumes. Since it's a CSI driver that does the first mount of a volume, | ||
it must be the CSI driver who actually applies this mount option. We added a new | ||
field `SELinuxMount` to CSI Driver object, so CSI drivers can announce if they | ||
support `-o context` mount option. | ||
|
||
If Kubernetes knows SELinux label of a Pod **and** CSI driver responsible for | ||
a pod's volume announces `SELinuxMount: true` **and** the volume has Access Mode | ||
`ReadWriteOncePod`, then it will ask the CSI driver to mount the volume with | ||
mount option `context=<SELinux label>` **and** it will tell the container | ||
runtime not to relabel content of the volume - all files already have the right | ||
label. | ||
|
||
{{< note >}} | ||
Not all filesystems support `-o context` mount option out of the box. For | ||
example, blindly passing `-o context=<SELinux label>` to mount of a share from a | ||
NFS server would set the SELinux context for all subsequent mounts from the same | ||
server. A CSI driver that uses NFS must be smart enough to add `nosharecache` | ||
mount option, so a subsequent mount of a different volume from the same NFS | ||
server can have a different `context` option. It's up to a CSI driver vendor | ||
to carefully weight benefits of applying SELinux label in a constant time | ||
and potential performance impact caused by the necessary mount options | ||
and to test the CSI driver in a SELinux enabled environment before setting | ||
`SELinuxMount` to `true`. | ||
{{< /note >}} | ||
|
||
## Limitation of the initial implementation | ||
|
||
{{< caution >}} | ||
Since the `context` mount option always applies to the whole volume, two pods | ||
with two different SELinux context may not access the same volume, even if | ||
they use different subpaths of it. It depends on the CSI driver if it supports | ||
mounting a single volume multiple times with different SELinux labels - it's | ||
often easy for shared filesystems like NFS, CIFS, GlusterFS and CephFS, but it's | ||
impossible to mount a single block device with ext4 filesystem on the | ||
same host twice with different SELinux contexts. | ||
{{< /caution >}} | ||
|
||
Due to this limitation, we've chosen to implement `context` mount only for | ||
Persistent Volumes that have Access Mode `ReadWriteOncePod` in Kubernetes 1.27. | ||
Such volumes can be used only by a single pod and thus only with one SELinux | ||
label. | ||
|
||
[The KEP describes additional metrics](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#monitoring-requirements) | ||
that count how many pods would not start if we extended the implementation to | ||
all volume Access Modes. | ||
|
||
We kindly ask Kubernetes cluster admins to check the metrics and report any | ||
breakage that would be caused by extending the `context` mount to *all* volumes. | ||
Please tag `@jsafrane` in Kubernetes issues. | ||
|
||
# How can I learn more? | ||
Read the KEP: [Speed up SELinux volume relabeling using mounts](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling) |