Add blog for Speed up recursive SELinux label change beta

kubernetes · Mar 29, 2023 · 1c92352 · 1c92352
1 parent 68750e7
commit 1c92352
Showing 1 changed file with 131 additions and 0 deletions.
diff --git a/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md b/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md
@@ -0,0 +1,131 @@
+---
+layout: blog
+title: "Kubernetes 1.27: Efficient SELinux volume relabeling (Beta)"
+date: 2023-04-11T10:00:00-08:00
+slug: kubernetes-1-27-efficient-selinux-relabeling-beta
+---
+
+**Author:** Jan Šafránek (Red Hat)
+
+In Kubernetes 1.27 we graduated to beta a more efficient way, how SELinux labels
+are applied to volumes used by Pods.
+
+## Tl;Dr
+
+If a Pod has SELinux context assigned **and** the operating system supports
+SELinux **and** the Pod uses a PersistentVolume with
+`accessMode: ReadWriteOncePod` **and** the CSI driver
+that handles the volume announces `SELinuxMount: true` in its CSIDriver
+instance, **then** Kubernetes + the CSI driver mounts the volume with the Pod's
+SELinux label directly, and the container runtime will not relabel the files on
+the volume.
+
+Nothing changes on Windows or on Linux that does not use SELinux.
+
+See below for more description and future direction.
+
+## SELinux in containers
+
+See excellent
+[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide)
+by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes
+*Multi-Category Security* (MCS) mode using virtual machines as an example,
+however, similar concept is used for containers.
+
+See a series of blog posts for details how exactly SELinux is applied to
+containers by container runtimes:
+
+* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security)
+* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers)
+
+## SELinux in Kubernetes
+
+Kubernetes allows setting the complete pod process label in `securityContext`
+field of a Pod, or in `securityContext` of each container in the Pod.
+
+Kubernetes passes the SELinux label to the container runtime, together
+with pod's volumes and their subpaths. By default, Kubernetes tells the
+container runtime to recursively apply the SELinux label to all files on all
+volumes that support SELinux before running the pod containers.
+
+{{< caution >}}
+The container runtime relabels only the part of a volume that's visible to the
+running container(s). If a container uses `subPath` of a volume, only that
+`subPath` is relabeled.
+
+This allows two pods that have two different SELinux labels to use the same
+volume, as long as they use different subpaths of it.
+{{< /caution >}}
+
+{{< caution >}}
+If a Pod does not have any SELinux label assigned in Kubernetes API, the
+container runtime assigns a unique random one, so a process that potentially
+escapes the container boundary cannot access data of any other container on the
+host. The container runtime still recursively relabels all pod volumes with this
+random SELinux label.
+{{< /caution >}}
+
+It's up to the cluster user, or a security related admission plugin, to set the
+SELinux labels on Pods so Pods that should share volumes have the same SELinux
+label.
+
+# Improvement using mount options
+
+Linux kernel with SELinux support allows the first mount of a volume to set
+SELinux label on the whole volume using `-o context=<SELinux label>` mount
+option. This way, all files will have assigned the given label in a constant
+time, without recursively walking through the whole volumes.
+
+`context` mount option cannot be applied to bind-mounts or re-mounts of already
+mounted volumes. Since it's a CSI driver that does the first mount of a volume,
+it must be the CSI driver who actually applies this mount option. We added a new
+field `SELinuxMount` to CSI Driver object, so CSI drivers can announce if they
+support `-o context` mount option.
+
+If Kubernetes knows SELinux label of a Pod **and** CSI driver responsible for
+a pod's volume announces `SELinuxMount: true` **and** the volume has Access Mode
+`ReadWriteOncePod`, then it will ask the CSI driver to mount the volume with
+mount option `context=<SELinux label>` **and** it will tell the container
+runtime not to relabel content of the volume - all files already have the right
+label.
+
+{{< note >}}
+Not all filesystems support `-o context` mount option out of the box. For
+example, blindly passing `-o context=<SELinux label>` to mount of a share from a
+NFS server would set the SELinux context for all subsequent mounts from the same
+server. A CSI driver that uses NFS must be smart enough to add `nosharecache`
+mount option, so a subsequent mount of a different volume from the same NFS
+server can have a different `context` option. It's up to a CSI driver vendor
+to carefully weight benefits of applying SELinux label in a constant time
+and potential performance impact caused by the necessary mount options
+and to test the CSI driver in a SELinux enabled environment before setting
+`SELinuxMount` to `true`.
+{{< /note >}}
+
+## Limitation of the initial implementation
+
+{{< caution >}}
+Since the `context` mount option always applies to the whole volume, two pods
+with two different SELinux context may not access the same volume, even if
+they use different subpaths of it. It depends on the CSI driver if it supports
+mounting a single volume multiple times with different SELinux labels - it's
+often easy for shared filesystems like NFS, CIFS, GlusterFS and CephFS, but it's
+impossible to mount a single block device with ext4 filesystem on the
+same host twice with different SELinux contexts.
+{{< /caution >}}
+
+Due to this limitation, we've chosen to implement `context` mount only for
+Persistent Volumes that have Access Mode `ReadWriteOncePod` in Kubernetes 1.27.
+Such volumes can be used only by a single pod and thus only with one SELinux
+label.
+
+[The KEP describes additional metrics](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#monitoring-requirements)
+that count how many pods would not start if we extended the implementation to
+all volume Access Modes.
+
+We kindly ask Kubernetes cluster admins to check the metrics and report any
+breakage that would be caused by extending the `context` mount to *all* volumes.
+Please tag `@jsafrane` in Kubernetes issues.
+
+# How can I learn more?
+Read the KEP: [Speed up SELinux volume relabeling using mounts](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling)