Rephrase blog

kubernetes · Apr 4, 2023 · c9321b0 · c9321b0
1 parent 1c92352
commit c9321b0
Showing 1 changed file with 90 additions and 95 deletions.
diff --git a/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md b/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md
@@ -7,125 +7,120 @@ slug: kubernetes-1-27-efficient-selinux-relabeling-beta
 
 **Author:** Jan Šafránek (Red Hat)
 
-In Kubernetes 1.27 we graduated to beta a more efficient way, how SELinux labels
-are applied to volumes used by Pods.
+# The problem
 
-## Tl;Dr
+On Linux with Security-Enhanced Linux (SELinux) enabled, it's traditionally
+the container runtime who applies SELinux labels to a Pod and all it's volumes.
+Kubernetes only provides the SELinux label from Pod's Security Context fields
+to the container runtime.
 
-If a Pod has SELinux context assigned **and** the operating system supports
-SELinux **and** the Pod uses a PersistentVolume with
-`accessMode: ReadWriteOncePod` **and** the CSI driver
-that handles the volume announces `SELinuxMount: true` in its CSIDriver
-instance, **then** Kubernetes + the CSI driver mounts the volume with the Pod's
-SELinux label directly, and the container runtime will not relabel the files on
-the volume.
+The container runtime then recursively changes SELinux label on all files that
+are visible to the Pod's containers. This can be time-consuming, if there are
+many files on the volume, especially when the volume is on a remote filesystem.
 
-Nothing changes on Windows or on Linux that does not use SELinux.
+{{< note >}}
+If a container uses `subPath` of a volume, only that `subPath` of the whole
+volume is relabeled. This allows two pods that have two different SELinux labels
+to use the same volume, as long as they use different subpaths of it.
+{{< /note >}}
 
-See below for more description and future direction.
+{{< note >}}
+If a Pod does not have any SELinux label assigned in Kubernetes API, the
+container runtime assigns a unique random one, so a process that potentially
+escapes the container boundary cannot access data of any other container on the
+host. The container runtime still recursively relabels all pod volumes with this
+random SELinux label.
+{{< /note >}}
 
-## SELinux in containers
+# Improvement using mount options
 
-See excellent
-[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide)
-by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes
-*Multi-Category Security* (MCS) mode using virtual machines as an example,
-however, similar concept is used for containers.
+If a Pod + its volume satisfies **all** following conditions, Kubernetes will
+_mount_ the volume directly with the right SELinux label. Such mount will happen
+in a constant time and the container runtime will not need to recursively
+relabel any files on it.
 
-See a series of blog posts for details how exactly SELinux is applied to
-containers by container runtimes:
+1. The operating system must support SELinux.
 
-* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security)
-* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers)
+   Without SELinux support detected, kubelet and the container runtime does not
+   do anything with regard to SELinux.
 
-## SELinux in Kubernetes
+1. The [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
+   `ReadWriteOncePod` and `SELinuxMountReadWriteOncePod` must be enabled.
+   These feature gates are Beta in Kubernetes 1.27 and Alpha in 1.25.
 
-Kubernetes allows setting the complete pod process label in `securityContext`
-field of a Pod, or in `securityContext` of each container in the Pod.
+   With any of these feature gates disabled, SELinux labels will be always
+   applied by the container runtime by a recursive walk through the volume
+   (or its subPaths).
 
-Kubernetes passes the SELinux label to the container runtime, together
-with pod's volumes and their subpaths. By default, Kubernetes tells the
-container runtime to recursively apply the SELinux label to all files on all
-volumes that support SELinux before running the pod containers.
+1. The Pod must have at least `seLinuxOptions.level` assigned in its [Pod Security Context](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context) or all Pod containers must have it set in their [Security Contexts](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1).
+   Kubernetes will read the default `user`, `role` and `type` from the operating
+   system defaults (typically `system_u`, `system_r` and `container_t`).
 
-{{< caution >}}
-The container runtime relabels only the part of a volume that's visible to the
-running container(s). If a container uses `subPath` of a volume, only that
-`subPath` is relabeled.
+   Without Kubernetes knowing at least the SELinux `level`, the container
+   runtime will assign a random one _after_ the volumes are mounted. The
+   container runtime will still relabel the volumes recursively in that case.
 
-This allows two pods that have two different SELinux labels to use the same
-volume, as long as they use different subpaths of it.
-{{< /caution >}}
+1. The volume must be a Persistent Volume with
+   [Access Mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes)
+   `ReadWriteOncePod`.
 
-{{< caution >}}
-If a Pod does not have any SELinux label assigned in Kubernetes API, the
-container runtime assigns a unique random one, so a process that potentially
-escapes the container boundary cannot access data of any other container on the
-host. The container runtime still recursively relabels all pod volumes with this
-random SELinux label.
-{{< /caution >}}
+   {{< caution >}}
+   This is a limitation of the initial implementation. As described above,
+   two Pods can have a different SELinux label and still use the same volume,
+   as long as they use a different `subPath` of it. This use case is not
+   possible when the volumes are _mounted_ with the SELinux label, because the
+   whole volume is mounted and most filesystems don't support mounting a single
+   volume multiple times with multiple SELinux labels.
+   {{< /caution >}}
 
-It's up to the cluster user, or a security related admission plugin, to set the
-SELinux labels on Pods so Pods that should share volumes have the same SELinux
-label.
+   {{< note >}}
+   Please report in
+   [the feature issue](https://github.com/kubernetes/enhancements/issues/1710)
+   if running two Pods with two different SELinux contexts and using
+   different `subPaths` of the same volume is necessary in your deployments.
+   Such pods may not run when we extend the feature to all volume access modes.
+   {{< /note >}}
 
-# Improvement using mount options
+1. The volume plugin or the CSI driver responsible for the volume supports
+   mounting with SELinux mount options.
 
-Linux kernel with SELinux support allows the first mount of a volume to set
-SELinux label on the whole volume using `-o context=<SELinux label>` mount
-option. This way, all files will have assigned the given label in a constant
-time, without recursively walking through the whole volumes.
+   These in-tree volume plugins support mounting with SELinux mount options:
+   `fc`, `iscsi`, and `rbd`.
 
-`context` mount option cannot be applied to bind-mounts or re-mounts of already
-mounted volumes. Since it's a CSI driver that does the first mount of a volume,
-it must be the CSI driver who actually applies this mount option. We added a new
-field `SELinuxMount` to CSI Driver object, so CSI drivers can announce if they
-support `-o context` mount option.
+   CSI drivers that support mounting with SELinux mount options must announce
+   that in their
+   [CSI Driver](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/)
+   instance by setting `seLinuxMount` field.
 
-If Kubernetes knows SELinux label of a Pod **and** CSI driver responsible for
-a pod's volume announces `SELinuxMount: true` **and** the volume has Access Mode
-`ReadWriteOncePod`, then it will ask the CSI driver to mount the volume with
-mount option `context=<SELinux label>` **and** it will tell the container
-runtime not to relabel content of the volume - all files already have the right
-label.
+   Volumes managed by other volume plugins or CSI drivers that don't
+   set `seLinuxMount: true` will be recursively relabelled by the container
+   runtime.
 
-{{< note >}}
-Not all filesystems support `-o context` mount option out of the box. For
-example, blindly passing `-o context=<SELinux label>` to mount of a share from a
-NFS server would set the SELinux context for all subsequent mounts from the same
-server. A CSI driver that uses NFS must be smart enough to add `nosharecache`
-mount option, so a subsequent mount of a different volume from the same NFS
-server can have a different `context` option. It's up to a CSI driver vendor
-to carefully weight benefits of applying SELinux label in a constant time
-and potential performance impact caused by the necessary mount options
-and to test the CSI driver in a SELinux enabled environment before setting
-`SELinuxMount` to `true`.
-{{< /note >}}
+## Mounting with SELinux context
 
-## Limitation of the initial implementation
+When all aforementioned conditions are met, kubelet will
+pass `-o context=<SELinux label>` mount option to the volume plugin or CSI
+driver. CSI driver vendors must ensure that this mount option is supported
+by their CSI driver and, if necessary, the CSI driver appends other mount
+options that are needed for `-o context` to work.
 
-{{< caution >}}
-Since the `context` mount option always applies to the whole volume, two pods
-with two different SELinux context may not access the same volume, even if
-they use different subpaths of it. It depends on the CSI driver if it supports
-mounting a single volume multiple times with different SELinux labels - it's
-often easy for shared filesystems like NFS, CIFS, GlusterFS and CephFS, but it's
-impossible to mount a single block device with ext4 filesystem on the
-same host twice with different SELinux contexts.
-{{< /caution >}}
+For example, NFS may need `-o context=<SELinux label>,nosharecache`, so each
+volume mounted from the same NFS server can have a different SELinux label
+value. Similarly, CIFS may need `-o context=<SELinux label>,nosharesock`.
 
-Due to this limitation, we've chosen to implement `context` mount only for
-Persistent Volumes that have Access Mode `ReadWriteOncePod` in Kubernetes 1.27.
-Such volumes can be used only by a single pod and thus only with one SELinux
-label.
+It's up to the CSI driver vendor to test their CSI driver in a SELinux enabled
+environment before setting `seLinuxMount: true` in the CSI Driver instance.
 
-[The KEP describes additional metrics](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#monitoring-requirements)
-that count how many pods would not start if we extended the implementation to
-all volume Access Modes.
+# How can I learn more?
+SELinux in containers: see excellent
+[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide)
+by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes
+*Multi-Category Security* (MCS) mode using virtual machines as an example,
+however, similar concept is used for containers.
 
-We kindly ask Kubernetes cluster admins to check the metrics and report any
-breakage that would be caused by extending the `context` mount to *all* volumes.
-Please tag `@jsafrane` in Kubernetes issues.
+See a series of blog posts for details how exactly SELinux is applied to
+containers by container runtimes:
+* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security)
+* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers)
 
-# How can I learn more?
 Read the KEP: [Speed up SELinux volume relabeling using mounts](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling)