diff --git a/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md b/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md index 94fbbd631292b..42ecaede37cef 100644 --- a/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md +++ b/content/en/blog/_posts/kubernetes-1-27-efficient-selinux-relabeling-beta.md @@ -7,125 +7,120 @@ slug: kubernetes-1-27-efficient-selinux-relabeling-beta **Author:** Jan Šafránek (Red Hat) -In Kubernetes 1.27 we graduated to beta a more efficient way, how SELinux labels -are applied to volumes used by Pods. +# The problem -## Tl;Dr +On Linux with Security-Enhanced Linux (SELinux) enabled, it's traditionally +the container runtime who applies SELinux labels to a Pod and all it's volumes. +Kubernetes only provides the SELinux label from Pod's Security Context fields +to the container runtime. -If a Pod has SELinux context assigned **and** the operating system supports -SELinux **and** the Pod uses a PersistentVolume with -`accessMode: ReadWriteOncePod` **and** the CSI driver -that handles the volume announces `SELinuxMount: true` in its CSIDriver -instance, **then** Kubernetes + the CSI driver mounts the volume with the Pod's -SELinux label directly, and the container runtime will not relabel the files on -the volume. +The container runtime then recursively changes SELinux label on all files that +are visible to the Pod's containers. This can be time-consuming, if there are +many files on the volume, especially when the volume is on a remote filesystem. -Nothing changes on Windows or on Linux that does not use SELinux. +{{< note >}} +If a container uses `subPath` of a volume, only that `subPath` of the whole +volume is relabeled. This allows two pods that have two different SELinux labels +to use the same volume, as long as they use different subpaths of it. +{{< /note >}} -See below for more description and future direction. +{{< note >}} +If a Pod does not have any SELinux label assigned in Kubernetes API, the +container runtime assigns a unique random one, so a process that potentially +escapes the container boundary cannot access data of any other container on the +host. The container runtime still recursively relabels all pod volumes with this +random SELinux label. +{{< /note >}} -## SELinux in containers +# Improvement using mount options -See excellent -[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide) -by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes -*Multi-Category Security* (MCS) mode using virtual machines as an example, -however, similar concept is used for containers. +If a Pod + its volume satisfies **all** following conditions, Kubernetes will +_mount_ the volume directly with the right SELinux label. Such mount will happen +in a constant time and the container runtime will not need to recursively +relabel any files on it. -See a series of blog posts for details how exactly SELinux is applied to -containers by container runtimes: +1. The operating system must support SELinux. -* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security) -* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers) + Without SELinux support detected, kubelet and the container runtime does not + do anything with regard to SELinux. -## SELinux in Kubernetes +1. The [feature gates](/docs/reference/command-line-tools-reference/feature-gates/) + `ReadWriteOncePod` and `SELinuxMountReadWriteOncePod` must be enabled. + These feature gates are Beta in Kubernetes 1.27 and Alpha in 1.25. -Kubernetes allows setting the complete pod process label in `securityContext` -field of a Pod, or in `securityContext` of each container in the Pod. + With any of these feature gates disabled, SELinux labels will be always + applied by the container runtime by a recursive walk through the volume + (or its subPaths). -Kubernetes passes the SELinux label to the container runtime, together -with pod's volumes and their subpaths. By default, Kubernetes tells the -container runtime to recursively apply the SELinux label to all files on all -volumes that support SELinux before running the pod containers. +1. The Pod must have at least `seLinuxOptions.level` assigned in its [Pod Security Context](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context) or all Pod containers must have it set in their [Security Contexts](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1). + Kubernetes will read the default `user`, `role` and `type` from the operating + system defaults (typically `system_u`, `system_r` and `container_t`). -{{< caution >}} -The container runtime relabels only the part of a volume that's visible to the -running container(s). If a container uses `subPath` of a volume, only that -`subPath` is relabeled. + Without Kubernetes knowing at least the SELinux `level`, the container + runtime will assign a random one _after_ the volumes are mounted. The + container runtime will still relabel the volumes recursively in that case. -This allows two pods that have two different SELinux labels to use the same -volume, as long as they use different subpaths of it. -{{< /caution >}} +1. The volume must be a Persistent Volume with + [Access Mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) + `ReadWriteOncePod`. -{{< caution >}} -If a Pod does not have any SELinux label assigned in Kubernetes API, the -container runtime assigns a unique random one, so a process that potentially -escapes the container boundary cannot access data of any other container on the -host. The container runtime still recursively relabels all pod volumes with this -random SELinux label. -{{< /caution >}} + {{< caution >}} + This is a limitation of the initial implementation. As described above, + two Pods can have a different SELinux label and still use the same volume, + as long as they use a different `subPath` of it. This use case is not + possible when the volumes are _mounted_ with the SELinux label, because the + whole volume is mounted and most filesystems don't support mounting a single + volume multiple times with multiple SELinux labels. + {{< /caution >}} -It's up to the cluster user, or a security related admission plugin, to set the -SELinux labels on Pods so Pods that should share volumes have the same SELinux -label. + {{< note >}} + Please report in + [the feature issue](https://github.com/kubernetes/enhancements/issues/1710) + if running two Pods with two different SELinux contexts and using + different `subPaths` of the same volume is necessary in your deployments. + Such pods may not run when we extend the feature to all volume access modes. + {{< /note >}} -# Improvement using mount options +1. The volume plugin or the CSI driver responsible for the volume supports + mounting with SELinux mount options. -Linux kernel with SELinux support allows the first mount of a volume to set -SELinux label on the whole volume using `-o context=` mount -option. This way, all files will have assigned the given label in a constant -time, without recursively walking through the whole volumes. + These in-tree volume plugins support mounting with SELinux mount options: + `fc`, `iscsi`, and `rbd`. -`context` mount option cannot be applied to bind-mounts or re-mounts of already -mounted volumes. Since it's a CSI driver that does the first mount of a volume, -it must be the CSI driver who actually applies this mount option. We added a new -field `SELinuxMount` to CSI Driver object, so CSI drivers can announce if they -support `-o context` mount option. + CSI drivers that support mounting with SELinux mount options must announce + that in their + [CSI Driver](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/) + instance by setting `seLinuxMount` field. -If Kubernetes knows SELinux label of a Pod **and** CSI driver responsible for -a pod's volume announces `SELinuxMount: true` **and** the volume has Access Mode -`ReadWriteOncePod`, then it will ask the CSI driver to mount the volume with -mount option `context=` **and** it will tell the container -runtime not to relabel content of the volume - all files already have the right -label. + Volumes managed by other volume plugins or CSI drivers that don't + set `seLinuxMount: true` will be recursively relabelled by the container + runtime. -{{< note >}} -Not all filesystems support `-o context` mount option out of the box. For -example, blindly passing `-o context=` to mount of a share from a -NFS server would set the SELinux context for all subsequent mounts from the same -server. A CSI driver that uses NFS must be smart enough to add `nosharecache` -mount option, so a subsequent mount of a different volume from the same NFS -server can have a different `context` option. It's up to a CSI driver vendor -to carefully weight benefits of applying SELinux label in a constant time -and potential performance impact caused by the necessary mount options -and to test the CSI driver in a SELinux enabled environment before setting -`SELinuxMount` to `true`. -{{< /note >}} +## Mounting with SELinux context -## Limitation of the initial implementation +When all aforementioned conditions are met, kubelet will +pass `-o context=` mount option to the volume plugin or CSI +driver. CSI driver vendors must ensure that this mount option is supported +by their CSI driver and, if necessary, the CSI driver appends other mount +options that are needed for `-o context` to work. -{{< caution >}} -Since the `context` mount option always applies to the whole volume, two pods -with two different SELinux context may not access the same volume, even if -they use different subpaths of it. It depends on the CSI driver if it supports -mounting a single volume multiple times with different SELinux labels - it's -often easy for shared filesystems like NFS, CIFS, GlusterFS and CephFS, but it's -impossible to mount a single block device with ext4 filesystem on the -same host twice with different SELinux contexts. -{{< /caution >}} +For example, NFS may need `-o context=,nosharecache`, so each +volume mounted from the same NFS server can have a different SELinux label +value. Similarly, CIFS may need `-o context=,nosharesock`. -Due to this limitation, we've chosen to implement `context` mount only for -Persistent Volumes that have Access Mode `ReadWriteOncePod` in Kubernetes 1.27. -Such volumes can be used only by a single pod and thus only with one SELinux -label. +It's up to the CSI driver vendor to test their CSI driver in a SELinux enabled +environment before setting `seLinuxMount: true` in the CSI Driver instance. -[The KEP describes additional metrics](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#monitoring-requirements) -that count how many pods would not start if we extended the implementation to -all volume Access Modes. +# How can I learn more? +SELinux in containers: see excellent +[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide) +by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes +*Multi-Category Security* (MCS) mode using virtual machines as an example, +however, similar concept is used for containers. -We kindly ask Kubernetes cluster admins to check the metrics and report any -breakage that would be caused by extending the `context` mount to *all* volumes. -Please tag `@jsafrane` in Kubernetes issues. +See a series of blog posts for details how exactly SELinux is applied to +containers by container runtimes: +* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security) +* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers) -# How can I learn more? Read the KEP: [Speed up SELinux volume relabeling using mounts](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling)