Skip to content

Commit

Permalink
Rephrase blog
Browse files Browse the repository at this point in the history
  • Loading branch information
jsafrane committed Apr 4, 2023
1 parent 1c92352 commit c9321b0
Showing 1 changed file with 90 additions and 95 deletions.
Expand Up @@ -7,125 +7,120 @@ slug: kubernetes-1-27-efficient-selinux-relabeling-beta

**Author:** Jan Šafránek (Red Hat)

In Kubernetes 1.27 we graduated to beta a more efficient way, how SELinux labels
are applied to volumes used by Pods.
# The problem

## Tl;Dr
On Linux with Security-Enhanced Linux (SELinux) enabled, it's traditionally
the container runtime who applies SELinux labels to a Pod and all it's volumes.
Kubernetes only provides the SELinux label from Pod's Security Context fields
to the container runtime.

If a Pod has SELinux context assigned **and** the operating system supports
SELinux **and** the Pod uses a PersistentVolume with
`accessMode: ReadWriteOncePod` **and** the CSI driver
that handles the volume announces `SELinuxMount: true` in its CSIDriver
instance, **then** Kubernetes + the CSI driver mounts the volume with the Pod's
SELinux label directly, and the container runtime will not relabel the files on
the volume.
The container runtime then recursively changes SELinux label on all files that
are visible to the Pod's containers. This can be time-consuming, if there are
many files on the volume, especially when the volume is on a remote filesystem.

Nothing changes on Windows or on Linux that does not use SELinux.
{{< note >}}
If a container uses `subPath` of a volume, only that `subPath` of the whole
volume is relabeled. This allows two pods that have two different SELinux labels
to use the same volume, as long as they use different subpaths of it.
{{< /note >}}

See below for more description and future direction.
{{< note >}}
If a Pod does not have any SELinux label assigned in Kubernetes API, the
container runtime assigns a unique random one, so a process that potentially
escapes the container boundary cannot access data of any other container on the
host. The container runtime still recursively relabels all pod volumes with this
random SELinux label.
{{< /note >}}

## SELinux in containers
# Improvement using mount options

See excellent
[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide)
by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes
*Multi-Category Security* (MCS) mode using virtual machines as an example,
however, similar concept is used for containers.
If a Pod + its volume satisfies **all** following conditions, Kubernetes will
_mount_ the volume directly with the right SELinux label. Such mount will happen
in a constant time and the container runtime will not need to recursively
relabel any files on it.

See a series of blog posts for details how exactly SELinux is applied to
containers by container runtimes:
1. The operating system must support SELinux.

* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security)
* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers)
Without SELinux support detected, kubelet and the container runtime does not
do anything with regard to SELinux.

## SELinux in Kubernetes
1. The [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
`ReadWriteOncePod` and `SELinuxMountReadWriteOncePod` must be enabled.
These feature gates are Beta in Kubernetes 1.27 and Alpha in 1.25.

Kubernetes allows setting the complete pod process label in `securityContext`
field of a Pod, or in `securityContext` of each container in the Pod.
With any of these feature gates disabled, SELinux labels will be always
applied by the container runtime by a recursive walk through the volume
(or its subPaths).

Kubernetes passes the SELinux label to the container runtime, together
with pod's volumes and their subpaths. By default, Kubernetes tells the
container runtime to recursively apply the SELinux label to all files on all
volumes that support SELinux before running the pod containers.
1. The Pod must have at least `seLinuxOptions.level` assigned in its [Pod Security Context](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context) or all Pod containers must have it set in their [Security Contexts](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1).
Kubernetes will read the default `user`, `role` and `type` from the operating
system defaults (typically `system_u`, `system_r` and `container_t`).

{{< caution >}}
The container runtime relabels only the part of a volume that's visible to the
running container(s). If a container uses `subPath` of a volume, only that
`subPath` is relabeled.
Without Kubernetes knowing at least the SELinux `level`, the container
runtime will assign a random one _after_ the volumes are mounted. The
container runtime will still relabel the volumes recursively in that case.

This allows two pods that have two different SELinux labels to use the same
volume, as long as they use different subpaths of it.
{{< /caution >}}
1. The volume must be a Persistent Volume with
[Access Mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes)
`ReadWriteOncePod`.

{{< caution >}}
If a Pod does not have any SELinux label assigned in Kubernetes API, the
container runtime assigns a unique random one, so a process that potentially
escapes the container boundary cannot access data of any other container on the
host. The container runtime still recursively relabels all pod volumes with this
random SELinux label.
{{< /caution >}}
{{< caution >}}
This is a limitation of the initial implementation. As described above,
two Pods can have a different SELinux label and still use the same volume,
as long as they use a different `subPath` of it. This use case is not
possible when the volumes are _mounted_ with the SELinux label, because the
whole volume is mounted and most filesystems don't support mounting a single
volume multiple times with multiple SELinux labels.
{{< /caution >}}

It's up to the cluster user, or a security related admission plugin, to set the
SELinux labels on Pods so Pods that should share volumes have the same SELinux
label.
{{< note >}}
Please report in
[the feature issue](https://github.com/kubernetes/enhancements/issues/1710)
if running two Pods with two different SELinux contexts and using
different `subPaths` of the same volume is necessary in your deployments.
Such pods may not run when we extend the feature to all volume access modes.
{{< /note >}}

# Improvement using mount options
1. The volume plugin or the CSI driver responsible for the volume supports
mounting with SELinux mount options.

Linux kernel with SELinux support allows the first mount of a volume to set
SELinux label on the whole volume using `-o context=<SELinux label>` mount
option. This way, all files will have assigned the given label in a constant
time, without recursively walking through the whole volumes.
These in-tree volume plugins support mounting with SELinux mount options:
`fc`, `iscsi`, and `rbd`.

`context` mount option cannot be applied to bind-mounts or re-mounts of already
mounted volumes. Since it's a CSI driver that does the first mount of a volume,
it must be the CSI driver who actually applies this mount option. We added a new
field `SELinuxMount` to CSI Driver object, so CSI drivers can announce if they
support `-o context` mount option.
CSI drivers that support mounting with SELinux mount options must announce
that in their
[CSI Driver](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/)
instance by setting `seLinuxMount` field.

If Kubernetes knows SELinux label of a Pod **and** CSI driver responsible for
a pod's volume announces `SELinuxMount: true` **and** the volume has Access Mode
`ReadWriteOncePod`, then it will ask the CSI driver to mount the volume with
mount option `context=<SELinux label>` **and** it will tell the container
runtime not to relabel content of the volume - all files already have the right
label.
Volumes managed by other volume plugins or CSI drivers that don't
set `seLinuxMount: true` will be recursively relabelled by the container
runtime.

{{< note >}}
Not all filesystems support `-o context` mount option out of the box. For
example, blindly passing `-o context=<SELinux label>` to mount of a share from a
NFS server would set the SELinux context for all subsequent mounts from the same
server. A CSI driver that uses NFS must be smart enough to add `nosharecache`
mount option, so a subsequent mount of a different volume from the same NFS
server can have a different `context` option. It's up to a CSI driver vendor
to carefully weight benefits of applying SELinux label in a constant time
and potential performance impact caused by the necessary mount options
and to test the CSI driver in a SELinux enabled environment before setting
`SELinuxMount` to `true`.
{{< /note >}}
## Mounting with SELinux context

## Limitation of the initial implementation
When all aforementioned conditions are met, kubelet will
pass `-o context=<SELinux label>` mount option to the volume plugin or CSI
driver. CSI driver vendors must ensure that this mount option is supported
by their CSI driver and, if necessary, the CSI driver appends other mount
options that are needed for `-o context` to work.

{{< caution >}}
Since the `context` mount option always applies to the whole volume, two pods
with two different SELinux context may not access the same volume, even if
they use different subpaths of it. It depends on the CSI driver if it supports
mounting a single volume multiple times with different SELinux labels - it's
often easy for shared filesystems like NFS, CIFS, GlusterFS and CephFS, but it's
impossible to mount a single block device with ext4 filesystem on the
same host twice with different SELinux contexts.
{{< /caution >}}
For example, NFS may need `-o context=<SELinux label>,nosharecache`, so each
volume mounted from the same NFS server can have a different SELinux label
value. Similarly, CIFS may need `-o context=<SELinux label>,nosharesock`.

Due to this limitation, we've chosen to implement `context` mount only for
Persistent Volumes that have Access Mode `ReadWriteOncePod` in Kubernetes 1.27.
Such volumes can be used only by a single pod and thus only with one SELinux
label.
It's up to the CSI driver vendor to test their CSI driver in a SELinux enabled
environment before setting `seLinuxMount: true` in the CSI Driver instance.

[The KEP describes additional metrics](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling#monitoring-requirements)
that count how many pods would not start if we extended the implementation to
all volume Access Modes.
# How can I learn more?
SELinux in containers: see excellent
[visual SELinux guide](https://opensource.com/business/13/11/selinux-policy-guide)
by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes
*Multi-Category Security* (MCS) mode using virtual machines as an example,
however, similar concept is used for containers.

We kindly ask Kubernetes cluster admins to check the metrics and report any
breakage that would be caused by extending the `context` mount to *all* volumes.
Please tag `@jsafrane` in Kubernetes issues.
See a series of blog posts for details how exactly SELinux is applied to
containers by container runtimes:
* [How SELinux separates containers using Multi-Level Security](https://www.redhat.com/en/blog/how-selinux-separates-containers-using-multi-level-security)
* [Why you should be using Multi-Category Security for your Linux containers](https://www.redhat.com/en/blog/why-you-should-be-using-multi-category-security-your-linux-containers)

# How can I learn more?
Read the KEP: [Speed up SELinux volume relabeling using mounts](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling)

0 comments on commit c9321b0

Please sign in to comment.