Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: send pod information in NodePublishVolumeRequest #2439

Merged
merged 1 commit into from
Jan 29, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Pod in CSI NodePublish request
Author: @jsafrane

## Goal
* Pass Pod information (pod name/namespace/UID + service account) to CSI drivers in `NodePublish` request as CSI volume attributes.

## Motivation
We'd like to move away from exec based Flex to gRPC based CSI volumes. In Flex, kubelet always passes `pod.namespace`, `pod.name`, `pod.uid` and `pod.spec.serviceAccountName` ("pod information") in every `mount` call. In Kubernetes community we've seen some Flex drivers that use pod or service account information to authorize or audit usage of a volume or generate content of the volume tailored to the pod (e.g. https://github.com/Azure/kubernetes-keyvault-flexvol).

CSI is agnostic to container orchestrators (such as Kubernetes, Mesos or CloudFoundry) and as such does not understand concept of pods and service accounts. [Enhancement of CSI protocol](https://github.com/container-storage-interface/spec/pull/252) to pass "workload" (~pod) information from Kubernetes to CSI driver has met some resistance.

## High-level design
We decided to pass the pod information as `NodePublishVolumeRequest.volume_attributes`.

* Kubernetes passes pod information only to CSI drivers that explicitly require that information in their [`CSIDriver` instance](https://github.com/kubernetes/community/pull/2523). These drivers are tightly coupled to Kubernetes and may not work or may require reconfiguration on other cloud orchestrators. It is expected (but not limited to) that these drivers will provide ephemeral volumes similar to Secrets or ConfigMap, extending Kubernetes secret or configuration sources.
* Kubernetes will not pass pod information to CSI drivers that don't know or don't care about pods and service accounts. It is expected (but not limited to) that these drivers will provide real persistent storage. Such CSI driver would reject a CSI call with pod information as invalid. This is current behavior of Kubernetes and it will be the default behavior.

## Detailed design

### API changes
No API changes.

### CSI enhancement
We don't need to change CSI protocol in any way. It allows kubelet to pass `pod.name`, `pod.uid` and `pod.spec.serviceAccountName` in [`NodePublish` call as `volume_attributes`]((https://github.com/container-storage-interface/spec/blob/master/spec.md#nodepublishvolume)). `NodePublish` is roughly equivalent to Flex `mount` call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NodeServer has two sets of calls that I think are applicable. NodePublishVolume/Unpublish, but also NodeStageVolume/Unstage https://github.com/container-storage-interface/spec/blob/master/spec.md#nodestagevolume. Those calls also need the additional info to function properly.

Copy link
Member Author

@jsafrane jsafrane Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case several pods use the same volume on the same node, NodeStageVolume is called only once. It would include just the first Pod. NodePublishVolume is called separately for each pod that uses the volume and CSI driver can do the right thing there.

In addition, Flex driver did not get any pod in mountdevice call (which is equivalent of NodeStageVolume in CSI and nobody complained.

Flex did not provide pod information in unmount (=NodeUnpublishVolume) and nobody complained. Do you have concrete use case when pod can be useful in NodeUnpublishVolume?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, really? I thought stage was called once per pod, but mount was called per container mounting it? I guess I was mistaken. So I'll have to rewrite the image driver to have those semantics then. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane For inline CSI volume, what would be the equivalent of flexvolume's options? Currently we can only pass in volumeAttributes to csi as part of a PersistentVolume deployment. Is there a way to support volumeAttributes as part of inline CSI volume as part of the pod spec?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The only thing we need to do is to **define** names of the `volume_attributes` keys that CSI drivers can expect:
* `csi.storage.k8s.io/pod.name`: name of the pod that wants the volume.
* `csi.storage.k8s.io/pod.namespace`: namespace of the pod that wants the volume.
* `csi.storage.k8s.io/pod.uid`: uid of the pod that wants the volume.
* `csi.storage.k8s.io/serviceAccount.name`: name of the service account under which the pod operates. Namespace of the service account is the same as `pod.namespace`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional csi stage call here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added NodeStageVolume to the example, with a note that it won't get any pod information.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks. The clarity will help others writing drivers.

Note that these attribute names are very similar to [parameters we pass to flex volume plugin](https://github.com/kubernetes/kubernetes/blob/10688257e63e4d778c499ba30cddbc8c6219abe9/pkg/volume/flexvolume/driver-call.go#L55).

### Kubelet
Kubelet needs to create informer to cache `CSIDriver` instances. It passes the informer to CSI volume plugin as a new argument of [`ProbeVolumePlugins`](https://github.com/kubernetes/kubernetes/blob/43f805b7bdda7a5b491d34611f85c249a63d7f97/pkg/volume/csi/csi_plugin.go#L58).

### CSI volume plugin
In `SetUpAt()`, the CSI volume plugin checks the `CSIDriver` informer if `CSIDriver` instance exists for a particular CSI driver that handles the volume. If the instance exists and has `PodInfoRequiredOnMount` set, the volume plugin adds `csi.storage.k8s.io/*` attributes to `volume_attributes` of the CSI volume. It blindly overwrites any existing values there.

Kubelet and the volume plugin must tolerate when CRD for `CSIDriver` is not created (yet). Kubelet and CSI volume plugin falls back to original behavior, i.e. does not pass any pod information to CSI. We expect that CSI drivers will return reasonable error code instead of mounting a wrong volume.

TODO(jsafrane): check what (shared?) informer does when it's created for non-existing CRD. Will it start working automatically when the CRD is created? Or shall we retry creation of the informer every X seconds until the CRD is created? Alternatively, we may GEt fresh `CSIDriver` from API server in `SetUpAt()`, without any informer.

## Implementation

* Alpha in 1.12 (behind `CSIPodInfo` feature gate)
* Beta in 1.13 (behind `CSIPodInfo` feature gate)
* GA 1.14?