Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline CSI volumes support KEP #716

Merged
merged 1 commit into from
Jan 30, 2019
Merged

Inline CSI volumes support KEP #716

merged 1 commit into from
Jan 30, 2019

Conversation

vladimirvivien
Copy link
Member

@vladimirvivien vladimirvivien commented Jan 23, 2019

Currently, volumes that are backed by CSI drivers can only be used with the PersistentVolume and PersistentVolumeClaim objects. This proposal is to implement support for the ability to nest CSI volume declarations within pod specs:

apiVersion: v1
kind: Pod
metadata:
  name: testpod
spec:
  containers:
    ...
  volumes:
      - name: vol
        csi:
          driver: mock.storage.kubernetes.io
          volumeAttributes:
              name: "Mock Volume 1"
          volumeHandle: "1"

This would put CSI drivers on feature parity with existing in-tree storage plugins which already support this volume deployment mode.

Related feature entry - #596
This KEP started life as feature #2273.

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 23, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/pm size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2019
volumes:
- name: vol
csi:
lifecycle: ephemeral
Copy link

@kfox1111 kfox1111 Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a few issues in decreasing levels of importance:

  1. The user is in control of it. This means the user can not specify it, give it the same volume handle as another user's ephemeral volume and it would mount their volume. This is a security issue.
  2. Because of the restricted api, a given driver may malfunction if switched to ephemeral style when the driver doesn't support it.
  3. The user needs to know to specify it.
  4. It takes more effort for the user to specify.

Due to all of these, I recommend we change it to a CSIDriver CR flag. Then the driver is flaged as either ephemeral or normal by the driver packager. A driver can not be made to be both normal and ephemeral. We drop the lifecycle flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfox1111 thank you for the input. I think it escaped me that a driver should either be persistent or ephemeral, but never both. In that case, the setting can be introduced in the CSIDriver object.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lifecycle flag still here. should be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this lifecycle field be removed then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok. Fixing.

@kfox1111
Copy link

Oh. One more thing. There needs to be a section covering the PodSecurityPolicy stuff around needing a way to limit which drivers can be used PodInline. As an Operator I need to restrict my users only to the safe subset of ephemeral drivers I want to support for them.

@vladimirvivien
Copy link
Member Author

@kfox1111 thanks for the input. I will address.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/blocked-paths Indicates that a PR should not merge because it touches files in blocked paths. label Jan 25, 2019
@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/blocked-paths Indicates that a PR should not merge because it touches files in blocked paths. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 28, 2019
// CSI drivers may be used. This parameter is effective only when the usage of the CSI plugin
// is allowed in the "Volumes" field.
// +optional
AllowedCSIDrivers []AllowedCSIDriver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllowedInlineCSIDrivers? Or at least mention in the comment that this refers only to volumes referenced inline not via PVC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably should apply to both inline and PVC right? We should be clear about that. If it applies to both, does it need to be part of this inline design or should it be a new standalone design?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pvc's cant specify a driver to use. pv's do, but normal users can't create those at all. So I don't think it matters to apply there. i don't think the other volume restrictions in podsecuritypolicy apply to pv's either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PSP only applies to pod objects... it doesn't have visibility to transitive information in PVC->PV or PVC->StorageClass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Inline is needed... most of the other names have counterparts in PV sources. PSP is specifically for things in pod specs

@vladimirvivien
Copy link
Member Author

@saad-ali thanks for the review.

@vladimirvivien
Copy link
Member Author

@msau42 thank you for the review.

Copy link

@kfox1111 kfox1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me I think. down to a few simple typo's now. Nothing that effects the actual implementation.

* As a storage provider, I want to be able to create CSI drivers that support persistent volumes that can be nested within pod specs. These volumes would work similarly to how my current in-tree drivers work (with minor limitations addressed later).
* As a storage provider, I want to use the CSI API to develop arbitrary drivers that can mount ephemeral volumes which follow the lifecycles of pods where they are embedded. These ephemeral volumes do not follow the full CSI storage workflow (of provision, stage, attach, mount, unmount, detach, unstage, delete)
* As a user, I want to be able to deploy pods, with persistent CSI volumes embedded in their specs, without the use of PV/PVC.
* As a user I want to be able to define pod specs with embedded ephemeral CSI volumes that are created/mounted when the pod is deployed and are deleted when the pod goes away.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help the reader to know what the use cases are for out of tree ephemeral CSI volumes. From a user story perspective, I'm missing the "why", or a reference/link to the why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to the storage provider user story, why ephemeral CSI drivers would be useful. Thanks.


## Motivation
Implementing support for embedding volumes directly in pod specs would allow external CSI drivers to have feature parity with their in-tree counter parts. There are several reasons why this is desirable:
* It provides the community a path to move away from in-tree volume plugins to CSI, as designed in a separate proposal https://github.com/kubernetes/community/pull/2199/.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a tracking issue, or reference, that outlines the issues with moving * out of tree?

The grand promise of CSI is ~= CNI, it's a post bootstrap addition, and this KEP is a step in that direction. I guess I'm looking for a roadmap or umbrella issue that outlines this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a complete issue/roadmap.

I do know about this one:
kubernetes/community#2199

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc CSI has similar aspiration of moving in-tree drivers out via CSI drivers. There is a separate set of tasks and roadmap (here) with design to track the exodus.

### Goals
* Provide a high level design for inline CSI volumes support
* Define API changes needed to support this feature
* Inline CSI volumes should work with existing CSI drivers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about this. So now here it states that it should work "with existing CSI drivers".

But then further down it says "A CSI driver must indicate that it supports either persistent or ephemeral inline mode." and explains how a driver in ephemeral mode must "delay or combine any provisioning/deprovisioning operation".

In other words, none of the "existing drivers" will work - at least not for ephemeral mode.

Or is the goal merely that "Inline persistent volumes" should work without modifications?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing (normal) drivers should work without changes with pod inline.

There isn't such a thing as ephemeral drivers yet. those need to be specifically coded to work as they don't support the entire csi workflow and need to self provision/delete backing stores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the wording to make this obvious? I propose:

  • Inline persistent CSI volumes should work with existing CSI drivers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly sounds good. I will make the change.


#### Persistent inline mode
For CSI drivers that support persistent volumes (whether originated from PVC/PVC or inline) the `CSIDriver.lifecycle` value should be set to `persistent` (default). That value indicates the following:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above it says "A CSI driver must indicate that it supports either persistent or ephemeral inline mode." and here it says that persistent is the default.

So if CSIDriver.lifecycle is not set, does that mean that support for persistent inline mode is enabled (because that is the default)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly in order for existing persistent driver to work automatically, when that new flag is introduced in CSIDriver object, if not provided or not set, the code must assume it is working as a persistent driver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the language anyway.

Value `PodSecurityPolicy.AllowedCSIDrivers` must be explicitly set with the names of CSI drivers that are allowed to be embedded within a pod spec. An empty value means no CSI drivers are allowed to be specified inline inside a pod spec.

### Persistent vs ephemeral lifecycle setting
A CSI driver must indicate that it supports either persistent or ephemeral inline mode. This setting will be stored in a [`CSIDriver` configuration CRD](https://github.com/kubernetes/community/pull/2514) and must be retrieved from the API server at runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given @kfox1111's clarification, I propose to change this into "A CSI driver must indicate that it supports ephemeral inline mode. This setting will be stored...".

What I'm missing in the document is an explicit statement that a driver can only be used in one mode or the other, and an explanation why that decision was made.

At first glance that decision seems to make it harder to deploy a driver that can operate in both modes. My colleagues are working on https://github.com/intel/pmem-csi, which manages local storage. Exposing that as ephemeral local storage is one way of using that storage, but we have ideas how we can make persistent mode also useful. It would be nice if we didn't have to run two driver daemons on each node just to support both. I can imagine doing it via two different CSI sockets provided by the same process, but if we could have a single socket and a parameter on NodePublishVolume for "this is an ephemeral volume" then that might be simpler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is, Mount/Unmount of an ephemeral driver will do the creation of the backing store and the deletion of the backing store in the driver itself. The driver isn't handed any info to decide if it is an ephemeral driver or a normal one.

We might be able to relay this information in the future. I'd suggest though we make them totally separate for now and we can revisit having a driver that can do both modes in the future. We're really close to the deadline now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly thanks for the feedback. So I think a bit of clarification is in order (thanks for pointing that out). A driver should be able to support both at the same time. The lifecycle flag is consequential depending on how the volume is originated. When volume is being used inline, in a pod, the lifecyle flag is used to determine how it should function. When a volume is originated by a PV/PVC, it should automatically be assumed to be persistent.

The problem is, if a volume originates from PV/PVC and the driver does not in fact support persistent volume, then there will be issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol we came up with, for how an ephemeral driver will work does not have enough information to switch between being ephemeral and not being ephemeral.I say we only support one or the other explicitly for now, as this is alpha, and add that functionally later. perhaps as an enhancement to the pod info passing feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm fine with supporting just one mode for now. I just wish the document would elaborate a bit more about the reasoning that led to the proposal. But that shouldn't stop it from getting merge as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly I updated the doc to address the language and clarify your concerns.

@msau42
Copy link
Member

msau42 commented Jan 29, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2019
@kfox1111
Copy link

@saad-ali how is this looking? Good enough?

@msau42
Copy link
Member

msau42 commented Jan 29, 2019

@vladimirvivien I think you need to update the NEXT_KEP_NUMBER file
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2019
@msau42
Copy link
Member

msau42 commented Jan 29, 2019

Nevermind. apparently kep numbers are not used anymore. So you should actually rename the file and remove the kep number from it.
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2019
Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Will let @liggitt lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, saad-ali, vladimirvivien

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2019
@k8s-ci-robot k8s-ci-robot merged commit 0684459 into kubernetes:master Jan 30, 2019
- "@jsafrane"
- "@liggit"
approvers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how this got to implementable with TBD approvers. moving @thockin (for API) and @saad-ali (for storage) from reviewers to approvers would probably make sense

#### Persistent inline mode
For CSI drivers that indicate they support persistent inline volumes, the `CSIDriver.lifecycle` value should be set to `persistent` (default of if not provided). When a driver is used to handle volumes from an inline context:
* Volumes are assumed to be originated from an inline pod spec.
* The `volumeHandle` value must be generated ahead of time and manually entered in the pod spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this enforced by API validation, or by failing the pod at runtime if no volumeHandle is provided and the CSIDriver.lifecycle is not ephemeral?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime's pretty easy to implement. api validation would be harder? can an api validator lookup cr's to validate against? can we add to the existing api validation logic or would that need to be an external webhook thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt this will have to happen at runtime since volumeHandle can support two modes one of which does not require it when ephemeral. The api validators don't usually call the API server (or at least i haven't seen one), but if possible or doable, then the validator will check the CRD for proper application.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime makes sense. Clarify that pods targeting ephemeral inline drivers can specify a volumeHandle that will be ignored, and that pods targeting persisting inline drivers can omit a volumeHandle which will result in an error (and describe where in the lifecycle that error occurs, and how it surfaces to the user... I'd guess when the volumeattachment was attempted to be created, since no name would be able to be computed)

* The external CSI A/D controller **copies whole `VolumeSource`** from `Pod` into `VolumeAttachment`. This allows external CSI attacher to detach volumes for deleted pods without keeping any internal database of attached VolumeSources.
* Validation of `VolumeSource` will ensure that only `CSIVolumeSource` is being copied.
* External CSI attacher must be extended to process either `PersistentVolumeName` or `VolumeSource`.
* **External attacher may need permission to access secrets in any pod namespace** where inline volume is specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to iron out the implementation details when multiple pods in different namespaces share the same volume.

just got here... sketching out the lifecycle of a volumeattachment object with multiple pods involved would help understand if there are information flow gaps. (who creates the object, based on what info? who detaches the volume, based on what signal? etc)

* External CSI attacher must be extended to process either `PersistentVolumeName` or `VolumeSource`.
* **External attacher may need permission to access secrets in any pod namespace** where inline volume is specified.
* CSI `ControllerUnpublishVolume` call (~ volume detach) will require the secrets to be available at detach time.
* If user deletes secrets in a pod's namespace that was used to attach an inline volume, the external attacher will fail during detach (volume will remain attached), reporting errors about missing secrets to user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporting errors about missing secrets to user

using what mechanism?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt this will have to be done with events and logging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's an error attaching a volume, is it reported as an event into all namespaces that have pods waiting for that attachment, or just into the namespace referenced by the VolumeAttachment object?

* Ephemeral CSI drivers will have to delay or combine any provisioning/deprovisioning operation during different phase.

#### Attach/detach
CSI uses API object `storage.VolumeAttachment` to track and manage attach/detach storage operation. Currently that object contains reference to an associated PV when the driver is not used inline. However, it must be extended to support information from `CSIVolumeSource` (see above) when an inline volume is being attached.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this section only relevant for persistent inline CSI volumes? see question below at the "Ephemeral inline drivers must ignore attachment requests" line

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Not needed for ephemeral I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt that is correct. Only persistent inline drivers will participate in that phase.

Copy link
Member

@liggitt liggitt Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, clarify that at the top... if VolumeAttachment objects are not created for ephemeral inline volumes, that answers #716 (comment) and partially answers the ephemeral part of #716 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified.

* As stated, volumeHandle must refer to a pre-existing provisioned volume

For ephemeral inline drivers:
* Ephemeral inline drivers must ignore attachment requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little confusing...
does that mean VolumeAttachment objects are not created for ephemeral inline volumes?
does that mean the CSI driver is not invoked with attach/detach requests, or that it is invoked and should ignore the invocations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it means the three of:

  • volumeattachment's wont be created
  • csi attach/detach requests will not be issued to the driver
  • the csi driver should not implement attach/detach

The latter allows us to potentially add them at some point and not break existing ephemeral drivers. I don't foresee ever needing them, but its always good to give yourself room to make enhancements later to cover your own lack of foresight. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt one of the major reason ephemeral won't participate in attach/detach is because of the volumeHandle auto-generation strategy which will be based on the podUID for guaranteed uniqueness. During Attach, there is no suitable podInfo that I was able to find. So, the ephemeral drivers are restricted to mount/unmount which is when all info needed for form handle is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems reasonable... clarifying the whole attach/detach/volumeattachment section only applies to persistent inline drivers would be good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the * Ephemeral volumes will only make CSI calls for *mount* and *unmount* volume operations. statement, which I should have connected to the volume attachment bits not applying, but reiterating it for slow readers like me would be nice :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will re-read KEP and make sure that point is clear.

* Ephemeral drivers are responsible for handling volume operations (create, mount, unmount, delete) during this stage

## Security considerations
As written above, external attacher may require permissions to read Secrets in any namespace. It is up to CSI driver author to document if the driver needs such permission (i.e. access to Secrets at attach/detach time) and up to cluster admin to deploy the driver with these permissions or restrict external attacher to access secrets only in some namespaces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to read Secrets in any namespace

clarify this only attempts to read secrets referenced by CSI volume sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.