Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring Volume Snapshot to Beta #1133

Merged
merged 1 commit into from
Jul 31, 2019

Conversation

xing-yang
Copy link
Contributor

This KEP is for the volume snapshot feature.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jul 10, 2019
@xing-yang
Copy link
Contributor Author

@jingxu97 @msau42 PTAL.


The rest of the document includes required information missing from the original design document: test plan and graduation criteria.

## Test Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to add a section for changes from the original design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section for changes since the original design doc. PTAL.

@msau42
Copy link
Member

msau42 commented Jul 17, 2019

/assign @jingxu97 @saad-ali

@kacole2
Copy link
Contributor

kacole2 commented Jul 29, 2019

Is there a window of possibility this KEP will be merged and in an implementable state by EOD tomorrow? Enhancement Freeze is 7/30 and this is going to hinder this feature from being included in 1.16. Thanks

@xing-yang
Copy link
Contributor Author

@kacole2 Yes, we are working on it. Thanks.

@xing-yang xing-yang force-pushed the snapshot_beta branch 3 times, most recently from 839f544 to f0139af Compare July 29, 2019 15:43

## Changes

Here are the changes since the original design proposal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also describe the changes to the CRD lifecycle and controller management.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

* e2e tests for creating/deleting snapshot.
* e2e tests for creating volume from snapshot.

## Graduation Criteria
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add one more section for implementation history for alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 29, 2019
@xing-yang
Copy link
Contributor Author

@msau42 your comments are addressed. PTAL.


## Implementation History

Volume snapshot is implemented as alpha feature in this repo:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add what release it was added as alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

* Added Finalizer on the snapshot source PVC to prevent it from being deleted when a snapshot is being created from it.
* Added check to see whether ListSnapshots is supported by the CSI driver. If it is supported, ListSnapshots will be called to find out the status of a snapshot during static binding; otherwise it is assumed the snapshot ID provided by the admin is valid.
* If snapshot creation times out, we should allow CSI driver to continue to finish the snapshot creation in the background. Timeout will happen at the foreground so VolumeSnapshot is marked as failed. VolumeSnapshotContent will be created but eventually be marked as failed because there is no VolumeSnapshot requesting for it. This work is on-going.
* Investigation is on-going to determine whether we should separate common controller logic from other logic that belongs to the sidecar. Can be in the same external-snapshotter repo. The common controller should not be deployed with the driver. It should be deployed by the cluster deployer, or we can provide a way to deploy it as a separate Statefulset, not together with the driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for sure CRD installation should not be done as part the driver. So we should call that out and who is responsible for managing the CRD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we classify three groups of people? Cluster admin, storage admin, and users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

* Renamed `Ready` to `ReadyToUse` in the `Status` field of `VolumeSnapshot` API object.
* Changed type of `RestoreSize` in `CSIVolumeSnapshotSource` from `*resource.Quantity` to `*int64`.
* Lease based Leader Election support is added.
* Added `VolumeSnapshotContent` deletion policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deletion policy can be specified in VolumeSnapshotClass, I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification that it can be specified in VolumeSnapshotClass. We have it in both Content and Class. What I meant to say is that this is a policy for the content.

* Added `VolumeSnapshot` and `VolumeSnapshotContent` in Use Protection using Finalizers.
* Added Finalizer on the snapshot source PVC to prevent it from being deleted when a snapshot is being created from it.
* Added check to see whether ListSnapshots is supported by the CSI driver. If it is supported, ListSnapshots will be called to find out the status of a snapshot during static binding; otherwise it is assumed the snapshot ID provided by the admin is valid.
* If snapshot creation times out, we should allow CSI driver to continue to finish the snapshot creation in the background. Timeout will happen at the foreground so VolumeSnapshot is marked as failed. VolumeSnapshotContent will be created but eventually be marked as failed because there is no VolumeSnapshot requesting for it. This work is on-going.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for timeout, VolumeSnapshot status will not be marked as failed so that controller will continue to retry to create until the operation either succeeded or failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Fixed it.

@xing-yang
Copy link
Contributor Author

@msau42 and @jingxu97, addressed your comments. PTAL. Thanks.

@msau42
Copy link
Member

msau42 commented Jul 30, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2019
@jingxu97
Copy link
Contributor

/assign @saad-ali

* Snapshot feature is used as a basic building block in other advanced applications.
* Feature deployed in production and have gone through at least one K8s upgrade.

## Changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate this list between what is already complete and what is in progress for beta so it is easier to understand current state vs what is being done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

* Create/delete volume snapshots
* Create new volumes from a snapshot
* SnapshotContent Deletion/Retain Policy
* Snapshot Object in Use Protection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Beta API? Secrets? New controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2019
@xing-yang
Copy link
Contributor Author

Hi @saad-ali, addressed your comments. PTAL. Thanks!

### E2E tests

* e2e tests for creating/deleting snapshot.
* e2e tests for creating volume from snapshot.
Copy link
Member

@saad-ali saad-ali Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make those P0.

And add e2e tests for:

  • P1 delete/retain policy
  • P1 deleting API objects out of order (snapshot protection)
  • P2 test for secret fields
  • P2 tests for metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

* If snapshot creation times out, VolumeSnapshot status will not be marked as failed so that controller will continue to retry to create until the operation either succeeds or fails. It is up to the user or an upper level application that uses the VolumeSnapshot to determine what to do with the snapshot. This work is on-going.
* Investigation is on-going to determine whether we should separate common controller logic from other logic that belongs to the sidecar. Can be in the same external-snapshotter repo. The common controller should not be deployed with the driver. It should be deployed by the cluster deployer, or we can provide a way to deploy it as a separate Statefulset, not together with the driver. CRD installation should also be done by the cluster deployer.
* Add secrets field to list-snapshots RPC in the CSI spec. Add “data-source-secret” in create-volume intended for accessing the data source. Implement them in external-snapshotter and external-provisioner.
* Add metrics support in the snapshot controller.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make the list of metrics to-be-added more specific? I am thinking about following for BETA:

  1. operational end to end latency metrics
    labels:
    a. operation_name, i.e., creation-snapshot, delete-snapshot
    b. csi-driver-name
  2. operation error count
    labels:
    a. operation_name, i.e., creation-snapshot, delete-snapshot
    b. csi-driver-name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@xing-yang
Copy link
Contributor Author

Hi @saad-ali and @yuxiangqian, your comments are addressed. PTAL. Thanks!

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, saad-ali, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2019
@k8s-ci-robot k8s-ci-robot merged commit 3e36f06 into kubernetes:master Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants