Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the race between PVC finalizer and snapshot finalizer removal #360

Merged
merged 2 commits into from
Sep 2, 2020

Conversation

xing-yang
Copy link
Collaborator

@xing-yang xing-yang commented Aug 17, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #349

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix a problem deleting the PVC finalizer and snapshot finalizer.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 17, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xing-yang
Copy link
Collaborator Author

/hold

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 17, 2020
@huffmanca
Copy link
Contributor

I've tested this, and it addresses the issue that I was mentioning in #354 . I am able to delete a VolumeSnapshot that has a source of a non-CSI PVC.

@chrishenzie
Copy link
Contributor

Just tested this but was unable to delete the PVC nor VolumeSnapshot. This was the manifest that was used, which tries creates a VolumeSnapshot at the same time the PVC is created.

@humblec
Copy link
Contributor

humblec commented Aug 20, 2020

@xing-yang are we targetting this for 2.2?

@xing-yang
Copy link
Collaborator Author

@xing-yang are we targetting this for 2.2?

Yes.

@xing-yang
Copy link
Collaborator Author

Just tested this but was unable to delete the PVC nor VolumeSnapshot. This was the manifest that was used, which tries creates a VolumeSnapshot at the same time the PVC is created.

@chrishenzie I updated the PR. Can you test again?

@chrishenzie
Copy link
Contributor

@xing-yang Just tested using the same manifest linked above. All resources deleted successfully without any issues. Thank you!

@xing-yang
Copy link
Collaborator Author

@chrishenzie Thank you for testing this!

@xing-yang
Copy link
Collaborator Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 27, 2020
@xing-yang
Copy link
Collaborator Author

@ggriffiths Addressed your comments. PTAL. Thanks.

@@ -1303,6 +1285,16 @@ func (ctrl *csiSnapshotCommonController) removeSnapshotFinalizer(snapshot *crdv1
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeSourceFinalizer will always be true, let's just remove this parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You meant the caller always passes in removeSourceFinalizer as true here? removeSnapshotFinalizer is a utility function so I'd like to make it usable even if removeSourceFinalizer is false for the completeness.

// If we are here, it means we are going to remove finalizers from the snapshot so that the snapshot can be deleted.
// We need to check if there is still PVC finalizer that needs to be removed before removing the snapshot finalizers.
// Once snapshot is deleted, there won't be any snapshot update event that can trigger the PVC finalizer removal.
if err := ctrl.checkandRemovePVCFinalizer(snapshot, true); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is the best place to put the fix.
checkandRemoveSnapshotFinalizersAndCheckandDeleteContent, which is the caller of this function, has quite some returning points, i.e., this:

if !utils.IsSnapshotDeletionCandidate(snapshot) {

if any of those points is hit, it will still cause the PVC not deletable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tricky. We have to make sure PVC finalizer is deleted before VolumeSnapshot is deleted. Once VolumeSnapshot is deleted, there's no sync routine that handles PVC finalizer deletion any more. We also can't remove PVC finalizer too early. PVC finalizer should stay on if a VolumeSnapshot is still using it.

I add the logic in the current place because this is the place where we are sure we want to delete VolumeSnapshot object.

@@ -170,17 +170,14 @@ func (ctrl *csiSnapshotCommonController) syncSnapshot(snapshot *crdv1.VolumeSnap
klog.V(5).Infof("syncSnapshot [%s]: check if we should remove finalizer on snapshot PVC source and remove it if we can", utils.SnapshotKey(snapshot))

// Check if we should remove finalizer on PVC and remove it if we can.
if err := ctrl.checkandRemovePVCFinalizer(snapshot); err != nil {
if err := ctrl.checkandRemovePVCFinalizer(snapshot, false); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, checkandRemovePVCFinalizer can take the responsibility to handle "snapshot.ObjectMeta.DeletionTimestamp != nil" case,
and the bool flag can be saved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safer to have this bool flag passed in. Even when "snapshot.ObjectMeta.DeletionTimestamp != nil", it is possible that snapshot is also used as a source for another PVC. In that case, we still want to keep PVC finalizer on the snapshot's source if snapshot is not ready yet.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 29, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2020
@xing-yang
Copy link
Collaborator Author

/retest

@ggriffiths
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit 42650bc into kubernetes-csi:master Sep 2, 2020
@ibadullaev-inc4
Copy link

ibadullaev-inc4 commented Jun 9, 2022

Hi.
I have the same problem.

[nariman@notebook cronjob-k8s-client-python]$ kubectl get volumesnapshotcontents.snapshot.storage.k8s.io 
No resources found

when list snapshot

[nariman@notebook cronjob-k8s-client-python]$ kubectl -n geth get volumesnapshot
NAME                          READYTOUSE   SOURCEPVC      SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT                                    CREATIONTIME   AGE
data-gethu-0-20220608160302   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-569cd33c-4f39-4914-bfdc-1109fa0db30c   31h            31h
data-gethu-0-20220609083340   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-8b131c41-e940-4598-9557-16c352722456   14h            14h
data-gethu-0-20220609100305   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-a723ee07-8cf5-4d3f-b909-29b8e378930c   13h            13h
data-gethu-0-20220609120305   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-12a60ecc-88cb-4dc5-9762-f9ed7fe9dbdb   11h            11h
data-gethu-0-20220609140304   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-3c5af74c-cd01-473b-adf5-b705056526e5   9h             9h
data-gethu-0-20220609160302   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-15737a29-6eb1-4eee-adc0-1ea470a00d26   7h27m          7h27m
data-gethu-0-20220609180303   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-58fb6936-d432-408b-ac87-58c7869464af   5h27m          5h27m
data-gethu-0-20220609230303   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-7e84de30-3f4a-45c6-b63e-3953e0a0c115   27m            27m

When I try to delete everything is stuck

[nariman@notebook cronjob-k8s-client-python]$ kubectl -n geth delete volumesnapshot data-gethu-0-20220609230303
volumesnapshot.snapshot.storage.k8s.io "data-gethu-0-20220609230303" deleted

No any snapshot present in AWS

[nariman@notebook geth]$ aws ec2 describe-snapshots --owner-ids self  --query 'Snapshots[0]' --region=us-east-2
null

When I describe snapshot:

Events:
  Type     Reason             Age                  From                 Message
  ----     ------             ----                 ----                 -------
  Normal   CreatingSnapshot   33m                  snapshot-controller  Waiting for a snapshot geth/data-gethu-0-20220609230303 to be created by the CSI driver.
  Normal   SnapshotCreated    33m                  snapshot-controller  Snapshot geth/data-gethu-0-20220609230303 was successfully created by the CSI driver.
  Normal   SnapshotReady      32m                  snapshot-controller  Snapshot geth/data-gethu-0-20220609230303 is ready to use.
  Warning  ErrorPVCFinalizer  22m (x19 over 32m)   snapshot-controller  Error check and remove PVC Finalizer for VolumeSnapshot
  Warning  ErrorPVCFinalizer  109s (x26 over 20m)  snapshot-controller  Error check and remove PVC Finalizer for VolumeSnapshot

Log snapshot-controller:

I0609 23:31:27.162964       1 snapshot_controller_base.go:217] Failed to sync snapshot "geth/data-gethu-0-20220609160302", will retry again: snapshot controller failed to update data-gethu-0-20220609160302 on API server: snapshot controller failed to update data-gethu-0 on API server: PersistentVolumeClaim "data-gethu-0" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
  	... // 4 identical fields
  	StorageClassName: &"ebs-sc",
  	VolumeMode:       &"Filesystem",
  	DataSource: &core.TypedLocalObjectReference{
  		APIGroup: &"snapshot.storage.k8s.io",
  		Kind:     "VolumeSnapshot",
- 		Name:     "data-gethu-0-20220609230303",
+ 		Name:     "data-gethu-0-20220608160302",
  	},
  	DataSourceRef: nil,
  }
I0609 23:31:27.175173       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"geth", Name:"data-gethu-0-20220609160302", UID:"15737a29-6eb1-4eee-adc0-1ea470a00d26", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"4512026", FieldPath:""}): type: 'Warning' reason: 'ErrorPVCFinalizer' Error check and remove PVC Finalizer for VolumeSnapshot
I0609 23:31:27.190708       1 leaderelection.go:278] successfully renewed lease kube-system/snapshot-controller-leader

Questins: How i can solve this issue ? please help

@xing-yang
Copy link
Collaborator Author

What version of external-snapshotter are you using?

@ibadullaev-inc4
Copy link

ibadullaev-inc4 commented Jun 17, 2022

I am using version:
gcr.io/k8s-staging-sig-storage/snapshot-controller:v5.0.1

[nariman@notebook bnb]$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.24.0
Kustomize Version: v4.5.4
Server Version: v1.22.9-eks-a64ea69
WARNING: version difference between client (1.24) and server (1.22) exceeds the supported minor version skew of +/-1

aws-ebs-csi-driver

  repository = "https://kubernetes-sigs.github.io/aws-ebs-csi-driver"
  chart      = "aws-ebs-csi-driver"
  version    = "2.6.7"

Additional test: In two different namespace I have two ready snapshots in each

[nariman@notebook cronjob-k8s-client-python]$ kubectl -n near get volumesnapshot
NAME                          READYTOUSE   SOURCEPVC      SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT                                    CREATIONTIME   AGE
data-nearu-0-20220617005449   true         data-nearu-0                           800Gi         aws-block-storage   snapcontent-70295c59-f75d-4e7e-b3e5-28bcbf57bdc7   49m            49m
data-nearu-0-20220617014011   true         data-nearu-0                           800Gi         aws-block-storage   snapcontent-4612e854-a1b2-4ce2-a55c-23c2f7dc1978   3m58s          3m58s
[nariman@notebook cronjob-k8s-client-python]$ kubectl -n geth get volumesnapshot
NAME                          READYTOUSE   SOURCEPVC      SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT                                    CREATIONTIME   AGE
data-gethu-0-20220617000103   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-7f8eef36-319e-4a12-9f25-8445ac72ea71   104m           104m
data-gethu-0-20220617005748   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-8ebc315e-1690-408c-b322-e640d9f3d75c   47m            47m

When I delete volumesnapshot from near namespace result is successful :

[nariman@notebook cronjob-k8s-client-python]$ kubectl -n near delete volumesnapshot data-nearu-0-20220617005449
volumesnapshot.snapshot.storage.k8s.io "data-nearu-0-20220617005449" deleted
[nariman@notebook cronjob-k8s-client-python]$

But when I delete snapshot from geth namespace the command is stuck:

[nariman@notebook cronjob-k8s-client-python]$ kubectl -n geth delete volumesnapshot data-gethu-0-20220617000103
volumesnapshot.snapshot.storage.k8s.io "data-gethu-0-20220617000103" deleted

Both snapshot created with same way

When I describe this snapshot I show followed events ErrorPVCFinalyzer:
Error check and remove PVC Finalizer for VolumeSnapshot

[nariman@notebook bnb]$ kubectl -n geth describe volumesnapshot data-gethu-0-20220617000103
Name:         data-gethu-0-20220617000103
Namespace:    geth
Labels:       <none>
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:             2022-06-17T00:01:03Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2022-06-17T01:04:51Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generation:  2
  Managed Fields:
    API Version:  snapshot.storage.k8s.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:source:
          .:
          f:persistentVolumeClaimName:
        f:volumeSnapshotClassName:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2022-06-17T00:01:03Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection":
          v:"snapshot.storage.kubernetes.io/volumesnapshot-bound-protection":
    Manager:      snapshot-controller
    Operation:    Update
    Time:         2022-06-17T00:01:03Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:boundVolumeSnapshotContentName:
        f:creationTime:
        f:readyToUse:
        f:restoreSize:
    Manager:         snapshot-controller
    Operation:       Update
    Subresource:     status
    Time:            2022-06-17T00:01:03Z
  Resource Version:  8112914
  UID:               7f8eef36-319e-4a12-9f25-8445ac72ea71
Spec:
  Source:
    Persistent Volume Claim Name:  data-gethu-0
  Volume Snapshot Class Name:      aws-block-storage
Status:
  Bound Volume Snapshot Content Name:  snapcontent-7f8eef36-319e-4a12-9f25-8445ac72ea71
  Creation Time:                       2022-06-17T00:01:03Z
  Ready To Use:                        true
  Restore Size:                        800Gi
Events:
  Type     Reason             Age                    From                 Message
  ----     ------             ----                   ----                 -------
  Warning  ErrorPVCFinalizer  2m21s (x52 over 113m)  snapshot-controller  Error check and remove PVC Finalizer for VolumeSnapshot

Snapshotter LOG:

E0617 01:58:22.574735       1 snapshot_controller_base.go:399] could not sync snapshot "geth/data-gethu-0-20220617000103": snapshot controller failed to update data-gethu-0-20220617000103 on API server: snapshot controller failed to update data-gethu-0 on API server: PersistentVolumeClaim "data-gethu-0" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
  	... // 4 identical fields
  	StorageClassName: &"ebs-sc",
  	VolumeMode:       &"Filesystem",
  	DataSource: &core.TypedLocalObjectReference{
  		APIGroup: &"snapshot.storage.k8s.io",
  		Kind:     "VolumeSnapshot",
- 		Name:     "data-gethu-0-20220617005748",
+ 		Name:     "data-gethu-0-20220609235015",
  	},
  	DataSourceRef: nil,
  }
I0617 01:58:22.574751       1 snapshot_controller_base.go:217] Failed to sync snapshot "geth/data-gethu-0-20220617000103", will retry again: snapshot controller failed to update data-gethu-0-20220617000103 on API server: snapshot controller failed to update data-gethu-0 on API server: PersistentVolumeClaim "data-gethu-0" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
  	... // 4 identical fields
  	StorageClassName: &"ebs-sc",
  	VolumeMode:       &"Filesystem",
  	DataSource: &core.TypedLocalObjectReference{
  		APIGroup: &"snapshot.storage.k8s.io",
  		Kind:     "VolumeSnapshot",
- 		Name:     "data-gethu-0-20220617005748",
+ 		Name:     "data-gethu-0-20220609235015",
  	},
  	DataSourceRef: nil,
  }
I0617 01:58:22.575086       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"geth", Name:"data-gethu-0-20220617000103", UID:"7f8eef36-319e-4a12-9f25-8445ac72ea71", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"8112914", FieldPath:""}): type: 'Warning' reason: 'ErrorPVCFinalizer' Error check and remove PVC Finalizer for VolumeSnapshot

But if I execute follow command the snapshot delete successful:

[nariman@notebook cronjob-k8s-client-python]$ kubectl -n geth get volumesnapshot data-gethu-0-20220617000103 -o=json | jq '.metadata.finalizers = null' | kubectl apply -f -
Warning: resource volumesnapshots/data-gethu-0-20220617000103 is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
Warning: Detected changes to resource data-gethu-0-20220617000103 which is currently being deleted.
volumesnapshot.snapshot.storage.k8s.io/data-gethu-0-20220617000103 configured

@xing-yang
Copy link
Collaborator Author

Are you trying to modify VolumeSnapshot name? This field is immutable.

E0617 01:58:22.574735       1 snapshot_controller_base.go:399] could not sync snapshot "geth/data-gethu-0-20220617000103": snapshot controller failed to update data-gethu-0-20220617000103 on API server: snapshot controller failed to update data-gethu-0 on API server: PersistentVolumeClaim "data-gethu-0" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims
  core.PersistentVolumeClaimSpec{
  	... // 4 identical fields
  	StorageClassName: &"ebs-sc",
  	VolumeMode:       &"Filesystem",
  	DataSource: &core.TypedLocalObjectReference{
  		APIGroup: &"snapshot.storage.k8s.io",
  		Kind:     "VolumeSnapshot",
- 		Name:     "data-gethu-0-20220617005748",
+ 		Name:     "data-gethu-0-20220609235015",
  	},
  	DataSourceRef: nil,
  }

@ibadullaev-inc4
Copy link

09235015

No I try to delete volumesnapshot.

  1. List PVC in geth namespace:
[nariman@notebook bnb]$ kubectl -n geth get pvc
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-gethu-0   Bound    pvc-1ca62d5d-f661-4f6c-b095-37a3d148c911   800Gi      RWO            ebs-sc         7d21h
  1. Describe pvc NOTE on the field DataSource.Name
[nariman@notebook bnb]$ kubectl -n geth describe pvc data-gethu-0
Name:          data-gethu-0
Namespace:     geth
StorageClass:  ebs-sc
Status:        Bound
Volume:        pvc-1ca62d5d-f661-4f6c-b095-37a3d148c911
Labels:        app=gethu
Annotations:   policies.kyverno.io/last-applied-patches: set-name-latest-snapshot.mutate-pvc.kyverno.io: replaced /spec/dataSource/name
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
               volume.kubernetes.io/selected-node: ip-10-0-1-41.us-east-2.compute.internal
Finalizers:    [kubernetes.io/pvc-protection snapshot.storage.kubernetes.io/pvc-as-source-protection]
Capacity:      800Gi
Access Modes:  RWO
VolumeMode:    Filesystem
DataSource:
  APIGroup:  snapshot.storage.k8s.io
  Kind:      VolumeSnapshot
  Name:      data-gethu-0-20220609235015
Used By:     gethu-0
Events:      <none>

  1. List volumesnapshots
[nariman@notebook bnb]$ kubectl -n geth get volumesnapshots.snapshot.storage.k8s.io 
NAME                          READYTOUSE   SOURCEPVC      SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT                                    CREATIONTIME   AGE
data-gethu-0-20220617005748   true         data-gethu-0                           800Gi         aws-block-storage   snapcontent-8ebc315e-1690-408c-b322-e640d9f3d75c   20h            20h

  1. try to delete: result stuck
[nariman@notebook bnb]$ kubectl -n geth delete volumesnapshots.snapshot.storage.k8s.io data-gethu-0-20220617005748
volumesnapshot.snapshot.storage.k8s.io "data-gethu-0-20220617005748" deleted

LOG
log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VolumeSnapshot cannot be deleted
8 participants