Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 #778

ambiknai · 2022-11-04T04:21:21Z

What happened:

Created VolumeSnapshot but the request failed due to authorisation issue in storage provider. In logs, I could see frequent calls to CreateSnapshot method. Ad per doc, it should double after each failure but logs doesn't show that behaviour.

What you expected to happen:
retry interval of failed volume snapshot creation or deletion should double after each failure

How to reproduce it:

Create a negative test scenario where VolumeSnapshot Creation fails and observe csi-snapshotter sidecar logs.

Anything else we need to know?:

cat csi-snapshotter.log | grep "GRPC call: /csi.v1.Controller/CreateSnapshot"
I1103 09:23:10.412875       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.306660       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.477653       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.785050       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.943098       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:13.124844       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:13.700102       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:14.300774       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:14.904655       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:15.504961       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:16.101649       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:16.704480       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:17.303818       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:17.904957       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:18.524977       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:19.100899       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:19.701028       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:20.302357       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:20.903176       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:21.502685       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:22.101407       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:22.701141       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:23.306374       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:23.902031       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:24.546378       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:25.104353       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:25.706340       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:26.304882       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:26.902224       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:27.500703       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:28.100920       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:28.705816       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot

csi-snapshotter has errors logged

I1103 09:23:10.412875       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:10.412881       1 connection.go:184] GRPC request: {"name":"snapshot-b9eb23d0-cd0d-47e9-a73f-da6d2132bde1","source_volume_id":"r014-41301bc5-7acf-4013-aaae-386ae5cbd5f4"}
I1103 09:23:10.426415       1 snapshot_controller_base.go:162] enqueued "snapcontent-b9eb23d0-cd0d-47e9-a73f-da6d2132bde1" for sync
I1103 09:23:12.279782       1 connection.go:186] GRPC response: {}
I1103 09:23:12.279876       1 connection.go:187] GRPC error: rpc error: code = Internal desc = {RequestID: 556ec084-4b68-4bd7-bd9f-fcc5f44bccaf , Code: InternalError, Description: Internal error occurred%!(EXTRA string=creation), BackendError: {Code:SnapshotSpaceOrderFailed, Type:ProvisioningFailed, Description:Snapshot space order failed for the given volume ID, BackendError:Trace Code:556ec084-4b68-4bd7-bd9f-fcc5f44bccaf, The user is not authorized Please check Check access permissions and try again., RC:500}, Action: Please check 'BackendError' tag for more details}
I1103 09:23:12.279895       1 snapshot_controller.go:324] createSnapshotWrapper: CreateSnapshot for content snapcontent-b9eb23d0-cd0d-47e9-a73f-da6d2132bde1 returned error: rpc error: code = Internal desc = {RequestID: 556ec084-4b68-4bd7-bd9f-fcc5f44bccaf , Code: InternalError, Description: Internal error occurred%!(EXTRA string=creation), BackendError: {Code:SnapshotSpaceOrderFailed, Type:ProvisioningFailed, Description:Snapshot space order failed for the given volume ID, BackendError:Trace Code:556ec084-4b68-4bd7-bd9f-fcc5f44bccaf, The user is not authorized Please check Check access permissions and try again., RC:500}, Action: Please check 'BackendError' tag for more details}

External-Snapshotter version : v6.0.1

Related PR for reference : #651

Environment:

Driver version:
Kubernetes version (use kubectl version): 1.25
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

xing-yang · 2022-11-04T20:51:11Z

@zhucan Can you take a look?

zhucan · 2022-11-07T07:10:47Z

@xing-yang @ambiknai

external-snapshotter/pkg/sidecar-controller/snapshot_controller_base.go

Line 107 in 09a98bc

if newSnapContent.Status != nil && newSnapContent.Status.Error != nil {

if create the snapshot failed(and the err is csi final err), it wiil remove the finalizer from the content(it is the update function) and the content.status.err is nil, the content will be readded to the queue again. if we don't add this conditions to check the content.status.err, it will not be readded to the queue.

k8s-triage-robot · 2023-02-05T07:13:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-07T07:31:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

avivlevitski-vlz · 2023-03-19T08:57:03Z

Workaround solution:
It's apparent only when adding the sidecar's optional argument: timeout (for example: "--timeout=60s"). The retry interval takes place as expected.

gfariasalves-ionos · 2023-03-29T13:06:32Z

After some debugging we noticed some details that might be affecting the workflow:

external-snapshotter/pkg/sidecar-controller/snapshot_controller.go

Line 304 in e746d07

content, err = ctrl.setAnnVolumeSnapshotBeingCreated(content)

When creating a snapshot, ctrl.setAnnVolumeSnapshotBeingCreated() and ctrl.removeAnnVolumeSnapshotBeingCreated() are called with content. The annotation is used on ResourceEventHandlerFuncs's UpdateFunc() for the exponential backoff in

external-snapshotter/pkg/sidecar-controller/snapshot_controller_base.go

Line 99 in e746d07

UpdateFunc: func(oldObj, newObj interface{}) {

ResourceEventHandlerFuncs is a implementation of ResourceEventHandler, whose method OnUpdate has the following documentation:

external-snapshotter/vendor/k8s.io/client-go/tools/cache/controller.go

Lines 203 to 208 in e746d07

    
           //  * OnUpdate is called when an object is modified. Note that oldObj is the 
        
           //      last known state of the object-- it is possible that several changes 
        
           //      were combined together, so you can't use this to see every single 
        
           //      change. OnUpdate is also called when a re-list happens, and it will 
        
           //      get called even if nothing changed. This is useful for periodically 
        
           //      evaluating or syncing something.

Considering the object is modified more than once during the workflow (for example when there is an error), it might be possible that oldObj doesn't have the annotation set anymore and the check in line 111 doesn't really work.

external-snapshotter/pkg/sidecar-controller/snapshot_controller_base.go

Line 111 in e746d07

if !newExists && oldExists {

k8s-triage-robot · 2023-04-28T14:05:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-04-28T14:05:51Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

torredil · 2023-05-19T15:27:38Z

/reopen

k8s-ci-robot · 2023-05-19T15:27:40Z

@torredil: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xing-yang · 2023-05-19T17:00:37Z

/reopen

k8s-ci-robot · 2023-05-19T17:00:39Z

@xing-yang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

See kubernetes-sigs#1608 See kubernetes-csi/external-snapshotter#778 This does not seek to be a comprehensive rate-limiting solution, but rather to add a temporary workaround for the bug in the snapshotter sidecar by refusing to call the CreateSnapshot for a specific volume unless it has been 30 seconds since the last attempt. Signed-off-by: Connor Catlett <conncatl@amazon.com>

k8s-triage-robot · 2023-06-18T17:28:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-06-18T17:28:55Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ambiknai · 2023-06-19T09:24:37Z

/reopen

k8s-ci-robot · 2023-06-19T09:24:39Z

@ambiknai: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sameshai · 2023-06-26T10:03:25Z

@zhucan @xing-yang @gfariasalves-ionos we have tried to revert to original fix but still issue does exist.

Can we get help from community to fix this ASAP ? This is big issue causing DDOS in case we have scenarios were source volume was deleted or the volume is not attached for the respective snapshot and snapshot content object.

This can be of great concern for any consumer of this sidecar.

diff --git a/pkg/sidecar-controller/snapshot_controller_base.go b/pkg/sidecar-controller/snapshot_controller_base.go
index 60b30e7a..88c5b10f 100644
--- a/pkg/sidecar-controller/snapshot_controller_base.go
+++ b/pkg/sidecar-controller/snapshot_controller_base.go
@@ -104,13 +104,17 @@ func NewCSISnapshotSideCarController(
                                // and CSI CreateSnapshot will be called again without exponential backoff.
                                // So we are skipping the re-queue here to avoid CreateSnapshot being called without exponential backoff.
                                newSnapContent := newObj.(*crdv1.VolumeSnapshotContent)
-                               if newSnapContent.Status != nil && newSnapContent.Status.Error != nil {
-                                       oldSnapContent := oldObj.(*crdv1.VolumeSnapshotContent)
-                                       _, newExists := newSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
-                                       _, oldExists := oldSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
-                                       if !newExists && oldExists {
-                                               return
-                                       }
+                               oldSnapContent := oldObj.(*crdv1.VolumeSnapshotContent)
+                               klog.V(5).Infof("newSnapContent %+v", newSnapContent)
+                               klog.V(5).Infof("oldSnapContent %+v", oldSnapContent)
+                               klog.V(5).Infof("newSnapContent status %+v", newSnapContent.Status)
+                               klog.V(5).Infof("oldSnapContent status %+v", oldSnapContent.Status)
+                               _, newExists := newSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
+                               _, oldExists := oldSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
+                               klog.V(5).Infof("newExists %v", newExists)
+                               klog.V(5).Infof("oldExists %v", oldExists)
+                               if !newExists && oldExists {
+                                       return
                                }
                                ctrl.enqueueContentWork(newObj)
                        },

See kubernetes-sigs#1608 See kubernetes-csi/external-snapshotter#778 Adds a 15 second rate limit to CreateSnapshot when the failure originates from CreateSnapshot in cloud (i.e. the error likely originates from the AWS API). This prevents the driver from getting stuck in an infinite loop if snapshot creation fails, where it will indefinately retry creating a snapshot and continue to receive an error because it is going too fast. Signed-off-by: Connor Catlett <conncatl@amazon.com>

xing-yang · 2023-06-28T19:40:10Z

/remove-lifecycle rotten

ambiknai changed the title ~~retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1~~ Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 Nov 4, 2022

xing-yang assigned zhucan Nov 4, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2023

k8s-ci-robot reopened this May 19, 2023

torredil mentioned this issue May 19, 2023

Using Fast Snapshot Restores could not create the second volume's fast snapshot kubernetes-sigs/aws-ebs-csi-driver#1608

Closed

ConnorJC3 mentioned this issue May 22, 2023

Add 30 second rate limit to CreateSnapshot kubernetes-sigs/aws-ebs-csi-driver#1611

Closed

Phaow mentioned this issue May 23, 2023

STOR-1167: Rebase to v1.18.0 for OCP 4.14 openshift/aws-ebs-csi-driver#222

Merged

Phaow mentioned this issue Jun 12, 2023

OCPQE-15608: Temporarily avoid change default storageclass known issue openshift/release#40247

Merged

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 18, 2023

k8s-ci-robot reopened this Jun 19, 2023

ConnorJC3 mentioned this issue Jun 26, 2023

Add 15 second rate limit to CreateSnapshot kubernetes-sigs/aws-ebs-csi-driver#1659

Closed

sameshai mentioned this issue Jun 28, 2023

external-snapshotter constantly retrying CreateSnapshot calls on error w/o backoff #871

Merged

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 28, 2023

k8s-ci-robot closed this as completed in #871 Jun 30, 2023

This was referenced Aug 10, 2023

Refactor CreateSnapshot to return error when backup cannot complete longhorn/longhorn-manager#2104

Merged

[TASK] Upgrade csi-snapshotter to mitigate rapid retry bug longhorn/longhorn#6506

Closed

ConnorJC3 mentioned this issue Feb 13, 2024

Discard unnecessary VolumeSnapshotContent updates to prevent rapid RPC calls #1009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 #778

Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 #778

ambiknai commented Nov 4, 2022

xing-yang commented Nov 4, 2022

zhucan commented Nov 7, 2022 •

edited

k8s-triage-robot commented Feb 5, 2023

k8s-triage-robot commented Mar 7, 2023

avivlevitski-vlz commented Mar 19, 2023

gfariasalves-ionos commented Mar 29, 2023

k8s-triage-robot commented Apr 28, 2023

k8s-ci-robot commented Apr 28, 2023

torredil commented May 19, 2023

k8s-ci-robot commented May 19, 2023

xing-yang commented May 19, 2023

k8s-ci-robot commented May 19, 2023

k8s-triage-robot commented Jun 18, 2023

k8s-ci-robot commented Jun 18, 2023

ambiknai commented Jun 19, 2023

k8s-ci-robot commented Jun 19, 2023

sameshai commented Jun 26, 2023

xing-yang commented Jun 28, 2023

Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 #778

Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 #778

Comments

ambiknai commented Nov 4, 2022

xing-yang commented Nov 4, 2022

zhucan commented Nov 7, 2022 • edited

k8s-triage-robot commented Feb 5, 2023

k8s-triage-robot commented Mar 7, 2023

avivlevitski-vlz commented Mar 19, 2023

gfariasalves-ionos commented Mar 29, 2023

k8s-triage-robot commented Apr 28, 2023

k8s-ci-robot commented Apr 28, 2023

torredil commented May 19, 2023

k8s-ci-robot commented May 19, 2023

xing-yang commented May 19, 2023

k8s-ci-robot commented May 19, 2023

k8s-triage-robot commented Jun 18, 2023

k8s-ci-robot commented Jun 18, 2023

ambiknai commented Jun 19, 2023

k8s-ci-robot commented Jun 19, 2023

sameshai commented Jun 26, 2023

xing-yang commented Jun 28, 2023

zhucan commented Nov 7, 2022 •

edited