Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 #778

Closed
ambiknai opened this issue Nov 4, 2022 · 18 comments · Fixed by #871
Assignees

Comments

@ambiknai
Copy link

ambiknai commented Nov 4, 2022

What happened:

Created VolumeSnapshot but the request failed due to authorisation issue in storage provider. In logs, I could see frequent calls to CreateSnapshot method. Ad per doc, it should double after each failure but logs doesn't show that behaviour.

What you expected to happen:
retry interval of failed volume snapshot creation or deletion should double after each failure

How to reproduce it:

Create a negative test scenario where VolumeSnapshot Creation fails and observe csi-snapshotter sidecar logs.

Anything else we need to know?:

cat csi-snapshotter.log | grep "GRPC call: /csi.v1.Controller/CreateSnapshot"
I1103 09:23:10.412875       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.306660       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.477653       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.785050       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:12.943098       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:13.124844       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:13.700102       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:14.300774       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:14.904655       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:15.504961       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:16.101649       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:16.704480       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:17.303818       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:17.904957       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:18.524977       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:19.100899       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:19.701028       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:20.302357       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:20.903176       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:21.502685       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:22.101407       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:22.701141       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:23.306374       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:23.902031       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:24.546378       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:25.104353       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:25.706340       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:26.304882       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:26.902224       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:27.500703       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:28.100920       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:28.705816       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot

csi-snapshotter has errors logged

I1103 09:23:10.412875       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateSnapshot
I1103 09:23:10.412881       1 connection.go:184] GRPC request: {"name":"snapshot-b9eb23d0-cd0d-47e9-a73f-da6d2132bde1","source_volume_id":"r014-41301bc5-7acf-4013-aaae-386ae5cbd5f4"}
I1103 09:23:10.426415       1 snapshot_controller_base.go:162] enqueued "snapcontent-b9eb23d0-cd0d-47e9-a73f-da6d2132bde1" for sync
I1103 09:23:12.279782       1 connection.go:186] GRPC response: {}
I1103 09:23:12.279876       1 connection.go:187] GRPC error: rpc error: code = Internal desc = {RequestID: 556ec084-4b68-4bd7-bd9f-fcc5f44bccaf , Code: InternalError, Description: Internal error occurred%!(EXTRA string=creation), BackendError: {Code:SnapshotSpaceOrderFailed, Type:ProvisioningFailed, Description:Snapshot space order failed for the given volume ID, BackendError:Trace Code:556ec084-4b68-4bd7-bd9f-fcc5f44bccaf, The user is not authorized Please check Check access permissions and try again., RC:500}, Action: Please check 'BackendError' tag for more details}
I1103 09:23:12.279895       1 snapshot_controller.go:324] createSnapshotWrapper: CreateSnapshot for content snapcontent-b9eb23d0-cd0d-47e9-a73f-da6d2132bde1 returned error: rpc error: code = Internal desc = {RequestID: 556ec084-4b68-4bd7-bd9f-fcc5f44bccaf , Code: InternalError, Description: Internal error occurred%!(EXTRA string=creation), BackendError: {Code:SnapshotSpaceOrderFailed, Type:ProvisioningFailed, Description:Snapshot space order failed for the given volume ID, BackendError:Trace Code:556ec084-4b68-4bd7-bd9f-fcc5f44bccaf, The user is not authorized Please check Check access permissions and try again., RC:500}, Action: Please check 'BackendError' tag for more details}

External-Snapshotter version : v6.0.1

Related PR for reference : #651

Environment:

  • Driver version:
  • Kubernetes version (use kubectl version): 1.25
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@ambiknai ambiknai changed the title retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 Retry interval of failed volume snapshot creation or deletion does not double after each failure: v6.0.1 Nov 4, 2022
@xing-yang
Copy link
Collaborator

@zhucan Can you take a look?

@zhucan
Copy link
Member

zhucan commented Nov 7, 2022

@xing-yang @ambiknai

if newSnapContent.Status != nil && newSnapContent.Status.Error != nil {
if create the snapshot failed(and the err is csi final err), it wiil remove the finalizer from the content(it is the update function) and the content.status.err is nil, the content will be readded to the queue again. if we don't add this conditions to check the content.status.err, it will not be readded to the queue.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2023
@avivlevitski-vlz
Copy link

Workaround solution:
It's apparent only when adding the sidecar's optional argument: timeout (for example: "--timeout=60s"). The retry interval takes place as expected.

@gfariasalves-ionos
Copy link

After some debugging we noticed some details that might be affecting the workflow:

content, err = ctrl.setAnnVolumeSnapshotBeingCreated(content)

When creating a snapshot, ctrl.setAnnVolumeSnapshotBeingCreated() and ctrl.removeAnnVolumeSnapshotBeingCreated() are called with content. The annotation is used on ResourceEventHandlerFuncs's UpdateFunc() for the exponential backoff in

UpdateFunc: func(oldObj, newObj interface{}) {

ResourceEventHandlerFuncs is a implementation of ResourceEventHandler, whose method OnUpdate has the following documentation:

// * OnUpdate is called when an object is modified. Note that oldObj is the
// last known state of the object-- it is possible that several changes
// were combined together, so you can't use this to see every single
// change. OnUpdate is also called when a re-list happens, and it will
// get called even if nothing changed. This is useful for periodically
// evaluating or syncing something.

Considering the object is modified more than once during the workflow (for example when there is an error), it might be possible that oldObj doesn't have the annotation set anymore and the check in line 111 doesn't really work.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2023
@torredil
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@torredil: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xing-yang
Copy link
Collaborator

/reopen

@k8s-ci-robot k8s-ci-robot reopened this May 19, 2023
@k8s-ci-robot
Copy link
Contributor

@xing-yang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 18, 2023
@ambiknai
Copy link
Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@ambiknai: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jun 19, 2023
@sameshai
Copy link
Contributor

@zhucan @xing-yang @gfariasalves-ionos we have tried to revert to original fix but still issue does exist.

Can we get help from community to fix this ASAP ? This is big issue causing DDOS in case we have scenarios were source volume was deleted or the volume is not attached for the respective snapshot and snapshot content object.

This can be of great concern for any consumer of this sidecar.

diff --git a/pkg/sidecar-controller/snapshot_controller_base.go b/pkg/sidecar-controller/snapshot_controller_base.go
index 60b30e7a..88c5b10f 100644
--- a/pkg/sidecar-controller/snapshot_controller_base.go
+++ b/pkg/sidecar-controller/snapshot_controller_base.go
@@ -104,13 +104,17 @@ func NewCSISnapshotSideCarController(
                                // and CSI CreateSnapshot will be called again without exponential backoff.
                                // So we are skipping the re-queue here to avoid CreateSnapshot being called without exponential backoff.
                                newSnapContent := newObj.(*crdv1.VolumeSnapshotContent)
-                               if newSnapContent.Status != nil && newSnapContent.Status.Error != nil {
-                                       oldSnapContent := oldObj.(*crdv1.VolumeSnapshotContent)
-                                       _, newExists := newSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
-                                       _, oldExists := oldSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
-                                       if !newExists && oldExists {
-                                               return
-                                       }
+                               oldSnapContent := oldObj.(*crdv1.VolumeSnapshotContent)
+                               klog.V(5).Infof("newSnapContent %+v", newSnapContent)
+                               klog.V(5).Infof("oldSnapContent %+v", oldSnapContent)
+                               klog.V(5).Infof("newSnapContent status %+v", newSnapContent.Status)
+                               klog.V(5).Infof("oldSnapContent status %+v", oldSnapContent.Status)
+                               _, newExists := newSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
+                               _, oldExists := oldSnapContent.ObjectMeta.Annotations[utils.AnnVolumeSnapshotBeingCreated]
+                               klog.V(5).Infof("newExists %v", newExists)
+                               klog.V(5).Infof("oldExists %v", oldExists)
+                               if !newExists && oldExists {
+                                       return
                                }
                                ctrl.enqueueContentWork(newObj)
                        },

ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue Jun 26, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

Adds a 15 second rate limit to CreateSnapshot when the failure
originates from CreateSnapshot in cloud (i.e. the error likely
originates from the AWS API).

This prevents the driver from getting stuck in an infinite loop if
snapshot creation fails, where it will indefinately retry creating a
snapshot and continue to receive an error because it is going too fast.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
@xing-yang
Copy link
Collaborator

/remove-lifecycle rotten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants