Using Fast Snapshot Restores could not create the second volume's fast snapshot #1608

Phaow · 2023-05-19T14:32:44Z

/kind bug

What happened?

Create VolumeSnapshotClass with fastSnapshotRestoreAvailabilityZones parameter

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-vsc-fast
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  fastSnapshotRestoreAvailabilityZones: "us-east-2a,us-east-2b,us-east-2c"

Create 2 namespaces fsr-test1 and fsr-test2
Create pvc1,pod1 in namespace fsr-test1 and wait for pod running then create snapshot for the pvc1, the volumesnapshot/volumesnapshotcontent could become readyToUse:true
Create pvc2,pod2 in namespace fsr-test2 and wait for pod running then create snapshot for the pvc2, the volumesnapshot/volumesnapshotcontent stuck at readyToUse:false

What you expected to happen?
In step 4 the volumesnapshot/volumesnapshotcontent could become readyToUse:true

How to reproduce it (as minimally and precisely as possible)?
Always

Anything else we need to know?:
Check the controller csi-snapshotter container logs, lots of Could not create snapshot SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests. error

$ kubectl logs -c csi-snapshotter ebs-csi-controller-5f5cb78cf6-98z8c
...
E0519 14:02:50.823265       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-80e53a1d-3d11-47a4-a0ec-803e6c291ade]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume vol-069dd17ffe023255d: "rpc error: code = Internal desc = Could not create snapshot \"snapshot-80e53a1d-3d11-47a4-a0ec-803e6c291ade\": error creating snapshot of volume vol-069dd17ffe023255d: SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests.\n\tstatus code: 400, request id: 3dbc9ef3-9fe1-486b-83c6-267a744309a1"
E0519 14:02:50.823280       1 snapshot_controller_base.go:283] could not sync content "snapcontent-80e53a1d-3d11-47a4-a0ec-803e6c291ade": failed to take snapshot of the volume vol-069dd17ffe023255d: "rpc error: code = Internal desc = Could not create snapshot \"snapshot-80e53a1d-3d11-47a4-a0ec-803e6c291ade\": error creating snapshot of volume vol-069dd17ffe023255d: SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests.\n\tstatus code: 400, request id: 3dbc9ef3-9fe1-486b-83c6-267a744309a1"
I0519 14:02:50.823376       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-80e53a1d-3d11-47a4-a0ec-803e6c291ade", UID:"56274979-34d6-4d35-9a1a-5279^Cnal desc = Could not create snapshot "snapshot-80e53a1d-3d11-47a4-a0ec-803e6c291ade": error creating snapshot of volume vol-069dd17ffe023255d: SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests.
	status code: 400, request id: e36e2a71-802f-4ffd-9964-60e598d09a41
...

$ kubectl logs -c csi-snapshotter ebs-csi-controller-5f5cb78cf6-98z8c|grep -i 'The maximum per volume CreateSnapshot request rate has been exceeded.'|wc -l
    3496

Environment

Kubernetes version (use kubectl version):

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.0-2934+b40493584076fb-dirty", GitCommit:"b40493584076fb1ab29f3bed1d05d16cbc5b17f1", GitTreeState:"dirty", BuildDate:"2023-04-19T02:35:52Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}

Driver version:
v1.18.0

The text was updated successfully, but these errors were encountered:

Phaow · 2023-05-19T14:45:31Z

Some references:

https://repost.aws/knowledge-center/ebs-snapshot-stuck

torredil · 2023-05-19T15:22:54Z

Hi @Phaow thanks for reporting this, I was able to reproduce it.

This seems to be a bug with the The external-snapshotter sidecar that watches for VolumeSnapshotContent events and triggers CreateSnapshot calls against the CSI endpoint.

More specifically, the retry interval should double with each failure:

--retry-interval-start: Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default value is 1 second.
--retry-interval-max: Maximum retry interval of failed volume snapshot creation or deletion. Default value is 5 minutes.

This log demonstrates the retry mechanism is not working as expected:

$ kubectl logs ebs-csi-controller-c647996b4-5ssr4 -n kube-system -c ebs-plugin | grep "CreateSnapshot: called"

I0519 15:07:59.990335       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:00.712428       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:01.160139       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:01.640038       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:02.060471       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:02.484650       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:02.899675       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:03.254507       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:03.629363       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:03.979643       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:04.518830       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:05.118366       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:05.719047       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:06.318988       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:06.923287       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:07.518083       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:08.118241       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:08.719120       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:09.318523       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:09.919449       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:10.525795       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:11.125796       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:11.721766       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:12.317995       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:12.917658       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:14.395720       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:14.779429       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:15.130892       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:15.944187       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:16.266930       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:16.645909       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:17.123715       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:17.718194       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:18.318361       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:18.918820       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:19.520324       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:20.117079       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:20.718375       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:21.324949       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:21.921063       1 controller.go:574] "CreateSnapshot: called"

torredil · 2023-05-19T18:57:12Z

The second CreateSnapshot call fails likely due to a service quota limit:

You can enable up to 5 snapshots for fast snapshot restore per Region.

Error: {
     Code: "FastSnapshotRestoreLimitExceeded",
     Message: "Your account's regional limit would be exceeded by your request to enable fast snapshot restore on snapshot 'snap-0f2b152a1e1268e1a' in Availability Zone 'us-east-1c'"
}

This would also happen in a single namespace by adding more AZs to your VolumeSnapshotClass:

fastSnapshotRestoreAvailabilityZones: "us-east-2a,us-east-2b,us-east-2c,us-east-2d,us-east-2e,us-east-2f"

You may view your applied quota value and request an increase in the Service Quotas dashboard:

The retry interval not doubling is still a bug and causing the observed ConcurrentSnapshotLimitExceeded error. Tracking this one here: kubernetes-csi/external-snapshotter#778

Phaow · 2023-05-20T23:30:51Z

Hi @torredil ,got it, thanks a lot for the detail debugging and clarifying! Sorry for I missedd the 5 snapshots for fast snapshot restore per Region quota info. Agree that the retry interval not doubling is still a bug we could track with kubernetes-csi/external-snapshotter#778 . If so we could keep it current status(wait for the kubernetes-csi/external-snapshotter#778 solved then update the external-snapshotter sidecars image and close.) or close it now, both are ok for me. I could close it if needed.

See kubernetes-sigs#1608 See kubernetes-csi/external-snapshotter#778 This does not seek to be a comprehensive rate-limiting solution, but rather to add a temporary workaround for the bug in the snapshotter sidecar by refusing to call the CreateSnapshot for a specific volume unless it has been 30 seconds since the last attempt. Signed-off-by: Connor Catlett <conncatl@amazon.com>

torredil · 2023-05-23T17:59:42Z

Hi @Phaow to loop you in on our current plan; we are introducing a change in the next release to temporarily address the SnapshotCreationPerVolumeRateExceeded errors. Let's keep this open for now, it'll help with tracking and visibility until a more permanent fix is in place and the retry mechanism is fixed in the external snapshotter sidecar.

Phaow · 2023-05-24T00:05:05Z

@torredil Good to know, many thanks again!

See kubernetes-sigs#1608 See kubernetes-csi/external-snapshotter#778 Adds a 15 second rate limit to CreateSnapshot when the failure originates from CreateSnapshot in cloud (i.e. the error likely originates from the AWS API). This prevents the driver from getting stuck in an infinite loop if snapshot creation fails, where it will indefinately retry creating a snapshot and continue to receive an error because it is going too fast. Signed-off-by: Connor Catlett <conncatl@amazon.com>

AndrewSirenko · 2023-09-25T18:38:45Z

Closing this issue because it should be fixed with external-snapshotter constantly retrying CreateSnapshot calls on error w/o backoff #871, many thanks @Phaow! /close

AndrewSirenko · 2023-09-25T20:19:54Z

/close

k8s-ci-robot · 2023-09-25T20:19:59Z

@AndrewSirenko: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Phaow · 2023-09-26T00:53:51Z

@AndrewSirenko Thanks for the info and update!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2023

ConnorJC3 mentioned this issue May 22, 2023

Add 30 second rate limit to CreateSnapshot #1611

Closed

Phaow mentioned this issue May 23, 2023

STOR-1167: Rebase to v1.18.0 for OCP 4.14 openshift/aws-ebs-csi-driver#222

Merged

ConnorJC3 mentioned this issue Jun 26, 2023

Add 15 second rate limit to CreateSnapshot #1659

Closed

k8s-ci-robot closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Fast Snapshot Restores could not create the second volume's fast snapshot #1608

Using Fast Snapshot Restores could not create the second volume's fast snapshot #1608

Phaow commented May 19, 2023

Phaow commented May 19, 2023

torredil commented May 19, 2023

torredil commented May 19, 2023

Phaow commented May 20, 2023

torredil commented May 23, 2023

Phaow commented May 24, 2023

AndrewSirenko commented Sep 25, 2023

AndrewSirenko commented Sep 25, 2023

k8s-ci-robot commented Sep 25, 2023

Phaow commented Sep 26, 2023

Using Fast Snapshot Restores could not create the second volume's fast snapshot #1608

Using Fast Snapshot Restores could not create the second volume's fast snapshot #1608

Comments

Phaow commented May 19, 2023

Phaow commented May 19, 2023

torredil commented May 19, 2023

torredil commented May 19, 2023

Phaow commented May 20, 2023

torredil commented May 23, 2023

Phaow commented May 24, 2023

AndrewSirenko commented Sep 25, 2023

AndrewSirenko commented Sep 25, 2023

k8s-ci-robot commented Sep 25, 2023

Phaow commented Sep 26, 2023