Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Fast Snapshot Restores could not create the second volume's fast snapshot #1608

Closed
Phaow opened this issue May 19, 2023 · 10 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Phaow
Copy link

Phaow commented May 19, 2023

/kind bug

What happened?

  1. Create VolumeSnapshotClass with fastSnapshotRestoreAvailabilityZones parameter
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-vsc-fast
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  fastSnapshotRestoreAvailabilityZones: "us-east-2a,us-east-2b,us-east-2c"
  1. Create 2 namespaces fsr-test1 and fsr-test2
  2. Create pvc1,pod1 in namespace fsr-test1 and wait for pod running then create snapshot for the pvc1, the volumesnapshot/volumesnapshotcontent could become readyToUse:true
  3. Create pvc2,pod2 in namespace fsr-test2 and wait for pod running then create snapshot for the pvc2, the volumesnapshot/volumesnapshotcontent stuck at readyToUse:false

What you expected to happen?
In step 4 the volumesnapshot/volumesnapshotcontent could become readyToUse:true

How to reproduce it (as minimally and precisely as possible)?
Always

Anything else we need to know?:
Check the controller csi-snapshotter container logs, lots of Could not create snapshot SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests. error

$ kubectl logs -c csi-snapshotter ebs-csi-controller-5f5cb78cf6-98z8c
...
E0519 14:02:50.823265       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-80e53a1d-3d11-47a4-a0ec-803e6c291ade]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume vol-069dd17ffe023255d: "rpc error: code = Internal desc = Could not create snapshot \"snapshot-80e53a1d-3d11-47a4-a0ec-803e6c291ade\": error creating snapshot of volume vol-069dd17ffe023255d: SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests.\n\tstatus code: 400, request id: 3dbc9ef3-9fe1-486b-83c6-267a744309a1"
E0519 14:02:50.823280       1 snapshot_controller_base.go:283] could not sync content "snapcontent-80e53a1d-3d11-47a4-a0ec-803e6c291ade": failed to take snapshot of the volume vol-069dd17ffe023255d: "rpc error: code = Internal desc = Could not create snapshot \"snapshot-80e53a1d-3d11-47a4-a0ec-803e6c291ade\": error creating snapshot of volume vol-069dd17ffe023255d: SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests.\n\tstatus code: 400, request id: 3dbc9ef3-9fe1-486b-83c6-267a744309a1"
I0519 14:02:50.823376       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-80e53a1d-3d11-47a4-a0ec-803e6c291ade", UID:"56274979-34d6-4d35-9a1a-5279^Cnal desc = Could not create snapshot "snapshot-80e53a1d-3d11-47a4-a0ec-803e6c291ade": error creating snapshot of volume vol-069dd17ffe023255d: SnapshotCreationPerVolumeRateExceeded: The maximum per volume CreateSnapshot request rate has been exceeded. Use an increasing or variable sleep interval between requests.
	status code: 400, request id: e36e2a71-802f-4ffd-9964-60e598d09a41
...

$ kubectl logs -c csi-snapshotter ebs-csi-controller-5f5cb78cf6-98z8c|grep -i 'The maximum per volume CreateSnapshot request rate has been exceeded.'|wc -l
    3496

Environment

  • Kubernetes version (use kubectl version):
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.0-2934+b40493584076fb-dirty", GitCommit:"b40493584076fb1ab29f3bed1d05d16cbc5b17f1", GitTreeState:"dirty", BuildDate:"2023-04-19T02:35:52Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2023
@Phaow
Copy link
Author

Phaow commented May 19, 2023

@torredil
Copy link
Member

Hi @Phaow thanks for reporting this, I was able to reproduce it.

This seems to be a bug with the The external-snapshotter sidecar that watches for VolumeSnapshotContent events and triggers CreateSnapshot calls against the CSI endpoint.

More specifically, the retry interval should double with each failure:

--retry-interval-start: Initial retry interval of failed volume snapshot creation or deletion. It doubles with each failure, up to retry-interval-max. Default value is 1 second.
--retry-interval-max: Maximum retry interval of failed volume snapshot creation or deletion. Default value is 5 minutes.

This log demonstrates the retry mechanism is not working as expected:

$ kubectl logs ebs-csi-controller-c647996b4-5ssr4 -n kube-system -c ebs-plugin | grep "CreateSnapshot: called"

I0519 15:07:59.990335       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:00.712428       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:01.160139       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:01.640038       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:02.060471       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:02.484650       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:02.899675       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:03.254507       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:03.629363       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:03.979643       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:04.518830       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:05.118366       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:05.719047       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:06.318988       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:06.923287       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:07.518083       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:08.118241       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:08.719120       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:09.318523       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:09.919449       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:10.525795       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:11.125796       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:11.721766       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:12.317995       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:12.917658       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:14.395720       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:14.779429       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:15.130892       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:15.944187       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:16.266930       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:16.645909       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:17.123715       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:17.718194       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:18.318361       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:18.918820       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:19.520324       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:20.117079       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:20.718375       1 controller.go:574] "CreateSnapshot: called"
I0519 15:08:21.324949       1 controller.go:574] "CreateSnapshot: called" 
I0519 15:08:21.921063       1 controller.go:574] "CreateSnapshot: called"

@torredil
Copy link
Member

The second CreateSnapshot call fails likely due to a service quota limit:

You can enable up to 5 snapshots for fast snapshot restore per Region.

Error: {
     Code: "FastSnapshotRestoreLimitExceeded",
     Message: "Your account's regional limit would be exceeded by your request to enable fast snapshot restore on snapshot 'snap-0f2b152a1e1268e1a' in Availability Zone 'us-east-1c'"
}

This would also happen in a single namespace by adding more AZs to your VolumeSnapshotClass:

fastSnapshotRestoreAvailabilityZones: "us-east-2a,us-east-2b,us-east-2c,us-east-2d,us-east-2e,us-east-2f"

You may view your applied quota value and request an increase in the Service Quotas dashboard:

Screenshot 2023-05-19 at 2 28 28 PM

The retry interval not doubling is still a bug and causing the observed ConcurrentSnapshotLimitExceeded error. Tracking this one here: kubernetes-csi/external-snapshotter#778

@Phaow
Copy link
Author

Phaow commented May 20, 2023

Hi @torredil ,got it, thanks a lot for the detail debugging and clarifying! Sorry for I missedd the 5 snapshots for fast snapshot restore per Region quota info. Agree that the retry interval not doubling is still a bug we could track with kubernetes-csi/external-snapshotter#778 . If so we could keep it current status(wait for the kubernetes-csi/external-snapshotter#778 solved then update the external-snapshotter sidecars image and close.) or close it now, both are ok for me. I could close it if needed.

ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue May 22, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

This does not seek to be a comprehensive rate-limiting solution, but
rather to add a temporary workaround for the bug in the snapshotter
sidecar by refusing to call the CreateSnapshot for a specific volume
unless it has been 30 seconds since the last attempt.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
@torredil
Copy link
Member

Hi @Phaow to loop you in on our current plan; we are introducing a change in the next release to temporarily address the SnapshotCreationPerVolumeRateExceeded errors. Let's keep this open for now, it'll help with tracking and visibility until a more permanent fix is in place and the retry mechanism is fixed in the external snapshotter sidecar.

@Phaow
Copy link
Author

Phaow commented May 24, 2023

@torredil Good to know, many thanks again!

ConnorJC3 added a commit to ConnorJC3/aws-ebs-csi-driver that referenced this issue Jun 26, 2023
See kubernetes-sigs#1608
See kubernetes-csi/external-snapshotter#778

Adds a 15 second rate limit to CreateSnapshot when the failure
originates from CreateSnapshot in cloud (i.e. the error likely
originates from the AWS API).

This prevents the driver from getting stuck in an infinite loop if
snapshot creation fails, where it will indefinately retry creating a
snapshot and continue to receive an error because it is going too fast.

Signed-off-by: Connor Catlett <conncatl@amazon.com>
@AndrewSirenko
Copy link
Contributor

Closing this issue because it should be fixed with external-snapshotter constantly retrying CreateSnapshot calls on error w/o backoff #871, many thanks @Phaow! /close

@AndrewSirenko
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@AndrewSirenko: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Phaow
Copy link
Author

Phaow commented Sep 26, 2023

@AndrewSirenko Thanks for the info and update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants