Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-snapshotter does not wait snapshot to get ready #953

Closed
jsafrane opened this issue Nov 7, 2023 · 2 comments · Fixed by #958
Closed

external-snapshotter does not wait snapshot to get ready #953

jsafrane opened this issue Nov 7, 2023 · 2 comments · Fixed by #958

Comments

@jsafrane
Copy link
Contributor

jsafrane commented Nov 7, 2023

What happened:

Consider this sequence of calls between the external-snapshotter and a CSI driver creating a single snapshot:

  1. CreateSnapshot succeeds and returns an un-ready snapshot.
  2. Subsequent CreateSnapshot with the same name fails with an error (this seems to be important!)
  3. Subsequent CreateSnapshot with the same name succeeds with un-ready snapshot again (the driver / storage is slow).

At this point, the external-snapshotter stops. It does not call CreateSnapshot again to check if the snapshot is ready. The VolumeSnapshot and VolumeSnapshotContent are readyToUse: false forever.

What you expected to happen:

The external-snapshotter calls CreateSnapshot again, until it's ready.

How to reproduce it:

See above. The error injected at step 2. seems to be somehow important.

Anything else we need to know?:

Environment:
external-snapshotter v6.3.0

@jsafrane
Copy link
Contributor Author

jsafrane commented Nov 7, 2023

I actually haven't found any code that waits for a snapshot to become ready. createSnapshotWrapper is called just because it got informer events, because the very same createSnapshotWrapper keeps adding + removing AnnVolumeSnapshotBeingCreated to VolumeSnapshotContent that it syncs.. There is no exp. backoff when waiting for the snapshot to become ready!

Somehow, an error response breaks this vicious cycle.

@jsafrane
Copy link
Contributor Author

jsafrane commented Nov 8, 2023

The issue seems to be very timing sensitive and the error at step 2. IMO just changes the timing a bit.

The initial CreateSnapshot that actually created a snapshot + returned readyToUse=false logs this after the snapshot creation:

snapshot_controller.go:338] Created snapshot: driver hostpath.csi.k8s.io, snapshotId 37d69928-7e21-11ee-8ef0-0a580a830019, creationTime 2023-11-08 10:26:03.089625331 +0000 UTC, size 1073741824, readyToUse false
snapshot_controller.go:419] updateSnapshotContentStatus: updating VolumeSnapshotContent [snapcontent-618b23b8-8714-4166-95d7-b5066958b188], snapshotHandle 37d69928-7e21-11ee-8ef0-0a580a830019, readyToUse false, createdAt 1699439163089625331, size 1073741824
snapshot_controller_base.go:208] enqueued "snapcontent-618b23b8-8714-4166-95d7-b5066958b188" for sync
snapshot_controller_base.go:208] enqueued "snapcontent-618b23b8-8714-4166-95d7-b5066958b188" for sync
snapshot_controller.go:662] Removed VolumeSnapshotBeingCreated annotation from volume snapshot content snapcontent-618b23b8-8714-4166-95d7-b5066958b188
util.go:248] storeObjectUpdate updating content "snapcontent-618b23b8-8714-4166-95d7-b5066958b188" with version 72030
util.go:248] storeObjectUpdate updating content "snapcontent-618b23b8-8714-4166-95d7-b5066958b188" with version 72030

I.e. the snapshotter has updated VolumeSnapshotContentStatus + removed AnnVolumeSnapshotBeingCreated and got events from the informer before it has saved new VolumeSnapshotContent resourceVersion to the informer cache.

In the step 3., the snapshotter managed to save the VolumeSnapshotContent in the informer cache before the informer received it through its watch and thus the informer ignored it:

snapshot_controller.go:338] Created snapshot: driver hostpath.csi.k8s.io, snapshotId 37d69928-7e21-11ee-8ef0-0a580a830019, creationTime 2023-11-08 10:26:03.089625331 +0000 UTC, size 1073741824, readyToUse false
snapshot_controller.go:419] updateSnapshotContentStatus: updating VolumeSnapshotContent [snapcontent-618b23b8-8714-4166-95d7-b5066958b188], snapshotHandle 37d69928-7e21-11ee-8ef0-0a580a830019, readyToUse false, createdAt 1699439163089625331, size 1073741824
request.go:629] Waited for 178.269582ms due to client-side throttling, not priority and fairness, request: PATCH:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1/volumesnapshotcontents/snapcontent-618b23b8-8714-4166-95d7-b5066958b188
snapshot_controller.go:662] Removed VolumeSnapshotBeingCreated annotation from volume snapshot content snapcontent-618b23b8-8714-4166-95d7-b5066958b188
util.go:248] storeObjectUpdate updating content "snapcontent-618b23b8-8714-4166-95d7-b5066958b188" with version 72052
util.go:248] storeObjectUpdate updating content "snapcontent-618b23b8-8714-4166-95d7-b5066958b188" with version 72052

IMO, the snapshotter should re-queue all VolumeSnapshotContents that are still readyToUse=false with exp. backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant