Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] VolumeSnapshot keeps in a non-ready state even related LH snapshot and backup are ready #8618

Closed
WebberHuang1118 opened this issue May 22, 2024 · 5 comments
Assignees
Labels
area/csi CSI related like control/node driver, sidecars area/volume-backup-restore Volume backup restore investigation-needed Need to identify the case before estimating and starting the development kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@WebberHuang1118
Copy link

WebberHuang1118 commented May 22, 2024

Describe the bug

volumesnapshot keeps in a non-ready state even related LH snapshot and backup are ready

To Reproduce

Steps to reproduce the behavior:

  1. Create a custom storage class custom

    image

  2. Create 3 VMs with different volume setup

    • vm-1 (Only 1 rootdisk)

      image

    • vm-2-default (1 rootdisk + 1 extra volume using default SC)

      image

    • vm-2-custom (1 rootdisk + 1 extra volume using custom SC)

      image

  3. Take backup on 3 VMs

  4. Stop 3 VMs

  5. Take backup on 3 VMs
    backup may fail
    image

Expected behavior

Expected behavior
Can backup VM both when Running and Off

Support bundle for troubleshooting

supportbundle_off-extra-disk_2nodes.zip

Environment

  • Harvester
    • Version: v1.2.2
    • Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
    • ui-source: Auto

Additional context

Some investigation: harvester/harvester#5841 (comment)

@WebberHuang1118 WebberHuang1118 added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels May 22, 2024
@innobead innobead added this to the v1.7.0 milestone May 22, 2024
@derekbit
Copy link
Member

derekbit commented May 22, 2024

@ejweber @PhanLe1010 Can you help check if it is a valid issue? We are waiting for the analysis first before releasing v1.6.2.
The issue is severity/3 or severity/4 and not a blocker for the release.

@innobead innobead added investigation-needed Need to identify the case before estimating and starting the development area/volume-backup-restore Volume backup restore area/csi CSI related like control/node driver, sidecars labels May 22, 2024
@ejweber
Copy link
Contributor

ejweber commented May 22, 2024

The analysis is mostly in harvester/harvester#5841. There we describe why the failure occurs in the Harvester reproduce. The most important thing in the Harvester case is to avoid it happening in the first place, as unnecessary volume faulting, etc. is never good.

Here we can investigate the theory that the upstream kubernetes-csi/external-snapshotter#953 prevents us from getting stuck permanently.


This is a fairly reliable reproduce:

  1. Deploy Longhorn from master, except use csi-snapshotter v6.3.2. Longhorn master has already upgraded.
  2. Create a volume (called test) with three replicas.
  3. Deploy the default longhorn VolumeSnapshotClass from examples/snapshot/snapshotclass.yaml.
  4. Prepare a VolumeSnapshot like the one below.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test
spec:
  volumeSnapshotClassName: longhorn
  source:
    persistentVolumeClaimName: test
  1. Execute something similar to the following command, which manually attaches the volume to a node and then immediately takes a snapshot. (This is the cause of the failure in the Harvester case.)
kl patch volume test --type json -p '[{"op": "replace", "path": "/spec/nodeID", "value":"eweber-v126-worker-9c1451b4-kgxdq"}]' && k apply -f examples/snapshot/snapshot_pvc.yaml

I was able to get stuck (seemingly) permanently in a way similar to https://github.com/harvester/harvester/issues/5841 two out of four times.


Now, do the same as the above, except use csi-snapshotter v7.0.2 (the one Longhorn master uses).

I was able to hit a backup issue temporarily three out of five times. However, csi-snapshotter rereconciled the VolumeSnapshotContents periodically until it detected (and recorded) readyToUse == true. In the logs below, we see the subsequent reconcilations.

[csi-snapshotter-59587776b-75wg7] I0522 20:42:20.714309       1 snapshot_controller.go:341] createSnapshotWrapper: CreateSnapshot for content snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd returned error: rpc error: code = DeadlineExceeded desc = waitForBackupControllerSync: timeout while waiting for backup controller to sync for volume test and snapshot snapshot-98e7e751-f82d-4e86-a49c-0e38b921a9dd
[csi-snapshotter-59587776b-75wg7] E0522 20:42:20.724558       1 snapshot_controller.go:121] createSnapshot for content [snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume test: "rpc error: code = DeadlineExceeded desc = waitForBackupControllerSync: timeout while waiting for backup controller to sync for volume test and snapshot snapshot-98e7e751-f82d-4e86-a49c-0e38b921a9dd"
[csi-snapshotter-59587776b-75wg7] E0522 20:42:20.724806       1 snapshot_controller_base.go:359] could not sync content "snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd": failed to take snapshot of the volume test: "rpc error: code = DeadlineExceeded desc = waitForBackupControllerSync: timeout while waiting for backup controller to sync for volume test and snapshot snapshot-98e7e751-f82d-4e86-a49c-0e38b921a9dd"
[csi-snapshotter-59587776b-75wg7] I0522 20:42:20.725405       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0522 20:42:20.724725       1 event.go:364] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd", UID:"a857e6bc-07b3-41d4-821c-f01c57c06ca6", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"102407610", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume test: "rpc error: code = DeadlineExceeded desc = waitForBackupControllerSync: timeout while waiting for backup controller to sync for volume test and snapshot snapshot-98e7e751-f82d-4e86-a49c-0e38b921a9dd"
[csi-snapshotter-59587776b-75wg7] I0522 20:42:21.725627       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0522 20:42:25.763006       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0522 20:42:33.811218       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0522 20:42:49.852162       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-98e7e751-f82d-4e86-a49c-0e38b921a9dd through the plugin ...

@PhanLe1010
Copy link
Contributor

Agree that this is not a Longhorn issue. Let's close this ticket and track it on the Harvester ticket. WDYT? @ejweber @derekbit @WebberHuang1118

@ejweber
Copy link
Contributor

ejweber commented May 22, 2024

By the way, we did not backport csi-snapshotter v7.0.2 to Longhorn v1.6.2 because of our policy around major version changes in a stable release. Since there is a workaround (and we probably won't regularly hit this issue anyway), I think we should continue to not backport.

#8493 (comment)

Agree that this is not a Longhorn issue. Let's close this ticket and track it on the Harvester ticket. WDYT? @ejweber @derekbit @WebberHuang1118

I agree.

@PhanLe1010
Copy link
Contributor

Closing. Feel to reopen if there is additional concern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi CSI related like control/node driver, sidecars area/volume-backup-restore Volume backup restore investigation-needed Need to identify the case before estimating and starting the development kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

5 participants