Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

Open
albinsun opened this issue May 21, 2024 · 26 comments
Open

[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

albinsun opened this issue May 21, 2024 · 26 comments
Assignees
Labels
area/longhorn issues depends on the upstream longhorn kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/often Reproducible 10% to 99% of the time require/doc Improvements or additions to documentation require/investigate severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@albinsun
Copy link

Describe the bug
Backup may fail on a Off/Stopped VM.
image

To Reproduce
Steps to reproduce the behavior:

  1. Create a custom storage class custom

    image

  2. Create 3 VMs with different volume setup

    • vm-1 (Only 1 rootdisk)

      image

    • vm-2-default (1 rootdisk + 1 extra volume using default SC)

      image

    • vm-2-custom (1 rootdisk + 1 extra volume using custom SC)

      image

  3. Take backup on 3 VMs

  4. Stop 3 VMs

  5. Take backup on 3 VMs
    backup may fail
    image
    Longhorn volume replica fail
    image

Expected behavior
Can backup VM both when Running and Off

Support bundle

  1. 3 nodes: supportbundle_off-volume_3nodes-2.zip

Environment

  • Harvester
    • Version: v1.2.2
    • Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
    • ui-source: Auto

Additional context

  1. Seems hard to reproduce on single node harvester.
  2. Can reproduce on 2 nodes cluster
    supportbundle_off-extra-disk_2nodes.zip
    image
@albinsun albinsun added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) area/longhorn issues depends on the upstream longhorn reproduce/often Reproducible 10% to 99% of the time labels May 21, 2024
@albinsun albinsun added this to the v1.2.2 milestone May 21, 2024
@innobead
Copy link

innobead commented May 21, 2024

@albinsun do we cover this in our e2e auto test coverage? or just manual test coverage?

cc @khushboo-rancher

@WebberHuang1118
Copy link
Member

Some findings from the 2-node SB,

the error vmabckup is default/extra-default.off

apiVersion: harvesterhci.io/v1beta1
kind: VirtualMachineBackup
metadata:
  name: extra-default.off
  namespace: default
  resourceVersion: "1265"
spec:
  source:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: vm-extra-default
  type: backup
status:
  backupTarget:
    endpoint: nfs://192.168.123.1:/pcloud/
  conditions:
  - lastTransitionTime: "2024-05-21T05:15:21Z"
    message: VolumeSnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py
      in error state
    reason: Error
    status: "False"
    type: InProgress
  - lastTransitionTime: "2024-05-21T05:15:21Z"
    message: Not Ready
    status: "False"
    type: Ready
  csiDriverVolumeSnapshotClassNames:
    driver.longhorn.io: longhorn
  error:
    message: VolumeSnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py
      in error state
    time: "2024-05-21T05:15:21Z"
  readyToUse: false
  secretBackups:
  - data:
      userdata: I2Nsb3VkLWNvbmZpZwpwYWNrYWdlX3VwZGF0ZTogdHJ1ZQpwYWNrYWdlczoKICAtIHFlbXUtZ3Vlc3QtYWdlbnQKcnVuY21kOgogIC0gLSBzeXN0ZW1jdGwKICAgIC0gZW5hYmxlCiAgICAtIC0tbm93CiAgICAtIHFlbXUtZ3Vlc3QtYWdlbnQuc2VydmljZQo=
    name: vm-extra-default-cryzr
  source:
    metadata:
      annotations:
        harvesterhci.io/vmRunStrategy: RerunOnFailure
        harvesterhci.io/volumeClaimTemplates: '[{"metadata":{"name":"vm-extra-default-rootdisk-vp3py","annotations":{"harvesterhci.io/imageId":"default/image-c87wj"}},"spec":{"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"10Gi"}},"volumeMode":"Block","storageClassName":"longhorn-image-c87wj"}},{"metadata":{"name":"vm-extra-default-disk-1-oohjf"},"spec":{"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"4Gi"}},"volumeMode":"Block","storageClassName":"harvester-longhorn"}}]'
        kubevirt.io/latest-observed-api-version: v1
        kubevirt.io/storage-observed-api-version: v1
        network.harvesterhci.io/ips: '[]'
      labels:
        harvesterhci.io/creator: harvester
        harvesterhci.io/os: ubuntu
      name: vm-extra-default
      namespace: default
  sourceUID: a2e7f0e6-650a-46b5-bd4e-58f19e5b0a19
  volumeBackups:
  - creationTime: "2024-05-21T05:15:27Z"
    csiDriverName: driver.longhorn.io
    error:
      message: 'Failed to create snapshot: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349:
        "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting
        for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action
        [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349
        volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349]
        map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach
        detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach
        salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"null'
      time: "2024-05-21T05:15:20Z"
    longhornBackupName: backup-e6b3ac5fe0fb463a
    name: extra-default.off-volume-vm-extra-default-rootdisk-vp3py
    persistentVolumeClaim:
      metadata:
        annotations:
          harvesterhci.io/imageId: default/image-c87wj
          harvesterhci.io/owned-by: '[{"schema":"kubevirt.io.virtualmachine","refs":["default/vm-extra-default"]}]'
          pv.kubernetes.io/bind-completed: "yes"
          pv.kubernetes.io/bound-by-controller: "yes"
          volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
          volume.kubernetes.io/storage-provisioner: driver.longhorn.io
        name: vm-extra-default-rootdisk-vp3py
        namespace: default
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 10Gi
        storageClassName: longhorn-image-c87wj
        volumeMode: Block
        volumeName: pvc-31c455ad-c207-4d2f-bc77-229619430349
    readyToUse: false
    volumeName: rootdisk
  - creationTime: "2024-05-21T05:15:26Z"
    csiDriverName: driver.longhorn.io
    error:
      message: 'Failed to create snapshot: failed to remove VolumeSnapshotBeingCreated
        annotation from the content snapcontent-7387072c-f039-4d65-b989-152587fd8379:
        snapshot controller failed to update snapcontent-7387072c-f039-4d65-b989-152587fd8379
        on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io
        "snapcontent-7387072c-f039-4d65-b989-152587fd8379": the object has been modified;
        please apply your changes to the latest version and try again'
      time: "2024-05-21T05:15:20Z"
    longhornBackupName: backup-0ce511dbbb7244fb
    name: extra-default.off-volume-vm-extra-default-disk-1-oohjf
    persistentVolumeClaim:
      metadata:
        annotations:
          harvesterhci.io/owned-by: '[{"schema":"kubevirt.io.virtualmachine","refs":["default/vm-extra-default"]}]'
          pv.kubernetes.io/bind-completed: "yes"
          pv.kubernetes.io/bound-by-controller: "yes"
          volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
          volume.kubernetes.io/storage-provisioner: driver.longhorn.io
        name: vm-extra-default-disk-1-oohjf
        namespace: default
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 4Gi
        storageClassName: harvester-longhorn
        volumeMode: Block
        volumeName: pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a
    readyToUse: false
    volumeName: disk-1

This vmbackup tried to backup two VM's disks:

  • rootdisk
    • PVC vm-extra-default-rootdisk-vp3py
    • LH volume pvc-31c455ad-c207-4d2f-bc77-229619430349
  • disk1
    • PVC vm-extra-default-disk-1-oohjf
    • LH volume pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a

The backup is still in progress, however, these two volumes are already in the detached state, which is not available to take snapshots.

kubectl -n longhorn-system get volumes.longhorn.io 
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE   AGE
pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a   detached   unknown                  4294967296           3h50m
pvc-31c455ad-c207-4d2f-bc77-229619430349   detached   unknown                  10737418240          3h50m

It seems there is some similarity as #3813 , I will try to reproduce it to have further investigation, thanks.

@bk201
Copy link
Member

bk201 commented May 21, 2024

I can repro this with v1.3-head + Longhorn v1.6.2-rc2 while testing #5787
Update: seems due to space issue, please ignore

@albinsun
Copy link
Author

FYR, I can reproduce on dolphins cluster (4 nodes) on v1.2.2
Will sent you supportbundle in message.

Backup

VolumeSnapshot vm2-custom-volume-vm2-custom-disk-1-d84nk in error state
image

Longhorn

  • Volume pvc-324a9e72-3061-4186-b88e-967a1f89ca61
    Replica 967a1f89ca61-r-7aa83157 / harvester-rrtfg Failed
    image

  • Volume pvc-1e1ddf14-9f0b-4ee5-9efe-dc0f6fa44991
    Replica dc0f6fa44991-r-64e568b0 / harvester-rrtfg Failed
    image

@albinsun
Copy link
Author

albinsun commented May 21, 2024

@albinsun do we cover this in our e2e auto test coverage? or just manual test coverage?

cc @khushboo-rancher

Having some check, our current VM backup TCs mainly locates in harvester_e2e_tests/integrations/test_4_vm_backup_restore but they seems taking backup on a Running VM.

  1. TestBackupRestore::tests_backup_vm
    https://github.com/harvester/tests/blob/1dc6fa5a07db143e69e339af9bef79471f49b94c/harvester_e2e_tests/integrations/test_4_vm_backup_restore.py#L331
  2. TestBackupRestoreOnMigration::test_backup_migrated_vm
    https://github.com/harvester/tests/blob/1dc6fa5a07db143e69e339af9bef79471f49b94c/harvester_e2e_tests/integrations/test_4_vm_backup_restore.py#L659
  3. TestMultipleBackupRestore::test_backup_multiple
    https://github.com/harvester/tests/blob/1dc6fa5a07db143e69e339af9bef79471f49b94c/harvester_e2e_tests/integrations/test_4_vm_backup_restore.py#L836

Note that, in the test so far, it's harder to reproduce this in bare-metal env.
So hardware performance may be one factor.

c.c @khushboo-rancher, @lanfon72

@irishgordo
Copy link

@albinsun - I struggled to reproduce this on bare-metal, I've not been able to have a successful attempt at reproducing it.
This might be similar to what I noticed with:
#5427
It was impossible on bare-metal (system that met the "testing requirements") to reproduce 5427 - only KVM provisioned env was successful for me.

@albinsun
Copy link
Author

@albinsun - I struggled to reproduce this on bare-metal, I've not been able to have a successful attempt at reproducing it. This might be similar to what I noticed with: #5427 It was impossible on bare-metal (system that met the "testing requirements") to reproduce 5427 - only KVM provisioned env was successful for me.

Thank you for check!
This issue can, but just less likely to be reproduced on bare-metal.
Refer to #5841 (comment)

@irishgordo
Copy link

@albinsun - I'll run a few more checks on it too, as it was just two runs - as perhaps I'll be able to reproduce it on a few more iterations 👍

@lanfon72
Copy link
Member

we don't have any test cases to backup the VM twice.

@derekbit
Copy link

cc @ChanYiLin

@WebberHuang1118
Copy link
Member

WebberHuang1118 commented May 22, 2024

TL;DR LH CSI CreateSnapshot() returns without error but sets ReadyToUse as false, this makes an external-snapshotter stop to reconcile, and the VolumeSnapshot stays in a non-ready state


Taking the 2-node SB as an example

Data structure for one problematic volumesnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py:

  • volumesnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py (READYTOUSE=false)
    • volumesnapshotcontent snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 (READYTOUSE=false)
      • snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 (READYTOUSE=true)
      • backup backup-e6b3ac5fe0fb463a (state=completed)

Timeline during trying to take the snapshot:

Around 2024-05-21T05:15:18 CSI snapshotter tried to take the snapshot, however, LH snapshot was not ready yet, we can find it from external-snapshotter log

2024-05-21T05:15:18.588724357Z I0521 05:15:18.588673       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...

2024-05-21T05:15:20.806632265Z I0521 05:15:20.806587       1 snapshot_controller.go:325] createSnapshotWrapper: CreateSnapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 returned error: rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]

LH CSI plugin has matched logs

2024-05-21T05:15:18.664498451Z time="2024-05-21T05:15:18Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",    \"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"

 2024-05-21T05:15:18.670883382Z time="2024-05-21T05:15:18Z" level=info msg="Creating Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:776" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup

2024-05-21T05:15:20.806363293Z time="2024-05-21T05:15:20Z" level=error msg="Failed to create snapshot" func="csi.(*ControllerServer).CreateSnapshot.func1" file="controller_server.go:590" component=csi-controller-server error="rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]" function=CreateSnapshot

Around 2024-05-21T05:15:21 CSI snapshotter tried to take the snapshot again

2024-05-21T05:15:21.889671854Z I0521 05:15:21.888949       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...

This time LH successfully took the snapshot and backup

2024-05-21T05:15:21.948437174Z time="2024-05-21T05:15:21Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"

2024-05-21T05:15:33.958948776Z time="2024-05-21T05:15:33Z" level=info msg="Creating volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:794" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup

2024-05-21T05:15:48.084897436Z time="2024-05-21T05:15:48Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup backup-e6b3ac5fe0fb463a of snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 created" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:813"

However, even LH snapshot and backup were successfully created, the related volumesnapshot kept the state as READYTOUSE=false. IMHO, this is brought by LH to check if the backup is ready by the field created, but it responses ReadyToUse by the field state, if at a specific point LH backup with created != nil but state == false, this is likely to happen.

During the reproduce steps on my site, I turned external-snapshotter log to V5, and found this log csi-snapshotter-6b754d4b6c-5j4zp:I0522 06:18:06.232129 1 snapshot_controller.go:417] updateSnapshotContentStatus: updating VolumeSnapshotContent [snapcontent-08867fd6-3f70-4c59-b6b2-dce8fb144901], snapshotHandle bak://pvc-53d62a14-ad8d-499d-9fa2-9bd62e9d1095/backup-aad0e09c1cd84358, readyToUse false, createdAt 1716358667000000000, size 10737418240 which is from https://github.com/kubernetes-csi/external-snapshotter/blob/f7ea027c3cdc76537edd9a295c4a5dfd94cfdc8e/pkg/sidecar-controller/snapshot_controller.go#L336, it indicates LH returned the volumesnapshotcontent without error but readyToUse as false

A possible workaround:
Via $ kubectl -n longhorn-system rollout restart deployment csi-snapshotter to trigger external-snapshotter reconcile volumesnapshotcontent

@derekbit
Copy link

cc @PhanLe1010

@ejweber
Copy link

ejweber commented May 22, 2024

it indicates LH returned the volumesnapshotcontent without error but readyToUse as false

I'm not sure yet on the full context of this issue, but it is correct for Longhorn to return no error with readyToUse false for bak type snapshots. This is because the actual snapshotting can be very fast, but the backup part can take a long time. The CSI spec does not want COs to freeze workloads (if they DO freeze workloads) while operations like uploads are progressing.

The cluster snapshotter components should periodically check back with the CSI plugin to confirm whether the snapshot has become readyToUse later.

https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter

@ejweber
Copy link

ejweber commented May 22, 2024

I am investigating the permanent failure part, but I think the reason for the initial failure is somewhat clear.

This is a Harvester v1.2.2 cluster running Longhorn v1.5.5. As discussed in #4907, Longhorn v1.5.5 handles its own attaching and detaching for snapshots using an internal volume attachment controller. #4910 removes functionality from Harvester that previously attached automatically attached volumes for snapshots. HOWEVER, v1.2.2 does not appear to have that change. For example, the code at

func (h *ActionHandler) mountSourcePVC(volumeSnapshot *snapshotv1.VolumeSnapshot) error {
if volumeSnapshot.Spec.Source.PersistentVolumeClaimName == nil {
return nil
}
pvc, err := h.pvcCache.Get(volumeSnapshot.Namespace, *volumeSnapshot.Spec.Source.PersistentVolumeClaimName)
if err != nil {
if apierrors.IsNotFound(err) {
return nil
}
return fmt.Errorf("can't find pvc %s/%s, err: %w", volumeSnapshot.Namespace, *volumeSnapshot.Spec.Source.PersistentVolumeClaimName, err)
}
volume, err := h.volumeCache.Get(util.LonghornSystemNamespaceName, pvc.Spec.VolumeName)
if err != nil {
return fmt.Errorf("failed to get volume %s/%s, error: %s", util.LonghornSystemNamespaceName, pvc.Spec.VolumeName, err.Error())
}
if volume.Status.State == lhv1beta2.VolumeStateDetached || volume.Status.State == lhv1beta2.VolumeStateDetaching {
volCpy := volume.DeepCopy()
volCpy.Spec.NodeID = volume.Status.OwnerID
logrus.Infof("mount detached volume %s to the node %s", volCpy.Name, volCpy.Spec.NodeID)
if _, err = h.volumes.Update(volCpy); err != nil {
return err
}
}
return nil
still exists. We see that this code executed at the time the failed backups were created:

2024-05-21T05:15:18.300855786Z time="2024-05-21T05:15:18Z" level=info msg="mount detached volume pvc-31c455ad-c207-4d2f-bc77-229619430349 to the node harvester-node-1"

This is very confusing to Longhorn, because the only component that is currently allowed to set volume.spec.nodeID is the internal volume attachment controller.

The volume attachment controller notices Harvester tried to attach the volume and decides to detach it.

2024-05-21T05:15:18.366260253Z time="2024-05-21T05:15:18Z" level=info msg="Should detach volume harvester-node-1 because there is no matching attachment ticket" func="controller.(*VolumeAttachmentController).shouldDoDetach" file="volume_attachment_controller.go:496" controller=longhorn-volume-attachment longhornVolumeAttachment=pvc-31c455ad-c207-4d2f-bc77-229619430349 node=harvester-node-1
2024-05-21T05:15:18.366357418Z time="2024-05-21T05:15:18Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 is selected to detach from node harvester-node-1" func="controller.(*VolumeAttachmentController).handleVolumeDetachment" file="volume_attachment_controller.go:455" controller=longhorn-volume-attachment longhornVolumeAttachment=pvc-31c455ad-c207-4d2f-bc77-229619430349 node=harvester-node-1

But some processes are already starting.

2024-05-21T05:15:18.543496374Z time="2024-05-21T05:15:18Z" level=info msg="Creating instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff" func="controller.(*InstanceHandler).createInstance" file="instance_handler.go:472"
2024-05-21T05:15:18.546947067Z time="2024-05-21T05:15:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff\", UID:\"943a13c9-f00d-4fa9-87c4-86a79cf1f2ad\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"21850\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"
2024-05-21T05:15:18.701191154Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff starts running, Storage IP 10.52.1.13" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:164"
2024-05-21T05:15:18.701256509Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff starts running, IP 10.52.1.13" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:169"
2024-05-21T05:15:18.701262747Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff starts running, Port 10021" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:173"

2024-05-21T05:15:18.405046754Z time="2024-05-21T05:15:18Z" level=info msg="Creating instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e" func="controller.(*InstanceHandler).createInstance" file="instance_handler.go:472"
2024-05-21T05:15:18.409616650Z time="2024-05-21T05:15:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e\", UID:\"1e19339c-752e-44f8-bd0b-b68ae3618d16\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"21838\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"
2024-05-21T05:15:18.654000108Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e starts running, Storage IP 10.52.0.47" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:164"
2024-05-21T05:15:18.654201931Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e starts running, IP 10.52.0.47" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:169"
2024-05-21T05:15:18.654304898Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e starts running, Port 10010" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:173"

Then the volume attachment controller decides to attach the volume.

2024-05-21T05:15:18.753617612Z time="2024-05-21T05:15:18Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 is selected to attach to node harvester-node-1, ticket +&{snapshot-controller-snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 snapshot-controller harvester-node-1 map[disableFrontend:any] 0}" func="controller.(*VolumeAttachmentController).handleVolumeAttachment" file="volume_attachment_controller.go:576" controller=longhorn-volume-attachment longhornVolumeAttachment=pvc-31c455ad-c207-4d2f-bc77-229619430349 node=harvester-node-1

There are a flurry of process deletions and creations related to the conflicting actions. The engine process crashes because the replica processes that were started are shut down before it can use them.

2024-05-21T05:15:19.721120427Z time="2024-05-21T05:15:19Z" level=warning msg="pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a-e-0: time=\"2024-05-21T05:15:19Z\" level=error msg=\"no available backend due to service unavailable from the addresses [tcp://10.52.0.47:10020 tcp://10.52.1.13:10011]\" func=cmd.startController file=\"controller.go:186\"" func="controller.(*InstanceHandler).printInstanceLogs" file="instance_handler.go:458"

In the process of this happening, we try and fail to take a snapshot, kicking off the VM backup failures this ticket was opened with.

I think for offline backups to work reliably, versions of Harvester that use Longhorn v1.5.x need to adopt a backport of #4910.

@ejweber
Copy link

ejweber commented May 22, 2024

As for the permanent backup failure, Longhorn mostly seems to be performing correctly, given the circumstances.

In the two node support bundle, one backup is taken while the volume is attached and one while it is detached.

While it is attached, we see multiple csi-snapshotter calls to create a snapshot:

2024-05-21T05:13:26.811565040Z I0521 05:13:26.807341       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.024469713Z I0521 05:13:35.024406       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.096448974Z I0521 05:13:35.096266       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.138410250Z I0521 05:13:35.138355       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.429538885Z I0521 05:13:35.429135       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...

On the csi-plugin side, we respond with no error each time. The last time, we respond with ready_to_use = true. This is due to the behavior I previously described. The csi-snapshotter keeps checking back until we complete the backup upload.

2024-05-21T05:13:34.996622158Z time="2024-05-21T05:13:34Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.045948912Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.126273263Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.157806757Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.620873156Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268411},\"ready_to_use\":true,\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"

While the volume is detached, we see something somewhat similar, except an error occurs. On the csi-snapshotter side, the first attempt fails. But it tries again.

2024-05-21T05:15:18.588724357Z I0521 05:15:18.588673       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...
2024-05-21T05:15:20.806632265Z I0521 05:15:20.806587       1 snapshot_controller.go:325] createSnapshotWrapper: CreateSnapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 returned error: rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]
2024-05-21T05:15:20.888221499Z E0521 05:15:20.888006       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"
2024-05-21T05:15:20.888275541Z E0521 05:15:20.888058       1 snapshot_controller_base.go:351] could not sync content "snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50": failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"
2024-05-21T05:15:20.888648190Z I0521 05:15:20.888527       1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50", UID:"fd933a8d-5f30-4d08-bee9-2e15923e36e8", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"22008", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"

2024-05-21T05:15:21.889671854Z I0521 05:15:21.888949       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...

On the csi-plugin side, we respond as follows:

2024-05-21T05:15:18.664498451Z time="2024-05-21T05:15:18Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"
2024-05-21T05:15:18.670883382Z time="2024-05-21T05:15:18Z" level=info msg="Creating Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:776" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
2024-05-21T05:15:20.806363293Z time="2024-05-21T05:15:20Z" level=error msg="Failed to create snapshot" func="csi.(*ControllerServer).CreateSnapshot.func1" file="controller_server.go:590" component=csi-controller-server error="rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]" function=CreateSnapshot
2024-05-21T05:15:20.806384095Z time="2024-05-21T05:15:20Z" level=error msg="CreateSnapshot: err: rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]" func=csi.logGRPC file="server.go:138"

2024-05-21T05:15:21.948437174Z time="2024-05-21T05:15:21Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"
2024-05-21T05:15:33.958948776Z time="2024-05-21T05:15:33Z" level=info msg="Creating volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:794" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
2024-05-21T05:15:48.084897436Z time="2024-05-21T05:15:48Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup backup-e6b3ac5fe0fb463a of snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 created" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:813"
2024-05-21T05:15:48.085109292Z time="2024-05-21T05:15:48Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268527},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-e6b3ac5fe0fb463a\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"

So far, this behavior is correct. Now, the backup actually completes almost immediately after the backup CR is created.

2024-05-21T05:15:48.602742604Z time="2024-05-21T05:15:48Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-31c455ad-c207-4d2f-bc77-229619430349\", UID:\"1bbb7e42-b976-48c1-87e5-c550fee231c9\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"22888\", FieldPath:\"\"}): type: 'Normal' reason: 'Completed' Snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 backup backup-e6b3ac5fe0fb463a label map[KubernetesStatus:{\"pvName\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\",\"pvStatus\":\"Bound\",\"namespace\":\"default\",\"pvcName\":\"vm-extra-default-rootdisk-vp3py\",\"lastPVCRefAt\":\"\",\"workloadsStatus\":[{\"podName\":\"virt-launcher-vm-extra-default-64rgh\",\"podStatus\":\"Failed\",\"workloadName\":\"vm-extra-default\",\"workloadType\":\"VirtualMachineInstance\"}],\"lastPodRefAt\":\"2024-05-21T05:14:49Z\"} longhorn.io/volume-access-mode:rwx]" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"

In this case, it would have been nice if the csi-plugin had waited until the backup completed to return, so it could return a ready_to_use == true, but it does not due to longhorn/longhorn-manager#2104. (In general, there is no reason to assume uploading the backup will be quick, so the behavior is correct.) We expect the csi-snapshotter to try again, at which point, we will set ready_to_use = true and the csi-snapshotter will clear the error.

The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun

@PhanLe1010
Copy link

Agree with @ejweber on the immediate error of why CSI snapshot failed. A small nit that Harvester seems to use this logic to attach the volume instead

func (h *ActionHandler) snapshot(_ context.Context, pvcNamespace, pvcName, snapshotName string) error {
pvc, err := h.pvcCache.Get(pvcNamespace, pvcName)
if err != nil {
return err
}
provisioner := util.GetProvisionedPVCProvisioner(pvc)
csiDriverInfo, err := settings.GetCSIDriverInfo(provisioner)
if err != nil {
return err
}
if provisioner == longhorntypes.LonghornDriverName {
volume, err := h.volumeCache.Get(util.LonghornSystemNamespaceName, pvc.Spec.VolumeName)
if err != nil {
return fmt.Errorf("failed to get volume %s/%s, error: %s", pvc.Namespace, pvc.Spec.VolumeName, err.Error())
}
volCpy := volume.DeepCopy()
if volume.Status.State == lhv1beta2.VolumeStateDetached || volume.Status.State == lhv1beta2.VolumeStateDetaching {
volCpy.Spec.NodeID = volume.Status.OwnerID
}
if !reflect.DeepEqual(volCpy, volume) {
logrus.Infof("mount detached volume %s to the node %s", volCpy.Name, volCpy.Spec.NodeID)
if _, err = h.volumes.Update(volCpy); err != nil {
return err
}
}
}

@PhanLe1010
Copy link

PhanLe1010 commented May 22, 2024

For the context, this error in the CSI snapshotter
2024-05-21T05:15:20.888648190Z I0521 05:15:20.888527 1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50", UID:"fd933a8d-5f30-4d08-bee9-2e15923e36e8", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"22008", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"
means that the volume is currently faulted (as there are a bunch of attach/detach war between Longhorn AD controller and Harvester controller trying to modify the volume.Spec.NodeID manually). When the volume is faulted, Longhorn API snapshotCRGet is not available: https://github.com/longhorn/longhorn-manager/blob/88074ee747e60d8edcd98072a793c8614d7bff86/api/model.go#L1456-L1461

I strongly agree that Harvester should get rid of all the logic which set Longhorn volume.Spec.NodeID directly, it will conflict with Longhorn AD controller and can make the volume faulted

@ejweber
Copy link

ejweber commented May 22, 2024

The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun

It may be related to kubernetes-csi/external-snapshotter#953.

@ejweber
Copy link

ejweber commented May 22, 2024

A possible workaround:
Via $ kubectl -n longhorn-system rollout restart deployment csi-snapshotter to trigger external-snapshotter reconcile volumesnapshotcontent

I agree with this workaround. I can trigger this with about 50% regularity using pure Longhorn now that we understand what can lead to it. This workaround has successfully gotten me unstuck.

@PhanLe1010
Copy link

PhanLe1010 commented May 22, 2024

I use this approach to produce with Longhorn only (without Harvester):

  1. Deploy Longhorn v1.5.5
  2. Deploy a deployment with a Longhorn PVC: kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/deployment.yaml
  3. Edit the volume to make it have 1 replica. This makes it more likely to crash the only replica of the volume and make the volume faulted
  4. Run this script repeatedly:
    #!/bin/bash
    
    kubectl -n longhorn-system patch volumes.longhorn.io pvc-a9887842-fe5a-4c02-891b-6fd82df26c30 --type='merge' -p '{"spec":{"nodeID":"phan-v660-pool2-e685807c-zv582"}}' 
    
    cat <<EOF | kubectl apply -f-
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: test-csi-volume-snapshot-longhorn-backup
    spec:
      volumeSnapshotClassName: longhorn-backup-vsc
      source:
        persistentVolumeClaimName: mysql-pvc
    EOF
    sleep 10
    IS_READY=$(kubectl get volumesnapshots.snapshot.storage.k8s.io test-csi-volume-snapshot-longhorn-backup -o jsonpath='{.status.readyToUse}')
    if [ "$IS_READY" == "true" ]; then
      echo "The CSI snapshot is ready. Deleting it"
      kubectl delete volumesnapshots.snapshot.storage.k8s.io test-csi-volume-snapshot-longhorn-backup
    else
      echo " CSI snapshot is not ready."
    fi
  5. Until you see the volume is faulted and the CSI snapshot stuck in not READYTOUSE=false
    ➜  .kube k get volumesnapshots.snapshot.storage.k8s.io 
    NAME                                       READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS         SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    test-csi-volume-snapshot-longhorn-backup   false        mysql-pvc                           2Gi           longhorn-backup-vsc   snapcontent-4d9dd671-037b-4e75-ae29-dc1b6dff7833   4m10s          4m18s
    even though Longhorn backup is completed:
    ➜  .kube k get backups.longhorn.io -n longhorn-system                                             
    NAME                      SNAPSHOTNAME                                    SNAPSHOTSIZE   SNAPSHOTCREATEDAT      STATE       LASTSYNCEDAT
    backup-78adf5e748e041bb   snapshot-4d9dd671-037b-4e75-ae29-dc1b6dff7833   241172480      2024-05-22T20:39:26Z   Completed   2024-05-22T20:39:30Z

@PhanLe1010
Copy link

PhanLe1010 commented May 22, 2024

A possible workaround:
Via $ kubectl -n longhorn-system rollout restart deployment csi-snapshotter to trigger external-snapshotter reconcile volumesnapshotcontent

I agree with this workaround. I can trigger this with about 50% regularity using pure Longhorn now that we understand what can lead to it. This workaround has successfully gotten me unstuck.

I am thinking if we should recommend the user not to take backup of stopped VM in Harvester 1.2.2 at all because doing so could make the volume faulted or trigger rebuilding (if not all replicas crashed at the time). WDYT @ejweber @WebberHuang1118

@bk201
Copy link
Member

bk201 commented May 23, 2024

I think for offline backups to work reliably, versions of Harvester that use Longhorn v1.5.x need to adopt a backport of #4910.

@FrankYang0529 Please help backport the PR that remove the attach/detach logic in Harvester for backups: #5853

@WebberHuang1118
Copy link
Member

In this case, it would have been nice if the csi-plugin had waited until the backup completed to return, so it could return a ready_to_use == true, but it does not due to longhorn/longhorn-manager#2104. (In general, there is no reason to assume uploading the backup will be quick, so the behavior is correct.) We expect the csi-snapshotter to try again, at which point, we will set ready_to_use = true and the csi-snapshotter will clear the error.

The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun

Hi @ejweber @PhanLe1010,

There are no further logs in csi-snapshotter for the problematic volumesnapshot.

IIUC, because Harvester handles its own attaching and detaching, and this brings chaos to volume behavior. But the LH snapshot and backup are completed eventually, why volumesnapshot kept in readyToUse == false? Because kubernetes-csi/external-snapshotter#953? And this only happens when the Harvester controller has its volume attaching and detaching for snapshot? Thanks.

@FrankYang0529
Copy link
Member

I think for offline backups to work reliably, versions of Harvester that use Longhorn v1.5.x need to adopt a backport of #4910.

@FrankYang0529 Please help backport the PR that remove the attach/detach logic in Harvester for backups: #5853

#5854

@ejweber
Copy link

ejweber commented May 23, 2024

IIUC, because Harvester handles its own attaching and detaching, and this brings chaos to volume behavior. But the LH snapshot and backup are completed eventually, why volumesnapshot kept in readyToUse == false? Because kubernetes-csi/external-snapshotter#953? And this only happens when the Harvester controller has its volume attaching and detaching for snapshot? Thanks.

It appears to be because of kubernetes-csi/external-snapshotter#953. The reporter there describes a race involving errors and readyToUse that stops external-snapshotter from continuing to reconcile. When we test Longhorn without the confusion of the extra attach and detach, we observe the following logs. Notice that:

  • The csi-plugin returns immediately when the backup CR is created with ready_to_use == false (the field is not logged).
  • The csi-snapshotter periodically polls the csi-plugin.
  • Eventually the csi-plugin returns ready_to_use == true.
[csi-snapshotter-59587776b-75wg7] I0523 13:22:10.274798       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:32.770779       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:32.851200       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:32.977362       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:33.111914       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:33.653617       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:34.256486       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...

[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="Creating Volume test snapshot snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:788" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:20Z" level=info msg="Creating volume test backup for snapshot snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:806" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="Volume test backup backup-a264a7f92c6143e0 of snapshot snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00 created" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:825" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470549},\"ready_to_use\":true,\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"

Unfortunately, when we return an error at some point during the process, it appears to prevent further csi-snapshotter reconciliation (at least some of the time). We tested moving to a version of external-snapshotter that doesn't have the issue in longhorn/longhorn#8618 (comment). You can see the results there.

@khushboo-rancher khushboo-rancher modified the milestones: v1.2.2, v1.2.3 May 24, 2024
@WebberHuang1118
Copy link
Member

WebberHuang1118 commented May 24, 2024

Thanks @ejweber @PhanLe1010 ,
we will backport #4907 to Harvester v1.2 and re-verify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/longhorn issues depends on the upstream longhorn kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/often Reproducible 10% to 99% of the time require/doc Improvements or additions to documentation require/investigate severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests