[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

albinsun · 2024-05-21T08:14:44Z

Describe the bug
Backup may fail on a Off/Stopped VM.

To Reproduce
Steps to reproduce the behavior:

Create a custom storage class custom
Create 3 VMs with different volume setup
- vm-1 (Only 1 rootdisk)
- vm-2-default (1 rootdisk + 1 extra volume using default SC)
- vm-2-custom (1 rootdisk + 1 extra volume using custom SC)
Take backup on 3 VMs
Stop 3 VMs
Take backup on 3 VMs
backup may fail

Longhorn volume replica fail

Expected behavior
Can backup VM both when Running and Off

Support bundle

3 nodes: supportbundle_off-volume_3nodes-2.zip

Environment

Harvester
- Version: v1.2.2
- Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
- ui-source: Auto

Additional context

Seems hard to reproduce on single node harvester.
Can reproduce on 2 nodes cluster
supportbundle_off-extra-disk_2nodes.zip

The text was updated successfully, but these errors were encountered:

innobead · 2024-05-21T08:30:52Z

@albinsun do we cover this in our e2e auto test coverage? or just manual test coverage?

cc @khushboo-rancher

WebberHuang1118 · 2024-05-21T09:06:09Z

Some findings from the 2-node SB,

the error vmabckup is default/extra-default.off

apiVersion: harvesterhci.io/v1beta1
kind: VirtualMachineBackup
metadata:
  name: extra-default.off
  namespace: default
  resourceVersion: "1265"
spec:
  source:
    apiGroup: kubevirt.io
    kind: VirtualMachine
    name: vm-extra-default
  type: backup
status:
  backupTarget:
    endpoint: nfs://192.168.123.1:/pcloud/
  conditions:
  - lastTransitionTime: "2024-05-21T05:15:21Z"
    message: VolumeSnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py
      in error state
    reason: Error
    status: "False"
    type: InProgress
  - lastTransitionTime: "2024-05-21T05:15:21Z"
    message: Not Ready
    status: "False"
    type: Ready
  csiDriverVolumeSnapshotClassNames:
    driver.longhorn.io: longhorn
  error:
    message: VolumeSnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py
      in error state
    time: "2024-05-21T05:15:21Z"
  readyToUse: false
  secretBackups:
  - data:
      userdata: I2Nsb3VkLWNvbmZpZwpwYWNrYWdlX3VwZGF0ZTogdHJ1ZQpwYWNrYWdlczoKICAtIHFlbXUtZ3Vlc3QtYWdlbnQKcnVuY21kOgogIC0gLSBzeXN0ZW1jdGwKICAgIC0gZW5hYmxlCiAgICAtIC0tbm93CiAgICAtIHFlbXUtZ3Vlc3QtYWdlbnQuc2VydmljZQo=
    name: vm-extra-default-cryzr
  source:
    metadata:
      annotations:
        harvesterhci.io/vmRunStrategy: RerunOnFailure
        harvesterhci.io/volumeClaimTemplates: '[{"metadata":{"name":"vm-extra-default-rootdisk-vp3py","annotations":{"harvesterhci.io/imageId":"default/image-c87wj"}},"spec":{"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"10Gi"}},"volumeMode":"Block","storageClassName":"longhorn-image-c87wj"}},{"metadata":{"name":"vm-extra-default-disk-1-oohjf"},"spec":{"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"4Gi"}},"volumeMode":"Block","storageClassName":"harvester-longhorn"}}]'
        kubevirt.io/latest-observed-api-version: v1
        kubevirt.io/storage-observed-api-version: v1
        network.harvesterhci.io/ips: '[]'
      labels:
        harvesterhci.io/creator: harvester
        harvesterhci.io/os: ubuntu
      name: vm-extra-default
      namespace: default
  sourceUID: a2e7f0e6-650a-46b5-bd4e-58f19e5b0a19
  volumeBackups:
  - creationTime: "2024-05-21T05:15:27Z"
    csiDriverName: driver.longhorn.io
    error:
      message: 'Failed to create snapshot: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349:
        "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting
        for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action
        [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349
        volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349]
        map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach
        detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach
        salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"null'
      time: "2024-05-21T05:15:20Z"
    longhornBackupName: backup-e6b3ac5fe0fb463a
    name: extra-default.off-volume-vm-extra-default-rootdisk-vp3py
    persistentVolumeClaim:
      metadata:
        annotations:
          harvesterhci.io/imageId: default/image-c87wj
          harvesterhci.io/owned-by: '[{"schema":"kubevirt.io.virtualmachine","refs":["default/vm-extra-default"]}]'
          pv.kubernetes.io/bind-completed: "yes"
          pv.kubernetes.io/bound-by-controller: "yes"
          volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
          volume.kubernetes.io/storage-provisioner: driver.longhorn.io
        name: vm-extra-default-rootdisk-vp3py
        namespace: default
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 10Gi
        storageClassName: longhorn-image-c87wj
        volumeMode: Block
        volumeName: pvc-31c455ad-c207-4d2f-bc77-229619430349
    readyToUse: false
    volumeName: rootdisk
  - creationTime: "2024-05-21T05:15:26Z"
    csiDriverName: driver.longhorn.io
    error:
      message: 'Failed to create snapshot: failed to remove VolumeSnapshotBeingCreated
        annotation from the content snapcontent-7387072c-f039-4d65-b989-152587fd8379:
        snapshot controller failed to update snapcontent-7387072c-f039-4d65-b989-152587fd8379
        on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io
        "snapcontent-7387072c-f039-4d65-b989-152587fd8379": the object has been modified;
        please apply your changes to the latest version and try again'
      time: "2024-05-21T05:15:20Z"
    longhornBackupName: backup-0ce511dbbb7244fb
    name: extra-default.off-volume-vm-extra-default-disk-1-oohjf
    persistentVolumeClaim:
      metadata:
        annotations:
          harvesterhci.io/owned-by: '[{"schema":"kubevirt.io.virtualmachine","refs":["default/vm-extra-default"]}]'
          pv.kubernetes.io/bind-completed: "yes"
          pv.kubernetes.io/bound-by-controller: "yes"
          volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
          volume.kubernetes.io/storage-provisioner: driver.longhorn.io
        name: vm-extra-default-disk-1-oohjf
        namespace: default
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 4Gi
        storageClassName: harvester-longhorn
        volumeMode: Block
        volumeName: pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a
    readyToUse: false
    volumeName: disk-1

This vmbackup tried to backup two VM's disks:

rootdisk
- PVC vm-extra-default-rootdisk-vp3py
- LH volume pvc-31c455ad-c207-4d2f-bc77-229619430349
disk1
- PVC vm-extra-default-disk-1-oohjf
- LH volume pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a

The backup is still in progress, however, these two volumes are already in the detached state, which is not available to take snapshots.

kubectl -n longhorn-system get volumes.longhorn.io 
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE   AGE
pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a   detached   unknown                  4294967296           3h50m
pvc-31c455ad-c207-4d2f-bc77-229619430349   detached   unknown                  10737418240          3h50m

It seems there is some similarity as #3813 , I will try to reproduce it to have further investigation, thanks.

bk201 · 2024-05-21T15:57:54Z

~~I can repro this with v1.3-head + Longhorn v1.6.2-rc2 while testing #5787~~
Update: seems due to space issue, please ignore

albinsun · 2024-05-21T16:01:54Z

FYR, I can reproduce on dolphins cluster (4 nodes) on v1.2.2
Will sent you supportbundle in message.

Backup

VolumeSnapshot vm2-custom-volume-vm2-custom-disk-1-d84nk in error state

Longhorn

Volume pvc-324a9e72-3061-4186-b88e-967a1f89ca61
Replica 967a1f89ca61-r-7aa83157 / harvester-rrtfg Failed
Volume pvc-1e1ddf14-9f0b-4ee5-9efe-dc0f6fa44991
Replica dc0f6fa44991-r-64e568b0 / harvester-rrtfg Failed

albinsun · 2024-05-21T16:39:01Z

@albinsun do we cover this in our e2e auto test coverage? or just manual test coverage?

cc @khushboo-rancher

Having some check, our current VM backup TCs mainly locates in harvester_e2e_tests/integrations/test_4_vm_backup_restore but they seems taking backup on a Running VM.

TestBackupRestore::tests_backup_vm
https://github.com/harvester/tests/blob/1dc6fa5a07db143e69e339af9bef79471f49b94c/harvester_e2e_tests/integrations/test_4_vm_backup_restore.py#L331
TestBackupRestoreOnMigration::test_backup_migrated_vm
https://github.com/harvester/tests/blob/1dc6fa5a07db143e69e339af9bef79471f49b94c/harvester_e2e_tests/integrations/test_4_vm_backup_restore.py#L659
TestMultipleBackupRestore::test_backup_multiple
https://github.com/harvester/tests/blob/1dc6fa5a07db143e69e339af9bef79471f49b94c/harvester_e2e_tests/integrations/test_4_vm_backup_restore.py#L836

Note that, in the test so far, it's harder to reproduce this in bare-metal env.
So hardware performance may be one factor.

c.c @khushboo-rancher, @lanfon72

irishgordo · 2024-05-21T21:40:40Z

@albinsun - I struggled to reproduce this on bare-metal, I've not been able to have a successful attempt at reproducing it.
This might be similar to what I noticed with:
#5427
It was impossible on bare-metal (system that met the "testing requirements") to reproduce 5427 - only KVM provisioned env was successful for me.

albinsun · 2024-05-22T00:15:46Z

@albinsun - I struggled to reproduce this on bare-metal, I've not been able to have a successful attempt at reproducing it. This might be similar to what I noticed with: #5427 It was impossible on bare-metal (system that met the "testing requirements") to reproduce 5427 - only KVM provisioned env was successful for me.

Thank you for check!
This issue can, but just less likely to be reproduced on bare-metal.
Refer to #5841 (comment)

irishgordo · 2024-05-22T00:32:27Z

@albinsun - I'll run a few more checks on it too, as it was just two runs - as perhaps I'll be able to reproduce it on a few more iterations 👍

lanfon72 · 2024-05-22T06:32:50Z

we don't have any test cases to backup the VM twice.

derekbit · 2024-05-22T08:23:20Z

cc @ChanYiLin

WebberHuang1118 · 2024-05-22T10:24:07Z

TL;DR LH CSI CreateSnapshot() returns without error but sets ReadyToUse as false, this makes an external-snapshotter stop to reconcile, and the VolumeSnapshot stays in a non-ready state

Taking the 2-node SB as an example

Data structure for one problematic volumesnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py:

volumesnapshot extra-default.off-volume-vm-extra-default-rootdisk-vp3py (READYTOUSE=false)
- volumesnapshotcontent snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 (READYTOUSE=false)
  - snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 (READYTOUSE=true)
  - backup backup-e6b3ac5fe0fb463a (state=completed)

Timeline during trying to take the snapshot:

Around 2024-05-21T05:15:18 CSI snapshotter tried to take the snapshot, however, LH snapshot was not ready yet, we can find it from external-snapshotter log

2024-05-21T05:15:18.588724357Z I0521 05:15:18.588673       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...

2024-05-21T05:15:20.806632265Z I0521 05:15:20.806587       1 snapshot_controller.go:325] createSnapshotWrapper: CreateSnapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 returned error: rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]

LH CSI plugin has matched logs

2024-05-21T05:15:18.664498451Z time="2024-05-21T05:15:18Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",    \"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"

 2024-05-21T05:15:18.670883382Z time="2024-05-21T05:15:18Z" level=info msg="Creating Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:776" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup

2024-05-21T05:15:20.806363293Z time="2024-05-21T05:15:20Z" level=error msg="Failed to create snapshot" func="csi.(*ControllerServer).CreateSnapshot.func1" file="controller_server.go:590" component=csi-controller-server error="rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]" function=CreateSnapshot

Around 2024-05-21T05:15:21 CSI snapshotter tried to take the snapshot again

2024-05-21T05:15:21.889671854Z I0521 05:15:21.888949       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...

This time LH successfully took the snapshot and backup

2024-05-21T05:15:21.948437174Z time="2024-05-21T05:15:21Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"

2024-05-21T05:15:33.958948776Z time="2024-05-21T05:15:33Z" level=info msg="Creating volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:794" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup

2024-05-21T05:15:48.084897436Z time="2024-05-21T05:15:48Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup backup-e6b3ac5fe0fb463a of snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 created" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:813"

However, even LH snapshot and backup were successfully created, the related volumesnapshot kept the state as READYTOUSE=false. IMHO, this is brought by LH to check if the backup is ready by the field created, but it responses ReadyToUse by the field state, if at a specific point LH backup with created != nil but state == false, this is likely to happen.

During the reproduce steps on my site, I turned external-snapshotter log to V5, and found this log csi-snapshotter-6b754d4b6c-5j4zp:I0522 06:18:06.232129 1 snapshot_controller.go:417] updateSnapshotContentStatus: updating VolumeSnapshotContent [snapcontent-08867fd6-3f70-4c59-b6b2-dce8fb144901], snapshotHandle bak://pvc-53d62a14-ad8d-499d-9fa2-9bd62e9d1095/backup-aad0e09c1cd84358, readyToUse false, createdAt 1716358667000000000, size 10737418240 which is from https://github.com/kubernetes-csi/external-snapshotter/blob/f7ea027c3cdc76537edd9a295c4a5dfd94cfdc8e/pkg/sidecar-controller/snapshot_controller.go#L336, it indicates LH returned the volumesnapshotcontent without error but readyToUse as false

A possible workaround:
Via $ kubectl -n longhorn-system rollout restart deployment csi-snapshotter to trigger external-snapshotter reconcile volumesnapshotcontent

derekbit · 2024-05-22T10:28:05Z

cc @PhanLe1010

ejweber · 2024-05-22T15:01:22Z

it indicates LH returned the volumesnapshotcontent without error but readyToUse as false

I'm not sure yet on the full context of this issue, but it is correct for Longhorn to return no error with readyToUse false for bak type snapshots. This is because the actual snapshotting can be very fast, but the backup part can take a long time. The CSI spec does not want COs to freeze workloads (if they DO freeze workloads) while operations like uploads are progressing.

The cluster snapshotter components should periodically check back with the CSI plugin to confirm whether the snapshot has become readyToUse later.

https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter

ejweber · 2024-05-22T18:15:24Z

I am investigating the permanent failure part, but I think the reason for the initial failure is somewhat clear.

This is a Harvester v1.2.2 cluster running Longhorn v1.5.5. As discussed in #4907, Longhorn v1.5.5 handles its own attaching and detaching for snapshots using an internal volume attachment controller. #4910 removes functionality from Harvester that previously attached automatically attached volumes for snapshots. HOWEVER, v1.2.2 does not appear to have that change. For example, the code at

harvester/pkg/api/volumesnapshot/handler.go

Lines 142 to 168 in ccd7e1c

    
           func (h *ActionHandler) mountSourcePVC(volumeSnapshot *snapshotv1.VolumeSnapshot) error { 
        
           	if volumeSnapshot.Spec.Source.PersistentVolumeClaimName == nil { 
        
           		return nil 
        
           	} 
        
           	pvc, err := h.pvcCache.Get(volumeSnapshot.Namespace, *volumeSnapshot.Spec.Source.PersistentVolumeClaimName) 
        
           	if err != nil { 
        
           		if apierrors.IsNotFound(err) { 
        
           			return nil 
        
           		} 
        
           		return fmt.Errorf("can't find pvc %s/%s, err: %w", volumeSnapshot.Namespace, *volumeSnapshot.Spec.Source.PersistentVolumeClaimName, err) 
        
           	} 
        
           	volume, err := h.volumeCache.Get(util.LonghornSystemNamespaceName, pvc.Spec.VolumeName) 
        
           	if err != nil { 
        
           		return fmt.Errorf("failed to get volume %s/%s, error: %s", util.LonghornSystemNamespaceName, pvc.Spec.VolumeName, err.Error()) 
        
           	} 
        
           	if volume.Status.State == lhv1beta2.VolumeStateDetached || volume.Status.State == lhv1beta2.VolumeStateDetaching { 
        
           		volCpy := volume.DeepCopy() 
        
           		volCpy.Spec.NodeID = volume.Status.OwnerID 
        
           		logrus.Infof("mount detached volume %s to the node %s", volCpy.Name, volCpy.Spec.NodeID) 
        
           		if _, err = h.volumes.Update(volCpy); err != nil { 
        
           			return err 
        
           		} 
        
           	} 
        
           	return nil

still exists. We see that this code executed at the time the failed backups were created:

2024-05-21T05:15:18.300855786Z time="2024-05-21T05:15:18Z" level=info msg="mount detached volume pvc-31c455ad-c207-4d2f-bc77-229619430349 to the node harvester-node-1"

This is very confusing to Longhorn, because the only component that is currently allowed to set volume.spec.nodeID is the internal volume attachment controller.

The volume attachment controller notices Harvester tried to attach the volume and decides to detach it.

2024-05-21T05:15:18.366260253Z time="2024-05-21T05:15:18Z" level=info msg="Should detach volume harvester-node-1 because there is no matching attachment ticket" func="controller.(*VolumeAttachmentController).shouldDoDetach" file="volume_attachment_controller.go:496" controller=longhorn-volume-attachment longhornVolumeAttachment=pvc-31c455ad-c207-4d2f-bc77-229619430349 node=harvester-node-1
2024-05-21T05:15:18.366357418Z time="2024-05-21T05:15:18Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 is selected to detach from node harvester-node-1" func="controller.(*VolumeAttachmentController).handleVolumeDetachment" file="volume_attachment_controller.go:455" controller=longhorn-volume-attachment longhornVolumeAttachment=pvc-31c455ad-c207-4d2f-bc77-229619430349 node=harvester-node-1

But some processes are already starting.

2024-05-21T05:15:18.543496374Z time="2024-05-21T05:15:18Z" level=info msg="Creating instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff" func="controller.(*InstanceHandler).createInstance" file="instance_handler.go:472"
2024-05-21T05:15:18.546947067Z time="2024-05-21T05:15:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff\", UID:\"943a13c9-f00d-4fa9-87c4-86a79cf1f2ad\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"21850\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"
2024-05-21T05:15:18.701191154Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff starts running, Storage IP 10.52.1.13" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:164"
2024-05-21T05:15:18.701256509Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff starts running, IP 10.52.1.13" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:169"
2024-05-21T05:15:18.701262747Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-62522cff starts running, Port 10021" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:173"

2024-05-21T05:15:18.405046754Z time="2024-05-21T05:15:18Z" level=info msg="Creating instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e" func="controller.(*InstanceHandler).createInstance" file="instance_handler.go:472"
2024-05-21T05:15:18.409616650Z time="2024-05-21T05:15:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e\", UID:\"1e19339c-752e-44f8-bd0b-b68ae3618d16\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"21838\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"
2024-05-21T05:15:18.654000108Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e starts running, Storage IP 10.52.0.47" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:164"
2024-05-21T05:15:18.654201931Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e starts running, IP 10.52.0.47" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:169"
2024-05-21T05:15:18.654304898Z time="2024-05-21T05:15:18Z" level=warning msg="Instance pvc-31c455ad-c207-4d2f-bc77-229619430349-r-ae361a1e starts running, Port 10010" func="controller.(*InstanceHandler).syncStatusWithInstanceManager" file="instance_handler.go:173"

Then the volume attachment controller decides to attach the volume.

2024-05-21T05:15:18.753617612Z time="2024-05-21T05:15:18Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 is selected to attach to node harvester-node-1, ticket +&{snapshot-controller-snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 snapshot-controller harvester-node-1 map[disableFrontend:any] 0}" func="controller.(*VolumeAttachmentController).handleVolumeAttachment" file="volume_attachment_controller.go:576" controller=longhorn-volume-attachment longhornVolumeAttachment=pvc-31c455ad-c207-4d2f-bc77-229619430349 node=harvester-node-1

There are a flurry of process deletions and creations related to the conflicting actions. The engine process crashes because the replica processes that were started are shut down before it can use them.

2024-05-21T05:15:19.721120427Z time="2024-05-21T05:15:19Z" level=warning msg="pvc-19b970f7-7dd0-4b39-be01-71a3e4f4873a-e-0: time=\"2024-05-21T05:15:19Z\" level=error msg=\"no available backend due to service unavailable from the addresses [tcp://10.52.0.47:10020 tcp://10.52.1.13:10011]\" func=cmd.startController file=\"controller.go:186\"" func="controller.(*InstanceHandler).printInstanceLogs" file="instance_handler.go:458"

In the process of this happening, we try and fail to take a snapshot, kicking off the VM backup failures this ticket was opened with.

I think for offline backups to work reliably, versions of Harvester that use Longhorn v1.5.x need to adopt a backport of #4910.

ejweber · 2024-05-22T18:53:43Z

As for the permanent backup failure, Longhorn mostly seems to be performing correctly, given the circumstances.

In the two node support bundle, one backup is taken while the volume is attached and one while it is detached.

While it is attached, we see multiple csi-snapshotter calls to create a snapshot:

2024-05-21T05:13:26.811565040Z I0521 05:13:26.807341       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.024469713Z I0521 05:13:35.024406       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.096448974Z I0521 05:13:35.096266       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.138410250Z I0521 05:13:35.138355       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...
2024-05-21T05:13:35.429538885Z I0521 05:13:35.429135       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-9242229a-239c-4bdd-a3ca-4aaaff62be17 through the plugin ...

On the csi-plugin side, we respond with no error each time. The last time, we respond with ready_to_use = true. This is due to the behavior I previously described. The csi-snapshotter keeps checking back until we complete the backup upload.

2024-05-21T05:13:34.996622158Z time="2024-05-21T05:13:34Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.045948912Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.126273263Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.157806757Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268407},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"
2024-05-21T05:13:35.620873156Z time="2024-05-21T05:13:35Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268411},\"ready_to_use\":true,\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-a5902a2315624534\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"

While the volume is detached, we see something somewhat similar, except an error occurs. On the csi-snapshotter side, the first attempt fails. But it tries again.

2024-05-21T05:15:18.588724357Z I0521 05:15:18.588673       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...
2024-05-21T05:15:20.806632265Z I0521 05:15:20.806587       1 snapshot_controller.go:325] createSnapshotWrapper: CreateSnapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 returned error: rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]
2024-05-21T05:15:20.888221499Z E0521 05:15:20.888006       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"
2024-05-21T05:15:20.888275541Z E0521 05:15:20.888058       1 snapshot_controller_base.go:351] could not sync content "snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50": failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"
2024-05-21T05:15:20.888648190Z I0521 05:15:20.888527       1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50", UID:"fd933a8d-5f30-4d08-bee9-2e15923e36e8", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"22008", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"

2024-05-21T05:15:21.889671854Z I0521 05:15:21.888949       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50 through the plugin ...

On the csi-plugin side, we respond as follows:

2024-05-21T05:15:18.664498451Z time="2024-05-21T05:15:18Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"
2024-05-21T05:15:18.670883382Z time="2024-05-21T05:15:18Z" level=info msg="Creating Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:776" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
2024-05-21T05:15:20.806363293Z time="2024-05-21T05:15:20Z" level=error msg="Failed to create snapshot" func="csi.(*ControllerServer).CreateSnapshot.func1" file="controller_server.go:590" component=csi-controller-server error="rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]" function=CreateSnapshot
2024-05-21T05:15:20.806384095Z time="2024-05-21T05:15:20Z" level=error msg="CreateSnapshot: err: rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]" func=csi.logGRPC file="server.go:138"

2024-05-21T05:15:21.948437174Z time="2024-05-21T05:15:21Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}" func=csi.logGRPC file="server.go:132"
2024-05-21T05:15:33.958948776Z time="2024-05-21T05:15:33Z" level=info msg="Creating volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:794" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
2024-05-21T05:15:48.084897436Z time="2024-05-21T05:15:48Z" level=info msg="Volume pvc-31c455ad-c207-4d2f-bc77-229619430349 backup backup-e6b3ac5fe0fb463a of snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 created" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:813"
2024-05-21T05:15:48.085109292Z time="2024-05-21T05:15:48Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716268527},\"size_bytes\":10737418240,\"snapshot_id\":\"bak://pvc-31c455ad-c207-4d2f-bc77-229619430349/backup-e6b3ac5fe0fb463a\",\"source_volume_id\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\"}}" func=csi.logGRPC file="server.go:141"

So far, this behavior is correct. Now, the backup actually completes almost immediately after the backup CR is created.

2024-05-21T05:15:48.602742604Z time="2024-05-21T05:15:48Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-31c455ad-c207-4d2f-bc77-229619430349\", UID:\"1bbb7e42-b976-48c1-87e5-c550fee231c9\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"22888\", FieldPath:\"\"}): type: 'Normal' reason: 'Completed' Snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 backup backup-e6b3ac5fe0fb463a label map[KubernetesStatus:{\"pvName\":\"pvc-31c455ad-c207-4d2f-bc77-229619430349\",\"pvStatus\":\"Bound\",\"namespace\":\"default\",\"pvcName\":\"vm-extra-default-rootdisk-vp3py\",\"lastPVCRefAt\":\"\",\"workloadsStatus\":[{\"podName\":\"virt-launcher-vm-extra-default-64rgh\",\"podStatus\":\"Failed\",\"workloadName\":\"vm-extra-default\",\"workloadType\":\"VirtualMachineInstance\"}],\"lastPodRefAt\":\"2024-05-21T05:14:49Z\"} longhorn.io/volume-access-mode:rwx]" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"

In this case, it would have been nice if the csi-plugin had waited until the backup completed to return, so it could return a ready_to_use == true, but it does not due to longhorn/longhorn-manager#2104. (In general, there is no reason to assume uploading the backup will be quick, so the behavior is correct.) We expect the csi-snapshotter to try again, at which point, we will set ready_to_use = true and the csi-snapshotter will clear the error.

The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun

PhanLe1010 · 2024-05-22T18:56:19Z

Agree with @ejweber on the immediate error of why CSI snapshot failed. A small nit that Harvester seems to use this logic to attach the volume instead

harvester/pkg/api/volume/handler.go

Lines 275 to 304 in ccd7e1c

    
           func (h *ActionHandler) snapshot(_ context.Context, pvcNamespace, pvcName, snapshotName string) error { 
        
           	pvc, err := h.pvcCache.Get(pvcNamespace, pvcName) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	provisioner := util.GetProvisionedPVCProvisioner(pvc) 
        
           	csiDriverInfo, err := settings.GetCSIDriverInfo(provisioner) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if provisioner == longhorntypes.LonghornDriverName { 
        
           		volume, err := h.volumeCache.Get(util.LonghornSystemNamespaceName, pvc.Spec.VolumeName) 
        
           		if err != nil { 
        
           			return fmt.Errorf("failed to get volume %s/%s, error: %s", pvc.Namespace, pvc.Spec.VolumeName, err.Error()) 
        
           		} 
        
           		volCpy := volume.DeepCopy() 
        
           		if volume.Status.State == lhv1beta2.VolumeStateDetached || volume.Status.State == lhv1beta2.VolumeStateDetaching { 
        
           			volCpy.Spec.NodeID = volume.Status.OwnerID 
        
           		} 
        
           		if !reflect.DeepEqual(volCpy, volume) { 
        
           			logrus.Infof("mount detached volume %s to the node %s", volCpy.Name, volCpy.Spec.NodeID) 
        
           			if _, err = h.volumes.Update(volCpy); err != nil { 
        
           				return err 
        
           			} 
        
           		} 
        
           	}

PhanLe1010 · 2024-05-22T19:05:02Z

For the context, this error in the CSI snapshotter
2024-05-21T05:15:20.888648190Z I0521 05:15:20.888527 1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-2e32ae7b-005c-4100-89c8-c8465031fc50", UID:"fd933a8d-5f30-4d08-bee9-2e15923e36e8", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"22008", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume pvc-31c455ad-c207-4d2f-bc77-229619430349: "rpc error: code = Internal desc = waitForSnapshotToBeReady: error while waiting for snapshot snapshot-2e32ae7b-005c-4100-89c8-c8465031fc50 to be ready: Action [snapshotCRGet] not available on [&{pvc-31c455ad-c207-4d2f-bc77-229619430349 volume map[self:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349] map[attach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=attach detach:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=detach salvage:http://longhorn-backend:9500/v1/volumes/pvc-31c455ad-c207-4d2f-bc77-229619430349?action=salvage]}]"
means that the volume is currently faulted (as there are a bunch of attach/detach war between Longhorn AD controller and Harvester controller trying to modify the volume.Spec.NodeID manually). When the volume is faulted, Longhorn API snapshotCRGet is not available: https://github.com/longhorn/longhorn-manager/blob/88074ee747e60d8edcd98072a793c8614d7bff86/api/model.go#L1456-L1461

I strongly agree that Harvester should get rid of all the logic which set Longhorn volume.Spec.NodeID directly, it will conflict with Longhorn AD controller and can make the volume faulted

ejweber · 2024-05-22T19:27:31Z

The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun

It may be related to kubernetes-csi/external-snapshotter#953.

ejweber · 2024-05-22T20:30:38Z

A possible workaround:
Via $ kubectl -n longhorn-system rollout restart deployment csi-snapshotter to trigger external-snapshotter reconcile volumesnapshotcontent

I agree with this workaround. I can trigger this with about 50% regularity using pure Longhorn now that we understand what can lead to it. This workaround has successfully gotten me unstuck.

PhanLe1010 · 2024-05-22T20:44:10Z

I use this approach to produce with Longhorn only (without Harvester):

Deploy Longhorn v1.5.5
Deploy a deployment with a Longhorn PVC: kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/deployment.yaml
Edit the volume to make it have 1 replica. This makes it more likely to crash the only replica of the volume and make the volume faulted

Run this script repeatedly:

#!/bin/bash

kubectl -n longhorn-system patch volumes.longhorn.io pvc-a9887842-fe5a-4c02-891b-6fd82df26c30 --type='merge' -p '{"spec":{"nodeID":"phan-v660-pool2-e685807c-zv582"}}' 

cat <<EOF | kubectl apply -f-
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-csi-volume-snapshot-longhorn-backup
spec:
  volumeSnapshotClassName: longhorn-backup-vsc
  source:
    persistentVolumeClaimName: mysql-pvc
EOF
sleep 10
IS_READY=$(kubectl get volumesnapshots.snapshot.storage.k8s.io test-csi-volume-snapshot-longhorn-backup -o jsonpath='{.status.readyToUse}')
if [ "$IS_READY" == "true" ]; then
  echo "The CSI snapshot is ready. Deleting it"
  kubectl delete volumesnapshots.snapshot.storage.k8s.io test-csi-volume-snapshot-longhorn-backup
else
  echo " CSI snapshot is not ready."
fi

Until you see the volume is faulted and the CSI snapshot stuck in not READYTOUSE=false

➜  .kube k get volumesnapshots.snapshot.storage.k8s.io 
NAME                                       READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS         SNAPSHOTCONTENT                                    CREATIONTIME   AGE
test-csi-volume-snapshot-longhorn-backup   false        mysql-pvc                           2Gi           longhorn-backup-vsc   snapcontent-4d9dd671-037b-4e75-ae29-dc1b6dff7833   4m10s          4m18s

even though Longhorn backup is completed:

➜  .kube k get backups.longhorn.io -n longhorn-system                                             
NAME                      SNAPSHOTNAME                                    SNAPSHOTSIZE   SNAPSHOTCREATEDAT      STATE       LASTSYNCEDAT
backup-78adf5e748e041bb   snapshot-4d9dd671-037b-4e75-ae29-dc1b6dff7833   241172480      2024-05-22T20:39:26Z   Completed   2024-05-22T20:39:30Z

PhanLe1010 · 2024-05-22T21:13:19Z

A possible workaround:
Via $ kubectl -n longhorn-system rollout restart deployment csi-snapshotter to trigger external-snapshotter reconcile volumesnapshotcontent

I agree with this workaround. I can trigger this with about 50% regularity using pure Longhorn now that we understand what can lead to it. This workaround has successfully gotten me unstuck.

I am thinking if we should recommend the user not to take backup of stopped VM in Harvester 1.2.2 at all because doing so could make the volume faulted or trigger rebuilding (if not all replicas crashed at the time). WDYT @ejweber @WebberHuang1118

bk201 · 2024-05-23T00:53:40Z

I think for offline backups to work reliably, versions of Harvester that use Longhorn v1.5.x need to adopt a backport of #4910.

@FrankYang0529 Please help backport the PR that remove the attach/detach logic in Harvester for backups: #5853

WebberHuang1118 · 2024-05-23T01:24:32Z

In this case, it would have been nice if the csi-plugin had waited until the backup completed to return, so it could return a ready_to_use == true, but it does not due to longhorn/longhorn-manager#2104. (In general, there is no reason to assume uploading the backup will be quick, so the behavior is correct.) We expect the csi-snapshotter to try again, at which point, we will set ready_to_use = true and the csi-snapshotter will clear the error.

The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun

Hi @ejweber @PhanLe1010,

There are no further logs in csi-snapshotter for the problematic volumesnapshot.

IIUC, because Harvester handles its own attaching and detaching, and this brings chaos to volume behavior. But the LH snapshot and backup are completed eventually, why volumesnapshot kept in readyToUse == false? Because kubernetes-csi/external-snapshotter#953? And this only happens when the Harvester controller has its volume attaching and detaching for snapshot? Thanks.

FrankYang0529 · 2024-05-23T02:05:26Z

I think for offline backups to work reliably, versions of Harvester that use Longhorn v1.5.x need to adopt a backport of #4910.

@FrankYang0529 Please help backport the PR that remove the attach/detach logic in Harvester for backups: #5853

#5854

ejweber · 2024-05-23T13:28:09Z

IIUC, because Harvester handles its own attaching and detaching, and this brings chaos to volume behavior. But the LH snapshot and backup are completed eventually, why volumesnapshot kept in readyToUse == false? Because kubernetes-csi/external-snapshotter#953? And this only happens when the Harvester controller has its volume attaching and detaching for snapshot? Thanks.

It appears to be because of kubernetes-csi/external-snapshotter#953. The reporter there describes a race involving errors and readyToUse that stops external-snapshotter from continuing to reconcile. When we test Longhorn without the confusion of the extra attach and detach, we observe the following logs. Notice that:

The csi-plugin returns immediately when the backup CR is created with ready_to_use == false (the field is not logged).
The csi-snapshotter periodically polls the csi-plugin.
Eventually the csi-plugin returns ready_to_use == true.

[csi-snapshotter-59587776b-75wg7] I0523 13:22:10.274798       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:32.770779       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:32.851200       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:32.977362       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:33.111914       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:33.653617       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...
[csi-snapshotter-59587776b-75wg7] I0523 13:22:34.256486       1 snapshot_controller.go:307] createSnapshotWrapper: Creating snapshot for content snapcontent-789b8e72-2e69-407f-bfeb-5c472ea32b00 through the plugin ...

[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:10Z" level=info msg="Creating Volume test snapshot snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:788" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:20Z" level=info msg="Creating volume test backup for snapshot snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:806" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="Volume test backup backup-a264a7f92c6143e0 of snapshot snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00 created" func="csi.(*ControllerServer).createCSISnapshotTypeLonghornBackup" file="controller_server.go:825" component=csi-controller-server function=createCSISnapshotTypeLonghornBackup
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:32Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:33Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470534},\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="GetPluginInfo: req: {}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.7.0-dev\"}" func=csi.logGRPC file="server.go:139"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\",\"source_volume_id\":\"test\"}" func=csi.logGRPC file="server.go:130"
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="CreateSnapshot is called with req source_volume_id:\"test\" name:\"snapshot-789b8e72-2e69-407f-bfeb-5c472ea32b00\" " func="csi.(*ControllerServer).CreateSnapshot" file="controller_server.go:599" component=csi-controller-server function=CreateSnapshot
[longhorn-csi-plugin-fmd8v longhorn-csi-plugin] time="2024-05-23T13:22:34Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1716470549},\"ready_to_use\":true,\"size_bytes\":21474836480,\"snapshot_id\":\"bak://test/backup-a264a7f92c6143e0\",\"source_volume_id\":\"test\"}}" func=csi.logGRPC file="server.go:139"

Unfortunately, when we return an error at some point during the process, it appears to prevent further csi-snapshotter reconciliation (at least some of the time). We tested moving to a version of external-snapshotter that doesn't have the issue in longhorn/longhorn#8618 (comment). You can see the results there.

WebberHuang1118 · 2024-05-24T02:15:54Z

Thanks @ejweber @PhanLe1010 ,
we will backport #4907 to Harvester v1.2 and re-verify this.

albinsun added this to the v1.2.2 milestone May 21, 2024

bk201 assigned WebberHuang1118 May 21, 2024

bk201 added the require/investigate label May 21, 2024

WebberHuang1118 mentioned this issue May 22, 2024

[BUG] VolumeSnapshot keeps in a non-ready state even related LH snapshot and backup are ready longhorn/longhorn#8618

Closed

khushboo-rancher mentioned this issue May 22, 2024

[TEST] Implement test csi snapshot with volume detached longhorn/longhorn#8620

Closed

2 tasks

albinsun mentioned this issue May 23, 2024

[TEST] Harvester v1.2.2 - Rancher v2.8.4 Integration Test harvester/tests#1279

Closed

bk201 added the require/doc Improvements or additions to documentation label May 23, 2024

WebberHuang1118 mentioned this issue May 23, 2024

doc: description and workaround for known issue #5841 harvester/docs#571

Merged

khushboo-rancher modified the milestones: v1.2.2, v1.2.3 May 24, 2024

WebberHuang1118 mentioned this issue May 27, 2024

refactor: remove mount logic in harvester (backport #4910) #5854

Merged

Yu-Jack mentioned this issue Jul 4, 2024

[BUG] creating a VM snapshot sometimes will be stuck in the Progressing #3813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

albinsun commented May 21, 2024

innobead commented May 21, 2024 •

edited

Loading

WebberHuang1118 commented May 21, 2024

bk201 commented May 21, 2024 •

edited

Loading

albinsun commented May 21, 2024

albinsun commented May 21, 2024 •

edited

Loading

irishgordo commented May 21, 2024

albinsun commented May 22, 2024

irishgordo commented May 22, 2024

lanfon72 commented May 22, 2024

derekbit commented May 22, 2024

WebberHuang1118 commented May 22, 2024 •

edited

Loading

derekbit commented May 22, 2024

ejweber commented May 22, 2024

ejweber commented May 22, 2024 •

edited

Loading

ejweber commented May 22, 2024 •

edited

Loading

PhanLe1010 commented May 22, 2024

PhanLe1010 commented May 22, 2024 •

edited

Loading

ejweber commented May 22, 2024

ejweber commented May 22, 2024

PhanLe1010 commented May 22, 2024 •

edited

Loading

PhanLe1010 commented May 22, 2024 •

edited

Loading

bk201 commented May 23, 2024

WebberHuang1118 commented May 23, 2024

FrankYang0529 commented May 23, 2024

ejweber commented May 23, 2024

WebberHuang1118 commented May 24, 2024 •

edited

Loading

[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841

Comments

albinsun commented May 21, 2024

innobead commented May 21, 2024 • edited Loading

WebberHuang1118 commented May 21, 2024

bk201 commented May 21, 2024 • edited Loading

albinsun commented May 21, 2024

Backup

Longhorn

albinsun commented May 21, 2024 • edited Loading

irishgordo commented May 21, 2024

albinsun commented May 22, 2024

irishgordo commented May 22, 2024

lanfon72 commented May 22, 2024

derekbit commented May 22, 2024

WebberHuang1118 commented May 22, 2024 • edited Loading

derekbit commented May 22, 2024

ejweber commented May 22, 2024

ejweber commented May 22, 2024 • edited Loading

ejweber commented May 22, 2024 • edited Loading

PhanLe1010 commented May 22, 2024

PhanLe1010 commented May 22, 2024 • edited Loading

ejweber commented May 22, 2024

ejweber commented May 22, 2024

PhanLe1010 commented May 22, 2024 • edited Loading

PhanLe1010 commented May 22, 2024 • edited Loading

bk201 commented May 23, 2024

WebberHuang1118 commented May 23, 2024

FrankYang0529 commented May 23, 2024

ejweber commented May 23, 2024

WebberHuang1118 commented May 24, 2024 • edited Loading

innobead commented May 21, 2024 •

edited

Loading

bk201 commented May 21, 2024 •

edited

Loading

albinsun commented May 21, 2024 •

edited

Loading

WebberHuang1118 commented May 22, 2024 •

edited

Loading

ejweber commented May 22, 2024 •

edited

Loading

ejweber commented May 22, 2024 •

edited

Loading

PhanLe1010 commented May 22, 2024 •

edited

Loading

PhanLe1010 commented May 22, 2024 •

edited

Loading

PhanLe1010 commented May 22, 2024 •

edited

Loading

WebberHuang1118 commented May 24, 2024 •

edited

Loading