[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

yangchiu · 2024-01-22T05:43:52Z

Describe the bug

Run test case test_recurring_job_restored_from_backup_target[nfs]. It's possible to encounter backup creation failed with error message:

  error: 'proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup
    snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1:
    rpc error: code = Internal desc = failed to create backup: failed to create backup
    to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job:
    rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck
    type 1 acquisition'

https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-amd64/8/testReport/junit/tests/test_recurring_job/test_recurring_job_restored_from_backup_target_nfs_/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6026/console

To Reproduce

Run test case test_recurring_job_restored_from_backup_target[nfs]

Expected behavior

Support bundle for troubleshooting

supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip

Environment

Longhorn version: v1.6.x-head (longhorn-manager 0c1a49) or master-head (longhorn-manager 8a8414)
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

The text was updated successfully, but these errors were encountered:

yangchiu · 2024-01-22T06:32:41Z

Another related test case failure: https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-2-stage-upgrade-tests-sles-arm64/420/testReport/tests/test_basic/test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress_s3_/

ChanYiLin · 2024-01-22T06:46:59Z

Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck
    type 1 acquisition

happens when we try to create backup when there is a backup being deleted
Maybe the previous test cleanup hasn't finished and the next one starts running
Still looking into the root cause.

Wonder if this issue happened in v1.5.x regression test? @yangchiu

ChanYiLin · 2024-01-22T07:47:23Z

Testing - Modify the test to use generated volume name instead of fixed name

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6054/console

yangchiu · 2024-01-23T00:22:04Z

Wonder if this issue happened in v1.5.x regression test? @yangchiu

Tested with the same test image, ran the test case for 100 times, didn't find this issue in v1.5.x: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6046/

ChanYiLin · 2024-01-23T03:17:16Z

Testing - Modify the test to use generated volume name instead of fixed name - 99/100 success

2024-01-22T11:12:00.855351723Z test_recurring_job.py::test_recurring_job_restored_from_backup_target[nfs-40-100] wait_for_volume_status: record-recur-longhorn-testvol-zsz8or
2024-01-22T11:12:01.982149653Z wait_for_volume_status: record-recur-longhorn-testvol-zsz8or
2024-01-22T11:12:07.017497292Z wait_for_volume_status: record-recur-longhorn-testvol-zsz8or
2024-01-22T11:15:00.432013583Z FAILED

        complete_backup_1_count = 0
        restore_snapshot_name = ""
        volume = client.by_id_volume(volume_name1)
        wait_for_backup_completion(client, volume_name1)
        for b in volume.backupStatus:
            if back1+"-" in b.snapshot:
                complete_backup_1_count += 1
                restore_snapshot_name = b.snapshot
    
>       assert complete_backup_1_count == 1
E       assert 0 == 1
E         -0
E         +1

But it actually successfully create the backup
Volume: record-recur-longhorn-testvol-zsz8or
Snapshot: backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528
Backup: backup-89506bb825454b0b

There is no locking issue.
Need to print more information to see if it is issue of the testing code

2024-01-22T11:14:06.202150251Z [longhorn-instance-manager] time="2024-01-22T11:14:06Z" level=info msg="Backing up snapshot backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 to backup backup-89506bb825454b0b" func="proxy.(*Proxy).SnapshotBackup" file="backup.go:44" dataEngine=DATA_ENGINE_V1 engineName=record-recur-longhorn-testvol-zsz8or-e-0 serviceURL="10.42.2.9:10000" volumeName=record-recur-longhorn-testvol-zsz8or
2024-01-22T11:14:06.207589147Z [longhorn-instance-manager] time="2024-01-22T11:14:06Z" level=info msg="Backing up backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 on tcp://10.42.3.10:10000, to nfs://longhorn-test-nfs-svc.default:/opt/backupstore" func="sync.(*Task).createBackup" file="backup.go:105"

2024-01-22T11:14:13.296369023Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/locks/lock-b861948cdd5b4837.lck kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.297830520Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Stored lock backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/locks/lock-b861948cdd5b4837.lck type 1 on backupstore" func=backupstore.saveLock file="lock.go:190" pkg=backupstore
2024-01-22T11:14:13.297849720Z time="2024-01-22T11:14:13Z" level=info msg="Done initiating backup creation, received backupID: backup-89506bb825454b0b" func="rpc.(*SyncAgentServer).BackupCreate" file="server.go:734"
2024-01-22T11:14:13.297852380Z time="2024-01-22T11:14:13Z" level=info msg="Performing delta block backup" func=backupstore.CreateDeltaBlockBackup.func2 file="deltablock.go:266" destURL="nfs://longhorn-test-nfs-svc.default:/opt/backupstore" snapshot="&{backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 2024-01-22T11:14:06Z}" volume="&{record-recur-longhorn-testvol-zsz8or 16777216 map[RecurringJob:backup1 VolumeRecurringJobInfo:{\"backup1\":{\"jobSpec\":{\"name\":\"backup1\",\"groups\":[\"default\"],\"task\":\"backup\",\"cron\":\"*/2 * * * *\",\"retain\":1,\"concurrency\":1},\"fromGroup\":[\"default\"],\"fromJob\":true},\"backup2\":{\"jobSpec\":{\"name\":\"backup2\",\"groups\":[\"group01\"],\"task\":\"backup\",\"cron\":\"0 0 * * 0\",\"retain\":3,\"concurrency\":3},\"fromGroup\":[\"group01\"],\"fromJob\":false},\"snapshot1\":{\"jobSpec\":{\"name\":\"snapshot1\",\"task\":\"snapshot\",\"cron\":\"* * * * *\",\"retain\":2,\"concurrency\":1},\"fromJob\":true}} longhorn.io/volume-access-mode:rwo] 2024-01-22T11:14:06Z   0   lz4  v1}"
2024-01-22T11:14:13.297856994Z time="2024-01-22T11:14:13Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.303047166Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.303060769Z time="2024-01-22T11:14:13Z" level=info msg="Volume record-recur-longhorn-testvol-zsz8or Snapshot backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 is consist of 1 mappings and 1 blocks" func=backupstore.performBackup file="deltablock.go:496"
2024-01-22T11:14:13.322666229Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Created snapshot changed blocks: 1 mappings, 1 blocks and 1 new blocks" func=backupstore.performBackup file="deltablock.go:524" event=backup object=snapshot pkg=backupstore reason=complete snapshot=backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528
2024-01-22T11:14:13.324010768Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.328107638Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.329840496Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:55" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.329855652Z time="2024-01-22T11:14:13Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:66" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.329860528Z time="2024-01-22T11:14:13Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.334825337Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=complete

ChanYiLin · 2024-01-23T03:48:28Z

I think I found the root cause

    # 2 from job_snap, 1 from job_backup, 1 volume-head
    wait_for_snapshot_count(volume, 4)

    complete_backup_1_count = 0
    restore_snapshot_name = ""
    volume = client.by_id_volume(volume_name1)
    wait_for_backup_completion(client, volume_name1)
    for b in volume.backupStatus:
        if back1+"-" in b.snapshot:
            complete_backup_1_count += 1
            restore_snapshot_name = b.snapshot

    assert complete_backup_1_count == 1

We shouldn't get the volume first and then wait for backup completion
because the volume we get might not have the latest backupStatus, it may haven't completed yet

The reason why this issue is so rare might because we first wait for 4 snapshots
and since our backup is small so most of the time it is already complete before we even call wait_for_backup_completion

cc @mantissahz

Testing - get the volume after wait_for_backup_completion

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6057/console

mantissahz · 2024-01-23T04:39:07Z

We shouldn't get the volume first and then wait for backup completion
because the volume we get might not have the latest backupStatus, it may haven't completed yet

Yes, it is.

@ChanYiLin
We should use the latest volume from wait_for_backup_completion, or get the volume by id after wait_for_backup_completion.
WDYT?

    complete_backup_1_count = 0
    restore_snapshot_name = ""
    #volume = client.by_id_volume(volume_name1)
    volume = wait_for_backup_completion(client, volume_name1)
    for b in volume.backupStatus:
        if back1+"-" in b.snapshot:
            complete_backup_1_count += 1
            restore_snapshot_name = b.snapshot

    assert complete_backup_1_count == 1

ChanYiLin · 2024-01-23T04:53:42Z

@mantissahz
Agree, let me finish the test to see if it passes all 100 runs
I will then create the PR.

ChanYiLin · 2024-01-23T06:44:51Z

I have run the test and confirmed that the root cause is because we get volume before the backup is ready
From the below logs, Backup was ready after a while, but the volume status was empty because we got the volume before Backup was ready.

test_recurring_job.py::test_recurring_job_restored_from_backup_target[nfs-43-100] wait_for_volume_status: record-recur-longhorn-testvol-rclsga
wait_for_volume_status: record-recur-longhorn-testvol-rclsga
wait_for_volume_status: record-recur-longhorn-testvol-rclsga
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': 'nfs://longhorn-test-nfs-svc.default:/opt/backupstore?backup=backup-90a16d4a5ef544ba&volume=record-recur-longhorn-testvol-rclsga', 'error': '', 'progress': 100, 'replica': 'record-recur-longhorn-testvol-rclsga-r-d9b01af4', 'size': '2097152', 'snapshot': 'backup1--877b36f7-f959-4a26-967c-111dabce7492', 'state': 'Completed'}
volume.backupStatus: [{'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}]
FAILED

Conclusion

This issue can concludes to 2 reasons

We use the same volume name across tests. When the next test is creating Backup, the last test might be still deleting the Backup at the same time and leads to lock issue. => Should use random volume name
We should get the volume after Backup is ready, or the Volume.BackupStatus will not be correct

ChanYiLin · 2024-01-23T07:12:27Z

Another related test case failure: https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-2-stage-upgrade-tests-sles-arm64/420/testReport/tests/test_basic/test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress_s3_/

For this test it is also because we deleted the backup previously and the lock hadn't been released yet
It is test code issue not backup logic, we didn't change backup logic since v1.5.x
Besides, test_dr_volume_with_backup_block_deletion and test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress have same logic but one failed and one passed.

yangchiu · 2024-01-23T08:14:09Z

We use the same volume name across tests. When the next test is creating Backup, the last test might be still deleting the Backup at the same time and leads to lock issue. => Should use random volume name

Since no test cases run parallel, it could imply the backups cleanup after a test case completed doesn't really wait for backups being deleted.

But actually there is a wait mechanism in backupstore_cleanup:
https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-upgrade-tests-sles-arm64/728/testReport/junit/tests/test_statefulset/test_statefulset_restore_s3_/

    @pytest.fixture(params=BACKUPSTORE)
    def set_random_backupstore(request, client):
        if request.param == "s3":
            set_backupstore_s3(client)
        elif request.param == "nfs":
            set_backupstore_nfs(client)
            mount_nfs_backupstore(client)
    
        yield
        cleanup_all_volumes(client)
>       backupstore_cleanup(client)

backupstore.py:68: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0xffff9fb6f7f0>

    def backupstore_cleanup(client):
        backup_volumes = client.list_backup_volume()
    
        # we delete the whole backup volume, which skips block gc
        for backup_volume in backup_volumes:
            delete_backup_volume(client, backup_volume.name)
    
        backup_volumes = client.list_backup_volume()
>       assert backup_volumes.data == []
E       AssertionError

backupstore.py:181: AssertionError

In delete_backup_volume, it will wait until the backup volume is not found, but after in delete all backup volumes, fetch the list of backup volumes again, it's still not empty. More investigation needed.

ChanYiLin · 2024-01-24T08:33:01Z

For the pollution, I found the proof from the test supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip you provided

At 2024-01-22T03:22:08Z, the test failed to create the Backup

// Jack: Create Snapshot
2024-01-22T03:22:00.844965669Z time="2024-01-22T03:22:00Z" level=info msg="Creating snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d of volume record-recurring-job" func="controller.(*SnapshotController).handleSnapshotCreate" file="snapshot_controller.go:638" controller=longhorn-snapshot node=ip-10-0-2-144
2024-01-22T03:22:00.909633751Z time="2024-01-22T03:22:00Z" level=info msg="Created snapshot CR backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d with labels map[RecurringJob:backup1] for volume record-recurring-job" func="manager.(*VolumeManager).CreateSnapshotCR" file="snapshot.go:55"
2024-01-22T03:22:00.954379690Z time="2024-01-22T03:22:00Z" level=info msg="Created snapshot CR snapshot-14928467-237d-4984-a7fb-57fd5c419de5 with labels map[RecurringJob:snapshot1] for volume record-recurring-job" func="manager.(*VolumeManager).CreateSnapshotCR" file="snapshot.go:55"
2024-01-22T03:22:00.975294808Z time="2024-01-22T03:22:00Z" level=info msg="Creating snapshot snapshot-14928467-237d-4984-a7fb-57fd5c419de5 of volume record-recurring-job" func="controller.(*SnapshotController).handleSnapshotCreate" file="snapshot_controller.go:638" controller=longhorn-snapshot node=ip-10-0-2-144

// Jack: Failed to create Backup
2024-01-22T03:22:08.277583114Z time="2024-01-22T03:22:08Z" level=warning msg="Cannot take snapshot backup" func=engineapi.NewBackupMonitor file="backup_monitor.go:90" backup=backup-8d164319ad7140d1 controller=longhorn-backup error="proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1: rpc error: code = Internal desc = failed to create backup: failed to create backup to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job: rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 acquisition" node=ip-10-0-2-144
2024-01-22T03:22:08.277680672Z time="2024-01-22T03:22:08Z" level=warning msg="Failed to enable the backup monitor for backup backup-8d164319ad7140d1" func="controller.(*BackupController).reconcile" file="backup_controller.go:384" backup=backup-8d164319ad7140d1 controller=longhorn-backup error="proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1: rpc error: code = Internal desc = failed to create backup: failed to create backup to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job: rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 acquisition" node=ip-10-0-2-144

In longhorn-instance-manager we print out the lock information, we can see that

After mounting the nfs
It tried to put a lock with type 1 (Backup Lock)
But found there is already another type 2 lock (Delete Lock)
So it failed

And the type 2 lock belongs to lock-5a134b725f0d4bc9 which is previous backup at 2024-01-22 03:20:54. 2 mins before.

record-recurring-job contains locks [{ volume: , name: lock-5a134b725f0d4bc9, type: 2, acquired: false, serverTime: 2024-01-22 03:20:54.911544093 +0000 UTC }

2024-01-22T03:22:06.176270280Z time="2024-01-22T03:22:06Z" level=info msg="Attempting mount for nfs path longhorn-test-nfs-svc.default:/opt/backupstore with nfsvers 4.2" func="nfs.(*BackupStoreDriver).mount" file="nfs.go:126" pkg=nfs
2024-01-22T03:22:06.176270680Z time="2024-01-22T03:22:06Z" level=info msg="Mounting NFS share nfs://longhorn-test-nfs-svc.default:/opt/backupstore on mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore with options [nfsvers=4.2 actimeo=1 soft timeo=300 retry=2]" func="nfs.(*BackupStoreDriver).mount" file="nfs.go:137" pkg=nfs
2024-01-22T03:22:06.241416701Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Loaded driver for nfs://longhorn-test-nfs-svc.default:/opt/backupstore" func=nfs.initFunc file="nfs.go:91" pkg=nfs
2024-01-22T03:22:06.242994091Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=start
2024-01-22T03:22:06.257240292Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T03:22:06.259027448Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Stored lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 on backupstore" func=backupstore.saveLock file="lock.go:190" pkg=backupstore
2024-01-22T03:22:08.269615763Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:55" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-5a134b725f0d4bc9.lck kind=nfs object=config pkg=backupstore reason=start
2024-01-22T03:22:08.271305038Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:66" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-5a134b725f0d4bc9.lck kind=nfs object=config pkg=backupstore reason=complete


2024-01-22T03:22:08.271316950Z time="2024-01-22T03:22:08Z" level=info msg="Loaded lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-5a134b725f0d4bc9.lck type 2 on backupstore" func=backupstore.loadLock file="lock.go:171" pkg=backupstore
2024-01-22T03:22:08.272779660Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:55" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=start
2024-01-22T03:22:08.272829487Z time="2024-01-22T03:22:08Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:66" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T03:22:08.274577879Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loaded lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 on backupstore" func=backupstore.loadLock file="lock.go:171" pkg=backupstore

// Jack: Failed to create Backup
2024-01-22T03:22:08.274593755Z time="2024-01-22T03:22:08Z" level=info msg="Trying to acquire lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:65" lock="{ volume: record-recurring-job, name: lock-97560b52474d4658, type: 1, acquired: false, serverTime: 2024-01-22 03:22:06.2440863 +0000 UTC }" pkg=backupstore
2024-01-22T03:22:08.274598142Z time="2024-01-22T03:22:08Z" level=info msg="backupstore volume record-recurring-job contains locks [{ volume: , name: lock-5a134b725f0d4bc9, type: 2, acquired: false, serverTime: 2024-01-22 03:20:54.911544093 +0000 UTC } { volume: , name: lock-97560b52474d4658, type: 1, acquired: false, serverTime: 2024-01-22 03:22:06.2440863 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore
2024-01-22T03:22:08.274602574Z time="2024-01-22T03:22:08Z" level=info msg="Removed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 on backupstore" func=backupstore.removeLock file="lock.go:180" pkg=backupstore

ChanYiLin · 2024-01-24T09:51:29Z

I think it is because two tests cases running too closely that the cleanup hadn't finished yet and it then unmount the nfs and then remount it back for the next test.

And it seems there is a chance the lock folder and volume folder is still there somehow not yet deleted maybe got interupted
And the controller create the backup volume because it found the volume folder there (but it should be deleted)

// Jack: Complete Deleting backup volume
2024-01-22T03:20:52.789271975Z time="2024-01-22T03:20:52Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"
2024-01-22T03:20:54.870734350Z time="2024-01-22T03:20:54Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"


// Jack: unmount the target at 03:20:55 because last test just finished
2024-01-22T03:20:55.788434621Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="rpc error: code = Internal desc = failed to unmount all mount points: failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.827466508Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-01-22T03:20:55Z\" level=info msg=\"Trying to force clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore\" func=util.cleanupMount file=\"util.go:285\" pkg=backupstore\ntime=\"2024-01-22T03:20:55Z\" level=warning msg=\"Failed to clean up mount points\" func=cmd.doCleanUpAllMounts file=\"cleanup_all_mounts.go:28\" Command=cleanup-mount error=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\"\ntime=\"2024-01-22T03:20:55Z\" level=error msg=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.843256074Z time="2024-01-22T03:20:55Z" level=info msg="Updated setting backup-target to " func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"


// Set the Backup Target Back at 03:20:58 because the next test starts
2024-01-22T03:20:58.878151532Z time="2024-01-22T03:20:58Z" level=info msg="Updated setting backup-target to nfs://longhorn-test-nfs-svc.default:/opt/backupstore" func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"
2024-01-22T03:20:58.878189441Z 10.42.2.16 - - [22/Jan/2024:03:20:58 +0000] "PUT /v1/settings/backup-target HTTP/1.1" 200 427 "" "python-requests/2.25.1"
2024-01-22T03:20:58.994814925Z time="2024-01-22T03:20:58Z" level=info msg="Updated setting backup-target-credential-secret to " func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"


// Found there is a backup volume in the backupstore and need to pull at `2024-01-22T03:20:58`
2024-01-22T03:20:55.737126292Z time="2024-01-22T03:20:55Z" level=info msg="Stopped backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1069" component=backup-store-timer controller=longhorn-setting interval=30s node=ip-10-0-2-144
2024-01-22T03:20:55.770570172Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="rpc error: code = Internal desc = failed to unmount all mount points: failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory" interval=30s node=ip-10-0-2-144 url=
2024-01-22T03:20:55.832001360Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-01-22T03:20:55Z\" level=info msg=\"Trying to force clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore\" func=util.cleanupMount file=\"util.go:285\" pkg=backupstore\ntime=\"2024-01-22T03:20:55Z\" level=warning msg=\"Failed to clean up mount points\" func=cmd.doCleanUpAllMounts file=\"cleanup_all_mounts.go:28\" Command=cleanup-mount error=\"failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\"\ntime=\"2024-01-22T03:20:55Z\" level=error msg=\"failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" interval=30s node=ip-10-0-2-144 url=
2024-01-22T03:20:58.871039708Z time="2024-01-22T03:20:58Z" level=info msg="Starting backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1052" component=backup-store-timer controller=longhorn-setting interval=5m0s node=ip-10-0-2-144
2024-01-22T03:20:58.925993794Z time="2024-01-22T03:20:58Z" level=info msg="Found 1 backup volumes in the backup target that do not exist in the cluster and need to be pulled" func="controller.(*BackupTargetController).syncBackupVolume" file="backup_target_controller.go:458" controller=longhorn-backup-target cred= interval=5m0s node=ip-10-0-2-144 url="nfs://longhorn-test-nfs-svc.default:/opt/backupstore"
2024-01-22T03:20:59.006396421Z time="2024-01-22T03:20:59Z" level=info msg="Stopped backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1069" component=backup-store-timer controller=longhorn-setting interval=5m0s node=ip-10-0-2-144
2024-01-22T03:20:59.006509678Z time="2024-01-22T03:20:59Z" level=info msg="Starting backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1052" component=backup-store-timer controller=longhorn-setting interval=30s node=ip-10-0-2-144

cc @mantissahz @derekbit

ChanYiLin · 2024-01-24T10:05:09Z

Since we didn't change any backup logic in v1.6.0 I suspect this behavior exists very long time
From testing side, we can apply the PR I provided to remove the possibility by using different volume name
From Longhorn side, I still need some time to figure out how it happens.

@yangchiu @innobead

ChanYiLin · 2024-01-25T10:04:05Z

I found the root causes of it

longhorn-test delete the backupvolume and then check if backupvolume is non then unmount the nfs
if backupvolume only has one backup, the backup_controller will request reconcile the BackupVolume one last time: here
So the if !backupVolume.DeletionTimestamp.IsZero() {} part will be execute twice: here
So the err := backupTargetClient.BackupVolumeDelete() will be execute twice
- One is the BackupVolume CRD delete event
- One is backup_controller ask BackupVolume CRD to reconcile again

When doing backup volume remove in backupstore, it will create a DELETE lock first

func DeleteBackupVolume(volumeName string, destURL string) error {
	bsDriver, err := GetBackupStoreDriver(destURL)
	if err != nil {
		return err
	}
	lock, err := New(bsDriver, volumeName, DELETION_LOCK)
	if err != nil {
		return err
	}

	if err := lock.Lock(); err != nil {
		return err
	}
	defer lock.Unlock()
	return removeVolume(volumeName, bsDriver)
}

Then it will delete all the folders of the backup volume

func removeVolume(volumeName string, driver BackupStoreDriver) error {
	...
	if err := driver.Remove(volumeBackupsDirectory); err != nil {
		return errors.Wrapf(err, "failed to remove all the backups for volume %v", volumeName)
	}
	if err := driver.Remove(volumeBlocksDirectory); err != nil {
		return errors.Wrapf(err, "failed to remove all the blocks for volume %v", volumeName)
	}
	if err := driver.Remove(volumeLocksDirectory); err != nil {
		return errors.Wrapf(err, "failed to remove all the locks for volume %v", volumeName)
	}
	if err := driver.Remove(volumeDir); err != nil {
		return errors.Wrapf(err, "failed to remove backup volume %v directory in backupstore", volumeName)
	}
	...
}

There is a chance that
- the first reconcile already finished the cleanup and remove the finalizer
- before the GC, the second reconcile starts to call BackupVolumeDelete()
- before it finished the BackupVolumeDelete() function
- k8s gc the BackupVolume CR because the finalizer is gone in the first reconcile
- longhorn-test then unset the BackupTarget and Longhorn unmount the nfs in longhorn-manager
- So the second BackupVolumeDelete doesn't really delete the lock file because the nfs is unmounted
The lock file remains in the nfs

That is why in the log, there are two Complete deleting backup volume messages

// Jack: Complete Deleting backup volume
2024-01-22T03:20:52.789271975Z time="2024-01-22T03:20:52Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"
2024-01-22T03:20:54.870734350Z time="2024-01-22T03:20:54Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"


// Jack: unmount the target at 03:20:55 because last test just finished
2024-01-22T03:20:55.788434621Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="rpc error: code = Internal desc = failed to unmount all mount points: failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.827466508Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-01-22T03:20:55Z\" level=info msg=\"Trying to force clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore\" func=util.cleanupMount file=\"util.go:285\" pkg=backupstore\ntime=\"2024-01-22T03:20:55Z\" level=warning msg=\"Failed to clean up mount points\" func=cmd.doCleanUpAllMounts file=\"cleanup_all_mounts.go:28\" Command=cleanup-mount error=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\"\ntime=\"2024-01-22T03:20:55Z\" level=error msg=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.843256074Z time="2024-01-22T03:20:55Z" level=info msg="Updated setting backup-target to " func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"

Thanks for the discussion @derekbit
cc @innobead

ChanYiLin · 2024-01-25T10:06:59Z

The fix is quite easy, we can check if
if !backupVolume.DeletionTimestamp.IsZero() && !util.FinalizerExists(longhornFinalizerKey, backupVolume) then we can return directly without doing the cleanup again

ChanYiLin · 2024-01-25T10:18:43Z

No wait, the obj second reconcile get could still have finalizer on it because it gets it from cache
But in etcd the finalizer of the obj is already gone
then condition won't work, because from second reconcile perspective, the obj still has finalizer.

We should skip remove when the volumeDir is gone in backupstore

ChanYiLin · 2024-01-25T15:26:05Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
- Run the test test_recurring_job_restored_from_backup_target 100 * 3 times

PR:

longhorn-test: fix(test): fix recurring_job_restored_from_backup_test longhorn-tests#1681
longhorn-engine: chore: update backupstore version longhorn-engine#994

ChanYiLin · 2024-01-25T17:14:05Z

Test https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6123/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6124/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6125/

All PASSED

yangchiu · 2024-01-26T07:38:39Z

Verified passed on master-head (longhorn-engine 3c1301) and v1.6.x-head (longhorn-engine 5a9b9d) by running test case test_recurring_job_restored_from_backup_target 300 times. All test cases passed.

Test results:
master-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6130/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6132/
v1.6.x-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6133/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6135/

Leave issue closing to @chriscchien in case you're already in the middle of testing.

derekbit · 2024-01-26T14:11:18Z

This reverts commit 1d5129cd1074445964733aa74257d0acd4fc7d9e. in backupstore.
The PR results in the failed test cases:

test_basic.py::test_backup_metadata_deletion[s3] FAILED
test_basic.py::test_backup_metadata_deletion[nfs] FAILED

Just reverted it and let's clarify it later. cc @ChanYiLin

ChanYiLin · 2024-01-31T03:18:58Z

Note

We should check the folder instead of volume.cfg when deleting the backup volume
Because there are other tests verifying DeleteBackupVolume when the volume.cfg is gone.
Since s3 doesn't support folder, we need to change the implementation
PR: longhorn/backupstore#160

yangchiu · 2024-02-05T01:04:18Z

Revisit this after the PR merged.

yangchiu · 2024-02-15T09:07:01Z

Verified passed on master-head (longhorn-engine f241896) by running test case test_recurring_job_restored_from_backup_target 300 times.

Test results:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6199/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6200/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6201/

yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) area/volume-backup-restore Volume backup restore labels Jan 22, 2024

yangchiu added this to the v1.6.0 milestone Jan 22, 2024

innobead assigned ChanYiLin Jan 22, 2024

ChanYiLin mentioned this issue Jan 23, 2024

fix(test): fix recurring_job_restored_from_backup_test longhorn/longhorn-tests#1681

Merged

innobead modified the milestones: v1.6.0, v1.7.0 Jan 24, 2024

longhorn deleted a comment from longhorn-io-github-bot Jan 25, 2024

ChanYiLin mentioned this issue Jan 25, 2024

fix: skip remove backup volume if the dir is already gone longhorn/backupstore#158

Merged

innobead modified the milestones: v1.7.0, v1.6.0 Jan 25, 2024

innobead assigned yangchiu Jan 25, 2024

innobead assigned chriscchien and ChanYiLin and unassigned yangchiu and ChanYiLin Jan 26, 2024

yangchiu self-assigned this Jan 26, 2024

chriscchien closed this as completed Jan 26, 2024

derekbit mentioned this issue Jan 26, 2024

Revert "fix: skip remove backup volume if the dir is already gone" longhorn/backupstore#159

Merged

innobead added priority/2 Nice to fix in this release (managed by PO) and removed priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 26, 2024

innobead reopened this Jan 26, 2024

derekbit mentioned this issue Jan 26, 2024

Revert "fix: skip remove backup volume if the dir is already gone" in backupstore longhorn/longhorn-engine#997

Merged

innobead modified the milestones: v1.6.0, v1.7.0 Jan 29, 2024

ChanYiLin mentioned this issue Jan 31, 2024

fix: skip deletion if folder does not exist longhorn/backupstore#160

Merged

ChanYiLin added backport/1.5.4 backport/1.6.1 labels Feb 7, 2024

This was referenced Feb 7, 2024

[BACKPORT][v1.6.1][BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7875

Closed

[BACKPORT][v1.5.4][BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7876

Closed

ChanYiLin mentioned this issue Feb 7, 2024

fix(backup): skip deletion if volume folder is gone longhorn/longhorn-engine#1009

Merged

ChanYiLin removed the backport/1.5.4 label Feb 7, 2024

james-munson mentioned this issue Feb 7, 2024

[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

Closed

yangchiu closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

yangchiu commented Jan 22, 2024

yangchiu commented Jan 22, 2024

ChanYiLin commented Jan 22, 2024

ChanYiLin commented Jan 22, 2024 •

edited

yangchiu commented Jan 23, 2024

ChanYiLin commented Jan 23, 2024

ChanYiLin commented Jan 23, 2024 •

edited

mantissahz commented Jan 23, 2024 •

edited

ChanYiLin commented Jan 23, 2024

ChanYiLin commented Jan 23, 2024

ChanYiLin commented Jan 23, 2024 •

edited

yangchiu commented Jan 23, 2024 •

edited

ChanYiLin commented Jan 24, 2024

ChanYiLin commented Jan 24, 2024 •

edited

ChanYiLin commented Jan 24, 2024 •

edited

ChanYiLin commented Jan 25, 2024 •

edited

ChanYiLin commented Jan 25, 2024 •

edited

ChanYiLin commented Jan 25, 2024

ChanYiLin commented Jan 25, 2024 •

edited

ChanYiLin commented Jan 25, 2024

yangchiu commented Jan 26, 2024

derekbit commented Jan 26, 2024

ChanYiLin commented Jan 31, 2024

yangchiu commented Feb 5, 2024

yangchiu commented Feb 15, 2024

[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

Comments

yangchiu commented Jan 22, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

yangchiu commented Jan 22, 2024

ChanYiLin commented Jan 22, 2024

ChanYiLin commented Jan 22, 2024 • edited

Testing - Modify the test to use generated volume name instead of fixed name

yangchiu commented Jan 23, 2024

ChanYiLin commented Jan 23, 2024

Testing - Modify the test to use generated volume name instead of fixed name - 99/100 success

ChanYiLin commented Jan 23, 2024 • edited

Testing - get the volume after wait_for_backup_completion

mantissahz commented Jan 23, 2024 • edited

ChanYiLin commented Jan 23, 2024

ChanYiLin commented Jan 23, 2024

Conclusion

ChanYiLin commented Jan 23, 2024 • edited

yangchiu commented Jan 23, 2024 • edited

ChanYiLin commented Jan 24, 2024

ChanYiLin commented Jan 24, 2024 • edited

ChanYiLin commented Jan 24, 2024 • edited

ChanYiLin commented Jan 25, 2024 • edited

ChanYiLin commented Jan 25, 2024 • edited

ChanYiLin commented Jan 25, 2024

ChanYiLin commented Jan 25, 2024 • edited

Pre Ready-For-Testing Checklist

ChanYiLin commented Jan 25, 2024

yangchiu commented Jan 26, 2024

derekbit commented Jan 26, 2024

ChanYiLin commented Jan 31, 2024

Note

yangchiu commented Feb 5, 2024

yangchiu commented Feb 15, 2024

ChanYiLin commented Jan 22, 2024 •

edited

ChanYiLin commented Jan 23, 2024 •

edited

mantissahz commented Jan 23, 2024 •

edited

ChanYiLin commented Jan 23, 2024 •

edited

yangchiu commented Jan 23, 2024 •

edited

ChanYiLin commented Jan 24, 2024 •

edited

ChanYiLin commented Jan 24, 2024 •

edited

ChanYiLin commented Jan 25, 2024 •

edited

ChanYiLin commented Jan 25, 2024 •

edited

ChanYiLin commented Jan 25, 2024 •

edited