Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

Closed
yangchiu opened this issue Jan 22, 2024 · 28 comments
Closed

[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744

yangchiu opened this issue Jan 22, 2024 · 28 comments
Assignees
Labels
area/volume-backup-restore Volume backup restore backport/1.6.1 kind/bug priority/2 Nice to fix in this release (managed by PO) reproduce/often 80 - 50% reproducible
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug

Run test case test_recurring_job_restored_from_backup_target[nfs]. It's possible to encounter backup creation failed with error message:

  error: 'proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup
    snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1:
    rpc error: code = Internal desc = failed to create backup: failed to create backup
    to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job:
    rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck
    type 1 acquisition'

https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-amd64/8/testReport/junit/tests/test_recurring_job/test_recurring_job_restored_from_backup_target_nfs_/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6026/console

To Reproduce

Run test case test_recurring_job_restored_from_backup_target[nfs]

Expected behavior

Support bundle for troubleshooting

supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip

Environment

  • Longhorn version: v1.6.x-head (longhorn-manager 0c1a49) or master-head (longhorn-manager 8a8414)
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) area/volume-backup-restore Volume backup restore labels Jan 22, 2024
@yangchiu yangchiu added this to the v1.6.0 milestone Jan 22, 2024
@ChanYiLin
Copy link
Contributor

Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck
    type 1 acquisition

happens when we try to create backup when there is a backup being deleted
Maybe the previous test cleanup hasn't finished and the next one starts running
Still looking into the root cause.

Wonder if this issue happened in v1.5.x regression test? @yangchiu

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 22, 2024

Testing - Modify the test to use generated volume name instead of fixed name

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6054/console

@yangchiu
Copy link
Member Author

Wonder if this issue happened in v1.5.x regression test? @yangchiu

Tested with the same test image, ran the test case for 100 times, didn't find this issue in v1.5.x: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6046/

@ChanYiLin
Copy link
Contributor

Testing - Modify the test to use generated volume name instead of fixed name - 99/100 success

2024-01-22T11:12:00.855351723Z test_recurring_job.py::test_recurring_job_restored_from_backup_target[nfs-40-100] wait_for_volume_status: record-recur-longhorn-testvol-zsz8or
2024-01-22T11:12:01.982149653Z wait_for_volume_status: record-recur-longhorn-testvol-zsz8or
2024-01-22T11:12:07.017497292Z wait_for_volume_status: record-recur-longhorn-testvol-zsz8or
2024-01-22T11:15:00.432013583Z FAILED
        complete_backup_1_count = 0
        restore_snapshot_name = ""
        volume = client.by_id_volume(volume_name1)
        wait_for_backup_completion(client, volume_name1)
        for b in volume.backupStatus:
            if back1+"-" in b.snapshot:
                complete_backup_1_count += 1
                restore_snapshot_name = b.snapshot
    
>       assert complete_backup_1_count == 1
E       assert 0 == 1
E         -0
E         +1

But it actually successfully create the backup
Volume: record-recur-longhorn-testvol-zsz8or
Snapshot: backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528
Backup: backup-89506bb825454b0b

There is no locking issue.
Need to print more information to see if it is issue of the testing code

2024-01-22T11:14:06.202150251Z [longhorn-instance-manager] time="2024-01-22T11:14:06Z" level=info msg="Backing up snapshot backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 to backup backup-89506bb825454b0b" func="proxy.(*Proxy).SnapshotBackup" file="backup.go:44" dataEngine=DATA_ENGINE_V1 engineName=record-recur-longhorn-testvol-zsz8or-e-0 serviceURL="10.42.2.9:10000" volumeName=record-recur-longhorn-testvol-zsz8or
2024-01-22T11:14:06.207589147Z [longhorn-instance-manager] time="2024-01-22T11:14:06Z" level=info msg="Backing up backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 on tcp://10.42.3.10:10000, to nfs://longhorn-test-nfs-svc.default:/opt/backupstore" func="sync.(*Task).createBackup" file="backup.go:105"
2024-01-22T11:14:13.296369023Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/locks/lock-b861948cdd5b4837.lck kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.297830520Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Stored lock backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/locks/lock-b861948cdd5b4837.lck type 1 on backupstore" func=backupstore.saveLock file="lock.go:190" pkg=backupstore
2024-01-22T11:14:13.297849720Z time="2024-01-22T11:14:13Z" level=info msg="Done initiating backup creation, received backupID: backup-89506bb825454b0b" func="rpc.(*SyncAgentServer).BackupCreate" file="server.go:734"
2024-01-22T11:14:13.297852380Z time="2024-01-22T11:14:13Z" level=info msg="Performing delta block backup" func=backupstore.CreateDeltaBlockBackup.func2 file="deltablock.go:266" destURL="nfs://longhorn-test-nfs-svc.default:/opt/backupstore" snapshot="&{backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 2024-01-22T11:14:06Z}" volume="&{record-recur-longhorn-testvol-zsz8or 16777216 map[RecurringJob:backup1 VolumeRecurringJobInfo:{\"backup1\":{\"jobSpec\":{\"name\":\"backup1\",\"groups\":[\"default\"],\"task\":\"backup\",\"cron\":\"*/2 * * * *\",\"retain\":1,\"concurrency\":1},\"fromGroup\":[\"default\"],\"fromJob\":true},\"backup2\":{\"jobSpec\":{\"name\":\"backup2\",\"groups\":[\"group01\"],\"task\":\"backup\",\"cron\":\"0 0 * * 0\",\"retain\":3,\"concurrency\":3},\"fromGroup\":[\"group01\"],\"fromJob\":false},\"snapshot1\":{\"jobSpec\":{\"name\":\"snapshot1\",\"task\":\"snapshot\",\"cron\":\"* * * * *\",\"retain\":2,\"concurrency\":1},\"fromJob\":true}} longhorn.io/volume-access-mode:rwo] 2024-01-22T11:14:06Z   0   lz4  v1}"
2024-01-22T11:14:13.297856994Z time="2024-01-22T11:14:13Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.303047166Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.303060769Z time="2024-01-22T11:14:13Z" level=info msg="Volume record-recur-longhorn-testvol-zsz8or Snapshot backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528 is consist of 1 mappings and 1 blocks" func=backupstore.performBackup file="deltablock.go:496"
2024-01-22T11:14:13.322666229Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Created snapshot changed blocks: 1 mappings, 1 blocks and 1 new blocks" func=backupstore.performBackup file="deltablock.go:524" event=backup object=snapshot pkg=backupstore reason=complete snapshot=backup1--cbdefdb9-b5ab-4e03-85d3-0592303c5528
2024-01-22T11:14:13.324010768Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.328107638Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/backups/backup_backup-89506bb825454b0b.cfg kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.329840496Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:55" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.329855652Z time="2024-01-22T11:14:13Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:66" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T11:14:13.329860528Z time="2024-01-22T11:14:13Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=start
2024-01-22T11:14:13.334825337Z [record-recur-longhorn-testvol-zsz8or-r-e905ddf3] time="2024-01-22T11:14:13Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/dc/56/record-recur-longhorn-testvol-zsz8or/volume.cfg kind=nfs object=config pkg=backupstore reason=complete

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 23, 2024

I think I found the root cause

    # 2 from job_snap, 1 from job_backup, 1 volume-head
    wait_for_snapshot_count(volume, 4)

    complete_backup_1_count = 0
    restore_snapshot_name = ""
    volume = client.by_id_volume(volume_name1)
    wait_for_backup_completion(client, volume_name1)
    for b in volume.backupStatus:
        if back1+"-" in b.snapshot:
            complete_backup_1_count += 1
            restore_snapshot_name = b.snapshot

    assert complete_backup_1_count == 1

We shouldn't get the volume first and then wait for backup completion
because the volume we get might not have the latest backupStatus, it may haven't completed yet

The reason why this issue is so rare might because we first wait for 4 snapshots
and since our backup is small so most of the time it is already complete before we even call wait_for_backup_completion

cc @mantissahz

Testing - get the volume after wait_for_backup_completion

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6057/console

@mantissahz
Copy link
Contributor

mantissahz commented Jan 23, 2024

We shouldn't get the volume first and then wait for backup completion
because the volume we get might not have the latest backupStatus, it may haven't completed yet

Yes, it is.

@ChanYiLin
We should use the latest volume from wait_for_backup_completion, or get the volume by id after wait_for_backup_completion.
WDYT?

    complete_backup_1_count = 0
    restore_snapshot_name = ""
    #volume = client.by_id_volume(volume_name1)
    volume = wait_for_backup_completion(client, volume_name1)
    for b in volume.backupStatus:
        if back1+"-" in b.snapshot:
            complete_backup_1_count += 1
            restore_snapshot_name = b.snapshot

    assert complete_backup_1_count == 1

@ChanYiLin
Copy link
Contributor

@mantissahz
Agree, let me finish the test to see if it passes all 100 runs
I will then create the PR.

@ChanYiLin
Copy link
Contributor

I have run the test and confirmed that the root cause is because we get volume before the backup is ready
From the below logs, Backup was ready after a while, but the volume status was empty because we got the volume before Backup was ready.

test_recurring_job.py::test_recurring_job_restored_from_backup_target[nfs-43-100] wait_for_volume_status: record-recur-longhorn-testvol-rclsga
wait_for_volume_status: record-recur-longhorn-testvol-rclsga
wait_for_volume_status: record-recur-longhorn-testvol-rclsga
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}
Volume record-recur-longhorn-testvol-rclsga
Backup {'backupURL': 'nfs://longhorn-test-nfs-svc.default:/opt/backupstore?backup=backup-90a16d4a5ef544ba&volume=record-recur-longhorn-testvol-rclsga', 'error': '', 'progress': 100, 'replica': 'record-recur-longhorn-testvol-rclsga-r-d9b01af4', 'size': '2097152', 'snapshot': 'backup1--877b36f7-f959-4a26-967c-111dabce7492', 'state': 'Completed'}
volume.backupStatus: [{'backupURL': '', 'error': '', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': ''}]
FAILED

Conclusion

This issue can concludes to 2 reasons

  1. We use the same volume name across tests. When the next test is creating Backup, the last test might be still deleting the Backup at the same time and leads to lock issue. => Should use random volume name
  2. We should get the volume after Backup is ready, or the Volume.BackupStatus will not be correct

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 23, 2024

Another related test case failure: https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-2-stage-upgrade-tests-sles-arm64/420/testReport/tests/test_basic/test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress_s3_/

For this test it is also because we deleted the backup previously and the lock hadn't been released yet
It is test code issue not backup logic, we didn't change backup logic since v1.5.x
Besides, test_dr_volume_with_backup_block_deletion and test_dr_volume_with_backup_block_deletion_abort_during_backup_in_progress have same logic but one failed and one passed.

@yangchiu
Copy link
Member Author

yangchiu commented Jan 23, 2024

We use the same volume name across tests. When the next test is creating Backup, the last test might be still deleting the Backup at the same time and leads to lock issue. => Should use random volume name

Since no test cases run parallel, it could imply the backups cleanup after a test case completed doesn't really wait for backups being deleted.

But actually there is a wait mechanism in backupstore_cleanup:
https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-upgrade-tests-sles-arm64/728/testReport/junit/tests/test_statefulset/test_statefulset_restore_s3_/

    @pytest.fixture(params=BACKUPSTORE)
    def set_random_backupstore(request, client):
        if request.param == "s3":
            set_backupstore_s3(client)
        elif request.param == "nfs":
            set_backupstore_nfs(client)
            mount_nfs_backupstore(client)
    
        yield
        cleanup_all_volumes(client)
>       backupstore_cleanup(client)

backupstore.py:68: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0xffff9fb6f7f0>

    def backupstore_cleanup(client):
        backup_volumes = client.list_backup_volume()
    
        # we delete the whole backup volume, which skips block gc
        for backup_volume in backup_volumes:
            delete_backup_volume(client, backup_volume.name)
    
        backup_volumes = client.list_backup_volume()
>       assert backup_volumes.data == []
E       AssertionError

backupstore.py:181: AssertionError

In delete_backup_volume, it will wait until the backup volume is not found, but after in delete all backup volumes, fetch the list of backup volumes again, it's still not empty. More investigation needed.

@ChanYiLin
Copy link
Contributor

For the pollution, I found the proof from the test supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip you provided

At 2024-01-22T03:22:08Z, the test failed to create the Backup

// Jack: Create Snapshot
2024-01-22T03:22:00.844965669Z time="2024-01-22T03:22:00Z" level=info msg="Creating snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d of volume record-recurring-job" func="controller.(*SnapshotController).handleSnapshotCreate" file="snapshot_controller.go:638" controller=longhorn-snapshot node=ip-10-0-2-144
2024-01-22T03:22:00.909633751Z time="2024-01-22T03:22:00Z" level=info msg="Created snapshot CR backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d with labels map[RecurringJob:backup1] for volume record-recurring-job" func="manager.(*VolumeManager).CreateSnapshotCR" file="snapshot.go:55"
2024-01-22T03:22:00.954379690Z time="2024-01-22T03:22:00Z" level=info msg="Created snapshot CR snapshot-14928467-237d-4984-a7fb-57fd5c419de5 with labels map[RecurringJob:snapshot1] for volume record-recurring-job" func="manager.(*VolumeManager).CreateSnapshotCR" file="snapshot.go:55"
2024-01-22T03:22:00.975294808Z time="2024-01-22T03:22:00Z" level=info msg="Creating snapshot snapshot-14928467-237d-4984-a7fb-57fd5c419de5 of volume record-recurring-job" func="controller.(*SnapshotController).handleSnapshotCreate" file="snapshot_controller.go:638" controller=longhorn-snapshot node=ip-10-0-2-144

// Jack: Failed to create Backup
2024-01-22T03:22:08.277583114Z time="2024-01-22T03:22:08Z" level=warning msg="Cannot take snapshot backup" func=engineapi.NewBackupMonitor file="backup_monitor.go:90" backup=backup-8d164319ad7140d1 controller=longhorn-backup error="proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1: rpc error: code = Internal desc = failed to create backup: failed to create backup to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job: rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 acquisition" node=ip-10-0-2-144
2024-01-22T03:22:08.277680672Z time="2024-01-22T03:22:08Z" level=warning msg="Failed to enable the backup monitor for backup backup-8d164319ad7140d1" func="controller.(*BackupController).reconcile" file="backup_controller.go:384" backup=backup-8d164319ad7140d1 controller=longhorn-backup error="proxyServer=10.42.2.10:8501 destination=10.42.2.10:10000: failed to backup snapshot backup1--ef2a60d6-5123-4f7d-8da8-ea6eff29154d to backup-8d164319ad7140d1: rpc error: code = Internal desc = failed to create backup: failed to create backup to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume record-recurring-job: rpc error: code = Unknown desc = failed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 acquisition" node=ip-10-0-2-144

In longhorn-instance-manager we print out the lock information, we can see that

  • After mounting the nfs
  • It tried to put a lock with type 1 (Backup Lock)
  • But found there is already another type 2 lock (Delete Lock)
  • So it failed

And the type 2 lock belongs to lock-5a134b725f0d4bc9 which is previous backup at 2024-01-22 03:20:54. 2 mins before.

record-recurring-job contains locks [{ volume: , name: lock-5a134b725f0d4bc9, type: 2, acquired: false, serverTime: 2024-01-22 03:20:54.911544093 +0000 UTC }
2024-01-22T03:22:06.176270280Z time="2024-01-22T03:22:06Z" level=info msg="Attempting mount for nfs path longhorn-test-nfs-svc.default:/opt/backupstore with nfsvers 4.2" func="nfs.(*BackupStoreDriver).mount" file="nfs.go:126" pkg=nfs
2024-01-22T03:22:06.176270680Z time="2024-01-22T03:22:06Z" level=info msg="Mounting NFS share nfs://longhorn-test-nfs-svc.default:/opt/backupstore on mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore with options [nfsvers=4.2 actimeo=1 soft timeo=300 retry=2]" func="nfs.(*BackupStoreDriver).mount" file="nfs.go:137" pkg=nfs
2024-01-22T03:22:06.241416701Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Loaded driver for nfs://longhorn-test-nfs-svc.default:/opt/backupstore" func=nfs.initFunc file="nfs.go:91" pkg=nfs
2024-01-22T03:22:06.242994091Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:80" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=start
2024-01-22T03:22:06.257240292Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:91" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T03:22:06.259027448Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:06Z" level=info msg="Stored lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 on backupstore" func=backupstore.saveLock file="lock.go:190" pkg=backupstore
2024-01-22T03:22:08.269615763Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:55" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-5a134b725f0d4bc9.lck kind=nfs object=config pkg=backupstore reason=start
2024-01-22T03:22:08.271305038Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:66" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-5a134b725f0d4bc9.lck kind=nfs object=config pkg=backupstore reason=complete


2024-01-22T03:22:08.271316950Z time="2024-01-22T03:22:08Z" level=info msg="Loaded lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-5a134b725f0d4bc9.lck type 2 on backupstore" func=backupstore.loadLock file="lock.go:171" pkg=backupstore
2024-01-22T03:22:08.272779660Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:55" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=start
2024-01-22T03:22:08.272829487Z time="2024-01-22T03:22:08Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:66" filepath=backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck kind=nfs object=config pkg=backupstore reason=complete
2024-01-22T03:22:08.274577879Z [record-recurring-job-r-26827386] time="2024-01-22T03:22:08Z" level=info msg="Loaded lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 on backupstore" func=backupstore.loadLock file="lock.go:171" pkg=backupstore

// Jack: Failed to create Backup
2024-01-22T03:22:08.274593755Z time="2024-01-22T03:22:08Z" level=info msg="Trying to acquire lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:65" lock="{ volume: record-recurring-job, name: lock-97560b52474d4658, type: 1, acquired: false, serverTime: 2024-01-22 03:22:06.2440863 +0000 UTC }" pkg=backupstore
2024-01-22T03:22:08.274598142Z time="2024-01-22T03:22:08Z" level=info msg="backupstore volume record-recurring-job contains locks [{ volume: , name: lock-5a134b725f0d4bc9, type: 2, acquired: false, serverTime: 2024-01-22 03:20:54.911544093 +0000 UTC } { volume: , name: lock-97560b52474d4658, type: 1, acquired: false, serverTime: 2024-01-22 03:22:06.2440863 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore
2024-01-22T03:22:08.274602574Z time="2024-01-22T03:22:08Z" level=info msg="Removed lock backupstore/volumes/c5/4e/record-recurring-job/locks/lock-97560b52474d4658.lck type 1 on backupstore" func=backupstore.removeLock file="lock.go:180" pkg=backupstore

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 24, 2024

I think it is because two tests cases running too closely that the cleanup hadn't finished yet and it then unmount the nfs and then remount it back for the next test.

And it seems there is a chance the lock folder and volume folder is still there somehow not yet deleted maybe got interupted
And the controller create the backup volume because it found the volume folder there (but it should be deleted)

// Jack: Complete Deleting backup volume
2024-01-22T03:20:52.789271975Z time="2024-01-22T03:20:52Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"
2024-01-22T03:20:54.870734350Z time="2024-01-22T03:20:54Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"


// Jack: unmount the target at 03:20:55 because last test just finished
2024-01-22T03:20:55.788434621Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="rpc error: code = Internal desc = failed to unmount all mount points: failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.827466508Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-01-22T03:20:55Z\" level=info msg=\"Trying to force clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore\" func=util.cleanupMount file=\"util.go:285\" pkg=backupstore\ntime=\"2024-01-22T03:20:55Z\" level=warning msg=\"Failed to clean up mount points\" func=cmd.doCleanUpAllMounts file=\"cleanup_all_mounts.go:28\" Command=cleanup-mount error=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\"\ntime=\"2024-01-22T03:20:55Z\" level=error msg=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.843256074Z time="2024-01-22T03:20:55Z" level=info msg="Updated setting backup-target to " func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"


// Set the Backup Target Back at 03:20:58 because the next test starts
2024-01-22T03:20:58.878151532Z time="2024-01-22T03:20:58Z" level=info msg="Updated setting backup-target to nfs://longhorn-test-nfs-svc.default:/opt/backupstore" func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"
2024-01-22T03:20:58.878189441Z 10.42.2.16 - - [22/Jan/2024:03:20:58 +0000] "PUT /v1/settings/backup-target HTTP/1.1" 200 427 "" "python-requests/2.25.1"
2024-01-22T03:20:58.994814925Z time="2024-01-22T03:20:58Z" level=info msg="Updated setting backup-target-credential-secret to " func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"


// Found there is a backup volume in the backupstore and need to pull at `2024-01-22T03:20:58`
2024-01-22T03:20:55.737126292Z time="2024-01-22T03:20:55Z" level=info msg="Stopped backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1069" component=backup-store-timer controller=longhorn-setting interval=30s node=ip-10-0-2-144
2024-01-22T03:20:55.770570172Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="rpc error: code = Internal desc = failed to unmount all mount points: failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory" interval=30s node=ip-10-0-2-144 url=
2024-01-22T03:20:55.832001360Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-01-22T03:20:55Z\" level=info msg=\"Trying to force clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore\" func=util.cleanupMount file=\"util.go:285\" pkg=backupstore\ntime=\"2024-01-22T03:20:55Z\" level=warning msg=\"Failed to clean up mount points\" func=cmd.doCleanUpAllMounts file=\"cleanup_all_mounts.go:28\" Command=cleanup-mount error=\"failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\"\ntime=\"2024-01-22T03:20:55Z\" level=error msg=\"failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" interval=30s node=ip-10-0-2-144 url=
2024-01-22T03:20:58.871039708Z time="2024-01-22T03:20:58Z" level=info msg="Starting backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1052" component=backup-store-timer controller=longhorn-setting interval=5m0s node=ip-10-0-2-144
2024-01-22T03:20:58.925993794Z time="2024-01-22T03:20:58Z" level=info msg="Found 1 backup volumes in the backup target that do not exist in the cluster and need to be pulled" func="controller.(*BackupTargetController).syncBackupVolume" file="backup_target_controller.go:458" controller=longhorn-backup-target cred= interval=5m0s node=ip-10-0-2-144 url="nfs://longhorn-test-nfs-svc.default:/opt/backupstore"
2024-01-22T03:20:59.006396421Z time="2024-01-22T03:20:59Z" level=info msg="Stopped backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1069" component=backup-store-timer controller=longhorn-setting interval=5m0s node=ip-10-0-2-144
2024-01-22T03:20:59.006509678Z time="2024-01-22T03:20:59Z" level=info msg="Starting backup store timer" func="controller.(*BackupStoreTimer).Start" file="setting_controller.go:1052" component=backup-store-timer controller=longhorn-setting interval=30s node=ip-10-0-2-144

cc @mantissahz @derekbit

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 24, 2024

Since we didn't change any backup logic in v1.6.0 I suspect this behavior exists very long time
From testing side, we can apply the PR I provided to remove the possibility by using different volume name
From Longhorn side, I still need some time to figure out how it happens.

@yangchiu @innobead

@innobead innobead modified the milestones: v1.6.0, v1.7.0 Jan 24, 2024
@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 25, 2024

I found the root causes of it

  1. longhorn-test delete the backupvolume and then check if backupvolume is non then unmount the nfs
  2. if backupvolume only has one backup, the backup_controller will request reconcile the BackupVolume one last time: here
  3. So the if !backupVolume.DeletionTimestamp.IsZero() {} part will be execute twice: here
  4. So the err := backupTargetClient.BackupVolumeDelete() will be execute twice
    • One is the BackupVolume CRD delete event
    • One is backup_controller ask BackupVolume CRD to reconcile again
  5. When doing backup volume remove in backupstore, it will create a DELETE lock first
    func DeleteBackupVolume(volumeName string, destURL string) error {
    	bsDriver, err := GetBackupStoreDriver(destURL)
    	if err != nil {
    		return err
    	}
    	lock, err := New(bsDriver, volumeName, DELETION_LOCK)
    	if err != nil {
    		return err
    	}
    
    	if err := lock.Lock(); err != nil {
    		return err
    	}
    	defer lock.Unlock()
    	return removeVolume(volumeName, bsDriver)
    }
    
  6. Then it will delete all the folders of the backup volume
    func removeVolume(volumeName string, driver BackupStoreDriver) error {
    	...
    	if err := driver.Remove(volumeBackupsDirectory); err != nil {
    		return errors.Wrapf(err, "failed to remove all the backups for volume %v", volumeName)
    	}
    	if err := driver.Remove(volumeBlocksDirectory); err != nil {
    		return errors.Wrapf(err, "failed to remove all the blocks for volume %v", volumeName)
    	}
    	if err := driver.Remove(volumeLocksDirectory); err != nil {
    		return errors.Wrapf(err, "failed to remove all the locks for volume %v", volumeName)
    	}
    	if err := driver.Remove(volumeDir); err != nil {
    		return errors.Wrapf(err, "failed to remove backup volume %v directory in backupstore", volumeName)
    	}
    	...
    }
    
  7. There is a chance that
    • the first reconcile already finished the cleanup and remove the finalizer
    • before the GC, the second reconcile starts to call BackupVolumeDelete()
    • before it finished the BackupVolumeDelete() function
    • k8s gc the BackupVolume CR because the finalizer is gone in the first reconcile
    • longhorn-test then unset the BackupTarget and Longhorn unmount the nfs in longhorn-manager
    • So the second BackupVolumeDelete doesn't really delete the lock file because the nfs is unmounted
  8. The lock file remains in the nfs

That is why in the log, there are two Complete deleting backup volume messages

// Jack: Complete Deleting backup volume
2024-01-22T03:20:52.789271975Z time="2024-01-22T03:20:52Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"
2024-01-22T03:20:54.870734350Z time="2024-01-22T03:20:54Z" level=info msg="Complete deleting backup volume record-recurring-job" func="engineapi.(*BackupTargetClient).BackupVolumeDelete" file="backups.go:228"


// Jack: unmount the target at 03:20:55 because last test just finished
2024-01-22T03:20:55.788434621Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="rpc error: code = Internal desc = failed to unmount all mount points: failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/backupstore: no such file or directory; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.827466508Z time="2024-01-22T03:20:55Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:300" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-01-22T03:20:55Z\" level=info msg=\"Trying to force clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore\" func=util.cleanupMount file=\"util.go:285\" pkg=backupstore\ntime=\"2024-01-22T03:20:55Z\" level=warning msg=\"Failed to clean up mount points\" func=cmd.doCleanUpAllMounts file=\"cleanup_all_mounts.go:28\" Command=cleanup-mount error=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\"\ntime=\"2024-01-22T03:20:55Z\" level=error msg=\"failed to clean up mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: remove /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore: directory not empty; failed to get file info of /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: lstat /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_default/opt/backupstore/lost+found: no such file or directory\" func=main.ResponseLogAndError file=\"main.go:47\"\n: exit status 1" interval=30s node=ip-10-0-2-252 url=
2024-01-22T03:20:55.843256074Z time="2024-01-22T03:20:55Z" level=info msg="Updated setting backup-target to " func="manager.(*VolumeManager).CreateOrUpdateSetting" file="setting.go:55"

Thanks for the discussion @derekbit
cc @innobead

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 25, 2024

The fix is quite easy, we can check if
if !backupVolume.DeletionTimestamp.IsZero() && !util.FinalizerExists(longhornFinalizerKey, backupVolume) then we can return directly without doing the cleanup again

@ChanYiLin
Copy link
Contributor

No wait, the obj second reconcile get could still have finalizer on it because it gets it from cache
But in etcd the finalizer of the obj is already gone
then condition won't work, because from second reconcile perspective, the obj still has finalizer.

We should skip remove when the volumeDir is gone in backupstore

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jan 25, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
    • Run the test test_recurring_job_restored_from_backup_target 100 * 3 times

PR:

@innobead innobead assigned chriscchien and ChanYiLin and unassigned yangchiu and ChanYiLin Jan 26, 2024
@yangchiu
Copy link
Member Author

Verified passed on master-head (longhorn-engine 3c1301) and v1.6.x-head (longhorn-engine 5a9b9d) by running test case test_recurring_job_restored_from_backup_target 300 times. All test cases passed.

Test results:
master-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6130/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6132/
v1.6.x-head: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6133/ ~ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6135/

Leave issue closing to @chriscchien in case you're already in the middle of testing.

@derekbit
Copy link
Member

This reverts commit 1d5129cd1074445964733aa74257d0acd4fc7d9e. in backupstore.
The PR results in the failed test cases:

  • test_basic.py::test_backup_metadata_deletion[s3] FAILED
  • test_basic.py::test_backup_metadata_deletion[nfs] FAILED

Just reverted it and let's clarify it later. cc @ChanYiLin

@ChanYiLin
Copy link
Contributor

Note

We should check the folder instead of volume.cfg when deleting the backup volume
Because there are other tests verifying DeleteBackupVolume when the volume.cfg is gone.
Since s3 doesn't support folder, we need to change the implementation
PR: longhorn/backupstore#160

@yangchiu
Copy link
Member Author

yangchiu commented Feb 5, 2024

Revisit this after the PR merged.

@yangchiu
Copy link
Member Author

Verified passed on master-head (longhorn-engine f241896) by running test case test_recurring_job_restored_from_backup_target 300 times.

Test results:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6199/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6200/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6201/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore backport/1.6.1 kind/bug priority/2 Nice to fix in this release (managed by PO) reproduce/often 80 - 50% reproducible
Projects
None yet
Development

No branches or pull requests

6 participants