[BUG] Backup in error state can not be deleted #3620

chriscchien · 2022-02-14T09:02:18Z

Describe the bug

From e2e case test_backup_status_for_unavailable_replicas

When backup in error state, it can not be deleted, this behavior make e2e case fail.
Can be reproduce on Ubuntu and SLES

To Reproduce

Steps to reproduce the behavior:

Create volume with 2 Gi storage and attached that volume to node
Write 500 Mi data into volume
From UI, create backup of volume
While backup in progress, delete volume replicas form UI
In backup page, backup will in error state
Try delete the error backup from UI, no error message shown and the error state backup still exist

Expected behavior

Backup in error state should can be deleted

Environment

Longhorn version: longhorn-manager:master-head
Installation method : Kubectl
Kubernetes distro and version: K3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: Ubuntu 20.04
Underlying Infrastructure : AWS
Number of Longhorn volumes in the cluster: 1

innobead · 2022-02-14T09:11:52Z

cc @jenting @kaxing

@cchien816 have you seen any log from longhorn managers about the error causing it to fail to delete?

chriscchien · 2022-02-14T10:01:04Z

Attach is log from longhorn-manager while I was testing.
longhorn-manager-z62fd_longhorn-manager.log
longhorn-manager-pzxq7_longhorn-manager.log
longhorn-manager-4nxh9_longhorn-manager.log

Below log may related to delete error state backup in longhorn-manager-z62fd_longhorn-manager.log

ime="2022-02-14T09:55:04Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e4fec19a39ee4065" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e4fec19a39ee4065: error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-bbdf99ada05a47fd.lck type 2 acquisition\n, stderr, time=\"2022-02-14T09:55:04Z\" level=error msg=\"failed lock backupstore/volumes/01/9c/vol1/locks/lock-bbdf99ada05a47fd.lck type 2 acquisition\"\n, error exit status 1" node=ip-10-251-3-139
--
Mon, Feb 14 2022 5:55:06 pm | time="2022-02-14T09:55:06Z" level=error msg="Error deleting remote backup" backup=backup-e4fec19a39ee4065 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\n, stderr, time=\"2022-02-14T09:55:06Z\" level=error msg=\"failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\"\n, error exit status 1" node=ip-10-251-3-139
Mon, Feb 14 2022 5:55:06 pm | E0214 09:55:06.662131 1 backup_controller.go:151] longhorn-backup: fail to sync backup longhorn-system/backup-e4fec19a39ee4065: error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition
Mon, Feb 14 2022 5:55:06 pm | , stderr, time="2022-02-14T09:55:06Z" level=error msg="failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition"
Mon, Feb 14 2022 5:55:06 pm | , error exit status 1
Mon, Feb 14 2022 5:55:06 pm | time="2022-02-14T09:55:06Z" level=warning msg="Dropping Longhorn backup longhorn-system/backup-e4fec19a39ee4065 out of the queue" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e4fec19a39ee4065: error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\n, stderr, time=\"2022-02-14T09:55:06Z\" level=error msg=\"failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\"\n, error exit status 1" node=ip-10-251-3-139
Mon, Feb 14 2022 5:55:24 pm | time="2022-02-14T09:55:24Z" level=debug msg="Forwarding request to 10.42.0.34:9500"
Mon, Feb 14 2022 5:55:24 pm | 1.163.230.77 - - [14/Feb/2022:09:55:24 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:24 pm | 1.163.230.77 - - [14/Feb/2022:09:55:24 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:24 pm | time="2022-02-14T09:55:24Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=vol1 controller=longhorn-backup-volume node=ip-10-251-3-139
Mon, Feb 14 2022 5:55:47 pm | time="2022-02-14T09:55:47Z" level=debug msg="Forwarding request to 10.42.0.34:9500"
Mon, Feb 14 2022 5:55:47 pm | 1.163.230.77 - - [14/Feb/2022:09:55:47 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:47 pm | 1.163.230.77 - - [14/Feb/2022:09:55:47 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:54 pm | time="2022-02-14T09:55:54Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=vol1 controller=longhorn-backup-volume node=ip-10-251-3-139
Mon, Feb 14 2022 5:56:04 pm | time="2022-02-14T09:56:04Z" level=debug msg="Forwarding request to 10.42.0.34:9500"
Mon, Feb 14 2022 5:56:04 pm | 1.163.230.77 - - [14/Feb/2022:09:56:04 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:56:04 pm | 1.163.230.77 - - [14/Feb/2022:09:56:04 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:56:24 pm | time="2022-02-14T09:56:24Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=vol1 controller=longhorn-backup-volume node=ip-10-251-3-139

derekbit · 2022-02-14T10:40:45Z

@cchien816
Do you have longhorn-admission-webhook and longhorn-conversion-webhook logs? I'd like to see if it is related to webhooks or caused by other issues.

chriscchien · 2022-02-14T12:04:58Z

Hi @derekbit ,

Please see below logs :
longhorn-admission-webhook-57c4747b5d-kktcc_longhorn-admission-webhook.log
longhorn-admission-webhook-57c4747b5d-r4v52_longhorn-admission-webhook.log
longhorn-conversion-webhook-686496f964-qpcfw_longhorn-conversion-webhook.log
longhorn-conversion-webhook-686496f964-xdnwt_longhorn-conversion-webhook.log

khushboo-rancher · 2022-02-14T17:50:37Z

Raising the severity to 1 as this is a regression.

chriscchien · 2022-02-15T02:22:54Z

Update : This issue may related to scheduling.
On my local environment. The error state backup disappeared (deleted) after a long period of time.

derekbit · 2022-02-15T06:59:26Z

After reverting the scheduling commits in #3083, the deletion works as expected.
I will come back to check again after scheduling issue is fixed.

cc @PhanLe1010

PhanLe1010 · 2022-02-15T22:45:12Z

I think this is not related to the scheduling issue #3626. So the PR for that issue doesn't fix this one.

I can see from logs that Longhorn UI set the deletion timestamp for the backup CR. Then backup controller tried to delete it from backup target but failed to do so with the error failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition. After 4 failed attempts (which is bigger than the maxretries which is 3 times), the backup controller gave up and forget this event.

Action item:

Increasing resync help to surface a new hidden issue here: why the deleting backuo fail with this error failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition? cc @jenting

[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:36Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=testvol1 controller=longhorn-backup-volume node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:36Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:36Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:36Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e6aee84a44d24622" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:36Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:38Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:38Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:38Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e6aee84a44d24622" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:38Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rlvlb longhorn-manager] 10.42.3.57 - - [15/Feb/2022:22:32:38 +0000] "GET /metrics HTTP/1.1" 200 4807 "" "Prometheus/2.16.0"
[longhorn-manager-wcpzc longhorn-manager] 10.42.3.57 - - [15/Feb/2022:22:32:39 +0000] "GET /metrics HTTP/1.1" 200 4748 "" "Prometheus/2.16.0"
[longhorn-manager-rxqdv longhorn-manager] 10.42.3.57 - - [15/Feb/2022:22:32:40 +0000] "GET /metrics HTTP/1.1" 200 5008 "" "Prometheus/2.16.0"
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:40Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:40Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:40Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e6aee84a44d24622" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:40Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-wcpzc longhorn-manager] time="2022-02-15T22:32:41Z" level=debug msg="Trigger sync backup target" component=backup-store-timer controller=longhorn-setting interval=5s node=phan-v117-worker-1
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:41Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=testvol1 controller=longhorn-backup-volume node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:42Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:42Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] E0215 22:32:42.783506       1 backup_controller.go:151] longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition
[longhorn-manager-rxqdv longhorn-manager] , stderr, time="2022-02-15T22:32:42Z" level=error msg="failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition"
[longhorn-manager-rxqdv longhorn-manager] , error exit status 1
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:42Z" level=warning msg="Dropping Longhorn backup longhorn-system/backup-e6aee84a44d24622 out of the queue" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:42Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2

jenting · 2022-02-18T01:07:05Z

I think it's because the backup is still in progress but be cancelled, the backup lock existed on the remote backup target.
After the deletion event comes in, longhorn-engine tries to acquire a lock but failed because the condition does not pass.

We can only acquire a lock if there is no unexpired(1) lock of a different
type(2) that has priority(3).

Locks are self expiring, once the current time is after lock.serverTime + LOCK_DURATION we no longer need to consider
this lock as valid.

Backup & Restore Locks are mapped to compatible types while Delete Locks are mapped to a different type to be mutually exclusive with the others.

Priority is based on the comparison order, where locks are compared by lock.Acquired then by lock.serverTime followed by lock.Name. Where acquired locks are always sorted before non acquired locks.

After 2.5 minutes, the lock acquisition will timeout and the backup operation will pass through that error and fail.

Therefore, the cancel backup operation would release the lock after 2.5 minutes.
Changing rsync period back to 30 seconds makes the deletion operation is able to acquire the lock after 5 times resync.

cc @joshimoo

derekbit · 2022-02-22T07:14:58Z

After discussing with @jenting , we need to investigate some points firstly

Why was the backup job in error state after deleting the replicas?
The deletion of the replicas using for backup will result in the error state
What does the error state show?
There are several error states.

    failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [--url 10.42.1.9:10000 backup status backup-65fd1e3b49574b4f --replica tcp://10.42.1.10:10000], output , stderr, time="2022-02-22T07:39:27Z" level=fatal msg="Error querying backup status: Failed to get backup status on tcp://10.42.1.10:10000 for backup-65fd1e3b49574b4f: unknown replica"
    , error exit status 1

Ref: https://github.com/longhorn/longhorn-manager/blob/7bd89abdbae3c406ec56094364cf6c2da1d36d9d/engineapi/backup_monitor.go#L114

  error: cannot find matched snapshot /3d9f8138-d1c5-42b5-a701-b6bb859f81c2 of backup
    backup-70820393831648ef status in longhorn engine

After deleting the replica and becoming error, what's the error handling of the longhorn-engine backup?
No error handling now. The deletion of a replica will make the backup process deleted immediately. The backup process has no chance to release the lock.

longhorn-io-github-bot · 2022-02-24T08:21:28Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
~~[ ] Is there a workaround for the issue? If so, where is it documented?~~
The workaround is at:
Does the PR include the explanation for the fix or the feature?
~~[ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?~~
The PR for the YAML change is at:
The PR for the chart change is at:
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at backup: retry deletion after failing to acquire the lock longhorn-manager#1232
Which areas/issues this PR might have potential impacts on?
Area backingimage
Issues
~~[ ] If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?~~
The LEP PR is at
~~[ ] If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?~~
The UI issue/PR is at
~~[ ] If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?~~
The documentation issue/PR is at
[ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation skeleton PR is at
The automation test case PR is at
The issue of automation test case implementation is at (please create by the template)
~~[ ] If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?~~
The engine automation PR is at
~~[ ] If labeled: require/manual-test-plan Has the manual test plan been documented?~~
The updated manual test plan is at

*~~ [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?~~
The compatibility issue is filed at

PhanLe1010 · 2022-02-28T22:54:22Z

I am thinking if we should remove the maxretries limit entirely. If we fail to sync a CR, I think we should retry it until we succeed (with a backoff limit capped at 5 minutes). This is similar to when CSI attacher fail to make NodePublishVolume call, it will retry infidelity until it succeeds with maximum 5 minutes backoff limit

Is there a strong reason to keep the maxretries @joshimoo @shuo-wu ?

shuo-wu · 2022-03-01T04:44:19Z

I guess at the beginning Longhorn just follows the basic pattern of Kubernetes controllers, in which there is typically a max retry limit set.
Your consideration is fair. And some of Kubernetes controllers do not set max retry limit, either. But we need to consider if the removed CRs or problematic CRs can be handled correctly rather than staying in the queue forever.

PhanLe1010 · 2022-03-01T20:32:10Z

For the removed CRs, they will be dropped out of the queue link

problematic CRs should be divided into 2 categories:

recoverable error (api error, fail precondition, ...) this should go back to the queue with backoff limit
Non-recoverable error (non-existing snapshot/ naming format) this can be drop out of the queue immediately. I.e., print out the error message and drop it out of the queue

How do you think? @shuo-wu , @jenting , @derekbit

shuo-wu · 2022-03-02T09:53:47Z

Sounds good. But the problem is, it's hard to figure out non-recoverable errors for each controller.

PhanLe1010 · 2022-03-02T22:13:56Z

Can we do a failed safe approach that assumes all errors are recoverable by default unless proved otherwise?
btw, requeuing by calling AddRateLimited() is not expensive. the object will be reconsidered after every 5 minutes.

PhanLe1010 · 2022-03-10T18:45:39Z

A similar approach is done in csi provisioned https://github.com/kubernetes-csi/external-provisioner/blob/32c27a57ce311cf7b0fedefd6cc850845a41e341/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go#L1329-L1333

shuo-wu · 2022-03-11T04:14:38Z

But there may be a retry threshold set in the CSI provisioner: https://github.com/kubernetes-csi/external-provisioner/blob/32c27a57ce311cf7b0fedefd6cc850845a41e341/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go#L932-L945

PhanLe1010 · 2022-03-11T20:20:19Z

Good observing! But the threshold is set to 0 by default which means retrying forever: https://github.com/kubernetes-csi/external-provisioner/blob/32c27a57ce311cf7b0fedefd6cc850845a41e341/cmd/csi-provisioner/csi-provisioner.go#L359

shuo-wu · 2022-03-14T02:37:01Z

OK. Then I am fine with your idea.

PhanLe1010 · 2022-03-15T19:34:25Z

cc @jenting How do you think about the above discussion?

jenting · 2022-03-15T22:48:27Z

I agree with the above approach.

I have a question, for non-recoverable error, it will be dropped out of queue immediately. Is that means this CR would be reconcile every resync period but it will be return inside the resync period? Also, how do we differentiate the Backup CR is non-recoverable or not? By what criteria?

PhanLe1010 · 2022-03-15T22:58:34Z

Is that means this CR would be reconcile every resync period but it will be return inside the resync period?

Yes, the CR will be touched again every resync period (1h or more)

Also, how do we differentiate the Backup CR is non-recoverable or not? By what criteria?

Some ideas on top of my head:

By default, we consider errors are recoverable
Some obvious non-recoverable errors: source volume doesn't exist; backup target setting is empty; ...

jenting · 2022-03-16T00:31:47Z

Okay, looks like we need to define the different error codes (Like Kubernetes conditions data struct) in Backup CR to indicate it's a recoverable or non-recoverable error.

innobead · 2022-03-31T06:06:03Z

Before having a way to define different types of errors. as per the above discussion, should we remove the max retry and just keep reading to the queue by assuming all errors are recoverable now?

derekbit · 2022-03-31T06:21:26Z

We may can increase the maxRetries number. The value is 3 and the corresponding time intervals are 5ms, 10ms and 20ms. They are pretty short.

How about increasing the value to make maximum interval to be, e.g, 30 seconds? @PhanLe1010 @shuo-wu @jenting @innobead

derekbit · 2022-04-01T10:25:07Z

As discussion with @PhanLe1010 , keep it as is.

chriscchien · 2022-04-07T01:50:07Z

Verified pass on master-head 20220407

Backup in error state can be deleted
Also on e2e test, result were pass, close this ticket

chriscchien added kind/bug severity/3 Function working but has a major issue w/ workaround labels Feb 14, 2022

innobead added this to the v1.3.0 milestone Feb 14, 2022

innobead assigned derekbit Feb 14, 2022

innobead added the area/volume-backup-restore Volume backup restore label Feb 14, 2022

kaxing mentioned this issue Feb 14, 2022

[BUG] Backup error happen to volume with older engine image #3621

Closed

PhanLe1010 mentioned this issue Feb 15, 2022

[EPIC] Side effects of increasing resync period in informer's event handlers #3629

Closed

9 tasks

chriscchien self-assigned this Mar 10, 2022

derekbit mentioned this issue Mar 31, 2022

Increase the maxRetries longhorn/longhorn-manager#1283

Closed

longhorn deleted a comment from longhorn-io-github-bot Mar 31, 2022

chriscchien closed this as completed Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Backup in error state can not be deleted #3620

[BUG] Backup in error state can not be deleted #3620

chriscchien commented Feb 14, 2022

innobead commented Feb 14, 2022 •

edited

Loading

chriscchien commented Feb 14, 2022 •

edited

Loading

derekbit commented Feb 14, 2022

chriscchien commented Feb 14, 2022

khushboo-rancher commented Feb 14, 2022

chriscchien commented Feb 15, 2022 •

edited

Loading

derekbit commented Feb 15, 2022 •

edited

Loading

PhanLe1010 commented Feb 15, 2022

jenting commented Feb 18, 2022 •

edited

Loading

derekbit commented Feb 22, 2022 •

edited

Loading

longhorn-io-github-bot commented Feb 24, 2022 •

edited by derekbit

Loading

PhanLe1010 commented Feb 28, 2022 •

edited

Loading

shuo-wu commented Mar 1, 2022

PhanLe1010 commented Mar 1, 2022 •

edited

Loading

shuo-wu commented Mar 2, 2022

PhanLe1010 commented Mar 2, 2022

PhanLe1010 commented Mar 10, 2022

shuo-wu commented Mar 11, 2022

PhanLe1010 commented Mar 11, 2022

shuo-wu commented Mar 14, 2022

PhanLe1010 commented Mar 15, 2022

jenting commented Mar 15, 2022

PhanLe1010 commented Mar 15, 2022 •

edited

Loading

jenting commented Mar 16, 2022 •

edited

Loading

innobead commented Mar 31, 2022

derekbit commented Mar 31, 2022 •

edited

Loading

derekbit commented Apr 1, 2022

chriscchien commented Apr 7, 2022

[BUG] Backup in error state can not be deleted #3620

[BUG] Backup in error state can not be deleted #3620

Comments

chriscchien commented Feb 14, 2022

Describe the bug

To Reproduce

Expected behavior

Environment

innobead commented Feb 14, 2022 • edited Loading

chriscchien commented Feb 14, 2022 • edited Loading

derekbit commented Feb 14, 2022

chriscchien commented Feb 14, 2022

khushboo-rancher commented Feb 14, 2022

chriscchien commented Feb 15, 2022 • edited Loading

derekbit commented Feb 15, 2022 • edited Loading

PhanLe1010 commented Feb 15, 2022

jenting commented Feb 18, 2022 • edited Loading

derekbit commented Feb 22, 2022 • edited Loading

longhorn-io-github-bot commented Feb 24, 2022 • edited by derekbit Loading

Pre Ready-For-Testing Checklist

PhanLe1010 commented Feb 28, 2022 • edited Loading

shuo-wu commented Mar 1, 2022

PhanLe1010 commented Mar 1, 2022 • edited Loading

shuo-wu commented Mar 2, 2022

PhanLe1010 commented Mar 2, 2022

PhanLe1010 commented Mar 10, 2022

shuo-wu commented Mar 11, 2022

PhanLe1010 commented Mar 11, 2022

shuo-wu commented Mar 14, 2022

PhanLe1010 commented Mar 15, 2022

jenting commented Mar 15, 2022

PhanLe1010 commented Mar 15, 2022 • edited Loading

jenting commented Mar 16, 2022 • edited Loading

innobead commented Mar 31, 2022

derekbit commented Mar 31, 2022 • edited Loading

derekbit commented Apr 1, 2022

chriscchien commented Apr 7, 2022

innobead commented Feb 14, 2022 •

edited

Loading

chriscchien commented Feb 14, 2022 •

edited

Loading

chriscchien commented Feb 15, 2022 •

edited

Loading

derekbit commented Feb 15, 2022 •

edited

Loading

jenting commented Feb 18, 2022 •

edited

Loading

derekbit commented Feb 22, 2022 •

edited

Loading

longhorn-io-github-bot commented Feb 24, 2022 •

edited by derekbit

Loading

PhanLe1010 commented Feb 28, 2022 •

edited

Loading

PhanLe1010 commented Mar 1, 2022 •

edited

Loading

PhanLe1010 commented Mar 15, 2022 •

edited

Loading

jenting commented Mar 16, 2022 •

edited

Loading

derekbit commented Mar 31, 2022 •

edited

Loading