Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backup in error state can not be deleted #3620

Closed
Tracked by #3629
chriscchien opened this issue Feb 14, 2022 · 28 comments
Closed
Tracked by #3629

[BUG] Backup in error state can not be deleted #3620

chriscchien opened this issue Feb 14, 2022 · 28 comments
Assignees
Labels
area/volume-backup-restore Volume backup restore kind/bug kind/regression Regression which has worked before priority/0 Must be implement or fixed in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@chriscchien
Copy link
Contributor

Describe the bug

From e2e case test_backup_status_for_unavailable_replicas

When backup in error state, it can not be deleted, this behavior make e2e case fail.
Can be reproduce on Ubuntu and SLES

To Reproduce

Steps to reproduce the behavior:

  1. Create volume with 2 Gi storage and attached that volume to node
  2. Write 500 Mi data into volume
  3. From UI, create backup of volume
  4. While backup in progress, delete volume replicas form UI
  5. In backup page, backup will in error state
  6. Try delete the error backup from UI, no error message shown and the error state backup still exist

Expected behavior

Backup in error state should can be deleted

Environment

  • Longhorn version: longhorn-manager:master-head
  • Installation method : Kubectl
  • Kubernetes distro and version: K3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 20.04
  • Underlying Infrastructure : AWS
  • Number of Longhorn volumes in the cluster: 1
@chriscchien chriscchien added kind/bug severity/3 Function working but has a major issue w/ workaround labels Feb 14, 2022
@innobead innobead added this to the v1.3.0 milestone Feb 14, 2022
@innobead
Copy link
Member

innobead commented Feb 14, 2022

cc @jenting @kaxing

@cchien816 have you seen any log from longhorn managers about the error causing it to fail to delete?

@innobead innobead added the area/volume-backup-restore Volume backup restore label Feb 14, 2022
@chriscchien
Copy link
Contributor Author

chriscchien commented Feb 14, 2022

Attach is log from longhorn-manager while I was testing.
longhorn-manager-z62fd_longhorn-manager.log
longhorn-manager-pzxq7_longhorn-manager.log
longhorn-manager-4nxh9_longhorn-manager.log

Below log may related to delete error state backup in longhorn-manager-z62fd_longhorn-manager.log

ime="2022-02-14T09:55:04Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e4fec19a39ee4065" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e4fec19a39ee4065: error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-bbdf99ada05a47fd.lck type 2 acquisition\n, stderr, time=\"2022-02-14T09:55:04Z\" level=error msg=\"failed lock backupstore/volumes/01/9c/vol1/locks/lock-bbdf99ada05a47fd.lck type 2 acquisition\"\n, error exit status 1" node=ip-10-251-3-139
--
Mon, Feb 14 2022 5:55:06 pm | time="2022-02-14T09:55:06Z" level=error msg="Error deleting remote backup" backup=backup-e4fec19a39ee4065 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\n, stderr, time=\"2022-02-14T09:55:06Z\" level=error msg=\"failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\"\n, error exit status 1" node=ip-10-251-3-139
Mon, Feb 14 2022 5:55:06 pm | E0214 09:55:06.662131 1 backup_controller.go:151] longhorn-backup: fail to sync backup longhorn-system/backup-e4fec19a39ee4065: error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition
Mon, Feb 14 2022 5:55:06 pm | , stderr, time="2022-02-14T09:55:06Z" level=error msg="failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition"
Mon, Feb 14 2022 5:55:06 pm | , error exit status 1
Mon, Feb 14 2022 5:55:06 pm | time="2022-02-14T09:55:06Z" level=warning msg="Dropping Longhorn backup longhorn-system/backup-e4fec19a39ee4065 out of the queue" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e4fec19a39ee4065: error deleting backup s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/backupstore?backup=backup-e4fec19a39ee4065&volume=vol1], output failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\n, stderr, time=\"2022-02-14T09:55:06Z\" level=error msg=\"failed lock backupstore/volumes/01/9c/vol1/locks/lock-5f2900d5b1454df1.lck type 2 acquisition\"\n, error exit status 1" node=ip-10-251-3-139
Mon, Feb 14 2022 5:55:24 pm | time="2022-02-14T09:55:24Z" level=debug msg="Forwarding request to 10.42.0.34:9500"
Mon, Feb 14 2022 5:55:24 pm | 1.163.230.77 - - [14/Feb/2022:09:55:24 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:24 pm | 1.163.230.77 - - [14/Feb/2022:09:55:24 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:24 pm | time="2022-02-14T09:55:24Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=vol1 controller=longhorn-backup-volume node=ip-10-251-3-139
Mon, Feb 14 2022 5:55:47 pm | time="2022-02-14T09:55:47Z" level=debug msg="Forwarding request to 10.42.0.34:9500"
Mon, Feb 14 2022 5:55:47 pm | 1.163.230.77 - - [14/Feb/2022:09:55:47 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:47 pm | 1.163.230.77 - - [14/Feb/2022:09:55:47 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:55:54 pm | time="2022-02-14T09:55:54Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=vol1 controller=longhorn-backup-volume node=ip-10-251-3-139
Mon, Feb 14 2022 5:56:04 pm | time="2022-02-14T09:56:04Z" level=debug msg="Forwarding request to 10.42.0.34:9500"
Mon, Feb 14 2022 5:56:04 pm | 1.163.230.77 - - [14/Feb/2022:09:56:04 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:56:04 pm | 1.163.230.77 - - [14/Feb/2022:09:56:04 +0000] "POST /v1/backupvolumes/vol1?action=backupDelete HTTP/1.1" 200 546 "https://54.187.244.166/k8s/clusters/c-m-998ccn8t/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
Mon, Feb 14 2022 5:56:24 pm | time="2022-02-14T09:56:24Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=vol1 controller=longhorn-backup-volume node=ip-10-251-3-139


@derekbit
Copy link
Member

@cchien816
Do you have longhorn-admission-webhook and longhorn-conversion-webhook logs? I'd like to see if it is related to webhooks or caused by other issues.

@khushboo-rancher
Copy link
Contributor

Raising the severity to 1 as this is a regression.

@khushboo-rancher khushboo-rancher added priority/0 Must be implement or fixed in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) kind/regression Regression which has worked before and removed severity/3 Function working but has a major issue w/ workaround labels Feb 14, 2022
@chriscchien
Copy link
Contributor Author

chriscchien commented Feb 15, 2022

Update : This issue may related to scheduling.
On my local environment. The error state backup disappeared (deleted) after a long period of time.

@derekbit
Copy link
Member

derekbit commented Feb 15, 2022

After reverting the scheduling commits in #3083, the deletion works as expected.
I will come back to check again after scheduling issue is fixed.

cc @PhanLe1010

@PhanLe1010
Copy link
Contributor

I think this is not related to the scheduling issue #3626. So the PR for that issue doesn't fix this one.

I can see from logs that Longhorn UI set the deletion timestamp for the backup CR. Then backup controller tried to delete it from backup target but failed to do so with the error failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition. After 4 failed attempts (which is bigger than the maxretries which is 3 times), the backup controller gave up and forget this event.

Action item:

  • Increasing resync help to surface a new hidden issue here: why the deleting backuo fail with this error failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition? cc @jenting
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:36Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=testvol1 controller=longhorn-backup-volume node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:36Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:36Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:36Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e6aee84a44d24622" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:36Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-665c15f8463f4a06.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:38Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:38Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:38Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e6aee84a44d24622" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:38Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-316920feea8f4831.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rlvlb longhorn-manager] 10.42.3.57 - - [15/Feb/2022:22:32:38 +0000] "GET /metrics HTTP/1.1" 200 4807 "" "Prometheus/2.16.0"
[longhorn-manager-wcpzc longhorn-manager] 10.42.3.57 - - [15/Feb/2022:22:32:39 +0000] "GET /metrics HTTP/1.1" 200 4748 "" "Prometheus/2.16.0"
[longhorn-manager-rxqdv longhorn-manager] 10.42.3.57 - - [15/Feb/2022:22:32:40 +0000] "GET /metrics HTTP/1.1" 200 5008 "" "Prometheus/2.16.0"
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:40Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:40Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:40Z" level=warning msg="Error syncing Longhorn backup longhorn-system/backup-e6aee84a44d24622" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:40Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-59d0e333de734b27.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-wcpzc longhorn-manager] time="2022-02-15T22:32:41Z" level=debug msg="Trigger sync backup target" component=backup-store-timer controller=longhorn-setting interval=5s node=phan-v117-worker-1
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:41Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=testvol1 controller=longhorn-backup-volume node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:42Z" level=error msg="Error deleting remote backup" backup=backup-e6aee84a44d24622 controller=longhorn-backup error="error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:42Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2
[longhorn-manager-rxqdv longhorn-manager] E0215 22:32:42.783506       1 backup_controller.go:151] longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition
[longhorn-manager-rxqdv longhorn-manager] , stderr, time="2022-02-15T22:32:42Z" level=error msg="failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition"
[longhorn-manager-rxqdv longhorn-manager] , error exit status 1
[longhorn-manager-rxqdv longhorn-manager] time="2022-02-15T22:32:42Z" level=warning msg="Dropping Longhorn backup longhorn-system/backup-e6aee84a44d24622 out of the queue" controller=longhorn-backup error="longhorn-backup: fail to sync backup longhorn-system/backup-e6aee84a44d24622: error deleting backup s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [backup rm s3://backupbucket@us-east-1/?backup=backup-e6aee84a44d24622&volume=testvol1], output failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\n, stderr, time=\"2022-02-15T22:32:42Z\" level=error msg=\"failed lock backupstore/volumes/c5/7e/testvol1/locks/lock-1c6b8b3927194341.lck type 2 acquisition\"\n, error exit status 1" node=phan-v117-worker-2

@jenting
Copy link
Contributor

jenting commented Feb 18, 2022

I think it's because the backup is still in progress but be cancelled, the backup lock existed on the remote backup target.
After the deletion event comes in, longhorn-engine tries to acquire a lock but failed because the condition does not pass.

We can only acquire a lock if there is no unexpired(1) lock of a different
type(2) that has priority(3).

  1. Locks are self expiring, once the current time is after lock.serverTime + LOCK_DURATION we no longer need to consider
    this lock as valid.
  2. Backup & Restore Locks are mapped to compatible types while Delete Locks are mapped to a different type to be mutually exclusive with the others.
  3. Priority is based on the comparison order, where locks are compared by lock.Acquired then by lock.serverTime followed by lock.Name. Where acquired locks are always sorted before non acquired locks.

After 2.5 minutes, the lock acquisition will timeout and the backup operation will pass through that error and fail.

Therefore, the cancel backup operation would release the lock after 2.5 minutes.
Changing rsync period back to 30 seconds makes the deletion operation is able to acquire the lock after 5 times resync.

cc @joshimoo

@derekbit
Copy link
Member

derekbit commented Feb 22, 2022

After discussing with @jenting , we need to investigate some points firstly

  • Why was the backup job in error state after deleting the replicas?
    The deletion of the replicas using for backup will result in the error state

  • What does the error state show?
    There are several error states.

    failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [--url 10.42.1.9:10000 backup status backup-65fd1e3b49574b4f --replica tcp://10.42.1.10:10000], output , stderr, time="2022-02-22T07:39:27Z" level=fatal msg="Error querying backup status: Failed to get backup status on tcp://10.42.1.10:10000 for backup-65fd1e3b49574b4f: unknown replica"
    , error exit status 1

Ref: https://github.com/longhorn/longhorn-manager/blob/7bd89abdbae3c406ec56094364cf6c2da1d36d9d/engineapi/backup_monitor.go#L114

  error: cannot find matched snapshot /3d9f8138-d1c5-42b5-a701-b6bb859f81c2 of backup
    backup-70820393831648ef status in longhorn engine
  • After deleting the replica and becoming error, what's the error handling of the longhorn-engine backup?
    No error handling now. The deletion of a replica will make the backup process deleted immediately. The backup process has no chance to release the lock.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Feb 24, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • [ ] Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at backup: retry deletion after failing to acquire the lock longhorn-manager#1232

  • Which areas/issues this PR might have potential impacts on?
    Area backingimage
    Issues

  • [ ] If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • [ ] If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • [ ] If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • [ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • [ ] If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • [ ] If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

*~~ [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?~~
The compatibility issue is filed at

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Feb 28, 2022

I am thinking if we should remove the maxretries limit entirely. If we fail to sync a CR, I think we should retry it until we succeed (with a backoff limit capped at 5 minutes). This is similar to when CSI attacher fail to make NodePublishVolume call, it will retry infidelity until it succeeds with maximum 5 minutes backoff limit

Is there a strong reason to keep the maxretries @joshimoo @shuo-wu ?

@shuo-wu
Copy link
Contributor

shuo-wu commented Mar 1, 2022

I guess at the beginning Longhorn just follows the basic pattern of Kubernetes controllers, in which there is typically a max retry limit set.
Your consideration is fair. And some of Kubernetes controllers do not set max retry limit, either. But we need to consider if the removed CRs or problematic CRs can be handled correctly rather than staying in the queue forever.

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Mar 1, 2022

For the removed CRs, they will be dropped out of the queue link

problematic CRs should be divided into 2 categories:

  • recoverable error (api error, fail precondition, ...) this should go back to the queue with backoff limit
  • Non-recoverable error (non-existing snapshot/ naming format) this can be drop out of the queue immediately. I.e., print out the error message and drop it out of the queue

How do you think? @shuo-wu , @jenting , @derekbit

@shuo-wu
Copy link
Contributor

shuo-wu commented Mar 2, 2022

Sounds good. But the problem is, it's hard to figure out non-recoverable errors for each controller.

@PhanLe1010
Copy link
Contributor

Can we do a failed safe approach that assumes all errors are recoverable by default unless proved otherwise?
btw, requeuing by calling AddRateLimited() is not expensive. the object will be reconsidered after every 5 minutes.

@chriscchien chriscchien self-assigned this Mar 10, 2022
@shuo-wu
Copy link
Contributor

shuo-wu commented Mar 11, 2022

@PhanLe1010
Copy link
Contributor

Good observing! But the threshold is set to 0 by default which means retrying forever: https://github.com/kubernetes-csi/external-provisioner/blob/32c27a57ce311cf7b0fedefd6cc850845a41e341/cmd/csi-provisioner/csi-provisioner.go#L359

@shuo-wu
Copy link
Contributor

shuo-wu commented Mar 14, 2022

OK. Then I am fine with your idea.

@PhanLe1010
Copy link
Contributor

cc @jenting How do you think about the above discussion?

@jenting
Copy link
Contributor

jenting commented Mar 15, 2022

I agree with the above approach.

I have a question, for non-recoverable error, it will be dropped out of queue immediately. Is that means this CR would be reconcile every resync period but it will be return inside the resync period? Also, how do we differentiate the Backup CR is non-recoverable or not? By what criteria?

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Mar 15, 2022

Is that means this CR would be reconcile every resync period but it will be return inside the resync period?

Yes, the CR will be touched again every resync period (1h or more)

Also, how do we differentiate the Backup CR is non-recoverable or not? By what criteria?

Some ideas on top of my head:

  • By default, we consider errors are recoverable
  • Some obvious non-recoverable errors: source volume doesn't exist; backup target setting is empty; ...

@jenting
Copy link
Contributor

jenting commented Mar 16, 2022

Okay, looks like we need to define the different error codes (Like Kubernetes conditions data struct) in Backup CR to indicate it's a recoverable or non-recoverable error.

@innobead
Copy link
Member

Before having a way to define different types of errors. as per the above discussion, should we remove the max retry and just keep reading to the queue by assuming all errors are recoverable now?

@derekbit
Copy link
Member

derekbit commented Mar 31, 2022

We may can increase the maxRetries number. The value is 3 and the corresponding time intervals are 5ms, 10ms and 20ms. They are pretty short.

How about increasing the value to make maximum interval to be, e.g, 30 seconds? @PhanLe1010 @shuo-wu @jenting @innobead

@derekbit
Copy link
Member

derekbit commented Apr 1, 2022

As discussion with @PhanLe1010 , keep it as is.

@chriscchien
Copy link
Contributor Author

Verified pass on master-head 20220407

Backup in error state can be deleted
Also on e2e test, result were pass, close this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore kind/bug kind/regression Regression which has worked before priority/0 Must be implement or fixed in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

8 participants