-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Backup in error state can not be deleted #3620
Comments
Attach is log from longhorn-manager while I was testing. Below log may related to delete error state backup in longhorn-manager-z62fd_longhorn-manager.log
|
@cchien816 |
Raising the severity to 1 as this is a regression. |
Update : This issue may related to scheduling. |
After reverting the scheduling commits in #3083, the deletion works as expected. cc @PhanLe1010 |
I think this is not related to the scheduling issue #3626. So the PR for that issue doesn't fix this one. I can see from logs that Longhorn UI set the deletion timestamp for the backup CR. Then backup controller tried to delete it from backup target but failed to do so with the error Action item:
|
I think it's because the backup is still in progress but be cancelled, the backup lock existed on the remote backup target.
Therefore, the cancel backup operation would release the lock after 2.5 minutes. cc @joshimoo |
After discussing with @jenting , we need to investigate some points firstly
|
Pre Ready-For-Testing Checklist
*~~ [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label |
I am thinking if we should remove the maxretries limit entirely. If we fail to sync a CR, I think we should retry it until we succeed (with a backoff limit capped at 5 minutes). This is similar to when CSI attacher fail to make NodePublishVolume call, it will retry infidelity until it succeeds with maximum 5 minutes backoff limit Is there a strong reason to keep the maxretries @joshimoo @shuo-wu ? |
I guess at the beginning Longhorn just follows the basic pattern of Kubernetes controllers, in which there is typically a max retry limit set. |
For the removed CRs, they will be dropped out of the queue link problematic CRs should be divided into 2 categories:
|
Sounds good. But the problem is, it's hard to figure out non-recoverable errors for each controller. |
Can we do a failed safe approach that assumes all errors are recoverable by default unless proved otherwise? |
But there may be a retry threshold set in the CSI provisioner: https://github.com/kubernetes-csi/external-provisioner/blob/32c27a57ce311cf7b0fedefd6cc850845a41e341/vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go#L932-L945 |
Good observing! But the threshold is set to 0 by default which means retrying forever: https://github.com/kubernetes-csi/external-provisioner/blob/32c27a57ce311cf7b0fedefd6cc850845a41e341/cmd/csi-provisioner/csi-provisioner.go#L359 |
OK. Then I am fine with your idea. |
cc @jenting How do you think about the above discussion? |
I agree with the above approach. I have a question, for non-recoverable error, it will be dropped out of queue immediately. Is that means this CR would be reconcile every resync period but it will be return inside the resync period? Also, how do we differentiate the Backup CR is non-recoverable or not? By what criteria? |
Yes, the CR will be touched again every resync period (1h or more)
Some ideas on top of my head:
|
Okay, looks like we need to define the different error codes (Like Kubernetes conditions data struct) in Backup CR to indicate it's a recoverable or non-recoverable error. |
Before having a way to define different types of errors. as per the above discussion, should we remove the max retry and just keep reading to the queue by assuming all errors are recoverable now? |
We may can increase the maxRetries number. The value is 3 and the corresponding time intervals are 5ms, 10ms and 20ms. They are pretty short. How about increasing the value to make maximum interval to be, e.g, 30 seconds? @PhanLe1010 @shuo-wu @jenting @innobead |
As discussion with @PhanLe1010 , keep it as is. |
Verified pass on master-head 20220407 Backup in error state can be deleted |
Describe the bug
From e2e case test_backup_status_for_unavailable_replicas
When backup in error state, it can not be deleted, this behavior make e2e case fail.
Can be reproduce on Ubuntu and SLES
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Backup in error state should can be deleted
Environment
The text was updated successfully, but these errors were encountered: