New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086
Comments
I believe this should be managed by the concurrent restore setting now, so the timeout should be after 24 hours? @c3y1huang |
@khushboo-rancher How did you simulate this case? this is a good case. cc @longhorn/qa |
This is actually at stage of rollout, not restore. So, when a system restore is rolling out a backup and creating volumes one by one and one volume is stuck, it never goes to next volume rollout. I believe, the 24 hours timeout is for actual restoring the data on the volume from backupstore. I was able to simulate this using backing image volume as backing image volume can't get rollout with current limitation. |
Yeah, you are 100% right 馃憤 I feel the current implementation makes sense from the flow perspective because if something is broken, users need to fix it first to unlock the flow. However, from the user experience, it's not good enough, and probably we can do better like moving forward with all working volume restore, then keeping failed ones stuck there. |
Pre Ready-For-Testing Checklist
|
|
Tested with v1.4.0-rc2 I tested with a system backup having 1 backing image volume and other volumes. Tried to restore the backup in a cluster where the backing image didn't exist. The system backup got completed without restoring the volume with backing image. logs:
longhorn-system-rollout-system-backup-6-zw2bv_longhorn-system-rollout-system-backup-6.log @c3y1huang Is this as per the expectation? cc: @innobead |
Yes, the expectation is Longhorn system restore will ignore the Volume, associated PV, and PVC if it is using the backing image. We will include this in doc. |
@c3y1huang Do we have events for that ignoration? Ideally, we should keep unrestorable items keep there for failing until they are resolved by users. However, it's good for now if having events (a notice way for users). |
|
Good, so let's have a note in the system backup & restore doc to mention any problematic resource restore will be ignored like volume. |
Included in the prerequisite. |
@roger-ryao Could you continue the testing because @khushboo-rancher is out today? Thanks. |
Verified on v1.4.x-head 20221227 Pre-requisite
The test steps
Result Passed
|
Verified on master-head 20221227 The test steps Result Passed
|
Is your improvement request related to a feature? Please describe (馃憤 if you like this request)
If system restore stuck in restoring a volume, it keeps trying to restore the same volume and does not proceed for next volume restore. So, all the subsequent volumes also fail to restore.
Describe the solution you'd like
We can have a smaller time out and after trying for certain time we should proceed to restore the next volume.
To reproduce
Try to restore a volume having backing image in a new cluster which doesn't have the backing image.
The text was updated successfully, but these errors were encountered: