[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086

khushboo-rancher · 2022-12-16T08:01:27Z

Is your improvement request related to a feature? Please describe (👍 if you like this request)

If system restore stuck in restoring a volume, it keeps trying to restore the same volume and does not proceed for next volume restore. So, all the subsequent volumes also fail to restore.

Describe the solution you'd like

We can have a smaller time out and after trying for certain time we should proceed to restore the next volume.

To reproduce

Try to restore a volume having backing image in a new cluster which doesn't have the backing image.

innobead · 2022-12-16T08:04:13Z

I believe this should be managed by the concurrent restore setting now, so the timeout should be after 24 hours? @c3y1huang

innobead · 2022-12-16T08:04:57Z

@khushboo-rancher How did you simulate this case? this is a good case.

cc @longhorn/qa

khushboo-rancher · 2022-12-16T08:16:36Z

This is actually at stage of rollout, not restore. So, when a system restore is rolling out a backup and creating volumes one by one and one volume is stuck, it never goes to next volume rollout.

I believe, the 24 hours timeout is for actual restoring the data on the volume from backupstore.

I was able to simulate this using backing image volume as backing image volume can't get rollout with current limitation.

innobead · 2022-12-16T08:26:54Z

Yeah, you are 100% right 👍

I feel the current implementation makes sense from the flow perspective because if something is broken, users need to fix it first to unlock the flow. However, from the user experience, it's not good enough, and probably we can do better like moving forward with all working volume restore, then keeping failed ones stuck there.

longhorn-io-github-bot · 2022-12-16T10:07:10Z

Pre Ready-For-Testing Checklist

c3y1huang · 2022-12-20T01:53:32Z

~~@khushboo-rancher has found volume from backing image with PV and PVC also get stuck. Need to fix them too.~~
Fixed.

khushboo-rancher · 2022-12-22T23:22:10Z

Tested with v1.4.0-rc2

I tested with a system backup having 1 backing image volume and other volumes. Tried to restore the backup in a cluster where the backing image didn't exist. The system backup got completed without restoring the volume with backing image.

logs:

ime="2022-12-22T22:54:28Z" level=warning msg="Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found" Volume=volume-bi-1 accessMode=rwo controller=longhorn-system-rollout fromBackup="s3://khushboo-longhorn@us-west-1/?backup=backup-0af87c1de9d847ec&volume=volume-bi-1" frontend=blockdev migratable=false owner=khush-lh-wk2 state=attached volume=volume-bi-1
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolume=volume-bi-1 controller=longhorn-system-rollout
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedCreating: Volume volume-bi-1' Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found"
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'RolloutSkipped: PersistentVolume volume-bi-1' System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1"
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolumeClaim=volume-bi-1 controller=longhorn-system-rollout

longhorn-system-rollout-system-backup-6-zw2bv_longhorn-system-rollout-system-backup-6.log

@c3y1huang Is this as per the expectation?

cc: @innobead

c3y1huang · 2022-12-23T02:29:37Z

Tested with v1.4.0-rc2

I tested with a system backup having 1 backing image volume and other volumes. Tried to restore the backup in a cluster where the backing image didn't exist. The system backup got completed without restoring the volume with backing image.

logs:

ime="2022-12-22T22:54:28Z" level=warning msg="Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found" Volume=volume-bi-1 accessMode=rwo controller=longhorn-system-rollout fromBackup="s3://khushboo-longhorn@us-west-1/?backup=backup-0af87c1de9d847ec&volume=volume-bi-1" frontend=blockdev migratable=false owner=khush-lh-wk2 state=attached volume=volume-bi-1
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolume=volume-bi-1 controller=longhorn-system-rollout
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedCreating: Volume volume-bi-1' Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found"
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'RolloutSkipped: PersistentVolume volume-bi-1' System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1"
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolumeClaim=volume-bi-1 controller=longhorn-system-rollout

longhorn-system-rollout-system-backup-6-zw2bv_longhorn-system-rollout-system-backup-6.log

@c3y1huang Is this as per the expectation?

Yes, the expectation is Longhorn system restore will ignore the Volume, associated PV, and PVC if it is using the backing image. We will include this in doc.

innobead · 2022-12-23T02:38:41Z

@c3y1huang Do we have events for that ignoration? Ideally, we should keep unrestorable items keep there for failing until they are resolved by users. However, it's good for now if having events (a notice way for users).

c3y1huang · 2022-12-23T02:44:28Z

@c3y1huang Do we have events for that ignoration? Ideally, we should keep unrestorable items keep there for failing until they are resolved by users. However, it's good for now if having events (a notice way for users).

We have events for the Volume, PV and PVC.

innobead · 2022-12-23T03:00:40Z

Good, so let's have a note in the system backup & restore doc to mention any problematic resource restore will be ignored like volume.

c3y1huang · 2022-12-23T05:07:25Z

Good, so let's have a note in the system backup & restore doc to mention any problematic resource restore will be ignored like volume.

Included in the prerequisite.

innobead · 2022-12-27T01:26:21Z

@roger-ryao Could you continue the testing because @khushboo-rancher is out today? Thanks.

roger-ryao · 2022-12-27T08:37:55Z

Verified on v1.4.x-head 20221227

longhorn v1.4.x-head (c32192c)
longhorn-manager v1.4.x-head (ef97b2b)

Pre-requisite

Create vol-0 with backing image-0
Create vol-0 PV & PVC
Create vol-1 with backing image-1
Create vol-1 PV & PVC
Create a workload, attach to vol-0 & vol-1.
Create vol-2 with backing image-2
Attach vol-2 to the node-1
Create vol-3 without backing image
Attach vol-3 to node-1
Create vol-4 with backing image-4
Create vol-4 PV & PVC
Create vol-5 with backing image-5
Create vol-5 PV & PVC
Create a workload, attach to vol-4 & vol-5.
Create vol-6 without backing image
Attach vol-6 to node-1
Setup backup target and backup target credential secret
Create longhorn system backup

The test steps

Upload backing image-4, 5, 6 to Longhorn
Restore System on another Cluster

Result Passed

The system backup got completed without restoring the volume with backing image.

supportbundle_6b919ba5-cb98-4d8b-bdfc-e3a3f0639cef_2022-12-27T06-26-23Z.zip

Ref.
https://longhorn.io/docs/1.4.0/advanced-resources/system-backup-restore/restore-longhorn-system/#prerequisite

roger-ryao · 2022-12-27T08:42:39Z

Verified on master-head 20221227

longhorn master-head (62998ad)
longhorn-manager master-head (90c73e7)

The test steps
#5086 (comment)

Result Passed

The system backup got completed without restoring the volume with backing image.

supportbundle_a98e4d95-7574-4685-aca9-1ccfa2f3b6d6_2022-12-27T08-34-43Z.zip

CC @khushboo-rancher @innobead

khushboo-rancher added severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) kind/improvement Request for improvement of existing function area/system-backup-restore Longhorn system backup restore labels Dec 16, 2022

c3y1huang self-assigned this Dec 16, 2022

c3y1huang added this to the v1.4.0 milestone Dec 16, 2022

c3y1huang mentioned this issue Dec 16, 2022

fix(system-restore): add log when volume failed to create longhorn/longhorn-manager#1627

Merged

c3y1huang mentioned this issue Dec 19, 2022

[BACKPORT][v1.4.0] fix(system-restore): add log when volume failed to create longhorn/longhorn-manager#1639

Merged

innobead assigned khushboo-rancher Dec 19, 2022

c3y1huang mentioned this issue Dec 20, 2022

fix(system-restore): ignore PV, PVC due to missing Volume longhorn/longhorn-manager#1642

Merged

longhorn deleted a comment from longhorn-io-github-bot Dec 20, 2022

c3y1huang mentioned this issue Dec 21, 2022

[BACKPORT][v1.4.0] fix(system-restore): ignore PV, PVC due to missing Volume longhorn/longhorn-manager#1646

Merged

innobead assigned roger-ryao Dec 27, 2022

roger-ryao closed this as completed Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086

[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086

khushboo-rancher commented Dec 16, 2022

innobead commented Dec 16, 2022

innobead commented Dec 16, 2022 •

edited

khushboo-rancher commented Dec 16, 2022

innobead commented Dec 16, 2022

longhorn-io-github-bot commented Dec 16, 2022 •

edited by c3y1huang

c3y1huang commented Dec 20, 2022 •

edited

khushboo-rancher commented Dec 22, 2022 •

edited

c3y1huang commented Dec 23, 2022

innobead commented Dec 23, 2022

c3y1huang commented Dec 23, 2022 •

edited

innobead commented Dec 23, 2022

c3y1huang commented Dec 23, 2022 •

edited

innobead commented Dec 27, 2022

roger-ryao commented Dec 27, 2022

roger-ryao commented Dec 27, 2022

[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086

[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086

Comments

khushboo-rancher commented Dec 16, 2022

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Describe the solution you'd like

To reproduce

innobead commented Dec 16, 2022

innobead commented Dec 16, 2022 • edited

khushboo-rancher commented Dec 16, 2022

innobead commented Dec 16, 2022

longhorn-io-github-bot commented Dec 16, 2022 • edited by c3y1huang

Pre Ready-For-Testing Checklist

c3y1huang commented Dec 20, 2022 • edited

khushboo-rancher commented Dec 22, 2022 • edited

c3y1huang commented Dec 23, 2022

innobead commented Dec 23, 2022

c3y1huang commented Dec 23, 2022 • edited

innobead commented Dec 23, 2022

c3y1huang commented Dec 23, 2022 • edited

innobead commented Dec 27, 2022

roger-ryao commented Dec 27, 2022

roger-ryao commented Dec 27, 2022

innobead commented Dec 16, 2022 •

edited

longhorn-io-github-bot commented Dec 16, 2022 •

edited by c3y1huang

c3y1huang commented Dec 20, 2022 •

edited

khushboo-rancher commented Dec 22, 2022 •

edited

c3y1huang commented Dec 23, 2022 •

edited

c3y1huang commented Dec 23, 2022 •

edited