Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] System restore should proceed to restore other volumes if restoring one volume keeps failing for a certain time. #5086

Closed
khushboo-rancher opened this issue Dec 16, 2022 · 15 comments
Assignees
Labels
area/system-backup-restore Longhorn system backup restore kind/improvement Request for improvement of existing function severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@khushboo-rancher
Copy link
Contributor

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

If system restore stuck in restoring a volume, it keeps trying to restore the same volume and does not proceed for next volume restore. So, all the subsequent volumes also fail to restore.

Describe the solution you'd like

We can have a smaller time out and after trying for certain time we should proceed to restore the next volume.

To reproduce

Try to restore a volume having backing image in a new cluster which doesn't have the backing image.

@khushboo-rancher khushboo-rancher added severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) kind/improvement Request for improvement of existing function area/system-backup-restore Longhorn system backup restore labels Dec 16, 2022
@innobead
Copy link
Member

I believe this should be managed by the concurrent restore setting now, so the timeout should be after 24 hours? @c3y1huang

@innobead
Copy link
Member

innobead commented Dec 16, 2022

@khushboo-rancher How did you simulate this case? this is a good case.

cc @longhorn/qa

@c3y1huang c3y1huang self-assigned this Dec 16, 2022
@khushboo-rancher
Copy link
Contributor Author

This is actually at stage of rollout, not restore. So, when a system restore is rolling out a backup and creating volumes one by one and one volume is stuck, it never goes to next volume rollout.

I believe, the 24 hours timeout is for actual restoring the data on the volume from backupstore.

I was able to simulate this using backing image volume as backing image volume can't get rollout with current limitation.

@innobead
Copy link
Member

Yeah, you are 100% right 馃憤

I feel the current implementation makes sense from the flow perspective because if something is broken, users need to fix it first to unlock the flow. However, from the user experience, it's not good enough, and probably we can do better like moving forward with all working volume restore, then keeping failed ones stuck there.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 16, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at:

  • Which areas/issues this PR might have potential impacts on?
    Area system-restore, manager
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@c3y1huang
Copy link
Contributor

c3y1huang commented Dec 20, 2022

@khushboo-rancher has found volume from backing image with PV and PVC also get stuck. Need to fix them too.
Fixed.

@khushboo-rancher
Copy link
Contributor Author

khushboo-rancher commented Dec 22, 2022

Tested with v1.4.0-rc2

I tested with a system backup having 1 backing image volume and other volumes. Tried to restore the backup in a cluster where the backing image didn't exist. The system backup got completed without restoring the volume with backing image.

logs:

ime="2022-12-22T22:54:28Z" level=warning msg="Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found" Volume=volume-bi-1 accessMode=rwo controller=longhorn-system-rollout fromBackup="s3://khushboo-longhorn@us-west-1/?backup=backup-0af87c1de9d847ec&volume=volume-bi-1" frontend=blockdev migratable=false owner=khush-lh-wk2 state=attached volume=volume-bi-1
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolume=volume-bi-1 controller=longhorn-system-rollout
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedCreating: Volume volume-bi-1' Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found"
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'RolloutSkipped: PersistentVolume volume-bi-1' System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1"
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolumeClaim=volume-bi-1 controller=longhorn-system-rollout

longhorn-system-rollout-system-backup-6-zw2bv_longhorn-system-rollout-system-backup-6.log

@c3y1huang Is this as per the expectation?

cc: @innobead

@c3y1huang
Copy link
Contributor

Tested with v1.4.0-rc2

I tested with a system backup having 1 backing image volume and other volumes. Tried to restore the backup in a cluster where the backing image didn't exist. The system backup got completed without restoring the volume with backing image.

logs:

ime="2022-12-22T22:54:28Z" level=warning msg="Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found" Volume=volume-bi-1 accessMode=rwo controller=longhorn-system-rollout fromBackup="s3://khushboo-longhorn@us-west-1/?backup=backup-0af87c1de9d847ec&volume=volume-bi-1" frontend=blockdev migratable=false owner=khush-lh-wk2 state=attached volume=volume-bi-1
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolume=volume-bi-1 controller=longhorn-system-rollout
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedCreating: Volume volume-bi-1' Failed to create item: Volume volume-bi-1: admission webhook \"mutator.longhorn.io\" denied the request: failed to get backing image bi-1: backingimage.longhorn.io \"bi-1\" not found"
time="2022-12-22T22:54:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"SystemRestore\", Namespace:\"longhorn-system\", Name:\"system-backup-6\", UID:\"790125f5-e1e0-4698-970e-4f545cf009b4\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"4562303\", FieldPath:\"\"}): type: 'Warning' reason: 'RolloutSkipped: PersistentVolume volume-bi-1' System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1"
time="2022-12-22T22:54:28Z" level=debug msg="System rollout ignoring item: cannot rollout volume-bi-1 due to missing dependency: Volume volume-bi-1" PersistentVolumeClaim=volume-bi-1 controller=longhorn-system-rollout

longhorn-system-rollout-system-backup-6-zw2bv_longhorn-system-rollout-system-backup-6.log

@c3y1huang Is this as per the expectation?

Yes, the expectation is Longhorn system restore will ignore the Volume, associated PV, and PVC if it is using the backing image. We will include this in doc.

@innobead
Copy link
Member

@c3y1huang Do we have events for that ignoration? Ideally, we should keep unrestorable items keep there for failing until they are resolved by users. However, it's good for now if having events (a notice way for users).

@c3y1huang
Copy link
Contributor

c3y1huang commented Dec 23, 2022

@c3y1huang Do we have events for that ignoration? Ideally, we should keep unrestorable items keep there for failing until they are resolved by users. However, it's good for now if having events (a notice way for users).

We have events for the Volume, PV and PVC.

@innobead
Copy link
Member

Good, so let's have a note in the system backup & restore doc to mention any problematic resource restore will be ignored like volume.

@c3y1huang
Copy link
Contributor

c3y1huang commented Dec 23, 2022

Good, so let's have a note in the system backup & restore doc to mention any problematic resource restore will be ignored like volume.

Included in the prerequisite.

@innobead
Copy link
Member

@roger-ryao Could you continue the testing because @khushboo-rancher is out today? Thanks.

@roger-ryao
Copy link

Verified on v1.4.x-head 20221227

  • longhorn v1.4.x-head (c32192c)
  • longhorn-manager v1.4.x-head (ef97b2b)

Pre-requisite

  1. Create vol-0 with backing image-0
  2. Create vol-0 PV & PVC
  3. Create vol-1 with backing image-1
  4. Create vol-1 PV & PVC
  5. Create a workload, attach to vol-0 & vol-1.
  6. Create vol-2 with backing image-2
  7. Attach vol-2 to the node-1
  8. Create vol-3 without backing image
  9. Attach vol-3 to node-1
  10. Create vol-4 with backing image-4
  11. Create vol-4 PV & PVC
  12. Create vol-5 with backing image-5
  13. Create vol-5 PV & PVC
  14. Create a workload, attach to vol-4 & vol-5.
  15. Create vol-6 without backing image
  16. Attach vol-6 to node-1
  17. Setup backup target and backup target credential secret
  18. Create longhorn system backup

The test steps

  1. Upload backing image-4, 5, 6 to Longhorn
  2. Restore System on another Cluster

Result Passed

  1. The system backup got completed without restoring the volume with backing image.

Screenshot_20221227_143505
Screenshot_20221227_143525
supportbundle_6b919ba5-cb98-4d8b-bdfc-e3a3f0639cef_2022-12-27T06-26-23Z.zip

Ref.
https://longhorn.io/docs/1.4.0/advanced-resources/system-backup-restore/restore-longhorn-system/#prerequisite

@roger-ryao
Copy link

Verified on master-head 20221227

  • longhorn master-head (62998ad)
  • longhorn-manager master-head (90c73e7)

The test steps
#5086 (comment)

Result Passed

  1. The system backup got completed without restoring the volume with backing image.
    Screenshot_20221227_162811
    Screenshot_20221227_162829
    supportbundle_a98e4d95-7574-4685-aca9-1ccfa2f3b6d6_2022-12-27T08-34-43Z.zip

CC @khushboo-rancher @innobead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system-backup-restore Longhorn system backup restore kind/improvement Request for improvement of existing function severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

5 participants