Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

Closed
khushboo-rancher opened this issue May 12, 2020 · 14 comments
Assignees
Labels
component/longhorn-manager Longhorn manager (control plane) kind/bug reproduce/always 100% reproducible
Milestone

Comments

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented May 12, 2020

Describe the bug
Creating DRV stuck in attaching state when restoration of volume is interrupted by rebooting one of the node.

To Reproduce
Steps to reproduce the behavior:

  1. Go to backup page.
  2. Select a backup and click 'Create Disaster Recovery Volume'. (3 replicas)
  3. While restoring is in progress, reboot the node where volume is attached to.
  4. Volume stuck in attaching state and sometimes to recursive attaching/detaching state.

Screen Shot 2020-05-12 at 3 11 15 PM

Screen Shot 2020-05-12 at 3 11 47 PM

Expected behavior
DRV should get created with 2 replicas if 1 node is rebooting.

Log

time="2020-05-12T22:10:58Z" level=warning msg="Cannot find the instance status in instance manager instance-manager-r-5ba69d85 for the running instance volume-longhorn-1-r-4b13e37f, will mark the instance as state ERROR"
time="2020-05-12T22:10:58Z" level=warning msg="Cannot find the instance status in instance manager instance-manager-r-5ba69d85 for the running instance volume-longhorn-1-r-4b13e37f, will mark the instance as state ERROR"
time="2020-05-12T22:11:28Z" level=warning msg="Cannot find the instance status in instance manager instance-manager-r-5ba69d85 for the running instance volume-longhorn-1-r-4b13e37f, will mark the instance as state ERROR"

longhorn-support-bundle_e8d217c6-3283-4664-9018-a84ee301ee96_2020-05-12T22-10-40Z.zip

Environment:

  • Longhorn version: master
  • Kubernetes version: v1.17.5
  • Node OS type and version: ubuntu 18.04
@khushboo-rancher khushboo-rancher added kind/bug component/longhorn-manager Longhorn manager (control plane) reproduce/always 100% reproducible labels May 12, 2020
@khushboo-rancher khushboo-rancher added this to the v1.0.0 milestone May 12, 2020
@khushboo-rancher
Copy link
Contributor Author

This could be similar to #1188. This bug can be tested once #1188 is fixed to see if this is fixed too.

@khushboo-rancher
Copy link
Contributor Author

Validated after fix of #1260 got merged, still this issue is reproducible.

@shuo-wu
Copy link
Contributor

shuo-wu commented May 14, 2020

This issue is almost the same as the reopened issue #1270. The fix for that issue would handle this.
Since the manual tests are almost identical but replace the restore volume with DR volume, I won't repeat them here.

@yasker
Copy link
Member

yasker commented May 14, 2020

@khushboo-rancher In this case, you've rebooted the node that engine or replica is running on?

@khushboo-rancher
Copy link
Contributor Author

@yasker The node where replica was running was rebooted.

@yasker
Copy link
Member

yasker commented May 14, 2020

@khushboo-rancher : and it's not the node that the volume was attached to? If so, this is a different case from #1270 . IIUC, #1270 only deals with engine detachment.

@khushboo-rancher
Copy link
Contributor Author

Yes, the node rebooted was not the attached one.

@khushboo-rancher
Copy link
Contributor Author

@yasker ignore my previous comment, I am seeing different behavior in latest master build. I don't see volume stuck in attaching state anymore.

  1. Rebooted the node where volume was attached. - Volume stuck in attaching/detaching recursive state.
  2. Rebooted the node that volume was not attached to - DRV completed the restoring with degraded status.

@yasker
Copy link
Member

yasker commented May 14, 2020

@khushboo-rancher That's good news. So which case is this issue originally about? I want to make sure we have separate issues for separate cases.

@khushboo-rancher
Copy link
Contributor Author

Originally, the issue was if a node(volume not attached to this node) where a replica was running rebooted while restore was in progress, it was stuck in attaching state forever.

Now, the above case doesn't have problem but if the attached node is rebooted volume gets stuck in attaching/detaching state.

@yasker
Copy link
Member

yasker commented May 15, 2020

@khushboo-rancher Let's either open a new issue for the reboot the attached node case, or repurpose this issue for that. If we want to repurpose this issue, you can update the first comment to reflect that, make sure it's about reboot the attached node only. Also mention the update is due to comment e.g. #1328 (comment)

@khushboo-rancher khushboo-rancher changed the title [BUG] DRV stuck in attaching state when restoring is interrupted by rebooting a node [BUG] DRV stuck in attaching state when restoring is interrupted by rebooting a attached node May 15, 2020
@khushboo-rancher khushboo-rancher changed the title [BUG] DRV stuck in attaching state when restoring is interrupted by rebooting a attached node [BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node May 15, 2020
@yasker yasker modified the milestones: v1.0.0, v1.0.1 May 15, 2020
@shuo-wu
Copy link
Contributor

shuo-wu commented May 19, 2020

Manual test 1:

  1. Create a pod with Longhorn volume.
  2. Write data to the volume and get the md5sum.
  3. Create the 1st backup for the volume.
  4. Create a DR volume from the backup.
  5. Wait for the DR volume starting the initial restore. Then reboot the DR volume attached node immediately.
  6. Wait for the DR volume detached then reattached.
  7. Wait for the DR volume restore complete after the reattachment.
  8. Activate the DR volume and check the data md5sum.

Manual test 2:

  1. Create a pod with Longhorn volume.
  2. Write data to the volume and get the md5sum.
  3. Create the 1st backup for the volume.
  4. Create a DR volume from the backup.
  5. Wait for the DR volume to complete the initial restore.
  6. Write more data to the original volume and get the md5sum.
  7. Create the 2nd backup for the volume.
  8. Wait for the DR volume incremental restore getting triggered. Then reboot the DR volume attached node immediately.
  9. Wait for the DR volume detached then reattached.
  10. Wait for the DR volume restore complete after the reattachment.
  11. Activate the DR volume and check the data md5sum.

@shuo-wu
Copy link
Contributor

shuo-wu commented May 19, 2020

The attached node rebooting will lead to the engine crashing unexpectedly.
Hence this case is similar to #1336 and the fix for that case should work here. (It passed the above tests in my cluster)

@khushboo-rancher Can you verify if this issue is still reproducible after the PR gets merged?

@yasker yasker modified the milestones: v1.0.1, v1.0.0 May 19, 2020
@khushboo-rancher
Copy link
Contributor Author

Verified in master build.
Verification: Passed

Verified reboot scenarios while initial and incremental restoring of DR volume. It worked fine.
Scenario 1:

  1. Rebooted the attached node while initial restoring was in progress. - Restore paused when node was rebooting. Once node came up, volume shows status detaching --> attach and started the restore again. Data verified.

  2. Rebooted the attached node while incremental restoring was in progress. - Restore paused when node was rebooting. Once node came up, volume shows status detaching --> attach and started the restore again. Data verified.

  3. Rebooted the replica node while initial restoring was in progress. - Restore continued with healthy replicas of up and running node and didn't try to rebuild replica on rebooted node. Data verified.

  4. Rebooted the replica node while incremental restoring was in progress. - Restore continued with healthy replicas of up and running node and didn't try to rebuild replica on rebooted node. Data verified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/longhorn-manager Longhorn manager (control plane) kind/bug reproduce/always 100% reproducible
Projects
None yet
Development

No branches or pull requests

3 participants