[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

khushboo-rancher · 2020-05-12T22:14:32Z

Describe the bug
Creating DRV stuck in attaching state when restoration of volume is interrupted by rebooting one of the node.

To Reproduce
Steps to reproduce the behavior:

Go to backup page.
Select a backup and click 'Create Disaster Recovery Volume'. (3 replicas)
While restoring is in progress, reboot the node where volume is attached to.
Volume stuck in attaching state and sometimes to recursive attaching/detaching state.

Expected behavior
DRV should get created with 2 replicas if 1 node is rebooting.

Log

time="2020-05-12T22:10:58Z" level=warning msg="Cannot find the instance status in instance manager instance-manager-r-5ba69d85 for the running instance volume-longhorn-1-r-4b13e37f, will mark the instance as state ERROR"
time="2020-05-12T22:10:58Z" level=warning msg="Cannot find the instance status in instance manager instance-manager-r-5ba69d85 for the running instance volume-longhorn-1-r-4b13e37f, will mark the instance as state ERROR"
time="2020-05-12T22:11:28Z" level=warning msg="Cannot find the instance status in instance manager instance-manager-r-5ba69d85 for the running instance volume-longhorn-1-r-4b13e37f, will mark the instance as state ERROR"

longhorn-support-bundle_e8d217c6-3283-4664-9018-a84ee301ee96_2020-05-12T22-10-40Z.zip

Environment:

Longhorn version: master
Kubernetes version: v1.17.5
Node OS type and version: ubuntu 18.04

The text was updated successfully, but these errors were encountered:

khushboo-rancher · 2020-05-13T01:52:45Z

This could be similar to #1188. This bug can be tested once #1188 is fixed to see if this is fixed too.

khushboo-rancher · 2020-05-13T22:59:16Z

Validated after fix of #1260 got merged, still this issue is reproducible.

shuo-wu · 2020-05-14T11:44:42Z

This issue is almost the same as the reopened issue #1270. The fix for that issue would handle this.
Since the manual tests are almost identical but replace the restore volume with DR volume, I won't repeat them here.

yasker · 2020-05-14T14:20:06Z

@khushboo-rancher In this case, you've rebooted the node that engine or replica is running on?

khushboo-rancher · 2020-05-14T16:34:54Z

@yasker The node where replica was running was rebooted.

yasker · 2020-05-14T16:49:50Z

@khushboo-rancher : and it's not the node that the volume was attached to? If so, this is a different case from #1270 . IIUC, #1270 only deals with engine detachment.

khushboo-rancher · 2020-05-14T17:09:42Z

Yes, the node rebooted was not the attached one.

khushboo-rancher · 2020-05-14T21:50:34Z

@yasker ignore my previous comment, I am seeing different behavior in latest master build. I don't see volume stuck in attaching state anymore.

Rebooted the node where volume was attached. - Volume stuck in attaching/detaching recursive state.
Rebooted the node that volume was not attached to - DRV completed the restoring with degraded status.

yasker · 2020-05-14T23:02:39Z

@khushboo-rancher That's good news. So which case is this issue originally about? I want to make sure we have separate issues for separate cases.

khushboo-rancher · 2020-05-15T00:09:46Z

Originally, the issue was if a node(volume not attached to this node) where a replica was running rebooted while restore was in progress, it was stuck in attaching state forever.

Now, the above case doesn't have problem but if the attached node is rebooted volume gets stuck in attaching/detaching state.

yasker · 2020-05-15T00:25:53Z

@khushboo-rancher Let's either open a new issue for the reboot the attached node case, or repurpose this issue for that. If we want to repurpose this issue, you can update the first comment to reflect that, make sure it's about reboot the attached node only. Also mention the update is due to comment e.g. #1328 (comment)

shuo-wu · 2020-05-19T06:00:45Z

Manual test 1:

Create a pod with Longhorn volume.
Write data to the volume and get the md5sum.
Create the 1st backup for the volume.
Create a DR volume from the backup.
Wait for the DR volume starting the initial restore. Then reboot the DR volume attached node immediately.
Wait for the DR volume detached then reattached.
Wait for the DR volume restore complete after the reattachment.
Activate the DR volume and check the data md5sum.

Manual test 2:

Create a pod with Longhorn volume.
Write data to the volume and get the md5sum.
Create the 1st backup for the volume.
Create a DR volume from the backup.
Wait for the DR volume to complete the initial restore.
Write more data to the original volume and get the md5sum.
Create the 2nd backup for the volume.
Wait for the DR volume incremental restore getting triggered. Then reboot the DR volume attached node immediately.
Wait for the DR volume detached then reattached.
Wait for the DR volume restore complete after the reattachment.
Activate the DR volume and check the data md5sum.

shuo-wu · 2020-05-19T06:23:21Z

The attached node rebooting will lead to the engine crashing unexpectedly.
Hence this case is similar to #1336 and the fix for that case should work here. (It passed the above tests in my cluster)

@khushboo-rancher Can you verify if this issue is still reproducible after the PR gets merged?

khushboo-rancher · 2020-05-19T23:18:46Z

Verified in master build.
Verification: Passed

Verified reboot scenarios while initial and incremental restoring of DR volume. It worked fine.
Scenario 1:

Rebooted the attached node while initial restoring was in progress. - Restore paused when node was rebooting. Once node came up, volume shows status detaching --> attach and started the restore again. Data verified.
Rebooted the attached node while incremental restoring was in progress. - Restore paused when node was rebooting. Once node came up, volume shows status detaching --> attach and started the restore again. Data verified.
Rebooted the replica node while initial restoring was in progress. - Restore continued with healthy replicas of up and running node and didn't try to rebuild replica on rebooted node. Data verified.
Rebooted the replica node while incremental restoring was in progress. - Restore continued with healthy replicas of up and running node and didn't try to rebuild replica on rebooted node. Data verified.

khushboo-rancher added kind/bug component/longhorn-manager Longhorn manager (control plane) reproduce/always 100% reproducible labels May 12, 2020

khushboo-rancher added this to the v1.0.0 milestone May 12, 2020

yasker assigned shuo-wu May 12, 2020

yasker mentioned this issue May 13, 2020

[BUG]Restore will get stuck if the restore error is not captured #1188

Closed

shuo-wu mentioned this issue May 14, 2020

Fail all replicas if the node is down or the engine is dead during the restore longhorn/longhorn-manager#566

Closed

yasker mentioned this issue May 15, 2020

Negative test case issues #1337

Closed

khushboo-rancher changed the title ~~[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting a node~~ [BUG] DRV stuck in attaching state when restoring is interrupted by rebooting a attached node May 15, 2020

khushboo-rancher changed the title ~~[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting a attached node~~ [BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node May 15, 2020

yasker modified the milestones: v1.0.0, v1.0.1 May 15, 2020

yasker modified the milestones: v1.0.1, v1.0.0 May 19, 2020

yasker assigned khushboo-rancher May 19, 2020

khushboo-rancher closed this as completed May 19, 2020

shuo-wu mentioned this issue Jun 4, 2020

[BUG] DR volume is stuck in restoring if the attached node is power downwhile restoring was in progress #1366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

khushboo-rancher commented May 12, 2020 •

edited

khushboo-rancher commented May 13, 2020

khushboo-rancher commented May 13, 2020

shuo-wu commented May 14, 2020

yasker commented May 14, 2020

khushboo-rancher commented May 14, 2020

yasker commented May 14, 2020 •

edited

khushboo-rancher commented May 14, 2020

khushboo-rancher commented May 14, 2020

yasker commented May 14, 2020

khushboo-rancher commented May 15, 2020

yasker commented May 15, 2020

shuo-wu commented May 19, 2020

shuo-wu commented May 19, 2020

khushboo-rancher commented May 19, 2020

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

Comments

khushboo-rancher commented May 12, 2020 • edited

khushboo-rancher commented May 13, 2020

khushboo-rancher commented May 13, 2020

shuo-wu commented May 14, 2020

yasker commented May 14, 2020

khushboo-rancher commented May 14, 2020

yasker commented May 14, 2020 • edited

khushboo-rancher commented May 14, 2020

khushboo-rancher commented May 14, 2020

yasker commented May 14, 2020

khushboo-rancher commented May 15, 2020

yasker commented May 15, 2020

shuo-wu commented May 19, 2020

shuo-wu commented May 19, 2020

khushboo-rancher commented May 19, 2020

khushboo-rancher commented May 12, 2020 •

edited

yasker commented May 14, 2020 •

edited