[BUG] Not able to attach a restored volume to a workload when the volume was interrupted during restoration process #1270

sowmyav27 · 2020-04-28T22:54:03Z

Describe the bug
Not able to attach a restored volume to a workload when the volume was interrupted during restoration process

To Reproduce

Restore from a backup to vol-1
while restore is in process, delete a replica
wait for the volume to be in "Detached" state after restoration. Create PV/PVC
Deploy a workload and attach the volume vol-1
Error seen:

Expected behavior
User should be able to attach volume to a workload/pod successfully.

Note:
Similar issue is seen - when during restoration, one of the nodes is powered down, which causes a replica rebuild

Environment:

Longhorn version: master-04/28/2020
Kubernetes version: 1.17.5
Node OS type and version: RKE DO Linux OS cluster

shuo-wu · 2020-04-29T11:06:52Z

If there is a replica deleted during the restoration, a new replica will be rebuild. But this rebuild replica won't continue restoring data after the rebuild complete. Hence it's actually an invalid replica. BUG!
The wrongly rebuilt replica messes up the whole volume (the meta info may be included) when the pod is trying to use it. Then the error reported above will be triggered.
Deleting the invalid replica after the restoration complete can bring the volume back. This is a simple workaround.
Not sure if there is a similar issue for DRV.

Possible solution:
Disable replica rebuild for the restoring volume.

Actually I am worried what will happen if a replica is deleted and rebuilt when users keep writing data into the volume. I am thinking there is a flaw/gap between rebuild complete and starting to write data for the replica in that case. If YES, it may lead to data inconsistency.

yasker · 2020-04-29T20:12:23Z

@shuo-wu replica rebuild and data writing should be fine. When we do replica rebuild there is a lock to prevent any writing until we've done snapshot. You can try it too.

Regarding the replica rebuild during the restoration, we can disable the rebuild during restoration volumes for now. Though for the new replica, it should start the restoration process - not the rebuild process. The same applies to the DR volume.

shuo-wu · 2020-04-30T03:13:01Z

OK. I checked the implementation. The replica rebuild with data writing looks fine.

I will disable the rebuild for this case in the manager part now. But when will we refactor/fix the rebuild?

yasker · 2020-04-30T03:52:57Z

@shuo-wu can you check and give me an estimate of how long it will take to fix the rebuild during restoration? We can just let the new replica to do restoration rather than rebuild.

shuo-wu · 2020-04-30T03:59:09Z

Maybe 3~5 days(Review is not included). I think it depends on the complexity of the DR volume part.

yasker · 2020-04-30T04:10:56Z

@shuo-wu OK, let's stop the rebuilding for failure during restoration or DR. Can you file a bug to track the rebuild during the restoration as an enhancement? Thanks.

khushboo-rancher · 2020-05-07T19:45:37Z

Verified in master - 05-04-2020

While restoring if there is any interruption, replica rebuild doesn't get triggered and volume later get attached to pod successfully.

Steps for verification:
Scenario: 1

Restored a volume. (3 replicas)
Deleted one replica when restoring was in progress.
New replica didn't get rebuilt. Volume created with 2 replicas.
Created PV/PVC
Attached to a pod successfully, Now replica rebuilt and data matched across replicas.

Scenario: 2

Restored a volume. (3 replicas)
Power down one node when restoring was in progress.
New replica didn't get rebuilt. Volume created with 2 replicas.
Created PV/PVC
Attached to a pod successfully, Now replica rebuilt and data matched across replicas.

yasker · 2020-05-07T22:31:33Z

E2E test has been implemented by Shuo.

meldafrawi · 2020-05-08T19:27:32Z

test test_rebuild_with_restoration & test test_rebuild_with_inc passed for two consecutive runs longhorn-tests/421 & longhorn-tests/422

sowmyav27 · 2020-05-12T01:48:46Z

Reopening this issue -

Validated on the latest master - 05/11/2020

Steps:

Restore from a backup to vol-1
while the restore is in progress, power down a worker node.
A replica fails and gets deleted (when seen from UI)
The other two replica are seen in "restoring" state since long. In the API failed replica still exists and is not removed.

"restoreStatus": [
        {
          "actions": null,
          "backupURL": "<>",
          "error": "",
          "filename": "volume-snap-1b5e3063-b2da-463e-b5a5-230e73fec6db.img",
          "isRestoring": true,
          "lastRestored": "",
          "links": null,
          "progress": 12,
          "replica": "tcp://10.42.3.11:10015",
          "state": "in_progress"
        },
        {
          "actions": null,
          "backupURL": "<>",
          "error": "",
          "filename": "volume-snap-1b5e3063-b2da-463e-b5a5-230e73fec6db.img",
          "isRestoring": true,
          "lastRestored": "",
          "links": null,
          "progress": 12,
          "replica": "restore-1-r-b5e91be7",
          "state": "in_progress"
        },
        {
          "actions": null,
          "backupURL": "<>",
          "error": "",
          "filename": "volume-snap-1b5e3063-b2da-463e-b5a5-230e73fec6db.img",
          "isRestoring": true,
          "lastRestored": "",
          "links": null,
          "progress": 12,
          "replica": "restore-1-r-75585701",
          "state": "in_progress"
        }
      ],

Logs:
longhorn-support-bundle_eed6ec20-4ae2-4c37-91ae-a01fa2b3605b_2020-05-13T01-31-18Z.zip

shuo-wu · 2020-05-13T12:54:20Z

There are actually 2 sub-cases for the node down:

The restore volume attached node (the engine node) is down.
The replica node of the restore volume is down.

I think what @sowmyav27 encountered is the 1st case. For a regular volume, the volume should be retained there if the attached node somehow gets disconnected hence the above behavior is what Longhorn currently expects. But for the restoring volume, I think Longhorn needs to directly mark the volume as Faulted. Since the restore volume won't be auto reattached again after the node is back. And the engine process doesn't know how to reuse the incomplete snapshot to continue the full restore, either. BTW, Longhorn also needs to mark DR volumes as Faulted for this node down case.

[Updated]For the 2nd case, the replica on the down node should become failed and the restore can be done. After the restore complete, the restored data is correct. (Please use a big backup, e.g., 5Gi backup, to test this scenario. Otherwise, the restore can be done before detecting the node down and removing the down replica.)

I will fix the 1st scenario then.

shuo-wu · 2020-05-14T11:16:00Z

Manually test 1:

Enable the auto-salvage feature.
Launch a volume and write some data. (e.g., 5Gi random data).
Create a backup.
Restore a new volume.
Power off the attached node of the restore volume.
Wait for the volume Faulted and all replicas failed.
Check the volume condition restore.
Make sure the auto-salvage is not triggered.

Manually test 2:

Enable the auto-salvage feature.
Launch a volume and write some data. (e.g., 5Gi random data).
Create a backup.
Restore a new volume.
Power off a node that contains one replica only. (The engine of the restore volume shouldn't be on the node.)
Wait for the volume Degraded and check the volume condition restore.
Wait for the volume restore complete. Make sure there is no rebuilt replica during the restoring.
Check if the volume works fine and the restored data.

yasker · 2020-05-14T17:23:45Z

@sowmyav27 Which worker node you powered down? Is it a replica one or the node that the volume attached to?

sowmyav27 · 2020-05-14T17:38:45Z

@yasker I consciously powered down a node where a replica was deployed. I am not sure how to find out which node is the volume attached to. How can we check this?

yasker · 2020-05-14T17:48:11Z

@yasker I consciously powered down a node where a replica was deployed. I am not sure how to find out which node is the volume attached to. How can we check this?

See the Attached To field.

sowmyav27 · 2020-05-15T21:44:18Z

Logged bug #1355 to track this issue separately - #1270 (comment) as otherwise the original issue is seen fixed - #1270 (comment)

meldafrawi · 2020-05-28T22:27:03Z

@shuo-wu

test_rebuild_with_restoration failed in longhorn-tests/457

meldafrawi · 2020-05-29T22:07:50Z

test_rebuild_with_restoration passed

sowmyav27 added the kind/bug label Apr 28, 2020

sowmyav27 added this to the v1.0.0 milestone Apr 28, 2020

yasker assigned shuo-wu Apr 29, 2020

yasker added area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) labels Apr 29, 2020

This was referenced Apr 30, 2020

[BUG]Rebuild doesn't work for the restoring volume or DR volume #1279

Closed

Disable rebuild for restoring volume or DR volume longhorn/longhorn-manager#553

Merged

Add tests for rebuild with restoration/DR volume longhorn/longhorn-tests#305

Merged

yasker added priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated labels Apr 30, 2020

khushboo-rancher self-assigned this May 7, 2020

khushboo-rancher removed their assignment May 7, 2020

meldafrawi closed this as completed May 8, 2020

sowmyav27 reopened this May 12, 2020

yasker mentioned this issue May 15, 2020

Negative test case issues #1337

Closed

sowmyav27 mentioned this issue May 15, 2020

[BUG] Replicas stuck in Restoring mode when the node where restore volume attached to is powered down #1355

Closed

shuo-wu mentioned this issue May 21, 2020

[BUG]Volume expansion will succeed even if the volume scheduling fails #1373

Closed

shuo-wu mentioned this issue May 29, 2020

Fix wait_for_volume_restoration_start() and the related tests longhorn/longhorn-tests#345

Merged

meldafrawi closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Not able to attach a restored volume to a workload when the volume was interrupted during restoration process #1270

[BUG] Not able to attach a restored volume to a workload when the volume was interrupted during restoration process #1270

sowmyav27 commented Apr 28, 2020

shuo-wu commented Apr 29, 2020

yasker commented Apr 29, 2020

shuo-wu commented Apr 30, 2020

yasker commented Apr 30, 2020

shuo-wu commented Apr 30, 2020

yasker commented Apr 30, 2020

khushboo-rancher commented May 7, 2020

yasker commented May 7, 2020

meldafrawi commented May 8, 2020

sowmyav27 commented May 12, 2020 •

edited

shuo-wu commented May 13, 2020 •

edited

shuo-wu commented May 14, 2020 •

edited

yasker commented May 14, 2020

sowmyav27 commented May 14, 2020

yasker commented May 14, 2020

sowmyav27 commented May 15, 2020

meldafrawi commented May 28, 2020

meldafrawi commented May 29, 2020

[BUG] Not able to attach a restored volume to a workload when the volume was interrupted during restoration process #1270

[BUG] Not able to attach a restored volume to a workload when the volume was interrupted during restoration process #1270

Comments

sowmyav27 commented Apr 28, 2020

shuo-wu commented Apr 29, 2020

yasker commented Apr 29, 2020

shuo-wu commented Apr 30, 2020

yasker commented Apr 30, 2020

shuo-wu commented Apr 30, 2020

yasker commented Apr 30, 2020

khushboo-rancher commented May 7, 2020

yasker commented May 7, 2020

meldafrawi commented May 8, 2020

sowmyav27 commented May 12, 2020 • edited

shuo-wu commented May 13, 2020 • edited

shuo-wu commented May 14, 2020 • edited

yasker commented May 14, 2020

sowmyav27 commented May 14, 2020

yasker commented May 14, 2020

sowmyav27 commented May 15, 2020

meldafrawi commented May 28, 2020

meldafrawi commented May 29, 2020

sowmyav27 commented May 12, 2020 •

edited

shuo-wu commented May 13, 2020 •

edited

shuo-wu commented May 14, 2020 •

edited