Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Not able to attach a restored volume to a workload when the volume was interrupted during restoration process #1270

Closed
sowmyav27 opened this issue Apr 28, 2020 · 18 comments
Assignees
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@sowmyav27
Copy link

Describe the bug
Not able to attach a restored volume to a workload when the volume was interrupted during restoration process

To Reproduce

  • Restore from a backup to vol-1
  • while restore is in process, delete a replica
  • wait for the volume to be in "Detached" state after restoration. Create PV/PVC
  • Deploy a workload and attach the volume vol-1
  • Error seen:

Screen Shot 2020-04-28 at 3 28 52 PM

Expected behavior
User should be able to attach volume to a workload/pod successfully.

Note:
Similar issue is seen - when during restoration, one of the nodes is powered down, which causes a replica rebuild

Environment:

  • Longhorn version: master-04/28/2020
  • Kubernetes version: 1.17.5
  • Node OS type and version: RKE DO Linux OS cluster
@sowmyav27 sowmyav27 added this to the v1.0.0 milestone Apr 28, 2020
@yasker yasker added area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) labels Apr 29, 2020
@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 29, 2020

  1. If there is a replica deleted during the restoration, a new replica will be rebuild. But this rebuild replica won't continue restoring data after the rebuild complete. Hence it's actually an invalid replica. BUG!
  2. The wrongly rebuilt replica messes up the whole volume (the meta info may be included) when the pod is trying to use it. Then the error reported above will be triggered.
  3. Deleting the invalid replica after the restoration complete can bring the volume back. This is a simple workaround.
  4. Not sure if there is a similar issue for DRV.

Possible solution:
Disable replica rebuild for the restoring volume.

Actually I am worried what will happen if a replica is deleted and rebuilt when users keep writing data into the volume. I am thinking there is a flaw/gap between rebuild complete and starting to write data for the replica in that case. If YES, it may lead to data inconsistency.

@yasker
Copy link
Member

yasker commented Apr 29, 2020

@shuo-wu replica rebuild and data writing should be fine. When we do replica rebuild there is a lock to prevent any writing until we've done snapshot. You can try it too.

Regarding the replica rebuild during the restoration, we can disable the rebuild during restoration volumes for now. Though for the new replica, it should start the restoration process - not the rebuild process. The same applies to the DR volume.

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 30, 2020

OK. I checked the implementation. The replica rebuild with data writing looks fine.

I will disable the rebuild for this case in the manager part now. But when will we refactor/fix the rebuild?

@yasker
Copy link
Member

yasker commented Apr 30, 2020

@shuo-wu can you check and give me an estimate of how long it will take to fix the rebuild during restoration? We can just let the new replica to do restoration rather than rebuild.

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 30, 2020

Maybe 3~5 days(Review is not included). I think it depends on the complexity of the DR volume part.

@yasker
Copy link
Member

yasker commented Apr 30, 2020

@shuo-wu OK, let's stop the rebuilding for failure during restoration or DR. Can you file a bug to track the rebuild during the restoration as an enhancement? Thanks.

@khushboo-rancher
Copy link
Contributor

Verified in master - 05-04-2020

While restoring if there is any interruption, replica rebuild doesn't get triggered and volume later get attached to pod successfully.

Steps for verification:
Scenario: 1

  1. Restored a volume. (3 replicas)
  2. Deleted one replica when restoring was in progress.
  3. New replica didn't get rebuilt. Volume created with 2 replicas.
  4. Created PV/PVC
  5. Attached to a pod successfully, Now replica rebuilt and data matched across replicas.

Scenario: 2

  1. Restored a volume. (3 replicas)
  2. Power down one node when restoring was in progress.
  3. New replica didn't get rebuilt. Volume created with 2 replicas.
  4. Created PV/PVC
  5. Attached to a pod successfully, Now replica rebuilt and data matched across replicas.

@khushboo-rancher khushboo-rancher removed their assignment May 7, 2020
@yasker
Copy link
Member

yasker commented May 7, 2020

E2E test has been implemented by Shuo.

@meldafrawi
Copy link
Contributor

test test_rebuild_with_restoration & test test_rebuild_with_inc passed for two consecutive runs longhorn-tests/421 & longhorn-tests/422

@sowmyav27 sowmyav27 reopened this May 12, 2020
@sowmyav27
Copy link
Author

sowmyav27 commented May 12, 2020

Reopening this issue -

Validated on the latest master - 05/11/2020

Steps:

  • Restore from a backup to vol-1
  • while the restore is in progress, power down a worker node.
  • A replica fails and gets deleted (when seen from UI)
  • The other two replica are seen in "restoring" state since long. In the API failed replica still exists and is not removed.
"restoreStatus": [
        {
          "actions": null,
          "backupURL": "<>",
          "error": "",
          "filename": "volume-snap-1b5e3063-b2da-463e-b5a5-230e73fec6db.img",
          "isRestoring": true,
          "lastRestored": "",
          "links": null,
          "progress": 12,
          "replica": "tcp://10.42.3.11:10015",
          "state": "in_progress"
        },
        {
          "actions": null,
          "backupURL": "<>",
          "error": "",
          "filename": "volume-snap-1b5e3063-b2da-463e-b5a5-230e73fec6db.img",
          "isRestoring": true,
          "lastRestored": "",
          "links": null,
          "progress": 12,
          "replica": "restore-1-r-b5e91be7",
          "state": "in_progress"
        },
        {
          "actions": null,
          "backupURL": "<>",
          "error": "",
          "filename": "volume-snap-1b5e3063-b2da-463e-b5a5-230e73fec6db.img",
          "isRestoring": true,
          "lastRestored": "",
          "links": null,
          "progress": 12,
          "replica": "restore-1-r-75585701",
          "state": "in_progress"
        }
      ],

Screen Shot 2020-05-11 at 6 50 01 PM

Logs:
longhorn-support-bundle_eed6ec20-4ae2-4c37-91ae-a01fa2b3605b_2020-05-13T01-31-18Z.zip

@shuo-wu
Copy link
Contributor

shuo-wu commented May 13, 2020

There are actually 2 sub-cases for the node down:

  1. The restore volume attached node (the engine node) is down.
  2. The replica node of the restore volume is down.

I think what @sowmyav27 encountered is the 1st case. For a regular volume, the volume should be retained there if the attached node somehow gets disconnected hence the above behavior is what Longhorn currently expects. But for the restoring volume, I think Longhorn needs to directly mark the volume as Faulted. Since the restore volume won't be auto reattached again after the node is back. And the engine process doesn't know how to reuse the incomplete snapshot to continue the full restore, either. BTW, Longhorn also needs to mark DR volumes as Faulted for this node down case.

[Updated]For the 2nd case, the replica on the down node should become failed and the restore can be done. After the restore complete, the restored data is correct. (Please use a big backup, e.g., 5Gi backup, to test this scenario. Otherwise, the restore can be done before detecting the node down and removing the down replica.)

I will fix the 1st scenario then.

@shuo-wu
Copy link
Contributor

shuo-wu commented May 14, 2020

Manually test 1:

  1. Enable the auto-salvage feature.
  2. Launch a volume and write some data. (e.g., 5Gi random data).
  3. Create a backup.
  4. Restore a new volume.
  5. Power off the attached node of the restore volume.
  6. Wait for the volume Faulted and all replicas failed.
  7. Check the volume condition restore.
  8. Make sure the auto-salvage is not triggered.

Manually test 2:

  1. Enable the auto-salvage feature.
  2. Launch a volume and write some data. (e.g., 5Gi random data).
  3. Create a backup.
  4. Restore a new volume.
  5. Power off a node that contains one replica only. (The engine of the restore volume shouldn't be on the node.)
  6. Wait for the volume Degraded and check the volume condition restore.
  7. Wait for the volume restore complete. Make sure there is no rebuilt replica during the restoring.
  8. Check if the volume works fine and the restored data.

@yasker
Copy link
Member

yasker commented May 14, 2020

@sowmyav27 Which worker node you powered down? Is it a replica one or the node that the volume attached to?

@sowmyav27
Copy link
Author

@yasker I consciously powered down a node where a replica was deployed. I am not sure how to find out which node is the volume attached to. How can we check this?

@yasker
Copy link
Member

yasker commented May 14, 2020

@yasker I consciously powered down a node where a replica was deployed. I am not sure how to find out which node is the volume attached to. How can we check this?

See the Attached To field.

@sowmyav27
Copy link
Author

Logged bug #1355 to track this issue separately - #1270 (comment) as otherwise the original issue is seen fixed - #1270 (comment)

@meldafrawi
Copy link
Contributor

@shuo-wu

test_rebuild_with_restoration failed in longhorn-tests/457

@meldafrawi
Copy link
Contributor

test_rebuild_with_restoration passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
None yet
Development

No branches or pull requests

5 participants