Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Restore will get stuck if the restore error is not captured #1188

Closed
shuo-wu opened this issue Apr 14, 2020 · 10 comments
Closed

[BUG]Restore will get stuck if the restore error is not captured #1188

shuo-wu opened this issue Apr 14, 2020 · 10 comments
Assignees
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Milestone

Comments

@shuo-wu
Copy link
Contributor

shuo-wu commented Apr 14, 2020

Describe the bug
Right now some errors caused by replica restore are not captured by the longhorn manager and not recorded in engine restore status. Then there is no way to handle the error and continue the restore.

To Reproduce
Steps to reproduce the behavior:

  1. Set backupstore for Longhorn
  2. Create a regular Longhorn volume with the workload, then write data to the volume.
  3. Create a backup for the volume, then launch a DR volume based on the backup.
  4. Wait for the DR volume init restore complete. Then create a non-empty directory named volume-delta-<last restored backup name>.img for one replica of the DR volume. This directory will fail the following incremental restoration for the replica. (e.g., mkdir -p <replica data path>/volume-delta-backup-c3edca9817c14f1a.img/dir)
  5. Write data to the regular volume then create a backup. Then wait for the DR volume incremental restoration result.
  6. Check restoreStatus and lastRestoredBackup in the DR volume engine status. There is no restoreStatus and engine.status.lastRestoredBackup is not updated.
  7. Create 2 more backups for the regular volume. The DR volume won't continue to restore volume.

Expected behavior

  1. The error should be recorded in restoreStatus and engine.status.lastRestoredBackup should be updated.
  2. The replica that fails to restore the data will be marked as failed

Environment:

  • Longhorn version: v1.0.0
  • Kubernetes version:
  • Node OS type and version:
@shuo-wu shuo-wu added kind/bug area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) labels Apr 14, 2020
@shuo-wu shuo-wu self-assigned this Apr 14, 2020
@yasker yasker added this to the v1.0.0 milestone Apr 14, 2020
@yasker yasker added priority/0 Must be fixed in this release (managed by PO) priority/1 Highly recommended to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Apr 30, 2020
@yasker yasker changed the title [BUG]DR volume may get stuck if there is one replica failing to restore the data [BUG]DR volume will get stuck if there is one replica failing to restore the data May 5, 2020
@yasker yasker added require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated automation-engine-required labels May 6, 2020
@shuo-wu
Copy link
Contributor Author

shuo-wu commented May 8, 2020

Integration test plan 1:
Test if the DR volume will become Faulted when the backup data is messed up.

  1. Set the backupstore.
  2. Create then attach a volume.
  3. Directly create a backup for the volume when there is no data in the volume.
  4. Create a DR volume from the backup.
  5. Write some data to the volume and create the 2nd backup.
  6. Delete some data blocks of the backup once the DR volume starts incremental restoration. ( Do not just delete the blocks that are created by the 1st backup.)
  7. Wait for the restore volume Faulted.
    --> Check if volume.conditions[restore].status == False && volume.conditions[restore].reason == "RestoreFailure".
    --> Check if volume.ready == false
    This test is similar to the 1st test in [BUG] State transitions for a failure in backup restore #1260

Integration test 2:
Test if the replica fails the incremental restore will be marked as ERROR.

  1. Set a random the backupstore.
  2. Create a regular volume and a backup (with some data).
  3. Create a DR volume from the backup.
  4. Create the 2nd backup for the regular volume
  5. Wait for the DR volume incremental restore complete.
  6. Pick up a replica of the DR volume, Create a non-empty directory named volume-delta-<last restored backup>.img in the replica directory. This operation will lead to replica incremental restore failure later.
    7 Create the 3rd backup for the regular volume
  7. Wait for the DR volume incremental restore complete. And check:
    --> Check if the volume is Degraded state and the replica in step5 is ERROR.
    --> Check if volume.conditions[restore].status == True && volume.conditions[restore].reason == "RestoreInProgress".
  8. See if the DR volume still works fine. (e.g. Trigger more incremental restore, trigger expansion)
  9. Activate the DR volume and check the data.

@shuo-wu
Copy link
Contributor Author

shuo-wu commented May 13, 2020

The PR for issue #1260 cannot completely fix this issue. The 1st test can be passed but the 2nd one will still fail after the fix. Since there are some errors during the incremental restore not being recorded in the restore status. Then the longhorn manager cannot mark the related replica as failed.
The errors are not recorded in the sync agent server restore status:

  1. Parameters check failure before restore status initialization.
  2. Delta file check and cleanup error before starting the incremental restoration.
  3. File coalesce error in the post inc restore function.
  4. Replica reload error in the post inc restore function.
    ...
    Besides, there is no related integration test covers those corner cases.

And the regular restore implementation has the similar issue, too.

If we want to cover those errors in the restore status, we need to check which error may be fatal for the replica (longhorn manager needs to fail the replica according to the restore status error msg) and which error won't affect the replica (longhorn manager doesn't need to fail the replica and simply retrying restore may be enough.). It means we need to refactor the whole restore, including incremental restore. I don't think we have time to do that in this release.

Actually I can fix the issue that triggered by the reproduce step, but I don't think that's what we want.

@yasker
Copy link
Member

yasker commented May 13, 2020

Sounds like we need to move this out of 1.0.0.

Have we fixed #1328 along with #1260 ?

@yasker
Copy link
Member

yasker commented May 13, 2020

@shuo-wu Can you assess the impact of missing the fix for this issue? E.g. in which case the user will have trouble, what's the workaround, etc.

@shuo-wu
Copy link
Contributor Author

shuo-wu commented May 14, 2020

No, #1328 will be fixed along with other issues later.
The fix for #1260 would only handle parts of the restore failure cases. We will record all found cases and track the fixes in #1337.

@yasker yasker modified the milestones: v1.0.0, v1.0.1 May 15, 2020
@shuo-wu shuo-wu changed the title [BUG]DR volume will get stuck if there is one replica failing to restore the data [BUG]Restore will get stuck if the restore error is not captured Jun 12, 2020
@yasker
Copy link
Member

yasker commented Jul 2, 2020

Code merged, need the pre-merge checklist.

@shuo-wu
Copy link
Contributor Author

shuo-wu commented Jul 6, 2020

Pre-merged Checklist

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Jul 9, 2020

Tried reproducing issue from #1188 (comment)

  1. On V1.0.0, The DR volume got stuck in Restore in progress forever. - It can't be activated.
    Volume name - volume-restore-stuck-test-1
    logs:
    time="2020-07-09T23:46:24Z" level=error msg="BUG: engine volume-restore-stuck-test-1-e-197c85d1: different lastRestored values even though engine is not restoring"
    longhorn-support-bundle_e27c1f86-2285-474c-8127-9c41cc352773_2020-07-09T23-36-05Z.zip

  2. On master 97a869e, DR volume got degraded but healthy. Also, checked the data after activation, data is intact.
    Volume name - volume-restore-stuck-test-2
    logs
    longhorn-support-bundle_516cbf10-706f-4852-860b-729747916e3d_2020-07-09T23-36-20Z.zip

In above set up I don't see restoreStatus and lastRestoredBackup empty.
@shuo-wu Is this expected?

@shuo-wu
Copy link
Contributor Author

shuo-wu commented Jul 10, 2020

  1. The error log in the v1.0.0 test means the status stored in the engine process is inconsistent with the engine object status. Since the restore error cannot be handled correctly in v1.0.0, the whole volume will mess up after the test then many kinds of weird issues may get triggered.
  2. There will be a big refactor for the restore feature in the next release. This kind of error will be fixed after the reactor.

In general, I don't think we need to worry about the error you triggered in v1.0.0 test now.

@khushboo-rancher
Copy link
Contributor

Verified on v1.0.1-rc1

Validation- Passed

Followed the steps from #1188 (comment) DR volume remains healthy.
Verified the restored data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug priority/1 Highly recommended to fix in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Projects
None yet
Development

No branches or pull requests

3 participants