[BUG]Restore will get stuck if the restore error is not captured #1188

shuo-wu · 2020-04-14T11:50:45Z

Describe the bug
Right now some errors caused by replica restore are not captured by the longhorn manager and not recorded in engine restore status. Then there is no way to handle the error and continue the restore.

To Reproduce
Steps to reproduce the behavior:

Set backupstore for Longhorn
Create a regular Longhorn volume with the workload, then write data to the volume.
Create a backup for the volume, then launch a DR volume based on the backup.
Wait for the DR volume init restore complete. Then create a non-empty directory named volume-delta-<last restored backup name>.img for one replica of the DR volume. This directory will fail the following incremental restoration for the replica. (e.g., mkdir -p <replica data path>/volume-delta-backup-c3edca9817c14f1a.img/dir)
Write data to the regular volume then create a backup. Then wait for the DR volume incremental restoration result.
Check restoreStatus and lastRestoredBackup in the DR volume engine status. There is no restoreStatus and engine.status.lastRestoredBackup is not updated.
Create 2 more backups for the regular volume. The DR volume won't continue to restore volume.

Expected behavior

The error should be recorded in restoreStatus and engine.status.lastRestoredBackup should be updated.
The replica that fails to restore the data will be marked as failed

Environment:

Longhorn version: v1.0.0
Kubernetes version:
Node OS type and version:

The text was updated successfully, but these errors were encountered:

shuo-wu · 2020-05-08T11:09:38Z

Integration test plan 1:
Test if the DR volume will become Faulted when the backup data is messed up.

Set the backupstore.
Create then attach a volume.
Directly create a backup for the volume when there is no data in the volume.
Create a DR volume from the backup.
Write some data to the volume and create the 2nd backup.
Delete some data blocks of the backup once the DR volume starts incremental restoration. ( Do not just delete the blocks that are created by the 1st backup.)
Wait for the restore volume Faulted.
--> Check if volume.conditions[restore].status == False && volume.conditions[restore].reason == "RestoreFailure".
--> Check if volume.ready == false
This test is similar to the 1st test in [BUG] State transitions for a failure in backup restore #1260

Integration test 2:
Test if the replica fails the incremental restore will be marked as ERROR.

Set a random the backupstore.
Create a regular volume and a backup (with some data).
Create a DR volume from the backup.
Create the 2nd backup for the regular volume
Wait for the DR volume incremental restore complete.
Pick up a replica of the DR volume, Create a non-empty directory named volume-delta-<last restored backup>.img in the replica directory. This operation will lead to replica incremental restore failure later.
7 Create the 3rd backup for the regular volume
Wait for the DR volume incremental restore complete. And check:
--> Check if the volume is Degraded state and the replica in step5 is ERROR.
--> Check if volume.conditions[restore].status == True && volume.conditions[restore].reason == "RestoreInProgress".
See if the DR volume still works fine. (e.g. Trigger more incremental restore, trigger expansion)
Activate the DR volume and check the data.

shuo-wu · 2020-05-13T11:03:05Z

The PR for issue #1260 cannot completely fix this issue. The 1st test can be passed but the 2nd one will still fail after the fix. Since there are some errors during the incremental restore not being recorded in the restore status. Then the longhorn manager cannot mark the related replica as failed.
The errors are not recorded in the sync agent server restore status:

Parameters check failure before restore status initialization.
Delta file check and cleanup error before starting the incremental restoration.
File coalesce error in the post inc restore function.
Replica reload error in the post inc restore function.
...
Besides, there is no related integration test covers those corner cases.

And the regular restore implementation has the similar issue, too.

If we want to cover those errors in the restore status, we need to check which error may be fatal for the replica (longhorn manager needs to fail the replica according to the restore status error msg) and which error won't affect the replica (longhorn manager doesn't need to fail the replica and simply retrying restore may be enough.). It means we need to refactor the whole restore, including incremental restore. I don't think we have time to do that in this release.

Actually I can fix the issue that triggered by the reproduce step, but I don't think that's what we want.

yasker · 2020-05-13T22:17:23Z

Sounds like we need to move this out of 1.0.0.

Have we fixed #1328 along with #1260 ?

yasker · 2020-05-13T22:34:35Z

@shuo-wu Can you assess the impact of missing the fix for this issue? E.g. in which case the user will have trouble, what's the workaround, etc.

shuo-wu · 2020-05-14T11:30:19Z

No, #1328 will be fixed along with other issues later.
The fix for #1260 would only handle parts of the restore failure cases. We will record all found cases and track the fixes in #1337.

yasker · 2020-07-02T21:20:58Z

Code merged, need the pre-merge checklist.

shuo-wu · 2020-07-06T02:23:14Z

khushboo-rancher · 2020-07-09T23:43:44Z

Tried reproducing issue from #1188 (comment)

On V1.0.0, The DR volume got stuck in Restore in progress forever. - It can't be activated.
Volume name - volume-restore-stuck-test-1
logs:
time="2020-07-09T23:46:24Z" level=error msg="BUG: engine volume-restore-stuck-test-1-e-197c85d1: different lastRestored values even though engine is not restoring"
longhorn-support-bundle_e27c1f86-2285-474c-8127-9c41cc352773_2020-07-09T23-36-05Z.zip
On master 97a869e, DR volume got degraded but healthy. Also, checked the data after activation, data is intact.
Volume name - volume-restore-stuck-test-2
logs
longhorn-support-bundle_516cbf10-706f-4852-860b-729747916e3d_2020-07-09T23-36-20Z.zip

In above set up I don't see restoreStatus and lastRestoredBackup empty.
@shuo-wu Is this expected?

shuo-wu · 2020-07-10T03:59:32Z

The error log in the v1.0.0 test means the status stored in the engine process is inconsistent with the engine object status. Since the restore error cannot be handled correctly in v1.0.0, the whole volume will mess up after the test then many kinds of weird issues may get triggered.
There will be a big refactor for the restore feature in the next release. This kind of error will be fixed after the reactor.

In general, I don't think we need to worry about the error you triggered in v1.0.0 test now.

khushboo-rancher · 2020-07-14T01:27:48Z

Verified on v1.0.1-rc1

Validation- Passed

Followed the steps from #1188 (comment) DR volume remains healthy.
Verified the restored data.

shuo-wu added kind/bug area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) labels Apr 14, 2020

shuo-wu self-assigned this Apr 14, 2020

yasker added this to the v1.0.0 milestone Apr 14, 2020

yasker added priority/0 Must be fixed in this release (managed by PO) priority/1 Highly recommended to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Apr 30, 2020

yasker changed the title ~~[BUG]DR volume may get stuck if there is one replica failing to restore the data~~ [BUG]DR volume will get stuck if there is one replica failing to restore the data May 5, 2020

yasker added require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated automation-engine-required labels May 6, 2020

shuo-wu mentioned this issue May 8, 2020

[BUG] State transitions for a failure in backup restore #1260

Closed

khushboo-rancher mentioned this issue May 13, 2020

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

Closed

yasker mentioned this issue May 13, 2020

Negative test case issues #1337

Closed

yasker modified the milestones: v1.0.0, v1.0.1 May 15, 2020

shuo-wu mentioned this issue Jun 10, 2020

Restore Rafactoring longhorn/longhorn-engine#501

Merged

shuo-wu changed the title ~~[BUG]DR volume will get stuck if there is one replica failing to restore the data~~ [BUG]Restore will get stuck if the restore error is not captured Jun 12, 2020

This was referenced Jun 12, 2020

Check and cleanup the file before restore longhorn/backupstore#43

Merged

Capture and handle the restore cmd error longhorn/longhorn-manager#609

Merged

shuo-wu mentioned this issue Jun 19, 2020

Add test test_dr_volume_with_restore_command_error() longhorn/longhorn-tests#357

Merged

boknowswiki mentioned this issue Jul 10, 2020

1556 backupstore errors longhorn/backupstore#49

Merged

khushboo-rancher closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Restore will get stuck if the restore error is not captured #1188

[BUG]Restore will get stuck if the restore error is not captured #1188

shuo-wu commented Apr 14, 2020 •

edited

shuo-wu commented May 8, 2020 •

edited

shuo-wu commented May 13, 2020 •

edited

yasker commented May 13, 2020 •

edited

yasker commented May 13, 2020

shuo-wu commented May 14, 2020 •

edited

yasker commented Jul 2, 2020

shuo-wu commented Jul 6, 2020 •

edited

khushboo-rancher commented Jul 9, 2020 •

edited

shuo-wu commented Jul 10, 2020

khushboo-rancher commented Jul 14, 2020

[BUG]Restore will get stuck if the restore error is not captured #1188

[BUG]Restore will get stuck if the restore error is not captured #1188

Comments

shuo-wu commented Apr 14, 2020 • edited

shuo-wu commented May 8, 2020 • edited

shuo-wu commented May 13, 2020 • edited

yasker commented May 13, 2020 • edited

yasker commented May 13, 2020

shuo-wu commented May 14, 2020 • edited

yasker commented Jul 2, 2020

shuo-wu commented Jul 6, 2020 • edited

Pre-merged Checklist

khushboo-rancher commented Jul 9, 2020 • edited

shuo-wu commented Jul 10, 2020

khushboo-rancher commented Jul 14, 2020

shuo-wu commented Apr 14, 2020 •

edited

shuo-wu commented May 8, 2020 •

edited

shuo-wu commented May 13, 2020 •

edited

yasker commented May 13, 2020 •

edited

shuo-wu commented May 14, 2020 •

edited

shuo-wu commented Jul 6, 2020 •

edited

khushboo-rancher commented Jul 9, 2020 •

edited