-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Restore will get stuck if the restore error is not captured #1188
Comments
Integration test plan 1:
Integration test 2:
|
The PR for issue #1260 cannot completely fix this issue. The 1st test can be passed but the 2nd one will still fail after the fix. Since there are some errors during the incremental restore not being recorded in the restore status. Then the longhorn manager cannot mark the related replica as failed.
And the regular restore implementation has the similar issue, too. If we want to cover those errors in the restore status, we need to check which error may be fatal for the replica (longhorn manager needs to fail the replica according to the restore status error msg) and which error won't affect the replica (longhorn manager doesn't need to fail the replica and simply retrying restore may be enough.). It means we need to refactor the whole restore, including incremental restore. I don't think we have time to do that in this release. Actually I can fix the issue that triggered by the reproduce step, but I don't think that's what we want. |
@shuo-wu Can you assess the impact of missing the fix for this issue? E.g. in which case the user will have trouble, what's the workaround, etc. |
Code merged, need the pre-merge checklist. |
Pre-merged Checklist
|
Tried reproducing issue from #1188 (comment)
In above set up I don't see |
In general, I don't think we need to worry about the error you triggered in v1.0.0 test now. |
Verified on v1.0.1-rc1 Validation- Passed Followed the steps from #1188 (comment) DR volume remains healthy. |
Describe the bug
Right now some errors caused by replica restore are not captured by the longhorn manager and not recorded in engine restore status. Then there is no way to handle the error and continue the restore.
To Reproduce
Steps to reproduce the behavior:
volume-delta-<last restored backup name>.img
for one replica of the DR volume. This directory will fail the following incremental restoration for the replica. (e.g.,mkdir -p <replica data path>/volume-delta-backup-c3edca9817c14f1a.img/dir
)restoreStatus
andlastRestoredBackup
in the DR volume engine status. There is norestoreStatus
andengine.status.lastRestoredBackup
is not updated.Expected behavior
restoreStatus
andengine.status.lastRestoredBackup
should be updated.Environment:
The text was updated successfully, but these errors were encountered: