New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628
Comments
This failure seems to be caused by a delay in the monitor stop. The monitor stop is best-effort, then it can happen after the volume recreation and lead to the replica mode `ERR, then leads to the restore command gets failed. Here are the logs:
BTW, I don't know why the |
@shuo-wu Another instance of this test failure: Looks like this test is still flaky, runs on local fails on remote. |
@shuo-wu Is the replica state something that would fix itself? instead of needing to go through the whole reset-volume ramp down/up again? |
Let me explain the case with more details:
The key point is that we don't know what the progress is in this monitoring stop goroutine. Hence checking the replica state is meaningless. |
Maybe we can add a lock to prevent the replica from being marked as ERR during the backend removal. |
We're hitting this issue more often now. Raise the priority. |
@keithalucas remember to move the issue to |
It looks like
We could create a mechanism to stop the |
Pre Ready-For-Testing Checklist
|
@keithalucas please update the checklist comment and also make sure the uncertain failed test cases got fixed before moving to |
This issue is dependent on timing so this procedure to duplicate it doesn't always work.
One the old codebase, it might not occur every time but, the controller will fail to recreate the replica because it has a mode of ERR. With the new codebase, the replica will always be recreated. |
I found another path that causes this problem. When the VolumeShutdown GRPC method is called on the controller, the replicas are closed without stopping the |
For the validation, can we rely on the test results or a conscious testing is required with steps from #1628 (comment)? |
@keithalucas Could you please help with backport PR for v1.1.3? |
I back ported with https://github.com/longhorn/longhorn-engine/pull/645/commits |
Closing this as the fix is validated and backported for v1.1.3 |
Describe the bug
We noticed engine integration test failures, during a restore operation.
The error is not dependent on the test method in question.
Potentially any restore operation in a test could trigger this.
To Reproduce
Expected behavior
A clear and concise description of what you expected to happen.
Log
If applicable, add the Longhorn managers' log when the issue happens.
Environment:
Additional context
Test failure
The text was updated successfully, but these errors were encountered: