New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Race leaves snapshot CRs that cannot be deleted #6298
Comments
Do we need to backport to 1.4/1.3? |
I think no, this one is introduced in v1.5.1 only |
@ejweber @PhanLe1010 Please don't forget to update the status of the zenhub pipeline. It's important for us to know where we are. |
Pre Ready-For-Testing Checklist
|
This appears to be reproducible during all rebuilds when On a three node cluster:
We would expect there to only be the new snapshot CR in the cluster, as the old one is purged during rebuild and no longer exists. |
I think we should have an automated test case that catches this issue. I was curious why it wasn't caught in a case like test_snapshot, which has a step like :
However, that case is using an API method to get snapshots the engine knows about. It is not aware of erroneous snapshot CRs in the cluster. |
Verified passed on v1.5.x-head (longhorn-manager 8f12052) following the test steps. After the second replica deletion/rebuilding, the old snapshot is replaced by a new one, there are no 2 snapshots existing in the same time. |
Describe the bug (馃悰 if you encounter this issue)
I discovered this while doing the iterative testing described in #6078 (comment). In a cluster with 100 volumes and
auto-cleanup-system-generated-snapshot=true
in which an instance-manager has been force deleted ~50 times, there are ~2000 snapshot CRs that all look similar to:Note that the deletion timestamp is set and that the status.error reads "lost track of the corresponding snapshot info inside volume engine".
These snapshots do not exist in engine.status.snapshots nor do they exist on disk.
To Reproduce
I triggered it with the above, but a much simpler reproduce is at #6298 (comment).
Expected behavior
Longhorn-manager should clear the finalizer and allow these snapshot CRs to be deleted from the cluster.
Log or Support bundle
If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment
Additional context
Discussed with @PhanLe1010 and @james-munson. We cannot get out of the state we are in because:
@PhanLe1010 suggested that we got into this state because:
snap-1
was created.snap-1
CR withstatus.children=volume-head
.snap-2
was created around the same timesnap-1
was purged.snap-1
CR and createdsnap-2 CR
with status.children=volume-head.status.children=volume-head
andsnap-1
CR has deletion timestamp.snap-1
CR's status is never updated because it is gone from the engine.This fits well with the circumstances, as my iterative testing causes lots of purging/rebuilding in a short period of time.
The text was updated successfully, but these errors were encountered: