-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Longhorn may keep corrupted salvaged replicas and discard good ones #7425
Comments
I'm not sure our autosalvage mechanism is the main culprit here after all. In all relevant versions of Longhorn, we only autosalvage with replicas that have As soon as we decide to rebuild a replica for the first time, we set We only set The corrupted replicas in our support bundle have |
Hi @ejweber I found this is quite similar to this issue as well |
Thanks @ChanYiLin! I agree that the linked issue is quite relevant! I'm not seeing any real evidence that |
Longhorn
These (possibly unintentional) protections are thwarted, however, if the volume is ever detached. Then, we clean up the stale (but good) replica after |
Pre Ready-For-Testing Checklist
|
Verified on master-head 20240306
The test steps Result Passed
|
Describe the bug (馃悰 if you encounter this issue)
In a Harvester cluster, we saw many volumes in an attach/detach loop. The
instance-manager-e
responsible for starting each engine logged the following on each loop:At this point, there were two replicas, and both were somehow corrupted (missing a
.meta
file to correspond to a.img
file).Both replicas appeared healthy to Longhorn, with
healthyAt
but nofailedAt
.We were able to look at the data directory on disk for one of the replicas:
volume-snap-af2454a6-b3b5-4de6-a6a6-eb465abfca97.img
was missing a corresponding.meta
file.We were lucky enough to capture the reason for this in the logs:
The rebuild timed out. We always transfer the
.meta
file after the.img
file, so this replica was left in a broken state. It shouldn't have been a big deal. Presumably we marked the replica as failed with afailedAt
and attempted to rebuild it. (We can't tell from the logs whether this was the case.) The same thing probably happened to the other replica. So, at one point, we likely had three replicas. Two were broken and one was whole.Somehow, we lost the whole replica and kept one broken one. I think it could have happened like this:
failedAt
from the broken replicas an attempt an autosalvage.failedAt
and the broken ones do not.To Reproduce
I still need to try to reproduce this. We can probably:
.meta
file) one of the replicas, then crash it.Expected behavior
We can distinguish the actual healthy replica from the replicas that failed to be salvaged. We do not delete the actual healthy replica. Eventually, the healthy replica is able to rebuild other healthy replicas.
Support bundle for troubleshooting
I cannot share the support bundle here, but we have one that we can re-examine.
Environment
Additional context
There is a
replica.spec.salvageRequested
field and areplica.status.salvageExecuted
field that our current code doesn't use. Maybe we can use these to track that certain replicas were never successfully salvaged.The text was updated successfully, but these errors were encountered: