-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] A replica may be incorrectly scheduled to a node with an existing failed replica #8043
Comments
|
In the Harvester cluster, the workaround was to delete enough of the incorrectly scheduled replicas ( |
@ejweber Does this mean for 3 node-cluster, the issue won't happen? (Say if a 3-node cluster has power outage and power on again at the same time). |
@bk201, I think you are mostly correct. In a three node cluster, all replicas should already be scheduled to some node, so the power outage should not result in any unexpected scheduling. If the outage is long enough, we may clean up some of the existing replicas and create new ones. From a brief review of the code, I don't think that is a path to hitting this issue, but I'm not sure I can rule it out. If it is a path, I think it will be much rarer. |
Pre Ready-For-Testing Checklist
|
@ejweber Added backport/1.5.5 first. If it's not required to backport after check, just remove the label. |
Describe the bug
Harvester QA hit a complete lockup of Longhorn after a hard node reboot in a single node cluster. Before the reboot:
numReplicas == 3
, so all volumes were degraded.After the reboot:
To Reproduce
Observe the root cause
storageMaximum
. (I'm not sure block volume and deployment are important here, but they mimic the original context). The volume is degraded and two replicas aren't scheduled.reboot -f
.Cause a lockup
This is more complicated than I originally supposed. We need to create a situation in which the node becomes overscheduled. This cannot be done with a single volume with size > 50% of a node's
storageMaximum
, or even a volume with close to, but < 50% of a node'sstorageMaximum
because the replica scheduler can easily recognize that a second replica will not fit. Instead, we need to ensure there are multiple volumes that, in aggregate, cause the node to be overscheduled (as was the cause of the lockup in the original context). It appears to be a bit racy as well. I reproduced 1/4 times with four volumes and never with only two volumes.storageAvailable
. (I'm not sure block volume and deployment are important here, but they mimic the original context). The volumes are degraded and two replicas each aren't scheduled.reboot -f
.Expected behavior
Because
replicaSoftAntiAffinity == false
in the cluster, Longhorn should not have scheduled and additional replica for each volume. If an extra replica was truly desired for some reason, the user should have had to:replicaSoftAntiAffinity == true
, orTargeting two potential fixes:
Support bundle for troubleshooting
https://github.com/harvester/harvester/files/14390529/supportbundle_e5003761-8a04-41c3-8cbf-88a8c0e19116_2024-02-23T21-55-41Z.zip
Environment
Additional context
See harvester/harvester#5109 (comment) for the original context.
The text was updated successfully, but these errors were encountered: