Scheduler pre-binding can cause race conditions with automated empty node removal #125491
Labels
kind/bug
Categorizes issue or PR as related to a bug.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
sig/scheduling
Categorizes an issue or PR as relevant to SIG Scheduling.
What happened?
In a Google Kubernetes Engine (GKE) environment, a pod was requesting a large Persistent Volume Claim (PVC). After the appropriate node was identified for the pod, the pod became stuck in the prebinding stage for several minutes while the volume provisioning process completed. Since the node name was not assigned to the pod during this time, the Cluster Autoscaler perceived the node as unoccupied. Consequently, the Cluster Autoscaler initiated a scale-down of the node, unaware that the pending pod was scheduled to run there.
What did you expect to happen?
I would expect that the Scheduler would communicate the intended binding of the pod to the identified node. This would enable the Cluster Autoscaler to recognize that the node is not actually empty and prevent it from being scaled down prematurely.
How can we reproduce it (as minimally and precisely as possible)?
The issue arose in a large GKE cluster with pods requesting substantial PVCs, making replication potentially challenging. However, the race condition within the Scheduler is evident.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
No response
Install tools
No response
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: