Description
What happened?
The scheduler has a 30s timeout for the bind operation to succeed; if we don't get a response within 30s, the in-memory assignment of pod to node in the scheduler cache expires.
A race condition will happen in the follow case:
-
pod1 is assigned to a node, scheduler cache is updated with the assignment, bind operation issued to apiserver.
-
if the apiserver is under huge pressure, bind takes more than 30s, scheduler expires the cached pod-to-node assignment.
-
bind eventually succeeds, but because the apiserver is under huge pressure, the pod update with the node name takes a long time to propagate to the scheduler.
-
because the pod update took a long time to propagate and the cache entry expired, the scheduler is not aware that the assignment actually happened, and so it had no problem assigning a second pod to the same node that would otherwise not fit if the scheduler was aware that the first pod was eventually assigned to the node.
On the scheduler side, what we need to do is make the 30s longer for large clusters, and ideally adaptable to cluster state.
/sig scheduling
What did you expect to happen?
No race conditions.
How can we reproduce it (as minimally and precisely as possible)?
Create a large scale cluster.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# paste output here
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status