Skip to content

Scheduler will run into race conditions on large scale clusters #106361

Closed as not planned
@ahg-g

Description

@ahg-g

What happened?

The scheduler has a 30s timeout for the bind operation to succeed; if we don't get a response within 30s, the in-memory assignment of pod to node in the scheduler cache expires.

A race condition will happen in the follow case:

  1. pod1 is assigned to a node, scheduler cache is updated with the assignment, bind operation issued to apiserver.

  2. if the apiserver is under huge pressure, bind takes more than 30s, scheduler expires the cached pod-to-node assignment.

  3. bind eventually succeeds, but because the apiserver is under huge pressure, the pod update with the node name takes a long time to propagate to the scheduler.

  4. because the pod update took a long time to propagate and the cache entry expired, the scheduler is not aware that the assignment actually happened, and so it had no problem assigning a second pod to the same node that would otherwise not fit if the scheduler was aware that the first pod was eventually assigned to the node.

On the scheduler side, what we need to do is make the 30s longer for large clusters, and ideally adaptable to cluster state.

/sig scheduling

What did you expect to happen?

No race conditions.

How can we reproduce it (as minimally and precisely as possible)?

Create a large scale cluster.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

Type

No type

Projects

Status

Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions