No proper scheduling retries could be made when Extender filters out some Nodes #122019

sanposhiho · 2023-11-23T09:09:25Z

What happened?

When Extender filters out some Nodes, we don't set any unschedulable plugins at all. It means Extender is completely ignored during the requeueing process.

So, what's happening is:

If Extender filters out all Nodes during scheduling, this Pod is soon retried because this Pod doesn't have any plugin name in unschedulable plugins.
If Extender filters out some Nodes during scheduling and plugins filter out all other Nodes, this Pod is retried based on plugins' QueueingHint. Even if any cluster events happen and it could change Extender's decision (but not any plugins' decision), this Pod won't be requeued to activeQ/backoffQ.

The latter case is serious because it could make Pods being stuck in unschedulable pod pool in 5min.

What did you expect to happen?

We should have a short-term solution for the latter case.
We can requeue Pods, which were rejected by Extender, by any kind of cluster events because we cannot know which events make Pods schedulable.

Eventually, this issue makes us wonder how Pods rejected by Extender should be requeued. We cannot implement QHint equivalent in Extenders because it'd be too slow to call Extender every time any cluster events happen. Probably, somehow implementing EventsToRegister equivalent in Extender?

How can we reproduce it (as minimally and precisely as possible)?

Use Extender which does something in Filter.

Anything else we need to know?

No response

Kubernetes version

master

Cloud provider

n/a

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

sanposhiho · 2023-11-23T09:09:34Z

/assign

k8s-ci-robot · 2023-11-23T09:09:34Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2023-11-23T13:40:24Z

/sig scheduling

sanposhiho · 2023-12-14T06:41:31Z

/reopen

We'll close it when all cherry-picks are done.

k8s-ci-robot · 2023-12-14T06:41:37Z

@sanposhiho: Reopened this issue.

In response to this:

/reopen

We'll close it when all cherry-picks are done.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sanposhiho · 2023-12-23T01:24:26Z

/close

Cherry-picks are done.

sanposhiho added the kind/bug Categorizes issue or PR as related to a bug. label Nov 23, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 23, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 23, 2023

k8s-ci-robot assigned sanposhiho Nov 23, 2023

sanposhiho mentioned this issue Nov 23, 2023

fix: requeue pods rejected by Extenders properly #122022

Merged

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 23, 2023

k8s-ci-robot closed this as completed in #122022 Dec 14, 2023

k8s-ci-robot reopened this Dec 14, 2023

sanposhiho closed this as completed Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No proper scheduling retries could be made when Extender filters out some Nodes #122019

No proper scheduling retries could be made when Extender filters out some Nodes #122019

sanposhiho commented Nov 23, 2023 •

edited

sanposhiho commented Nov 23, 2023

k8s-ci-robot commented Nov 23, 2023

neolit123 commented Nov 23, 2023

sanposhiho commented Dec 14, 2023

k8s-ci-robot commented Dec 14, 2023

sanposhiho commented Dec 23, 2023

No proper scheduling retries could be made when Extender filters out some Nodes #122019

No proper scheduling retries could be made when Extender filters out some Nodes #122019

Comments

sanposhiho commented Nov 23, 2023 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

sanposhiho commented Nov 23, 2023

k8s-ci-robot commented Nov 23, 2023

neolit123 commented Nov 23, 2023

sanposhiho commented Dec 14, 2023

k8s-ci-robot commented Dec 14, 2023

sanposhiho commented Dec 23, 2023

sanposhiho commented Nov 23, 2023 •

edited