Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No proper scheduling retries could be made when Extender filters out some Nodes #122019

Closed
sanposhiho opened this issue Nov 23, 2023 · 6 comments · Fixed by #122022
Closed

No proper scheduling retries could be made when Extender filters out some Nodes #122019

sanposhiho opened this issue Nov 23, 2023 · 6 comments · Fixed by #122022
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@sanposhiho
Copy link
Member

sanposhiho commented Nov 23, 2023

What happened?

When Extender filters out some Nodes, we don't set any unschedulable plugins at all. It means Extender is completely ignored during the requeueing process.

So, what's happening is:

  • If Extender filters out all Nodes during scheduling, this Pod is soon retried because this Pod doesn't have any plugin name in unschedulable plugins.
  • If Extender filters out some Nodes during scheduling and plugins filter out all other Nodes, this Pod is retried based on plugins' QueueingHint. Even if any cluster events happen and it could change Extender's decision (but not any plugins' decision), this Pod won't be requeued to activeQ/backoffQ.

The latter case is serious because it could make Pods being stuck in unschedulable pod pool in 5min.

What did you expect to happen?

We should have a short-term solution for the latter case.
We can requeue Pods, which were rejected by Extender, by any kind of cluster events because we cannot know which events make Pods schedulable.

Eventually, this issue makes us wonder how Pods rejected by Extender should be requeued. We cannot implement QHint equivalent in Extenders because it'd be too slow to call Extender every time any cluster events happen. Probably, somehow implementing EventsToRegister equivalent in Extender?

How can we reproduce it (as minimally and precisely as possible)?

Use Extender which does something in Filter.

Anything else we need to know?

No response

Kubernetes version

master

Cloud provider

n/a

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@sanposhiho sanposhiho added the kind/bug Categorizes issue or PR as related to a bug. label Nov 23, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 23, 2023
@sanposhiho
Copy link
Member Author

/assign

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 23, 2023
@neolit123
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 23, 2023
@sanposhiho
Copy link
Member Author

/reopen

We'll close it when all cherry-picks are done.

@k8s-ci-robot k8s-ci-robot reopened this Dec 14, 2023
@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Reopened this issue.

In response to this:

/reopen

We'll close it when all cherry-picks are done.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sanposhiho
Copy link
Member Author

/close

Cherry-picks are done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants