Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra: resourceclass missing #120213

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Aug 28, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.

Special notes for your reviewer:

Includes one other drive-by fix for event registration.

Does this PR introduce a user-facing change?

scheduler: handling of unschedulable pods because a ResourceClass is missing is a bit more efficient and no longer relies on periodic retries

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/3063

Because of a misplaced `append` (should have been inside if clause, not after
it), some handler from a previous loop iteration was added again. This was
harmless because the resulting slice was only used for waiting for cache sync,
but should better get fixed anyway.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 28, 2023
@pohly
Copy link
Contributor Author

pohly commented Aug 28, 2023

/assign @sanposhiho

@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 28, 2023
@@ -272,6 +272,8 @@ func (pl *dynamicResources) EventsToRegister() []framework.ClusterEventWithHint
// A resource might depend on node labels for topology filtering.
// A new or updated node may make pods schedulable.
{Event: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.UpdateNodeLabel}},
// A pod might be waiting for a class to get created or modified.
{Event: framework.ClusterEvent{Resource: framework.ResourceClass, ActionType: framework.Add | framework.Update}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scenario rejected Pods should be moved from unsched to backoffQ/activeQ via update events? I'm wondering if it's OK to have only Add event here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node filter in the class might change. I have a hunch that this then leads to an unschedulable pod (no suitable nodes at all) which might be stuck without this update event, already now (i.e. a bug).

Let me write a test for this scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's as I suspected: the new test hangs when the Update event is not registered.

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Aug 29, 2023
@pohly pohly force-pushed the dra-scheduler-resourceclass-missing branch from 131c818 to a479431 Compare August 29, 2023 09:21
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Aug 29, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Aug 29, 2023

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 29, 2023
@pohly pohly force-pushed the dra-scheduler-resourceclass-missing branch from a479431 to 16d3f2a Compare August 30, 2023 06:39
@mmiranda96 mmiranda96 moved this from Triage to Archive-it in SIG Node CI/Test Board Aug 30, 2023
When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.
@pohly pohly force-pushed the dra-scheduler-resourceclass-missing branch from 16d3f2a to c682d2b Compare September 6, 2023 09:14
@pohly
Copy link
Contributor Author

pohly commented Sep 6, 2023

@sanposhiho: can you take another look? The PR should be complete now and I fixed the gofmt issue.

@pohly
Copy link
Contributor Author

pohly commented Sep 6, 2023

/retest

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2179a2b696669f856adf43eb79f16dab6182acba

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2023
@pacoxu
Copy link
Member

pacoxu commented Sep 7, 2023

/retest-required

@pacoxu
Copy link
Member

pacoxu commented Sep 7, 2023

/retest
oomkiller failure is a top flake

@k8s-ci-robot k8s-ci-robot merged commit 2d5b6f1 into kubernetes:master Sep 7, 2023
18 checks passed
SIG Node CI/Test Board automation moved this from Archive-it to Done Sep 7, 2023
SIG Node PR Triage automation moved this from Needs Reviewer to Done Sep 7, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

5 participants