dra: resourceclass missing #120213

pohly · 2023-08-28T16:05:37Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

When filtering fails because a ResourceClass is missing, we can treat the pod
as "unschedulable" as long as we then also register a cluster event that wakes
up the pod. This is more efficient than periodically retrying.

Special notes for your reviewer:

Includes one other drive-by fix for event registration.

Does this PR introduce a user-facing change?

scheduler: handling of unschedulable pods because a ResourceClass is missing is a bit more efficient and no longer relies on periodic retries

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/3063

Because of a misplaced `append` (should have been inside if clause, not after it), some handler from a previous loop iteration was added again. This was harmless because the resulting slice was only used for waiting for cache sync, but should better get fixed anyway.

pohly · 2023-08-28T16:05:48Z

/assign @sanposhiho

sanposhiho · 2023-08-29T00:33:37Z

pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go

@@ -272,6 +272,8 @@ func (pl *dynamicResources) EventsToRegister() []framework.ClusterEventWithHint
 		// A resource might depend on node labels for topology filtering.
 		// A new or updated node may make pods schedulable.
 		{Event: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.UpdateNodeLabel}},
+		// A pod might be waiting for a class to get created or modified.
+		{Event: framework.ClusterEvent{Resource: framework.ResourceClass, ActionType: framework.Add | framework.Update}},


In which scenario rejected Pods should be moved from unsched to backoffQ/activeQ via update events? I'm wondering if it's OK to have only Add event here.

The node filter in the class might change. I have a hunch that this then leads to an unschedulable pod (no suitable nodes at all) which might be stuck without this update event, already now (i.e. a bug).

Let me write a test for this scenario.

It's as I suspected: the new test hangs when the Update event is not registered.

bart0sh · 2023-08-29T09:44:07Z

/triage accepted
/priority important-longterm

When filtering fails because a ResourceClass is missing, we can treat the pod as "unschedulable" as long as we then also register a cluster event that wakes up the pod. This is more efficient than periodically retrying.

pohly · 2023-09-06T09:15:23Z

@sanposhiho: can you take another look? The PR should be complete now and I fixed the gofmt issue.

pohly · 2023-09-06T10:37:49Z

/retest

sanposhiho

/lgtm
/approve
/retest

k8s-ci-robot · 2023-09-06T19:33:33Z

LGTM label has been added.

Git tree hash: 2179a2b696669f856adf43eb79f16dab6182acba

k8s-ci-robot · 2023-09-06T19:33:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [sanposhiho]
~~test/e2e/dra/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pacoxu · 2023-09-07T01:59:15Z

/retest-required

pacoxu · 2023-09-07T04:03:53Z

/retest
oomkiller failure is a top flake

k8s-ci-robot assigned sanposhiho Aug 28, 2023

k8s-ci-robot requested review from bart0sh and damemi August 28, 2023 16:06

pohly mentioned this pull request Aug 28, 2023

dynamic resource allocation: optimize class.SuitableNodes usage #114685

Merged

sanposhiho reviewed Aug 29, 2023

View reviewed changes

bart0sh added this to Triage in SIG Node PR Triage Aug 29, 2023

pohly force-pushed the dra-scheduler-resourceclass-missing branch from 131c818 to a479431 Compare August 29, 2023 09:21

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Aug 29, 2023

pohly force-pushed the dra-scheduler-resourceclass-missing branch from a479431 to 16d3f2a Compare August 30, 2023 06:39

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board Aug 30, 2023

mmiranda96 moved this from Triage to Archive-it in SIG Node CI/Test Board Aug 30, 2023

pacoxu mentioned this pull request Sep 5, 2023

dynamic resource allocation kubernetes/enhancements#3063

Open

34 tasks

scheduler: add ResourceClass events

c682d2b

When filtering fails because a ResourceClass is missing, we can treat the pod as "unschedulable" as long as we then also register a cluster event that wakes up the pod. This is more efficient than periodically retrying.

pohly force-pushed the dra-scheduler-resourceclass-missing branch from 16d3f2a to c682d2b Compare September 6, 2023 09:14

sanposhiho approved these changes Sep 6, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2023

k8s-ci-robot merged commit 2d5b6f1 into kubernetes:master Sep 7, 2023
18 checks passed

SIG Node CI/Test Board automation moved this from Archive-it to Done Sep 7, 2023

SIG Node PR Triage automation moved this from Needs Reviewer to Done Sep 7, 2023

k8s-ci-robot added this to the v1.29 milestone Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dra: resourceclass missing #120213

dra: resourceclass missing #120213

pohly commented Aug 28, 2023

pohly commented Aug 28, 2023

sanposhiho Aug 29, 2023 •

edited

pohly Aug 29, 2023

pohly Aug 29, 2023

bart0sh commented Aug 29, 2023

pohly commented Sep 6, 2023

pohly commented Sep 6, 2023

sanposhiho left a comment

k8s-ci-robot commented Sep 6, 2023

k8s-ci-robot commented Sep 6, 2023

pacoxu commented Sep 7, 2023

pacoxu commented Sep 7, 2023

dra: resourceclass missing #120213

dra: resourceclass missing #120213

Conversation

pohly commented Aug 28, 2023

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

pohly commented Aug 28, 2023

sanposhiho Aug 29, 2023 • edited

Choose a reason for hiding this comment

pohly Aug 29, 2023

Choose a reason for hiding this comment

pohly Aug 29, 2023

Choose a reason for hiding this comment

bart0sh commented Aug 29, 2023

pohly commented Sep 6, 2023

pohly commented Sep 6, 2023

sanposhiho left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 6, 2023

k8s-ci-robot commented Sep 6, 2023

pacoxu commented Sep 7, 2023

pacoxu commented Sep 7, 2023

sanposhiho Aug 29, 2023 •

edited