Scheduling Pods with dynamic resources and using the Pod `.spec.nodeName` field fails #114005

uniemimu · 2022-11-18T16:15:38Z

What happened?

The combination of using the Pod spec nodeName field with the DynamicResources scheduler plugin (currently feature-gated) doesn't work. No surprise really as the nodeName field effective bypasses the DynamicResources scheduler plugin running, and something rather clever would need to be invented to make things to work. But this is still a bit of a regression compared to the old device plugin model which doesn't have a similar limitation. @pohly as the author of the dynamic resources feature can explain what could maybe be created to fix this.

Granted, forcing scheduling even with a device plugin resource may or may not succeed at the node level depending on the availability of said plugin resources at the node, but with dynamic resources the failure seems to be currently guaranteed.

The use-case for this is such that the Kubernetes scheduler is effectively being preceded by some other scheduler which runs earlier (say, some HPC framework) which is giving the Pods the node names and somehow knows that there will be enough resources, and then the workloads ought to land in the specified node and the container related work is then handled normally from there on. So basically Kubernetes wouldn't do much in terms of scheduling, but takes over from there.

What did you expect to happen?

TBH I expected it to fail and it did. The resource claims won't work, since the scheduler plugin isn't getting a chance to do its thing.

How can we reproduce it (as minimally and precisely as possible)?

Just use DRA with Pod spec nodeName field, one needs a DRA driver obviously.

Anything else we need to know?

No response

Kubernetes version

Working with dynamic-resource-allocation branch from @pohly which recently got merged to the coming 1.26 release

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"5932fc8c", GitTreeState:"clean", BuildDate:"2022-11-15T12:54:32Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"5932fc8c", GitTreeState:"clean", BuildDate:"2022-11-15T13:06:30Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

pohly · 2022-11-18T16:23:44Z

/triage accepted
/priority backlog
/sig node

One possible solution would be to have a new controller which specifically watches for pods in this state (nodeName set and reference claims). That controller then must trigger delayed allocation (if needed) by creating a PodScheduling object with selectedNode set. After allocation it must add the pod to the reservedFor list.

kubelet will immediately try to run the pod once it sees it because of the nodeName field. It will refuse to run it until eventually all claims are allocated and reserved.

Obviously using a scheduler which is not aware of dynamic resources is problematic. There is no guarantee that the chosen node even has the hardware that is needed for the pod and that it isn't in use. It would be good to hear from whoever has this use case whether the solution outlined above would work in practice.

The alternative is to build support for dynamic resource allocation into that other scheduler.

sftim · 2022-12-03T20:03:39Z

/retitle Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails

alculquicondor · 2023-02-14T15:48:19Z

Should this be a beta blocker? Or we just want to mark this an unsupported behavior long-term?

sftim · 2023-02-14T16:57:30Z

My approach: provide a sample / “contrib” implementation of a restriction that prohibits specifying .spec.nodeName - other than by binding the Pod - for a Pod that also requests dynamic resources.

If ValidatingAdmissionPolicy gains a way to restrict this, we can provide a contrib ValidatingAdmissionPolicy as well, and recommend that tools for managing cluster lifecycle put that in place.

That'd be enough to move to beta. IMO.

pohly · 2023-02-14T19:36:40Z

Should this be a beta blocker?

I'm not sure whether it needs to be a blocker, but my intention is to work on this soonish - definitely before beta.

pohly · 2023-02-14T19:37:25Z

I intend to support this instead of prohibiting it.

alculquicondor · 2023-02-14T19:38:33Z

It sounds like it could be a non-trivial change to the design. Once you have something, please send a PR to the KEP.

pohly · 2023-06-12T06:55:02Z

Code PR: #118209
KEP PR: kubernetes/enhancements#4063

uniemimu added the kind/bug Categorizes issue or PR as related to a bug. label Nov 18, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2022

k8s-ci-robot changed the title ~~Scheduling Pods with dynamic resources and using the Pod spec nodeName field fails~~ Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails Dec 3, 2022

SergeyKanzhelev added this to Triage in SIG Node Bugs (migrated to https://github.com/orgs/kubernetes/projects/185) Dec 6, 2022

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs (migrated to https://github.com/orgs/kubernetes/projects/185) Dec 7, 2022

pohly mentioned this issue Jan 9, 2023

Document using node selector for dynamic resource allocation delayed claim resolution kubernetes/website#38841

Closed

tzneal mentioned this issue Feb 14, 2023

Add a note indicating when nodeName should be used kubernetes/website#39452

Merged

pohly mentioned this issue May 23, 2023

dra: pre-scheduled pods #118209

Merged

k8s-ci-robot closed this as completed in #118209 Jul 13, 2023

SIG Node Bugs (migrated to https://github.com/orgs/kubernetes/projects/185) automation moved this from Triaged to Done Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling Pods with dynamic resources and using the Pod `.spec.nodeName` field fails #114005

Scheduling Pods with dynamic resources and using the Pod `.spec.nodeName` field fails #114005

uniemimu commented Nov 18, 2022

pohly commented Nov 18, 2022

sftim commented Dec 3, 2022

alculquicondor commented Feb 14, 2023

sftim commented Feb 14, 2023

pohly commented Feb 14, 2023

pohly commented Feb 14, 2023

alculquicondor commented Feb 14, 2023

pohly commented Jun 12, 2023

Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails #114005

Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails #114005

Comments

uniemimu commented Nov 18, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

pohly commented Nov 18, 2022

sftim commented Dec 3, 2022

alculquicondor commented Feb 14, 2023

sftim commented Feb 14, 2023

pohly commented Feb 14, 2023

pohly commented Feb 14, 2023

alculquicondor commented Feb 14, 2023

pohly commented Jun 12, 2023

Scheduling Pods with dynamic resources and using the Pod `.spec.nodeName` field fails #114005

Scheduling Pods with dynamic resources and using the Pod `.spec.nodeName` field fails #114005