Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails #114005

Closed
uniemimu opened this issue Nov 18, 2022 · 8 comments · Fixed by #118209
Closed

Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails #114005

uniemimu opened this issue Nov 18, 2022 · 8 comments · Fixed by #118209
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@uniemimu
Copy link
Contributor

What happened?

The combination of using the Pod spec nodeName field with the DynamicResources scheduler plugin (currently feature-gated) doesn't work. No surprise really as the nodeName field effective bypasses the DynamicResources scheduler plugin running, and something rather clever would need to be invented to make things to work. But this is still a bit of a regression compared to the old device plugin model which doesn't have a similar limitation. @pohly as the author of the dynamic resources feature can explain what could maybe be created to fix this.

Granted, forcing scheduling even with a device plugin resource may or may not succeed at the node level depending on the availability of said plugin resources at the node, but with dynamic resources the failure seems to be currently guaranteed.

The use-case for this is such that the Kubernetes scheduler is effectively being preceded by some other scheduler which runs earlier (say, some HPC framework) which is giving the Pods the node names and somehow knows that there will be enough resources, and then the workloads ought to land in the specified node and the container related work is then handled normally from there on. So basically Kubernetes wouldn't do much in terms of scheduling, but takes over from there.

What did you expect to happen?

TBH I expected it to fail and it did. The resource claims won't work, since the scheduler plugin isn't getting a chance to do its thing.

How can we reproduce it (as minimally and precisely as possible)?

Just use DRA with Pod spec nodeName field, one needs a DRA driver obviously.

Anything else we need to know?

No response

Kubernetes version

Working with dynamic-resource-allocation branch from @pohly which recently got merged to the coming 1.26 release
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"5932fc8c", GitTreeState:"clean", BuildDate:"2022-11-15T12:54:32Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"5932fc8c", GitTreeState:"clean", BuildDate:"2022-11-15T13:06:30Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@uniemimu uniemimu added the kind/bug Categorizes issue or PR as related to a bug. label Nov 18, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2022
@pohly
Copy link
Contributor

pohly commented Nov 18, 2022

/triage accepted
/priority backlog
/sig node

One possible solution would be to have a new controller which specifically watches for pods in this state (nodeName set and reference claims). That controller then must trigger delayed allocation (if needed) by creating a PodScheduling object with selectedNode set. After allocation it must add the pod to the reservedFor list.

kubelet will immediately try to run the pod once it sees it because of the nodeName field. It will refuse to run it until eventually all claims are allocated and reserved.

Obviously using a scheduler which is not aware of dynamic resources is problematic. There is no guarantee that the chosen node even has the hardware that is needed for the pod and that it isn't in use. It would be good to hear from whoever has this use case whether the solution outlined above would work in practice.

The alternative is to build support for dynamic resource allocation into that other scheduler.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2022
@sftim
Copy link
Contributor

sftim commented Dec 3, 2022

/retitle Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails

@k8s-ci-robot k8s-ci-robot changed the title Scheduling Pods with dynamic resources and using the Pod spec nodeName field fails Scheduling Pods with dynamic resources and using the Pod .spec.nodeName field fails Dec 3, 2022
@alculquicondor
Copy link
Member

Should this be a beta blocker? Or we just want to mark this an unsupported behavior long-term?

@sftim
Copy link
Contributor

sftim commented Feb 14, 2023

My approach: provide a sample / “contrib” implementation of a restriction that prohibits specifying .spec.nodeName - other than by binding the Pod - for a Pod that also requests dynamic resources.

If ValidatingAdmissionPolicy gains a way to restrict this, we can provide a contrib ValidatingAdmissionPolicy as well, and recommend that tools for managing cluster lifecycle put that in place.

That'd be enough to move to beta. IMO.

@pohly
Copy link
Contributor

pohly commented Feb 14, 2023

Should this be a beta blocker?

I'm not sure whether it needs to be a blocker, but my intention is to work on this soonish - definitely before beta.

@pohly
Copy link
Contributor

pohly commented Feb 14, 2023

I intend to support this instead of prohibiting it.

@alculquicondor
Copy link
Member

It sounds like it could be a non-trivial change to the design. Once you have something, please send a PR to the KEP.

@pohly
Copy link
Contributor

pohly commented Jun 12, 2023

Code PR: #118209
KEP PR: kubernetes/enhancements#4063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants