Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic resource allocation #3063

Open
25 of 34 tasks
pohly opened this issue Nov 30, 2021 · 115 comments
Open
25 of 34 tasks

dynamic resource allocation #3063

pohly opened this issue Nov 30, 2021 · 115 comments
Assignees
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status

Comments

@pohly
Copy link
Contributor

pohly commented Nov 30, 2021

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 30, 2021
@pohly
Copy link
Contributor Author

pohly commented Nov 30, 2021

/assign @pohly
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 30, 2021
@ahg-g
Copy link
Member

ahg-g commented Dec 20, 2021

do we have a discussion issue on this enhancement?

@pohly
Copy link
Contributor Author

pohly commented Jan 10, 2022

@ahg-g: with discussion issue you mean a separate issue in some repo (where?) in which arbitrary comments are welcome?

No, not at the moment. I've also not seen that done elsewhere before. IMHO at this point the open KEP PR is a good place to collect feedback and questions. I also intend to come to the next SIG-Scheduling meeting.

@ahg-g
Copy link
Member

ahg-g commented Jan 10, 2022

@ahg-g: with discussion issue you mean a separate issue in some repo (where?) in which arbitrary comments are welcome?

Yeah, this is what I was looking for, the issue would be under k/k repo.

No, not at the moment. I've also not seen that done elsewhere before.

That is actually the common practice, one starts a feature request issue where the community discusses initial ideas and the merits of the request (look for issues with label kind/feature). That is what I would expect in the discussion link.

IMHO at this point the open KEP PR is a good place to collect feedback and questions. I also intend to come to the next SIG-Scheduling meeting.

But the community have no idea what this is about yet, so better to have an issue discusses "What would you like to be added?" and "Why is this needed" beforehand. Also, meetings are attended by fairly small groups of contributors, having an issue tracking the discussion is important IMO.

@pohly
Copy link
Contributor Author

pohly commented Jan 10, 2022

In my work in SIG-Storage I've not seen much use of such a discussion issue. Instead I had the impression that the usage of "kind/feature" is discouraged nowadays.

https://github.com/kubernetes/kubernetes/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.yaml explicitly says

Feature requests are unlikely to make progress as issues. Please consider engaging with SIGs on slack and mailing lists, instead. A proposal that works through the design along with the implications of the change can be opened as a KEP.

This proposal was discussed with various people beforehand, now we are in the formal KEP phase. But I agree, it is hard to provide a good link to those prior discussions.

@ahg-g
Copy link
Member

ahg-g commented Jan 10, 2022

We use that in sig-scheduling, and it does serve as a very good place for initial rounds of discussions, discussions on slack and meetings are hard to reference as you pointed out.

I still have no idea what this is proposing, and I may not attend the next sig meeting for example...

@gracenng gracenng added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Jan 17, 2022
@gracenng gracenng added this to the v1.24 milestone Jan 17, 2022
@gracenng
Copy link
Member

gracenng commented Jan 30, 2022

Hi @ ! 1.24 Enhancements team here.
Checking in as we approach enhancements freeze in less than a week on 18:00pm PT on Thursday Feb 3rd
Here’s where this enhancement currently stands:

  • Updated KEP file using the latest template has been merged into the k/enhancements repo. KEP-3063: dynamic resource allocation #3064
  • KEP status is marked as implementable for this release with latest-milestone: 1.24
  • KEP has a test plan section filled out.
  • KEP has up to date graduation criteria.
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

The status of this enhancement is track as at risk.
Thanks!

@gracenng gracenng added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team and removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Feb 4, 2022
@gracenng
Copy link
Member

gracenng commented Feb 4, 2022

The Enhancements Freeze is now in effect and this enhancement is removed from the release.
Please feel free to file an exception.

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.24 milestone Feb 4, 2022
@gracenng
Copy link
Member

gracenng commented Mar 1, 2022 via email

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2022
@kerthcet
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022
@dchen1107 dchen1107 added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Jun 9, 2022
@dchen1107 dchen1107 added this to the v1.25 milestone Jun 9, 2022
@Priyankasaggu11929 Priyankasaggu11929 added tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team and removed tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team labels Jun 10, 2022
@marosset
Copy link
Contributor

Hello @pohly 👋, 1.25 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PST on Thursday June 16, 2022.

For note, This enhancement is targeting for stage alpha for 1.25 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP file using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable
  • KEP has a updated detailed test plan section filled out
  • KEP has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

It looks like #3064 will address everything in this list.

For note, the status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@marosset
Copy link
Contributor

Hello @pohly 👋, just a quick check-in again, as we approach the 1.25 enhancements freeze.

Please plan to get #3064 reviewed and merged before enhancements freeze on Thursday, June 23, 2022 at 18:00 PM PT.

For note, the current status of the enhancement is atat-risk. Thank you!

@sftim
Copy link
Contributor

sftim commented Jan 29, 2024

My thinking: if .spec.resourceClaims goes beta for v1.30, the associated feature gate should be off by default. However, we can also ask cluster lifecycle tools to consider ways to help make it easy for people to experiment with DRA.

If we make something be enabled by default, and we one day find we regret doing that, then there's a lot more trouble for the cluster administrators who have to cope with reversion.

Even better, IMO: leave it all alpha until we know what the APIs all look like and we are confident they could and would work without changes.

@johnbelamaric
Copy link
Member

johnbelamaric commented Jan 30, 2024

Yeah, I hear you @sftim. It's probably too much change in the code to go straight to beta. In that case, I think the target path would be:

1.30

  • Base KEP alpha
  • Template KEP alpha

1.31

  • Base KEP beta
  • Template KEP alpha2 (unless things go amazingly well and we go beta)
  • Escape Hatch KEP alpha (if we need it)

1.32

  • Base KEP stays beta (aspirationally GA?)
  • Template KEP beta
  • Escape Hatch KEP alpha2 (if we need it)

1.33

  • Base KEP GA
  • Template KEP stays beta (aspirationally GA)
  • Escape Hatch KEP alpha2 (if we need it)

@thockin
Copy link
Member

thockin commented Jan 30, 2024

I appreciate the breakdown. That said -- beta doesn't really exist. There's alpha (off by default), GA with low-confidence, and GA with high(er) confidence. I'm very reluctant to "beta" (GA with low confidence) this if we don't have a plan for how it will evolve to support autoscaling.

@johnbelamaric
Copy link
Member

Template KEP is that plan

@thockin
Copy link
Member

thockin commented Jan 30, 2024

I will keep reading

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

Let's scope down this base KEP to a bare minimum viable product, which would target beta in 1.30 (this may be controversial):

  • ResourceClaim, with the addition of an optional nodeName field so the node can be manually determined when necessary
  • PodSpec.ResourceClaims field
    ...
    It's a little painful for Deployments, because you have to manually pre-create the resource claims associated with specific nodes, label those nodes, and then use that in the nodeSelector field of your deployment Pod template.

I don't find that a "minimum viable product". No-one is going to use this, so we are not going to get more feedback even if we do promote this subset to beta. It also sounds like we need to implement new functionality that never was available as alpha, so how can we go to beta with it straight away?

The other downside is that we have to start adding more feature gate checks for specific fields, with all the associated logic (drop alpha fields, but only if not already set). This is adding work and complexity, and thus a risk to introduce new bugs.

If we have to reduce the scope for beta, then I would slice up the KEP differently if (and only if) needed. But I am not (EDIT) going to dive into the how because of this:

I asked in #4384 (comment) how many different feature gates we need in 1.30 when everything is still alpha. Let me repeat the key point: perhaps we don't need to decide now?

We could continue to use the existing DynamicResourceAllocation feature gate for everything. Then before promotion to beta, we add additional feature gates for things that remain in alpha. It would change how things get enabled compared to 1.30, but IMHO that is okay because it is an alpha feature, which can change from one release to the next.

The practical advantage is that for 1.30 we we can skip the entire discussion around how to promote this and instead have that discussion later, for example in a working session at the KubeCon EU 2024 contributor summit (I have submitted a session proposal). It also makes the 1.30 implementation simpler (no additional feature gate checks).

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

We split out ResourceClaimTemplate into a new KEP, which would target alpha in 1.30.
[...]
Template KEP is that plan [for autoscaling]

The ResourceClaimTemplate is not what enables autoscaling. It solves the problem of per-pod resource claims when pods get generated by an app controller. This part also doesn't seem to be controversial, at least not anymore after I changed to dynamically generated names 😉.

My plan for supporting autoscaling are numeric parameters.

@johnbelamaric
Copy link
Member

We split out ResourceClaimTemplate into a new KEP, which would target alpha in 1.30.
[...]
Template KEP is that plan [for autoscaling]

The ResourceClaimTemplate is not what enables autoscaling. It solves the problem of per-pod resource claims when pods get generated by an app controller. This part also doesn't seem to be controversial, at least not anymore after I changed to dynamically generated names 😉.

My plan for supporting autoscaling are numeric parameters.

Yes - in the break down, the template and numerical parameter functionality is combined into one KEP. That's what I meant that that KEP is the plan. What's "controversial" isn't the template API per se, but the way it introduces complexity with scheduling. The numerical parameters will reduce that considerably.

I agree it was too aggressive to suggest even the scoped down thing in 1.30 for beta. You may be right that we can postpone the debate since we are staying all in alpha. But if we want a chance of delivering the solution in smaller, digestible chunks, I think we have to work out the right API now, which I don't think is quite there yet even for the basic ResourceClaim.

My suggestion is that the user-owned resource claim API is under-specified as written, because instead of the user specifying the node, it randomly picks one during scheduling. So, it's sort of unusable in the manual flow except for network-attached resources. Before we automate something (i.e., add templating and automatic scheduling), we need the manual flow to work. And I do think if you give people an API that solves their use case, even with a little more manual prep-work / client-side work, people will use it.

Along those lines, the change is small. You just need to require the user to pick a node during the creation of the ResourceClaim (for non-network attached resources), and then users can pre-provision nodes with pools of associated resources, and labels those sets of nodes. This makes it an actual usable API, and makes the functionality composable: the automation (templates) builds directly on top of the manual process.

In fact, I think we can even push delayed allocation out-of-scope for the MVP, and still have something very useful. Typical UX would be:

  • Users evaluate their workloads and decide on the set of resources needed on each node.
  • Users manually select the set of nodes on which they expect those workloads to run, and label those sets of nodes.
  • Users pre-provision the resources on those nodes using ResourceClaims that specify those nodes.
  • Users create deployments or other workload controller resources to provision the pods, using a nodeSelector to map those to the set of nodes with the pre-provisioned resources.

This is a reasonable UX which will certainly be used. The scope of this much, much simpler and smaller than the current base DRA KEP.

@sftim
Copy link
Contributor

sftim commented Jan 30, 2024

We can build on #3063 (comment) with an focused follow-up change to PodSchedulingContext: one that allows kubelets to demur to accept the Pod for arbitrary reasons.

In other words, a kubelet could look at the existing attached resources, and the node as it's running right now, and inform the control plane that there's no such GPU, or that a different Pod is already using that NUMA partition, or that the phase of the moon is wrong…

At that stage, this doesn't need to mean clever scheduling and doesn't actually count as dynamically allocating any resources. Maybe all the candidate nodes decline and the scheduler eventually gives up trying. Cluster autoscalers wouldn't be trying to make new nodes because the nodeSelector serves as proof that it doesn't help. In this story, a Pod that doesn't find a home on any node doesn't find a home, and doesn't run.

It's basic. However, just as @johnbelamaric explained, it's useful to some folk. The ability for a kubelet to demur through an update to PodSchedulingContext would support a bunch of related user stories, even if there are many others that still need work.


If we go this route, where's a good place to take that discussion?

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

What's "controversial" isn't the template API per se, but the way it introduces complexity with scheduling.

I don't get how templates add complexity for scheduling. The scheduler needs to wait for the created ResourceClaim, but that's all. That's the same as "wait for user to create ResourceClaim", it doesn't make the scheduling more complex. Templates are not related to which nodes a picked.

My suggestion is that the user-owned resource claim API is under-specified as written, because instead of the user specifying the node, it randomly picks one during scheduling.

The "I want this claim for node xyz" doesn't need to be in the resource.k8s.io API/ResourceClaim API. It can go into the claim parameters for the driver. After all, it is the driver which needs to evaluate that information, right? If users must manually create ResourceClaims, then they can also create claim parameters for each of those.

Users create deployments or other workload controller resources to provision the pods, using a nodeSelector to map those to the set of nodes with the pre-provisioned resources.

So when a deployment is used, all pods reference the same ResourceClaim? Then all pods run on the same node, using the same hardware resource. I don't see how you intend to handle this. This will require some new kind of API, one which will become obsolete once we have what people really want (automatic scheduling). If you think that this is doable, then this deserves a separate KEP which explains all the details and what that API would look like. It's not just some reduced DRA KEP.

it's useful to some folk

Who are those folks? This seems very speculative to me.

The ability for a kubelet to demur through an update to PodSchedulingContext would support a bunch of related user stories

PodSchedulingContext is what people are trying to avoid...

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

If we go this route, where's a good place to take that discussion?

Write a provisional KEP, submit it. We can then meet at KubeCon EU to discuss face-to-face or set up online meetings.

@johnbelamaric
Copy link
Member

So when a deployment is used, all pods reference the same ResourceClaim? Then all

Yeah, I think you're right, this doesn't quite work and templates are probably the fix.

The goal as you said is be to avoid pod scheduling context, not templates really.

I still think it's possible to create a scoped down but still useful API that accomplishes that.

@johnbelamaric
Copy link
Member

Who are those folks? This seems very speculative to me.

Today people solve this by grabbing the whole node and/or running privileges pods. This API that avoids this, allowing an administrator to pre-allocate resources via the node-side (privileged) drivers, without requiring a the user pod to have those privileges. Those would be the users of this initial API.

@thockin
Copy link
Member

thockin commented Jan 30, 2024

Before we automate something (i.e., add templating and automatic scheduling), we need the manual flow to work

This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.

  • Users evaluate their workloads and decide on the set of resources needed on each node.
  • Users manually select the set of nodes on which they expect those workloads to run, and label those sets of nodes.
  • Users pre-provision the resources on those nodes using ResourceClaims that specify those nodes.
  • Users create deployments or other workload controller resources to provision the pods, using a nodeSelector to map those to the set of nodes with the pre-provisioned resources.

In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right?

IOW it uses node-management as a substitute for (coarse) resource management.

I am certainly seeing a trend of people running (effectively) one workload pod per node, which fits this model. If we tidy that up and codify it (even just saying "here's a pattern you should not feel bad about"), does it relieve some of the pressure?

@sftim
Copy link
Contributor

sftim commented Jan 30, 2024

A manual (external) flow could work for things that you can attach to nodes, especially if they are then dedicated to the thing they attach to. Device plugins provide the ability for the thing you attach to work, but they don't attach it for you.

Something like https://cloud.google.com/compute/docs/gpus/add-remove-gpus -something triggers making a ResourceClaim, and automation fulfils it by attaching a GPU to a VM. NFD runs and labels the node. There is probably a device plugin in this story; anyway, we end up with a node that has some GPU capacity available.

Next, someone runs a Job that selects for that kind of GPU, and it works. However, what's possibly missing at this stage is the ability to reserve that GPU resource from the claim and avoid it being used for other Pods.

If we want to run a second Pod, maybe we're able to attach a second GPU, avoiding the one-pod-per-node problem. We aren't doing DRA but we have helped some stories and narrowed what still needs delivering.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

Pre-provisioned volumes are a lesson I took to heart.

Those were also what came to my mind when I read @johnbelamaric's outline. Pre-provisioned volumes have been replaced by CSI volume provisioning, but now that also is stuck having to support the complicated "volume binding" between PVC and PV. The result is that there are still race conditions that can lead to leaked volumes. Let's not repeat this for DRA.

@sftim
Copy link
Contributor

sftim commented Jan 30, 2024

Here's why I think it could be different.

  • with storage, you can set up the volume manually (which, to be fair, the control plane doesn't stop you from doing even with newer mechanisms)
  • the PVC has to handle the case where something has provided the PV already

However:

  • I don't think anyone is proposing the equivalent model for general resources.

Let's say you're attaching GPUs to a node, and you make a ResourceClaim to specify there should be 2 GPUs attached. The helper finds there's already one GPU manually / previous attached. How about we specify that this is not only undefined behaviour, but that we expect drivers to taint the node if they see it. No need to have complex logic around manual and automatic provisioning; something is Wrong and the node might not be any good now.

If the helper is told to add 2 GPUs for a total of 2 attached GPUs and finds 0 attached GPUs: great! We get nodes with GPUs, no need to taint, and other consequences such as NFD and device plugin registration can all happen.

Does that work?

@johnbelamaric
Copy link
Member

This is a pretty big statement. I worry that the things we need for manual selection may be the the things we DON'T WANT for automation. Giving the user too much control can be an atteactive nuisance which makes the "real" solution harder. Pre-provisioned volumes are a lesson I took to heart.

Fair point. But I think the trouble comes in more when you don't have a clear sense of ownership - that is, mixed manual and automated flows. If the automated flows have full ownership (overriding anything the user may have done).

In this model, what is the value of ResourceClaims above simple Device Plugins? Or maybe I misunderstand the proposal? I read this as: label nodes with GPUs, use node selectors, use devioce plugins to get access to GPUs. Which I think is what people already do, right?

I am not sure there is one, except that we are rebuilding a model that can be further extended to the full support. Another alternative may be to extend those existing mechanisms, rather than invent a new one.

My main goal with spit balling some alternatives is to see if we can deliver incremental scope in a useful bug digestible way. My thinking is:

  1. I can solve my use case with a combination of basic primitives and client-side tooling, but it requires me to know a lot of stuff about resources in my nodes.
  2. I can reduce the client side tooling with an API that incorporates basic client-side patterns I am using.
  3. I can automate scheduling by encoding what I know about resources in my nodes and exposing that information to the control plane.

2 and 3 may have to go together, but I was hoping to find a solution to 1. With this approach, the functionality delivered in 1 can be done earlier because it is simpler, and it does not need to material change as we implement 2 and 3, reducing risk.

Admittedly I may be cutting things up wrong...but I think there is merit to the approach.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

  1. I can solve my use case with a combination of basic primitives and client-side tooling, but it requires me to know a lot of stuff about resources in my nodes.

This can be done with the current API minus PodSchedulingContext and without adding numeric parameters: write a DRA driver kubelet plugin and a corresponding controller, then use immediate allocation with some driver specific claim parameters that select the node (by name or by labels). There's no need for an API extension specifically for this mode of operation.

Perhaps some hardware vendor will even write such a DRA driver for you, if some of their customers want to use it like this - I can't speak for either of them. This is why the KEP has "collect feedback" as one of the graduation criteria.

This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷

@johnbelamaric
Copy link
Member

This might even give you option 2 and 3. I am not sure whether I grasp the difference between them 🤷

I am not sure there is one!

@thockin
Copy link
Member

thockin commented Jan 30, 2024

Immediate provisioning doesn't ensure that the node on which the resource was provisioned can fit the pod's other dimensions, but maybe that's OK? The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests.

What I am trying to get to, and I think John is aiming at the same goal (not that you're not, Patrick :), is to say what is the biggest problem people really experience today, and how can we make that better? A year ago I would have said it was GPU sharing. Now I understand (anecdotally) that sharing is far less important than simply getting out of the way for people who want to use whole GPUs.

Here's my uber-concern: k8s is the distillation of more than a decade of real-world experience running serving workloads (primarily) on mostly-fungible hardware. The game has changed, and we don't have a decade of experience. Anything we do right now has better than even odds of being wrong within a short period of time. The hardware is changing. Training vs. inference is changing. Capacity is crunched everywhere, and there's a sort of "gold rush" going on.

What can we do to make life better for people? How can we help them improve their efficiency and their time to market? Everything else seems secondary.

I ACK that this is GPU-centric and that DRA does more than just GPUs.

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.

@klueska, @byako : I'll punt this to you.

Just beware that it would mean that we need to rush another KEP for "PodSchedulingContext" for 1.30 and add a feature gate for that - I'm a bit worried that we are stretching ourselves too thin when we do that, and we also skip all of the usual "gather feedback" steps for beta. I'd much rather focus on numeric parameters...

@pohly
Copy link
Contributor Author

pohly commented Jan 30, 2024

The advantage of simple, stupid, counted resources is that the scheduler has all the information it needs about all requests.

That sounds like "numeric parameters", which we cannot promote to beta in 1.30. When using "numeric parameters", PodSchedulingContext is indeed not needed.

@thockin
Copy link
Member

thockin commented Jan 30, 2024

Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that.

If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30?

@johnbelamaric
Copy link
Member

Right. The numeric parameters approach keeps emerging as a potentially better path forward, but I really don't think we know enough to proclaim any of this as beta. If there are things we can do that are smaller and more focused, which would solve some of the problems, I am eager to explore that.

Yeah. I concede nothing in beta in 1.30. That was my original position anyway, but I though perhaps if we scoped something way down but kept continuity, we could do it. But it's clearly a no-go.

If we were starting from scratch RIGHT NOW, with no baggage, what would we be trying to achieve in 30?

Great question, that is what I am looking for along the lines of MVP. What we really need to do is go back to the use cases for that, which is pretty hard on this tight timeline. The better option may be to ask "what would we be trying to achieve" cutting out the "in 30", and defer that question to 31. In the meantime, maybe we make some incremental steps in the direction we need in 30 based on what we know so far - something like @pohly is saying here plus numerical models.

@alculquicondor
Copy link
Member

alculquicondor commented Jan 30, 2024

I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.

Can you expand on this? What becomes core DRA?
Are we going to end up in the same inconsistent situation that the topology manager is? The one where kube-scheduler sends a Pod to a Node and the kubelet just rejects it.

@johnbelamaric
Copy link
Member

I'm okay with promoting "core DRA minus PodSchedulingContext" to beta in 1.30, if that helps someone, somewhere.

Can you expand on this? What becomes core DRA?

Are we going to end up in the same inconsistent situation that the topology manager is? The one where kube-scheduler sends a Pod to a Node and the kubelet just rejects it.

I think it's clear at this point that nothing is going beta in 1.30. We will make sure to avoid the issue you are describing. I think we should err on the side of "failing to schedule" rather than "scheduling and failing".

@salehsedghpour
Copy link
Contributor

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.30 milestone Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Projects
Status: Net New
Status: Tracked
Status: Tracked
Status: Removed from Milestone
Status: Needs Triage
Development

No branches or pull requests