KEP 4381: add structured parameters for dynamic resource allocation #4384

pohly · 2024-01-05T16:29:51Z

One-line PR description: initial draft

Issue link: DRA: structured parameters #4381

johnbelamaric · 2024-01-17T00:04:26Z

@pohly I think this and the related KEPs are going in the right direction. It will be interesting to see how many use cases we can cover with these.

I am wondering though if we can build a more structured and comprehensive model up front, rather than the model being a sort of flat list of parameters. The drivers would publish those to a resource just like you have here, but would also allows some topological considerations baked into the structure. The goal is to get to a place where the scheduler and autoscalers (workload or cluster) have enough information to get the right answer the vast, vast majority of time. The drivers would likely also need to publish information on the valid "transformations" of the model when reservations are done.

I am also not sure of the way this "peers into" driver-specific data structures for parameters. I see why you are doing it that way, but it creates some fragility between resources when we have those sort of field paths. I think we can explore some other approaches. For example, a schema of some sort for describing the resource model, which results in those parameter CRDs for the user to use, but attaches the resource management metadata to those types as well, for the scheduler and autoscalers to use. We can brainstorm some of this when we talk next.

pohly · 2024-01-17T07:24:46Z

I am wondering though if we can build a more structured and comprehensive model up front, rather than the model being a sort of flat list of parameters.

The parameters don't need to be a flat list. Anything that is allowed in a CRD could be used. How complex the in-tree model needs to (and should) be is something that we can discuss in #4384, with https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit#heading=h.ljj9kaa144nr as a starting point.

a schema of some sort for describing the resource model, which results in those parameter CRDs for the user to use, but attaches the resource management metadata to those types as well, for the scheduler and autoscalers to use

So basically a configurable mapping of fields in a vendor CRD to the in-tree numeric model type? That adds more flexibility, but also complexity. I don't see how it addresses the "fragility between resources" - instead of one configurable field path, we would have many. Perhaps being explicit about all paths would help. 🤷

pohly · 2024-01-17T10:17:20Z

keps/sig-node/4381-dra-numeric-parameters/README.md

+respond by returning the JSON encoding of this capacity type. A streaming gRPC
+API allows drivers to update the capacity information.
+
+For example, a simple counter could be described as:


I'll add here that this example is what is being defined in #4385.

pohly · 2024-01-17T10:31:21Z

keps/sig-node/4381-dra-numeric-parameters/README.md

+	metav1.TypeMeta `json:",inline"`
+
+	Count int64 `json:"count"`
+}


At the moment, the simple counter model uses the same struct for publishing capacity and requesting it. Most other models will have different structs, so let's extend the simple counter a bit so that it also has two.

One common additional criteria is some kind of selector. Let's discuss here what that might look like and then support it consistently with all builtin models.

What I came up with so far is to support a metav1.LabelSelector:

// Parameters defines how much and which resources are needed for a claim. type Parameters struct { metav1.ObjectMeta `json:"metadata,omitempty"` metav1.TypeMeta `json:",inline"` // Count is the required number of items. Count int64 `json:"count"` // In addition to the "In", "NotIn", "Exists", "DoesNotExist" // operators from [k8s.io/apimachinery/pkg/apis/meta/v1.LabelSelectorOperator], // additional operators from [k8s.io/dynamic-resource-allocation/apis/meta/v1.LabelSelectorOperator] // are also supported: // // - "Version>=", "Version>", "Version<=", "Version<", "Version=", "Version!=": // the Selector.MatchExpressions values must // contain a single semantic version strong. The expression // matches if the version in the label satisfy the condition // as defined by semantic versioning 2.0.0. // - ">=", ">", "<=", "<", "=", "!=": the Selector.MatchExpressions values must // contain a single string which can be parsed as [resource.Quantity], // with comparison against the labels' value using the numeric operation. Selector metav1.LabelSelector `json:"selector"` }

The implementation will be a fork of https://github.com/kubernetes/apimachinery/blob/3c8c1f22dc332a8a6b51033f6a7c0566b885d290/pkg/apis/meta/v1/helpers.go#L31-L72 and https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/labels/selector.go#L212-L267, with semantic versioning handled by github.com/blang/semver/v4.

Using a metav1.LabelSelector as type in the API is IMHO simpler for users than allowing a full expression (foo in (x y z), bar != 1) (https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#set-based-requirement).

Another benefit is that validating the parameters becomes easier. The CRD could limit the key to know, supported labels. Enforcing the correct operation for each label (i.e. only allowing semver ops for a label that has a version as value) probably needs an admission web hook, but could be implemented that way.

Does this make sense?

cc @klueska @byako

Have you considered expanding metav1.LabelSelector with these features vs. making another label selector?

For NVIDIA GPU selection we will also require the following operations:

Globbing, e.g. (if model == "A100_PCIE_40GB", being able to select on model == "*a100*)

Comparison of booleans with possible values of only true/false

We also need to be able to combine these operations in (somewhat) complicated ways, e.g. a common selector people want to express is:

model == "*v100* || (cudaComputeCapability >= 7.0 && memory >=16GB)`

We could maybe get away with not allowing arbitrary boolean expressions for now, but we do need globbing and boolean comparisons.

At the moment, the simple counter model uses the same struct for publishing capacity and requesting it. Most other models will have different structs, so let's extend the simple counter a bit so that it also has two.

What does this mean? What will the Request struct look like, exactly? Is this the difference between what is used in claims vs. what is passed over the gRPC protocol from the DRA plugin to the kubelet?

Yes, that's the difference. The "capacity type" is used when publishing capacity of a node, the "parameter type" when asking for resources via the vendor CRD.

Makes sense. I think it's covered in the proposal, but take a look tomorrow and see what you think. The proposal still has ResourceClaim with a reference to a vendor-specific CR. We translate that into a vendor neutral request object using CEL (which can operate on arbitrary CRs by treating them as Unstructured). Various selection criteria could be included there (they are not discussed in there now, beyond a boolean true/false).

I didn't click thru everything yet, but tell me if I am off-target here: This seems like a place where we need to evaluate semi-arbitrary expressions. One might say we need to agree on a common language for those expressions, with extensible semantics.

On globbing: IMO globbing indicates that you have a piece of stringly-typed data, which would almost always be better as N discrete items. This is why label selectors DO NOT support globbing, and instead demand that you factor your data into labels. AND STILL, people choose to parse names, and inevitably it is causing pain and misery.

Coming up with a string which composes the model, vendor, subtype, etc. should be a very last resort. Globbing is a smell.

I agree that we need a more comprehensive expression language than metav1.LabelSelector. metav1.LabelSelector only gives us <term1> && <term2> && ... where each term is one label check.

In @johnbelamaric's proposal, each hardware instance is described by more than one numeric model and the vendor CRD with the claim parameter then gets mapped to one parameter type per model. Each model is then checked independently and all must match an instance.

At first glance this looks appealing because one can "compose" the description of the hardware from different building blocks. But now consider mode switching. Each mode may have labels that need to match the selector provided by the user. The "selector model" is unaware of "modes", so it cannot be used. The "mode switch model" has to re-implement it. Alternatively, more details of the mode switch model need to be hoisted up into the resource.k8s.io API. This is a slippery slope where the common API becomes more and more complex.

Therefore I prefer to stick with "one model per driver" and instead implement label selection as a shared library that can be used by the different models. Both the "counter model" and the "mode switch model" would support it, with the difference that the "counter model" publishes a a flat set of labels as part of its capacity while the "mode switch model" has labels for the entire instance and each mode.

Regarding globbing -- it's been a long-standing feature request for users to be able to request GPUs based on a subset of the model name returned by nvidia-smi.

Examples of such model names include the following:

Tesla V100-SXM2-16GB NVIDIA T4 32GB NVIDIA A100-SXM4-40GB NVIDIA H100 PCIe

As some background, it's common for people to prototype their workloads in docker on a specific GPU model before moving them to Kubernetes. When doing so, they are keenly aware of the GPU model they are running on, and want to ensure that they continue to run on this same GPU model once deployed under Kubernetes.

Now keep in mind, this is already possible today using the extended resource exposed by the current device plugin API (i.e. nvidia.com/gpu) in conjunction with a node label that advertises the GPU model (assuming all GPUs on a node are homogeneous, of course).

It would look something like this:

apiVersion: v1 kind: Pod metadata: name: gpu-example spec: containers: - name: gpu-example image: nvidia/cuda resources: limits: nvidia.com/gpu: 2 nodeSelector: nvidia.com/gpu.model: NVIDIA A100-SXM4-40GB

Independent of the fact that the selection above occurs at the node level rather than the device level (which DRA proposes to solve), matching on an exact model name is problematic for two reasons:

As evidenced above, the format of the model name is inconsistent -- meaning that users have to remember the differences between them when putting their selectors together.

There is not just a single model name for, say, an A100 card on a given machine. The model name encodes other information, such as if its on a mother board that connects devices within a PCIe hierarchy or via SXM, or how much memory the card has on it (e.g. 40GB, 80GB, or 96GB).

Some examples of (2) include:

NVIDIA A100-PCIE-40GB NVIDIA A100-SXM4-40GB NVIDIA A100-PCIE-64GB NVIDIA A100-SXM-64GB NVIDIA A100-PCIE-80GB NVIDIA A100-SXM4-80GB

It would make the UX much nicer for one to be able to select on something like "*A100* in cases where they only care about getting access to an A100 (and don't care about the interconnect or the amount of memory -- they should be using a separate selector for memory anyway).

For cases where the do care about being more specific, they still can be by either matching on the full model name or a different subset of it using an alternate glob expression.

Given the (revised) proposal we have of using a DRA controller to translate any vendor specific selection parameters into selectors that the scheduler will understand (apologies I don't know the right terms here yet to be more specific), I imagine the scheduler would be able to take additional modifier expressions that tell it to ignore case when performing its globbing operation, for example, and the DRA controller could decide if it wants to apply these modifiers or not when performing the translation.

Regarding globbing -- it's been a long-standing feature request for users to be able to request GPUs based on a subset of the model name

People want to glob, substring, prefix, or regex match kube selectors, too. Just say no.

The model name encodes other information

You have a stringly-typed structure. Ideally you would decompose this into actual atoms of information, and let the user select on those.

It would make the UX much nicer for one to be able to select on something like "A100" in cases where they only care about getting access to an A100

So name the concept that "A100" represents. "family"? "generation"? A glommed together name is great for dumping in a logfile, but we have data structures for a reason.

they should be using a separate selector for memory anyway

exactly

Anyway, I probably won't die on the hill of not allowing globbing, but I will push back strongly.

keps/sig-node/4381-dra-numeric-parameters/README.md

swatisehgal · 2024-01-23T18:28:00Z

/cc

johnbelamaric

Some questions and comments but this is very much on the right track with respect to the concerns folks have expressed.

From a paperwork perspective, I think we should include all the models in this KEP, rather than separate KEPs. Even if we don't implement all the models in 1.30, we can decide which ones are required to go to beta in the beta criteria.

I think we can keep this separate from the larger DRA KEP for now at least.

I do think we should consider how far we can push this. If we can eliminate the need for the driver-by-driver back and forth communication via PodSchedulingContext that would go a long way towards alleviating concerns on the overall DRA KEP. I realize (at least I think this is true) that that process also covers node-side allocation failures pushing a pod back into the scheduling queue, and this wouldn't cover that. I need to think about that more.

keps/sig-node/4381-dra-numeric-parameters/README.md

johnbelamaric · 2024-01-26T04:59:03Z

To sort of answer my own question, I realized I got myself confused. We need three things, not two:

the capacity object (which will live in NodeResourceCapacity)
the capacity request object (which for counters is identical to the capacity object)
the capacity request expression object (defines the expressions used to generate a capacity request object from the CR)

I am assuming here that the capacity objects themselves, as stored in the NodeResourceCapacity, have a name or other identifier that we match to the capacity request object.

For how you "apply" the request to the current capacity, we could either have it be inherent in the numerical model type, or we can add an expression (and perhaps validation) object for that purpose too.

pohly · 2024-01-26T11:51:42Z

From a paperwork perspective, I think we should include all the models in this KEP, rather than separate KEPs. Even if we don't implement all the models in 1.30, we can decide which ones are required to go to beta in the beta criteria.

I'm open to merging things. I'm just a bit worried that it will become harder to associate use cases and PRR sections with the right feature. I can try and we'll see what the outcome will be.

Related to this, how many feature gates do we want for this? We could make this granular (current proposal):

DynamicResourceAllocation = core DRA
DRANumericParameters = this extension
DRACounterNumericModel = counter model
DRAModeSwitchModel = mode switch model

Do we agree on this?

I do think we should consider how far we can push this. If we can eliminate the need for the driver-by-driver back and forth communication via PodSchedulingContext that would go a long way towards alleviating concerns on the overall DRA KEP.

We loose the following functionality if we remove PodSchedulingContext:

Support for network-attached resources: not covered because it would raise additional questions around how allocation can make the relevant resource configuration changes when there is no vendor-provided control plane controller.
Support use cases which don't fit into the proposed numeric models and don't need autoscaling. Such cases exist.

IMHO these are important enough to keep the core DRA as originally envisioned.

I realize (at least I think this is true) that that process also covers node-side allocation failures pushing a pod back into the scheduling queue, and this wouldn't cover that. I need to think about that more.

Triggering pod rescheduling in kubelet is going to be a separate KEP from @ffromani . DRA is avoiding the problem as much as possible by making sure that resources are set aside before the scheduling decision.

ffromani · 2024-01-26T12:25:34Z

Triggering pod rescheduling in kubelet is going to be a separate KEP from @ffromani . DRA is avoiding the problem as much as possible by making sure that resources are set aside before the scheduling decision.

Correct. We agreed to kept this part separated as it is related but not a dependency for DRA progress. Of course the needs and potential benefits for DRA enabled by rescheduling is going to be a very major theme in the pod rescheduling conversation. I'm currently processing all the feedback received from the conversations with various SIGs, will update the design document ASAP.

pohly · 2024-01-26T13:44:57Z

To sort of answer my own question, I realized I got myself confused. We need three things, not two:

Correct. But remember that we need them for each numeric model.

the capacity request object (which for counters is identical to the capacity object)

This made it a bad example 😢 The extended counter model with selectors will be a better example, because then capacity and parameter types will be different.

I am assuming here that the capacity objects themselves, as stored in the NodeResourceCapacity, have a name or other identifier that we match to the capacity request object.

They all have an "instance ID" chosen by the driver and that gets repeated in the allocation result, so when the scheduler picks one resource instance, the driver will know which one.

For how you "apply" the request to the current capacity, we could either have it be inherent in the numerical model type,

That's the current proposal, with some Go package implementing that logic. Because it's builtin, it doesn't need to be configurable.

klueska · 2024-02-07T21:57:48Z

/assign @dchen1107
/assign @alculquicondor
/assing @MaciekPytel

Just want to make sure this is on all of your radars with the KEP freeze looming tomorrow. We need approvals from all of you before we can merge.

@thockin @johnbelamaric and myself are circling around approving it from a reviewers standpoint and hope to give it our sign off sometime early tomorrow afternoon PST.

johnbelamaric · 2024-02-07T22:43:47Z

/label lead-opted-in

johnbelamaric · 2024-02-07T22:44:25Z

oh, that belongs on the issue not the PR

thockin

My comments are little all over the place, but I don't want to erase them. I think the priority here is to get something that helps us prove this overall model. As such, we might want to just defer things with a lot of nuance and instead focus on landing broad strokes, knowing full well that we will get some of it wrong and that it will lack the requisite sophistication.

There will be followup, so this can safely be less than MVP, so we can work with autoscaling and scheduling to make sure it works (and if not, we don't lose so much :)

I also think we need to get back in sync about selection via CEL. We probably didn't converge as much as I thought we had :)

thockin · 2024-02-07T19:01:37Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+  parameters (i.e. parameters that are *not* needed for allocation but *are*
+  needed for configuration).
+
+* With such a claim in place, DRA drivers "resolve" the contents of any


For consideration during impl: What if the resolved form of the vendor-specific params was in the status of the claim?

Or another option, which I am not sure is better or worse UX: What if the Pod referenced a vendor-specific "claim resource" directly, and that claim resource's status referenced a generic claim? That makes the user's life slightly simpler, but relies on duck-typing.

For consideration during impl: What if the resolved form of the vendor-specific params was in the status of the claim?

That was my original thought too. But Patrick pointed out that this would mean the vendor controllers all need write access to the claim status. We decided that the simplest way, from the point of view of RBAC, was the intermediate object, the ResourceClaimParameters. The vendor controllers can create and update those, which include the resolved claim parameters, plus the config data needed by the driver stored in a rawextension for pass through. The scheduler only needs read access to that resource, rather than to any vendor specific resources.

Or another option, which I am not sure is better or worse UX: What if the Pod referenced a vendor-specific "claim resource" directly, and that claim resource's status referenced a generic claim? That makes the user's life slightly simpler, but relies on duck-typing.

Similar RBAC issues then, as now anything that needs to follow that reference needs RBAC to those vendor-specific objects.

What if the resolved form of the vendor-specific params was in the status of the claim?

Besides the permission issue mentioned by @johnbelamaric already there's also the "many claims referencing the same parameter object" aspect. This will happen with claim templates created by app controllers. Asking the DRA driver controller to constantly check for new claims and repeating the same information again and again seems sub-optimal to me.

OTOH, it has one potential advantage: the translation could be specific to the claim. But in the scenario above with one claim template per app, there wouldn't be any difference between the generated claims, so I think this doesn't matter that much.

Let's stick with the current approach.

thockin · 2024-02-07T19:03:16Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+  object used in the request.
+
+In this KEP, we define a single "semantic model" called
+`PartitionableResources`, and use it throughout the discussion to help describe


I don't think "Partitionable" adds enough value to justify the typing difficulty it imposes. I'll think about names as I do this read-thru

Divisible?

Segemented?

ModeSwitching?

thockin · 2024-02-07T19:19:00Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+
+### Publishing node resources
+
+The kubelet publishes NodeResourceSlices to the API server with content


Clarification:

Is this EXCLUSIVELY about kubelet, or would it make sense to expand this to something like:

"""
NodeResourceSlices are published to the API server to describe resources which are available on each node. Often they will be produced by kubelet, with information from on-node agents, but they can also be published by other agents. Access control through the node authorizer ...."
"""

My intent at some point was to allow admins to create NodeResourceSlices for pod groups, as information for the autoscaler. @MaciekPytel preferred to have that information in the autoscaler configuration or to get it from cloud providers, so at this point only kubelet is expected to publish these objects.

Either way, kubelet would have been responsible for all objects with the same nodeName field as the node on which kubelet runs. I'll make that clearer:

The kubelet publishes NodeResourceSlices to the API server with content provided by DRA drivers running on its own node. It's also responsible for deleting or updating stale objects. The nodeName field determines which objects each kubelet instance is responsible for.

Is the following valid or invalid:

The cloud-provider knows all about the machines and hardware. They run a controller which creates NodeResourceSlices for each machine instead of kubelet.

I guess, at the end of it all, kubelet has to feed device allocation decisions back to the runtime, so it's sort of moot, but I just hate being over-prescriptive.

I think it's valid. Especially considering that these are per-driver. So, kubelet and the on-node agents may know about some things, but for some drivers (e.g., vlan / vrf / routing policy configs on the ToR or distribution routers) they may actually not know, and a different agent populates this.

thockin · 2024-02-07T19:21:32Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+
+NodeResourceSlices are published separately for each driver, using a version of
+the API that matches what that driver is using. This implies that the kubelet, DRA
+driver, and apiserver must all support the *same* API version. It might be


This is a scary paragraph which I don't fully understand. Version skew is a VERY real thing, and anything that demands exact same versions is too brittle.

I don't think the paragraph is accurate. Round-tripping should work here just as well as anywhere else. So the functionality will revert to the oldest version understood by all components (in effect).

This goes back to the discussion about strong typing vs. letting kubelet pass through opaque data.

What I was trying to point out here is that kubelet and the DRA drivers on the node must agree on the version of the data and its representation when the DRA driver passes it over gRPC. If kubelet is using "resource.k8s.io/v1alpha2", then the driver cannot give it data defined by "resource.k8s.io/v1beta1". It has to be "resource.k8s.io/v1alpha2".

I'll list that as the current approach. I have an idea how to support version skew better, but it's not required for 1.30 (just one alpha version...).

API version negotiation for plugins is always a thing, but saying "the kubelet ... and apiserver must all support the same API version" is a little much. It's 100% normal that you can't use a new Pod API feature properly if kubelet is back-rev.

thockin · 2024-02-07T19:30:59Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+        String: 1.2.3
+      - name: runtimeVersion
+        String: 11.1.42
+    - name: t1000-gpu


Is this correct YAML? I read this as a list-item under commonAttributes

Edit: reading on I see what's happening and I don't love it.

thockin · 2024-02-07T21:07:52Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+  apiGroup: dra.example.com
+  uid: foobar-uid
+
+vendorParameters:


I'm lost on why we are copying that CR into this and what schema you will use to represent it?

These are configuration parameters that are not needed for scheduling / CA decisions, but need to be passed to the kubelet to do configuration on the node as part of resource "preparation". I believe the idea was to use a RawExtention to represent it.

Yes, exactly. So, the vendor CR contains various configuration parameters that are totally unrelated to scheduling and resource allocation. Those can remain opaque to K8s, but they need to make their way from the class and claim all the way down to the on-node driver. They also need to be captured as they existed at the time of allocation. So, the translation of the CR to vendor-neutral form also stuffs away any of these "config" related values from the class and claim into a passthrough RawExtension. This is then copied at the time of allocation into the resource claim status, along with the rest of the allocation information.

If a user subsequently updates the CR, it doesn't matter, because we have captured the config data as it was at the time of allocation. The updated CR will be used for future allocations but not for existing ones.

hmm, I think I get it. I don't love it, but I get it, I think.

thockin · 2024-02-07T22:53:23Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+    // +optional
+    metav1.ObjectMeta
+
+    Spec NodeResourceSliceSpec


Well, will there be status?

thockin · 2024-02-07T22:54:16Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+which of those resources are reserved for claims.
+
+However, despite the finalizer on the claims it could happen that a
+sufficiently determined user deletes a claim while it is in use. Therefore


"well intentioned but poorly informed" is my go-to phrasing

thockin · 2024-02-07T22:56:36Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+    Partitionable *PartitionableResources
+}
+
+type PartitionableResources struct {


I'll go one further - can we jettison anything we're unsure about to reduce scope? We're certainly going to do an alpha2, so I think it's valuable to just be sure the gears mesh overall...

thockin · 2024-02-07T23:02:30Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+
+### Immediate allocation
+
+Because there is no separate controller anymore, claims with immediate


We need to look at this anew before beta. Is this still representing user intention well?

claims with immediate allocation remain allocated when no longer in use

I think this is the important bit -- what use cases might require this (and why)?

I don't have any "immediately" on hand (pun intended!).

AIUI, part of the motivation for immediate allocation was for expensive initialization. Suppose you need to flash a card, which takes a long time, then want to run a whole bunch of workloads against it in series (rather than in parallel). @pohly ?

I think that use case is still served better by immediate allocation, because it still ensures that this expensively configured hardware is kept in that configuration after it has been used by one pod. The name just doesn't fit that well anymore.

Even in the "existing" DRA model, no configuration happens until the first pod that uses a claim lands on the node where it has been allocated. That first pod will always be delayed by the time to configure in both scenarios.

The real difference with this KEP is that now the allocation itself won't happen until a pod that references the claim is scheduled. Whereas before, the DRA controller would pick an arbitrary node to allocate the claim on as soon as the claim object itself was created, and the scheduler was forced to schedule any pods referencing that claim on that node.

Are we saying that "allocation mode" is defunct?

I think this is a different topic. I had suggested adding an optional nodeName to resource claim in an earlier model. I cannot abide the "randomly pick a node". The nodeName would allow immediate allocation without that, and allow then for pods to be attached to that same (shared) claim.

But I don't think that's key in this KEP, I think that's a discussion for how we evolve base DRA in 1.31.

I agree we should hold off on this, but i would also like to express my disdain for "randomly pick a node". It's the main reason we never built support for this mode in the existing DRA driver for GPUs.

johnbelamaric · 2024-02-08T00:29:18Z

Overall response to Tim's comments: I said it in one of the inline comments, but I'll repeat it here. I think we may be better off if we go with some simplified version of the partitionable (that is hard to type) model. I don't want everyone focused on the details of the model for 1.30, I think that's a distraction.

As I see it, the point of alpha is to reduce overall risk and learn from doing. The thing I think 1.30 should focus on for this KEP is the overall concept of the semantic model, how the CRs are processed, how the scheduler, job controllers, cluster autoscaler and other components interpret that, draw down on the allocations, store and reconcile those allocations, etc. I think we should keep the actual model really simple as a placeholder, and during then have a solid model to propose in the 1.31 design phase.

I think we should set aside a LOT of time at the contributor summit (and in subsequent days if folks are still in Paris) to really hash through those models real time, in person.

klueska · 2024-02-08T08:53:09Z

What if we strip the partitionable model down to its very basics (and call it something else of course).

i.e. have it provide a flat list of opaque strings with no attributes attached. Each string representing a unique resource that is allocatable on the node. This would be a direct mapping of the existing device plugin model to a DRA semantic model.

This would mean we can punt on talking about any sort of selection API (because all resources are equal), and focus solely on the vendor specific CRDs providing configuration parameters.

We lose the ability to do any sort of dynamic partitioning of resources, but it is still a net gain over the existing device plugin API since users will have the ability to do controlled sharing via resource claims and custom configuration through claim parameters.

pohly · 2024-02-08T12:45:25Z

I've pushed my updates where I tried to address the editorial changes (wording, take out implementation details for autoscaler).

What we need to decide now is which semantic model we want to describe in the KEP, if we can't agree on simply treating this as preliminary and "defined during implementation". It wouldn't be the first time that a KEP gets accepted with one API and the implementation's API then looks very different.... just saying. More in #4384 (comment).

klueska · 2024-02-08T16:30:55Z

I just pushed my own PR against the branch for this PR that describes a much more simplified model (and reiterates many times over that this is a hypothetical model, not one we are proposing to use):
pohly#17

"Structured parameters" describe resources in a format that enables the Kubernetes components to make scheduling decisions. Co-authored-by: Kevin Klues <kklues@nvidia.com>

pohly · 2024-02-08T21:42:47Z

I merged @klueska's update and then did another rename of "semantic parameters" into "structured parameters": that is a better name because we define some structure for parameters and the associated mechanics, but not the actual semantic.

klueska · 2024-02-08T22:02:36Z

Thanks @pohly for your patience and diligence in seeing this KEP through.

While I still think there is room for improvement (and there are still some unresolved differences in opinion between the primary reviewers on how this framework will eventually take shape), we need to start somewhere.

The KEP is in good enough shape and we are aligned enough, that I believe we can get started working on this in 1.30.

/lgtm
/approve

alculquicondor

/approve from sig-scheduling

I think this older approach is gone. But, for the record, having scheduler watch arbitrary CRDs, even if just to pass them through, would have been a no go for me.

alculquicondor · 2024-02-08T21:17:30Z

keps/sig-node/4381-dra-semantic-parameters/README.md

+    - name: drivers
+      attributes:
+      - name: driverVersion
+        String: 1.2.3


Suggested change

String: 1.2.3

string: 1.2.3

why not version though? What tells CEL that this can be compared using semantic versioning?

You are right, this should have been "version". That, and only that field is something that CEL will be able to use in a semantic version comparison.

alculquicondor · 2024-02-08T21:42:04Z

keps/sig-node/4381-dra-structured-parameters/README.md

+  name: someArbitraryName
+  namespace: user-namespace
+
+generatedFrom:


why do we need this if the name, namespace and uid are already part of the vendorParameters below?

Because the vendorParameters are an opaque RawExtension. It's not required that they contain the original object.

alculquicondor · 2024-02-08T22:17:36Z

keps/sig-node/4381-dra-structured-parameters/README.md

+
+Instead of defining a vendor-specific CRD, DRA driver authors (or
+administrators) could decide to allow users to create and reference
+`ResourceClaimParameters` directly within their `ResourceClaims`. This would


How does the ResourceClaim look like?
How does it reference the claim?

The claim would reference the parameter object with "resource.k8s.io" as APIGroup, "ResourceClaimParameters" as kind, and then the name of the object. This object then gets used directly.

alculquicondor · 2024-02-08T22:18:50Z

keps/sig-node/4381-dra-structured-parameters/README.md

+Like a normal DRA driver controller, the scheduler also sets a finalizer to
+ensure that users cannot accidentally delete the allocated claim while a pod
+is about to start which depends on it. That finalizer is
+"structured.dra.k8s.io/delete-protection".


why not just dra.k8s.io/delete-protection?

That also works.

dchen1107 · 2024-02-09T00:23:08Z

/approve from SIG Node perspective based on the review details @johnbelamaric and others.

johnbelamaric · 2024-02-09T00:29:25Z

/approve
/lgtm

k8s-ci-robot · 2024-02-09T00:29:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, dchen1107, johnbelamaric, klueska, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [dchen1107,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johnbelamaric · 2024-02-09T00:48:24Z

/retest

eero-t

Noticed few typos

eero-t · 2024-04-16T11:18:06Z

keps/sig-node/4381-dra-structured-parameters/README.md

+example, a cloud provider controller could populate this based upon information
+from the cloud provider API.
+
+In the kubelet case, each kubelet publishes kubelet publishes a set of


s/kubelet publishes kubelet publishes/kubelet publishes/

eero-t · 2024-04-16T11:20:11Z

keps/sig-node/4381-dra-structured-parameters/README.md

+  object used in the request.
+
+This KEP is specifically focused on defining the framework necessary to enable
+different "structured models" to be added to Kuberenetes over time. It is out of


s/Kuberenetes/Kubernetes/

eero-t · 2024-04-16T11:20:27Z

keps/sig-node/4381-dra-structured-parameters/README.md

+
+If a driver wanted to use a different structured model to represent its resources,
+a new structured model would need to be defined inside Kuberenetes, and a field
+would need to be added to this struct at the same level as


s/Kuberenetes/Kubernetes/

eero-t · 2024-04-16T11:22:40Z

keps/sig-node/4381-dra-structured-parameters/README.md

+tells them that they cannot handle the object because the API has been extended.
+
+Drivers can use different structured models by publishing multiple
+`NodeResourceSlice` objects, as long as each model represents a distinct set of


"distinct" as in "non-overlapping"?

eero-t · 2024-04-16T11:26:10Z

keps/sig-node/4381-dra-structured-parameters/README.md

+```
+
+Where "gpu-0" represents one type of card and "gpu-1" represents another (with
+the attributes hanging off each serving to "define" their individual


simpler:
s/attributes hanging off each serving to "define"/related attributes defining/

eero-t · 2024-04-16T11:34:21Z

keps/sig-node/4381-dra-structured-parameters/README.md

+
+For each node, one or more NodeResourceSlice objects get created. The kubelet
+publishes them with the node as the owner, so they get deleted when a node goes
+down and then gets removed.


s/ and then gets removed//

eero-t · 2024-04-16T11:38:16Z

keps/sig-node/4381-dra-structured-parameters/README.md

+    // allocated resources, others can be arbitrary setup parameters.
+    VendorClaimParameters *runtime.RawExtension
+
+   	// NodeName is the name of the node providing the necessary resources.


Inconsistent indent.

eero-t · 2024-04-16T11:44:21Z

keps/sig-node/4381-dra-structured-parameters/README.md

+remaining structured difference compared to delayed allocation is that claims
+with immediate allocation remain allocated when no longer in use.
+
+### Simulation with CA


For readability (CA could be e.g. Claim Allocation):
s/CA/Cluster Autoscaler/

eero-t · 2024-04-16T11:48:09Z

keps/sig-node/4381-dra-structured-parameters/README.md

+- All watchers of node objects get the information even if they don't need it.
+- It puts complex alpha quality fields into a GA API.
+
+### Injecting vendor logic into CA


s/CA/Cluster Autoscaler/

eero-t · 2024-04-16T11:48:45Z

keps/sig-node/4381-dra-structured-parameters/README.md

+several different ways:
+
+- Call out to a vendor server via some RPC mechanism (similar to scheduler
+  webhooks). The risk here is that simulation becomes to slow. Configuration


k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2024

k8s-ci-robot requested a review from kikisdeliveryservice January 5, 2024 16:30

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 5, 2024

k8s-ci-robot requested a review from palnabarun January 5, 2024 16:30

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 5, 2024

pohly force-pushed the dra-numeric-parameters branch 2 times, most recently from 76262cb to d5ceb77 Compare January 7, 2024 14:10

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 7, 2024

bart0sh added this to Triage in SIG Node PR Triage Jan 7, 2024

elezar mentioned this pull request Jan 9, 2024

KEP 4382: add semantic model for dynamic resource allocation with simple counter #4385

Closed

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Jan 9, 2024

pohly mentioned this pull request Jan 16, 2024

DRA: structured parameters #4381

Open

4 tasks

pohly commented Jan 17, 2024

View reviewed changes

keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved

pohly commented Jan 22, 2024

View reviewed changes

keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved

pohly commented Jan 22, 2024

View reviewed changes

keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved

johnbelamaric reviewed Jan 22, 2024

View reviewed changes

keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved

k8s-ci-robot requested a review from swatisehgal January 23, 2024 18:28

pohly mentioned this pull request Jan 24, 2024

KEP-3063: DRA: 1.30 update #4181

Merged

johnbelamaric reviewed Jan 26, 2024

View reviewed changes

k8s-ci-robot assigned alculquicondor and dchen1107 Feb 7, 2024

k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 7, 2024

thockin reviewed Feb 7, 2024

View reviewed changes

pohly mentioned this pull request Feb 8, 2024

WIP: dra: resource quotas kubernetes/kubernetes#120611

Open

KEP 4381: add structured parameters for dynamic resource allocation

624bec4

"Structured parameters" describe resources in a format that enables the Kubernetes components to make scheduling decisions. Co-authored-by: Kevin Klues <kklues@nvidia.com>

pohly force-pushed the dra-numeric-parameters branch from 9462d5d to 624bec4 Compare February 8, 2024 21:39

pohly changed the title ~~KEP 4381: add semantic parameters for dynamic resource allocation~~ KEP 4381: add structured parameters for dynamic resource allocation Feb 8, 2024

k8s-ci-robot assigned klueska Feb 8, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024

alculquicondor reviewed Feb 8, 2024

View reviewed changes

k8s-ci-robot assigned johnbelamaric Feb 9, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024

k8s-ci-robot merged commit 7ee5a73 into kubernetes:master Feb 9, 2024
4 checks passed

SIG Node PR Triage automation moved this from Needs Reviewer to Done Feb 9, 2024

k8s-ci-robot added this to the v1.30 milestone Feb 9, 2024

eero-t reviewed Apr 16, 2024

View reviewed changes


		### Publishing node resources

		The kubelet publishes NodeResourceSlices to the API server with content


		### Immediate allocation

		Because there is no separate controller anymore, claims with immediate

KEP 4381: add structured parameters for dynamic resource allocation #4384

KEP 4381: add structured parameters for dynamic resource allocation #4384

Conversation

pohly commented Jan 5, 2024

johnbelamaric commented Jan 17, 2024

pohly commented Jan 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Feb 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swatisehgal commented Jan 23, 2024

johnbelamaric left a comment

Choose a reason for hiding this comment

johnbelamaric commented Jan 26, 2024

pohly commented Jan 26, 2024

ffromani commented Jan 26, 2024

pohly commented Jan 26, 2024

klueska commented Feb 7, 2024

johnbelamaric commented Feb 7, 2024

johnbelamaric commented Feb 7, 2024

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin Feb 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Feb 8, 2024

klueska commented Feb 8, 2024

pohly commented Feb 8, 2024

klueska commented Feb 8, 2024

pohly commented Feb 8, 2024 • edited

klueska commented Feb 8, 2024

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchen1107 commented Feb 9, 2024

johnbelamaric commented Feb 9, 2024

k8s-ci-robot commented Feb 9, 2024

klueska Feb 3, 2024 •

edited

thockin Feb 7, 2024 •

edited

pohly commented Feb 8, 2024 •

edited