Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 4381: add structured parameters for dynamic resource allocation #4384

Merged
merged 1 commit into from Feb 9, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 5, 2024

  • One-line PR description: initial draft

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 5, 2024
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 5, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 5, 2024
@pohly pohly force-pushed the dra-numeric-parameters branch 2 times, most recently from 76262cb to d5ceb77 Compare January 7, 2024 14:10
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 7, 2024
@bart0sh bart0sh added this to Triage in SIG Node PR Triage Jan 7, 2024
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Jan 9, 2024
@pohly pohly mentioned this pull request Jan 16, 2024
4 tasks
@johnbelamaric
Copy link
Member

@pohly I think this and the related KEPs are going in the right direction. It will be interesting to see how many use cases we can cover with these.

I am wondering though if we can build a more structured and comprehensive model up front, rather than the model being a sort of flat list of parameters. The drivers would publish those to a resource just like you have here, but would also allows some topological considerations baked into the structure. The goal is to get to a place where the scheduler and autoscalers (workload or cluster) have enough information to get the right answer the vast, vast majority of time. The drivers would likely also need to publish information on the valid "transformations" of the model when reservations are done.

I am also not sure of the way this "peers into" driver-specific data structures for parameters. I see why you are doing it that way, but it creates some fragility between resources when we have those sort of field paths. I think we can explore some other approaches. For example, a schema of some sort for describing the resource model, which results in those parameter CRDs for the user to use, but attaches the resource management metadata to those types as well, for the scheduler and autoscalers to use. We can brainstorm some of this when we talk next.

@pohly
Copy link
Contributor Author

pohly commented Jan 17, 2024

I am wondering though if we can build a more structured and comprehensive model up front, rather than the model being a sort of flat list of parameters.

The parameters don't need to be a flat list. Anything that is allowed in a CRD could be used. How complex the in-tree model needs to (and should) be is something that we can discuss in #4384, with https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit#heading=h.ljj9kaa144nr as a starting point.

a schema of some sort for describing the resource model, which results in those parameter CRDs for the user to use, but attaches the resource management metadata to those types as well, for the scheduler and autoscalers to use

So basically a configurable mapping of fields in a vendor CRD to the in-tree numeric model type? That adds more flexibility, but also complexity. I don't see how it addresses the "fragility between resources" - instead of one configurable field path, we would have many. Perhaps being explicit about all paths would help. 🤷

respond by returning the JSON encoding of this capacity type. A streaming gRPC
API allows drivers to update the capacity information.

For example, a simple counter could be described as:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add here that this example is what is being defined in #4385.

metav1.TypeMeta `json:",inline"`

Count int64 `json:"count"`
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the simple counter model uses the same struct for publishing capacity and requesting it. Most other models will have different structs, so let's extend the simple counter a bit so that it also has two.

One common additional criteria is some kind of selector. Let's discuss here what that might look like and then support it consistently with all builtin models.

What I came up with so far is to support a metav1.LabelSelector:

// Parameters defines how much and which resources are needed for a claim.
type Parameters struct {
	metav1.ObjectMeta `json:"metadata,omitempty"`
	metav1.TypeMeta   `json:",inline"`

	// Count is the required number of items.
	Count int64 `json:"count"`

	// In addition to the "In", "NotIn", "Exists", "DoesNotExist"
	// operators from [k8s.io/apimachinery/pkg/apis/meta/v1.LabelSelectorOperator],
	// additional operators from [k8s.io/dynamic-resource-allocation/apis/meta/v1.LabelSelectorOperator]
	// are also supported:
	//
	// - "Version>=", "Version>", "Version<=", "Version<", "Version=", "Version!=":
	//   the Selector.MatchExpressions values must
	//   contain a single semantic version strong. The expression
	//   matches if the version in the label satisfy the condition
	//   as defined by semantic versioning 2.0.0.
	// - ">=", ">", "<=", "<", "=", "!=": the Selector.MatchExpressions values must
	//   contain a single string which can be parsed as [resource.Quantity],
	//   with comparison against the labels' value using the numeric operation.
	Selector metav1.LabelSelector `json:"selector"`
}

The implementation will be a fork of https://github.com/kubernetes/apimachinery/blob/3c8c1f22dc332a8a6b51033f6a7c0566b885d290/pkg/apis/meta/v1/helpers.go#L31-L72 and https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/labels/selector.go#L212-L267, with semantic versioning handled by github.com/blang/semver/v4.

Using a metav1.LabelSelector as type in the API is IMHO simpler for users than allowing a full expression (foo in (x y z), bar != 1) (https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#set-based-requirement).

Another benefit is that validating the parameters becomes easier. The CRD could limit the key to know, supported labels. Enforcing the correct operation for each label (i.e. only allowing semver ops for a label that has a version as value) probably needs an admission web hook, but could be implemented that way.

Does this make sense?

cc @klueska @byako

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered expanding metav1.LabelSelector with these features vs. making another label selector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NVIDIA GPU selection we will also require the following operations:

  • Globbing, e.g. (if model == "A100_PCIE_40GB", being able to select on model == "*a100*)
  • Comparison of booleans with possible values of only true/false

We also need to be able to combine these operations in (somewhat) complicated ways, e.g. a common selector people want to express is:

model == "*v100* || (cudaComputeCapability >= 7.0 && memory >=16GB)`

We could maybe get away with not allowing arbitrary boolean expressions for now, but we do need globbing and boolean comparisons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the simple counter model uses the same struct for publishing capacity and requesting it. Most other models will have different structs, so let's extend the simple counter a bit so that it also has two.

What does this mean? What will the Request struct look like, exactly? Is this the difference between what is used in claims vs. what is passed over the gRPC protocol from the DRA plugin to the kubelet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the difference. The "capacity type" is used when publishing capacity of a node, the "parameter type" when asking for resources via the vendor CRD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think it's covered in the proposal, but take a look tomorrow and see what you think. The proposal still has ResourceClaim with a reference to a vendor-specific CR. We translate that into a vendor neutral request object using CEL (which can operate on arbitrary CRs by treating them as Unstructured). Various selection criteria could be included there (they are not discussed in there now, beyond a boolean true/false).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't click thru everything yet, but tell me if I am off-target here: This seems like a place where we need to evaluate semi-arbitrary expressions. One might say we need to agree on a common language for those expressions, with extensible semantics.

On globbing: IMO globbing indicates that you have a piece of stringly-typed data, which would almost always be better as N discrete items. This is why label selectors DO NOT support globbing, and instead demand that you factor your data into labels. AND STILL, people choose to parse names, and inevitably it is causing pain and misery.

Coming up with a string which composes the model, vendor, subtype, etc. should be a very last resort. Globbing is a smell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need a more comprehensive expression language than metav1.LabelSelector. metav1.LabelSelector only gives us <term1> && <term2> && ... where each term is one label check.

In @johnbelamaric's proposal, each hardware instance is described by more than one numeric model and the vendor CRD with the claim parameter then gets mapped to one parameter type per model. Each model is then checked independently and all must match an instance.

At first glance this looks appealing because one can "compose" the description of the hardware from different building blocks. But now consider mode switching. Each mode may have labels that need to match the selector provided by the user. The "selector model" is unaware of "modes", so it cannot be used. The "mode switch model" has to re-implement it. Alternatively, more details of the mode switch model need to be hoisted up into the resource.k8s.io API. This is a slippery slope where the common API becomes more and more complex.

Therefore I prefer to stick with "one model per driver" and instead implement label selection as a shared library that can be used by the different models. Both the "counter model" and the "mode switch model" would support it, with the difference that the "counter model" publishes a a flat set of labels as part of its capacity while the "mode switch model" has labels for the entire instance and each mode.

Copy link
Contributor

@klueska klueska Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding globbing -- it's been a long-standing feature request for users to be able to request GPUs based on a subset of the model name returned by nvidia-smi.

Examples of such model names include the following:

Tesla V100-SXM2-16GB
NVIDIA T4 32GB
NVIDIA A100-SXM4-40GB
NVIDIA H100 PCIe

As some background, it's common for people to prototype their workloads in docker on a specific GPU model before moving them to Kubernetes. When doing so, they are keenly aware of the GPU model they are running on, and want to ensure that they continue to run on this same GPU model once deployed under Kubernetes.

Now keep in mind, this is already possible today using the extended resource exposed by the current device plugin API (i.e. nvidia.com/gpu) in conjunction with a node label that advertises the GPU model (assuming all GPUs on a node are homogeneous, of course).

It would look something like this:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-example
spec:
  containers:
    - name: gpu-example
      image: nvidia/cuda
      resources:
        limits:
          nvidia.com/gpu: 2
  nodeSelector:
    nvidia.com/gpu.model: NVIDIA A100-SXM4-40GB

Independent of the fact that the selection above occurs at the node level rather than the device level (which DRA proposes to solve), matching on an exact model name is problematic for two reasons:

  1. As evidenced above, the format of the model name is inconsistent -- meaning that users have to remember the differences between them when putting their selectors together.
  2. There is not just a single model name for, say, an A100 card on a given machine. The model name encodes other information, such as if its on a mother board that connects devices within a PCIe hierarchy or via SXM, or how much memory the card has on it (e.g. 40GB, 80GB, or 96GB).

Some examples of (2) include:

NVIDIA A100-PCIE-40GB
NVIDIA A100-SXM4-40GB
NVIDIA A100-PCIE-64GB
NVIDIA A100-SXM-64GB
NVIDIA A100-PCIE-80GB
NVIDIA A100-SXM4-80GB

It would make the UX much nicer for one to be able to select on something like "*A100* in cases where they only care about getting access to an A100 (and don't care about the interconnect or the amount of memory -- they should be using a separate selector for memory anyway).

For cases where the do care about being more specific, they still can be by either matching on the full model name or a different subset of it using an alternate glob expression.

Given the (revised) proposal we have of using a DRA controller to translate any vendor specific selection parameters into selectors that the scheduler will understand (apologies I don't know the right terms here yet to be more specific), I imagine the scheduler would be able to take additional modifier expressions that tell it to ignore case when performing its globbing operation, for example, and the DRA controller could decide if it wants to apply these modifiers or not when performing the translation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding globbing -- it's been a long-standing feature request for users to be able to request GPUs based on a subset of the model name

People want to glob, substring, prefix, or regex match kube selectors, too. Just say no.

The model name encodes other information

You have a stringly-typed structure. Ideally you would decompose this into actual atoms of information, and let the user select on those.

It would make the UX much nicer for one to be able to select on something like "A100" in cases where they only care about getting access to an A100

So name the concept that "A100" represents. "family"? "generation"? A glommed together name is great for dumping in a logfile, but we have data structures for a reason.

they should be using a separate selector for memory anyway

exactly

Anyway, I probably won't die on the hill of not allowing globbing, but I will push back strongly.

@swatisehgal
Copy link
Contributor

/cc

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and comments but this is very much on the right track with respect to the concerns folks have expressed.

From a paperwork perspective, I think we should include all the models in this KEP, rather than separate KEPs. Even if we don't implement all the models in 1.30, we can decide which ones are required to go to beta in the beta criteria.

I think we can keep this separate from the larger DRA KEP for now at least.

I do think we should consider how far we can push this. If we can eliminate the need for the driver-by-driver back and forth communication via PodSchedulingContext that would go a long way towards alleviating concerns on the overall DRA KEP. I realize (at least I think this is true) that that process also covers node-side allocation failures pushing a pod back into the scheduling queue, and this wouldn't cover that. I need to think about that more.

keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved
keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved
keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved
keps/sig-node/4381-dra-numeric-parameters/README.md Outdated Show resolved Hide resolved
@johnbelamaric
Copy link
Member

To sort of answer my own question, I realized I got myself confused. We need three things, not two:

  • the capacity object (which will live in NodeResourceCapacity)
  • the capacity request object (which for counters is identical to the capacity object)
  • the capacity request expression object (defines the expressions used to generate a capacity request object from the CR)

I am assuming here that the capacity objects themselves, as stored in the NodeResourceCapacity, have a name or other identifier that we match to the capacity request object.

For how you "apply" the request to the current capacity, we could either have it be inherent in the numerical model type, or we can add an expression (and perhaps validation) object for that purpose too.

@pohly
Copy link
Contributor Author

pohly commented Jan 26, 2024

From a paperwork perspective, I think we should include all the models in this KEP, rather than separate KEPs. Even if we don't implement all the models in 1.30, we can decide which ones are required to go to beta in the beta criteria.

I'm open to merging things. I'm just a bit worried that it will become harder to associate use cases and PRR sections with the right feature. I can try and we'll see what the outcome will be.

Related to this, how many feature gates do we want for this? We could make this granular (current proposal):

  • DynamicResourceAllocation = core DRA
  • DRANumericParameters = this extension
  • DRACounterNumericModel = counter model
  • DRAModeSwitchModel = mode switch model

Do we agree on this?

I do think we should consider how far we can push this. If we can eliminate the need for the driver-by-driver back and forth communication via PodSchedulingContext that would go a long way towards alleviating concerns on the overall DRA KEP.

We loose the following functionality if we remove PodSchedulingContext:

  • Support for network-attached resources: not covered because it would raise additional questions around how allocation can make the relevant resource configuration changes when there is no vendor-provided control plane controller.
  • Support use cases which don't fit into the proposed numeric models and don't need autoscaling. Such cases exist.

IMHO these are important enough to keep the core DRA as originally envisioned.

I realize (at least I think this is true) that that process also covers node-side allocation failures pushing a pod back into the scheduling queue, and this wouldn't cover that. I need to think about that more.

Triggering pod rescheduling in kubelet is going to be a separate KEP from @ffromani . DRA is avoiding the problem as much as possible by making sure that resources are set aside before the scheduling decision.

@ffromani
Copy link
Contributor

Triggering pod rescheduling in kubelet is going to be a separate KEP from @ffromani . DRA is avoiding the problem as much as possible by making sure that resources are set aside before the scheduling decision.

Correct. We agreed to kept this part separated as it is related but not a dependency for DRA progress. Of course the needs and potential benefits for DRA enabled by rescheduling is going to be a very major theme in the pod rescheduling conversation. I'm currently processing all the feedback received from the conversations with various SIGs, will update the design document ASAP.

@pohly
Copy link
Contributor Author

pohly commented Jan 26, 2024

To sort of answer my own question, I realized I got myself confused. We need three things, not two:

Correct. But remember that we need them for each numeric model.

the capacity request object (which for counters is identical to the capacity object)

This made it a bad example 😢 The extended counter model with selectors will be a better example, because then capacity and parameter types will be different.

I am assuming here that the capacity objects themselves, as stored in the NodeResourceCapacity, have a name or other identifier that we match to the capacity request object.

They all have an "instance ID" chosen by the driver and that gets repeated in the allocation result, so when the scheduler picks one resource instance, the driver will know which one.

For how you "apply" the request to the current capacity, we could either have it be inherent in the numerical model type,

That's the current proposal, with some Go package implementing that logic. Because it's builtin, it doesn't need to be configurable.

@klueska
Copy link
Contributor

klueska commented Feb 7, 2024

/assign @dchen1107
/assign @alculquicondor
/assing @MaciekPytel

Just want to make sure this is on all of your radars with the KEP freeze looming tomorrow. We need approvals from all of you before we can merge.

@thockin @johnbelamaric and myself are circling around approving it from a reviewers standpoint and hope to give it our sign off sometime early tomorrow afternoon PST.

@johnbelamaric
Copy link
Member

/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 7, 2024
@johnbelamaric
Copy link
Member

oh, that belongs on the issue not the PR

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are little all over the place, but I don't want to erase them. I think the priority here is to get something that helps us prove this overall model. As such, we might want to just defer things with a lot of nuance and instead focus on landing broad strokes, knowing full well that we will get some of it wrong and that it will lack the requisite sophistication.

There will be followup, so this can safely be less than MVP, so we can work with autoscaling and scheduling to make sure it works (and if not, we don't lose so much :)

I also think we need to get back in sync about selection via CEL. We probably didn't converge as much as I thought we had :)

parameters (i.e. parameters that are *not* needed for allocation but *are*
needed for configuration).

* With such a claim in place, DRA drivers "resolve" the contents of any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consideration during impl: What if the resolved form of the vendor-specific params was in the status of the claim?

Or another option, which I am not sure is better or worse UX: What if the Pod referenced a vendor-specific "claim resource" directly, and that claim resource's status referenced a generic claim? That makes the user's life slightly simpler, but relies on duck-typing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consideration during impl: What if the resolved form of the vendor-specific params was in the status of the claim?

That was my original thought too. But Patrick pointed out that this would mean the vendor controllers all need write access to the claim status. We decided that the simplest way, from the point of view of RBAC, was the intermediate object, the ResourceClaimParameters. The vendor controllers can create and update those, which include the resolved claim parameters, plus the config data needed by the driver stored in a rawextension for pass through. The scheduler only needs read access to that resource, rather than to any vendor specific resources.

Or another option, which I am not sure is better or worse UX: What if the Pod referenced a vendor-specific "claim resource" directly, and that claim resource's status referenced a generic claim? That makes the user's life slightly simpler, but relies on duck-typing.

Similar RBAC issues then, as now anything that needs to follow that reference needs RBAC to those vendor-specific objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the resolved form of the vendor-specific params was in the status of the claim?

Besides the permission issue mentioned by @johnbelamaric already there's also the "many claims referencing the same parameter object" aspect. This will happen with claim templates created by app controllers. Asking the DRA driver controller to constantly check for new claims and repeating the same information again and again seems sub-optimal to me.

OTOH, it has one potential advantage: the translation could be specific to the claim. But in the scenario above with one claim template per app, there wouldn't be any difference between the generated claims, so I think this doesn't matter that much.

Let's stick with the current approach.

object used in the request.

In this KEP, we define a single "semantic model" called
`PartitionableResources`, and use it throughout the discussion to help describe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "Partitionable" adds enough value to justify the typing difficulty it imposes. I'll think about names as I do this read-thru

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divisible?

Segemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModeSwitching?


### Publishing node resources

The kubelet publishes NodeResourceSlices to the API server with content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification:

Is this EXCLUSIVELY about kubelet, or would it make sense to expand this to something like:

"""
NodeResourceSlices are published to the API server to describe resources which are available on each node. Often they will be produced by kubelet, with information from on-node agents, but they can also be published by other agents. Access control through the node authorizer ...."
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent at some point was to allow admins to create NodeResourceSlices for pod groups, as information for the autoscaler. @MaciekPytel preferred to have that information in the autoscaler configuration or to get it from cloud providers, so at this point only kubelet is expected to publish these objects.

Either way, kubelet would have been responsible for all objects with the same nodeName field as the node on which kubelet runs. I'll make that clearer:

The kubelet publishes NodeResourceSlices to the API server with content provided by DRA drivers running on its own node. It's also responsible for deleting or updating stale objects. The nodeName field determines which objects each kubelet instance is responsible for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following valid or invalid:

The cloud-provider knows all about the machines and hardware. They run a controller which creates NodeResourceSlices for each machine instead of kubelet.

I guess, at the end of it all, kubelet has to feed device allocation decisions back to the runtime, so it's sort of moot, but I just hate being over-prescriptive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's valid. Especially considering that these are per-driver. So, kubelet and the on-node agents may know about some things, but for some drivers (e.g., vlan / vrf / routing policy configs on the ToR or distribution routers) they may actually not know, and a different agent populates this.


NodeResourceSlices are published separately for each driver, using a version of
the API that matches what that driver is using. This implies that the kubelet, DRA
driver, and apiserver must all support the *same* API version. It might be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a scary paragraph which I don't fully understand. Version skew is a VERY real thing, and anything that demands exact same versions is too brittle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the paragraph is accurate. Round-tripping should work here just as well as anywhere else. So the functionality will revert to the oldest version understood by all components (in effect).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to the discussion about strong typing vs. letting kubelet pass through opaque data.

What I was trying to point out here is that kubelet and the DRA drivers on the node must agree on the version of the data and its representation when the DRA driver passes it over gRPC. If kubelet is using "resource.k8s.io/v1alpha2", then the driver cannot give it data defined by "resource.k8s.io/v1beta1". It has to be "resource.k8s.io/v1alpha2".

I'll list that as the current approach. I have an idea how to support version skew better, but it's not required for 1.30 (just one alpha version...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API version negotiation for plugins is always a thing, but saying "the kubelet ... and apiserver must all support the same API version" is a little much. It's 100% normal that you can't use a new Pod API feature properly if kubelet is back-rev.

String: 1.2.3
- name: runtimeVersion
String: 11.1.42
- name: t1000-gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct YAML? I read this as a list-item under commonAttributes

Edit: reading on I see what's happening and I don't love it.

apiGroup: dra.example.com
uid: foobar-uid

vendorParameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lost on why we are copying that CR into this and what schema you will use to represent it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are configuration parameters that are not needed for scheduling / CA decisions, but need to be passed to the kubelet to do configuration on the node as part of resource "preparation". I believe the idea was to use a RawExtention to represent it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. So, the vendor CR contains various configuration parameters that are totally unrelated to scheduling and resource allocation. Those can remain opaque to K8s, but they need to make their way from the class and claim all the way down to the on-node driver. They also need to be captured as they existed at the time of allocation. So, the translation of the CR to vendor-neutral form also stuffs away any of these "config" related values from the class and claim into a passthrough RawExtension. This is then copied at the time of allocation into the resource claim status, along with the rest of the allocation information.

If a user subsequently updates the CR, it doesn't matter, because we have captured the config data as it was at the time of allocation. The updated CR will be used for future allocations but not for existing ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think I get it. I don't love it, but I get it, I think.

// +optional
metav1.ObjectMeta

Spec NodeResourceSliceSpec
Copy link
Member

@thockin thockin Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, will there be status?

which of those resources are reserved for claims.

However, despite the finalizer on the claims it could happen that a
sufficiently determined user deletes a claim while it is in use. Therefore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"well intentioned but poorly informed" is my go-to phrasing

Partitionable *PartitionableResources
}

type PartitionableResources struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go one further - can we jettison anything we're unsure about to reduce scope? We're certainly going to do an alpha2, so I think it's valuable to just be sure the gears mesh overall...


### Immediate allocation

Because there is no separate controller anymore, claims with immediate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look at this anew before beta. Is this still representing user intention well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claims with immediate allocation remain allocated when no longer in use

I think this is the important bit -- what use cases might require this (and why)?

I don't have any "immediately" on hand (pun intended!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, part of the motivation for immediate allocation was for expensive initialization. Suppose you need to flash a card, which takes a long time, then want to run a whole bunch of workloads against it in series (rather than in parallel). @pohly ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that use case is still served better by immediate allocation, because it still ensures that this expensively configured hardware is kept in that configuration after it has been used by one pod. The name just doesn't fit that well anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in the "existing" DRA model, no configuration happens until the first pod that uses a claim lands on the node where it has been allocated. That first pod will always be delayed by the time to configure in both scenarios.

The real difference with this KEP is that now the allocation itself won't happen until a pod that references the claim is scheduled. Whereas before, the DRA controller would pick an arbitrary node to allocate the claim on as soon as the claim object itself was created, and the scheduler was forced to schedule any pods referencing that claim on that node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying that "allocation mode" is defunct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a different topic. I had suggested adding an optional nodeName to resource claim in an earlier model. I cannot abide the "randomly pick a node". The nodeName would allow immediate allocation without that, and allow then for pods to be attached to that same (shared) claim.

But I don't think that's key in this KEP, I think that's a discussion for how we evolve base DRA in 1.31.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should hold off on this, but i would also like to express my disdain for "randomly pick a node". It's the main reason we never built support for this mode in the existing DRA driver for GPUs.

@johnbelamaric
Copy link
Member

Overall response to Tim's comments: I said it in one of the inline comments, but I'll repeat it here. I think we may be better off if we go with some simplified version of the partitionable (that is hard to type) model. I don't want everyone focused on the details of the model for 1.30, I think that's a distraction.

As I see it, the point of alpha is to reduce overall risk and learn from doing. The thing I think 1.30 should focus on for this KEP is the overall concept of the semantic model, how the CRs are processed, how the scheduler, job controllers, cluster autoscaler and other components interpret that, draw down on the allocations, store and reconcile those allocations, etc. I think we should keep the actual model really simple as a placeholder, and during then have a solid model to propose in the 1.31 design phase.

I think we should set aside a LOT of time at the contributor summit (and in subsequent days if folks are still in Paris) to really hash through those models real time, in person.

@klueska
Copy link
Contributor

klueska commented Feb 8, 2024

What if we strip the partitionable model down to its very basics (and call it something else of course).

i.e. have it provide a flat list of opaque strings with no attributes attached. Each string representing a unique resource that is allocatable on the node. This would be a direct mapping of the existing device plugin model to a DRA semantic model.

This would mean we can punt on talking about any sort of selection API (because all resources are equal), and focus solely on the vendor specific CRDs providing configuration parameters.

We lose the ability to do any sort of dynamic partitioning of resources, but it is still a net gain over the existing device plugin API since users will have the ability to do controlled sharing via resource claims and custom configuration through claim parameters.

@pohly
Copy link
Contributor Author

pohly commented Feb 8, 2024

I've pushed my updates where I tried to address the editorial changes (wording, take out implementation details for autoscaler).

What we need to decide now is which semantic model we want to describe in the KEP, if we can't agree on simply treating this as preliminary and "defined during implementation". It wouldn't be the first time that a KEP gets accepted with one API and the implementation's API then looks very different.... just saying. More in #4384 (comment).

@klueska
Copy link
Contributor

klueska commented Feb 8, 2024

I just pushed my own PR against the branch for this PR that describes a much more simplified model (and reiterates many times over that this is a hypothetical model, not one we are proposing to use):
pohly#17

"Structured parameters" describe resources in a format that enables the
Kubernetes components to make scheduling decisions.

Co-authored-by: Kevin Klues <kklues@nvidia.com>
@pohly
Copy link
Contributor Author

pohly commented Feb 8, 2024

I merged @klueska's update and then did another rename of "semantic parameters" into "structured parameters": that is a better name because we define some structure for parameters and the associated mechanics, but not the actual semantic.

@pohly pohly changed the title KEP 4381: add semantic parameters for dynamic resource allocation KEP 4381: add structured parameters for dynamic resource allocation Feb 8, 2024
@klueska
Copy link
Contributor

klueska commented Feb 8, 2024

Thanks @pohly for your patience and diligence in seeing this KEP through.

While I still think there is room for improvement (and there are still some unresolved differences in opinion between the primary reviewers on how this framework will eventually take shape), we need to start somewhere.

The KEP is in good enough shape and we are aligned enough, that I believe we can get started working on this in 1.30.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve from sig-scheduling

I think this older approach is gone. But, for the record, having scheduler watch arbitrary CRDs, even if just to pass them through, would have been a no go for me.

- name: drivers
attributes:
- name: driverVersion
String: 1.2.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String: 1.2.3
string: 1.2.3

why not version though? What tells CEL that this can be compared using semantic versioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this should have been "version". That, and only that field is something that CEL will be able to use in a semantic version comparison.

name: someArbitraryName
namespace: user-namespace

generatedFrom:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this if the name, namespace and uid are already part of the vendorParameters below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the vendorParameters are an opaque RawExtension. It's not required that they contain the original object.


Instead of defining a vendor-specific CRD, DRA driver authors (or
administrators) could decide to allow users to create and reference
`ResourceClaimParameters` directly within their `ResourceClaims`. This would
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the ResourceClaim look like?
How does it reference the claim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The claim would reference the parameter object with "resource.k8s.io" as APIGroup, "ResourceClaimParameters" as kind, and then the name of the object. This object then gets used directly.

Like a normal DRA driver controller, the scheduler also sets a finalizer to
ensure that users cannot accidentally delete the allocated claim while a pod
is about to start which depends on it. That finalizer is
"structured.dra.k8s.io/delete-protection".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just dra.k8s.io/delete-protection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also works.

@dchen1107
Copy link
Member

/approve from SIG Node perspective based on the review details @johnbelamaric and others.

@johnbelamaric
Copy link
Member

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, dchen1107, johnbelamaric, klueska, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024
@johnbelamaric
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot merged commit 7ee5a73 into kubernetes:master Feb 9, 2024
4 checks passed
SIG Node PR Triage automation moved this from Needs Reviewer to Done Feb 9, 2024
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 9, 2024
Copy link

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed few typos

example, a cloud provider controller could populate this based upon information
from the cloud provider API.

In the kubelet case, each kubelet publishes kubelet publishes a set of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kubelet publishes kubelet publishes/kubelet publishes/

object used in the request.

This KEP is specifically focused on defining the framework necessary to enable
different "structured models" to be added to Kuberenetes over time. It is out of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Kuberenetes/Kubernetes/


If a driver wanted to use a different structured model to represent its resources,
a new structured model would need to be defined inside Kuberenetes, and a field
would need to be added to this struct at the same level as
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Kuberenetes/Kubernetes/

tells them that they cannot handle the object because the API has been extended.

Drivers can use different structured models by publishing multiple
`NodeResourceSlice` objects, as long as each model represents a distinct set of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"distinct" as in "non-overlapping"?

```

Where "gpu-0" represents one type of card and "gpu-1" represents another (with
the attributes hanging off each serving to "define" their individual
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpler:
s/attributes hanging off each serving to "define"/related attributes defining/


For each node, one or more NodeResourceSlice objects get created. The kubelet
publishes them with the node as the owner, so they get deleted when a node goes
down and then gets removed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ and then gets removed//

// allocated resources, others can be arbitrary setup parameters.
VendorClaimParameters *runtime.RawExtension

// NodeName is the name of the node providing the necessary resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indent.

remaining structured difference compared to delayed allocation is that claims
with immediate allocation remain allocated when no longer in use.

### Simulation with CA
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability (CA could be e.g. Claim Allocation):
s/CA/Cluster Autoscaler/

- All watchers of node objects get the information even if they don't need it.
- It puts complex alpha quality fields into a GA API.

### Injecting vendor logic into CA
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/CA/Cluster Autoscaler/

several different ways:

- Call out to a vendor server via some RPC mechanism (similar to scheduler
webhooks). The risk here is that simulation becomes to slow. Configuration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/to/too/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet