Minimal changes for partitionable devices in DRA evolution prototype #27

johnbelamaric · 2024-06-06T21:30:08Z

This PR adds the minimal fields needed to support partitionable devices. A few notes for consideration:

Terminology chosen is SharedAllocatable for the shared pooled resources, and SharedAllocatableConsumed for the device values that consume items from the pool.
SharedAllocatable is a []ResourceCapacity which is a struct with just name and quantity. This leaves out BlockSize and IntRange stuff to keep it as simple as possible.
SharedAllocatableConsumed is a map[string]resource.Quantity to mirror PodSpec requests. Given that this is the capacity model, not the claim model, consistency may not be needed here. In that case, we can probably change this to a struct instead, which would give us more room for expansion in the future.
Max in each list is arbitrarily set to 32.
Since SharedAllocatable is directly in ResourcePool, that means each partitionable device needs to be its own pool. We could consider two other options:
- Create another type for groups of SharedAllocatables. Kevin had this in an earlier version. Then each physical GPU would be in one of those, and the devices would name the group in addition to the pool in their SharedResourceConsumed:
```
type ResourcePool struct {
  ...
  Devices []Device

  SharedAllocatable []AllocatableGroup
}
type AllocatableGroup struct {
  Name string
  Allocatable []ResourceCapacity
}
type Device struct {
  ...
  SharedAllocatableConsumed []ResourceRequest
}
type ResourceRequest struct {
  AllocatableGroupName string
  ResourceName string
  Quantity resource.Quantity
}
```
- Instead of the pool containing Devices []Device it would instead contain DeviceGroups []DeviceGroup where:
```
type DeviceGroup struct {
  SharedAllocatable []ResourceCapacity
  Devices []Device
}
```
- I am sort of OK with the pool-per-GPU, as it is the most minimal change. I could also see going for the AllocatableGroup option as well, since it is incremental. I don't at this point like the last option much.

k8s-ci-robot · 2024-06-06T21:30:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dra-evolution/pkg/api/capacity_types.go

klueska · 2024-06-06T21:52:09Z

dra-evolution/pkg/api/capacity_types.go

+	// Must not have more than 32 entries.
+	//
+	// +optional
+	SharedAllocatableConsumed map[string]resource.Quantity `json:"sharedAllocatableConsumed,omitempty"`


Do we need a level of indirection here instead of just a map of Quantity? Otherwise we can never use anything other than a quantity in ResourceCapacity.

probably, see the PR comment notes: #27 (comment)

Would love your opinion specifically on the last question - pool-per-gpu or something else

I personally like the idea of grouping them with names. It means I'm not forced to have one GPU per ResourceSlice if I don't want to. This will likely be important for "simpler" devices that only partition on, say, one dimention.

For example with this, I can have the following in a single ResourceSlice (using a newly proposed name for this field):

apiVersion: resource.k8s.io/v1alpha3 kind: ResourceSlice metadata: name: <uuid> spec: driver: gpu.nvidia.com poolName: node0 nodeName: node0 sharedCapacity: - name: gpu-0-shared-capacity resources: - name: memory capacity: 40Gi - name: compute capacity: 180 - name: gpu-1-shared-capacity resources: - name: memory capacity: 40Gi - name: compute capacity: 180 devices: - name: gpu-0 sharedCapacityConsumed: - name: gpu-0-shared-capacity resources: - name: memory capacity: 40Gi - name: compute capacity: 180 - name: gpu-0-first-half sharedCapacityConsumed: - name: gpu-0-shared-capacity resources: - name: memory capacity: 20Gi - name: compute capacity: 90 - name: gpu-0-second-half sharedCapacityConsumed: - name: gpu-0-shared-capacity resources: - name: memory capacity: 20Gi - name: compute capacity: 90 - name: gpu-1 sharedCapacityConsumed: - name: gpu-1-shared-capacity resources: - name: memory capacity: 40Gi - name: compute capacity: 180 - name: gpu-1-first-half sharedCapacityConsumed: - name: gpu-1-shared-capacity resources: - name: memory capacity: 20Gi - name: compute capacity: 90 - name: gpu-1-second-half sharedCapacityConsumed: - name: gpu-1-shared-capacity resources: - name: memory capacity: 20Gi - name: compute capacity: 90

It also makes it more flexible such that one device could pull from both the shared resources of a parent GPU and some (possibly) off-chip shared resources that all devices have access to (regardless of their parent).

Agreed, I like this too.

@klueska I pushed an update with this, PTAL.

Patrick convinced me to go back to a flat list, which can accomplish the same with:

apiVersion: resource.k8s.io/v1alpha3 kind: ResourceSlice metadata: name: <uuid> spec: driver: gpu.nvidia.com poolName: node0 nodeName: node0 sharedCapacity: - name: gpu-0-shared-memory capacity: 40Gi - name: gpu-0-shared-compute capacity: 180 - name: gpu-1-shared-memory capacity: 40Gi - name: gpu-1-shared-compute capacity: 180 devices: - name: gpu-0 sharedCapacityConsumed: - name: gpu-0-shared-memory capacity: 40Gi - name: gpu-0-shared-compute capacity: 180 - name: gpu-0-first-half sharedCapacityConsumed: - name: gpu-0-shared-memory capacity: 20Gi - name: gpu-0-shared-compute capacity: 90 - name: gpu-0-second-half sharedCapacityConsumed: - name: gpu-0-shared-memory capacity: 20Gi - name: gpu-0-shared-compute capacity: 90 - name: gpu-1 sharedCapacityConsumed: - name: gpu-1-shared-memory capacity: 40Gi - name: gpu-1-shared-compute capacity: 180 - name: gpu-1-first-half sharedCapacityConsumed: - name: gpu-1-shared-memory capacity: 20Gi - name: gpu-1-shared-compute capacity: 90 - name: gpu-1-second-half sharedCapacityConsumed: - name: gpu-1-shared-memory capacity: 20Gi - name: gpu-1-shared-compute capacity: 90

I will add it in the KEP as such.

dra-evolution/pkg/api/capacity_types.go

johnbelamaric · 2024-06-07T16:31:31Z

Partionable devices, along with driver-side magic, can also support the idea of "compound devices". Here's how it would work.

Suppose we have two drivers, one for GPUs and one for NICs. We have nodes with 8 GPUs and 4 NICs. We want to allow certain "valid" combinations of these to be consumed as a unit. The "rules" here are:

NICs are not available without a GPU
GPUs are available with or without a NIC, but specific GPUs go along with specific NICs, like:
- gpu0,1 go with nic0
- gpu2,3 go with nic1
- gpu4,5 go with nic2
- gpu6,7 go with nic3

This implies the following valid "molecules" for the triplet (gpu0, gpu1, nic0):

gpu0
gpu0 + nic0 (leaving gpu1 only available by itself, without a NIC)
gpu1
gpu1 + nic0 (leaving gpu0 only available by itself, without a NIC)
gpu0 + gpu1 + nic0

Similarly there are 5 valid combinations for other triplet.

So, what we do is create a "compound device" driver that runs on the node but acts as an intermediary between the K8s control plane and the drivers for the underlying devices. It contains an in-process mini API server that serves the ResourcePool API, and we point the GPU and NIC drivers at that local instance. The compound device driver uses those to construct a new compound pool on top of those drivers that follows the rules above, using this partitionable model:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourcePool
metadata:
  name: node0-compound0
spec:
  driver: compound.example.com
  nodeName: node0
  devices:
  - name: gpu0
    sharedConsumed:
      gpu0: 1
  - name: gpu0-nic0
    sharedConsumed:
      gpu0: 1
      nic0: 1
  - name: gpu1
    sharedConsumed:
      gpu1: 1
  - name: gpu1-nic0
    sharedConsumed:
      gpu1: 1
      nic0: 1
  - name: gpu0-gpu1-nic0
    sharedConsumed:
      gpu0: 1
      gpu1: 1
      nic0: 1
...
  sharedConsumable:
  - name: gpu0
    capacity: 1
  - name: gpu1
    capacity: 1
  - name: gpu2
    capacity: 1
  - name: gpu3
    capacity: 1
  - name: gpu4
    capacity: 1
  - name: gpu5
    capacity: 1
  - name: gpu6
    capacity: 1
  - name: gpu7
    capacity: 1
  - name: nic0
    capacity: 1
  - name: nic1
    capacity: 1
  - name: nic2
    capacity: 1
  - name: nic3
    capacity: 1

The compound device driver is the only one that actually publishes anything to the K8s control plane. It is also what kubelet makes calls to, and it in turn calls down to the other drivers.

There are lots of details to work out for this, of course. For example, ideally users don't need to know they are using this intermediary, except maybe based on the class they choose. This would mean that the CEL-based attributes they use should still be the ones used by the underlying devices, rather than some that are particular to the compound device driver (which also may have some). For that, we may need to make sure that attributes are qualified, always, rather than allowing the short-hand of "unqualified means from the driver". Otherwise I can see a lot of confusion, especially during copy-and-paste situations.

There are also a few limitations:

Underlying devices will only be available as part of a compound device. You cannot make them available both as independent devices managed by their own drivers, and as part of a compound device.
Theoretically it would be possible to make a compound device that includes partitions from underlying partitionable devices, but this would likely be pretty difficult and is not something we should make a priority (in fact the compound driver probably should not allow it, at least for the foreseeable future).
Similarly, when we add allocatable device resources as a way of creating fractional devices, we probably will want to exclude that from the compound device driver functionality. It may theoretically be possible to support fractional devices in the compound device by a utilizing fractional allocation of underlying devices, but that would be...complex.

klueska · 2024-06-11T12:35:26Z

I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later.
The more important thing is to agree on the API for partitionable devices, and I'm fairly happy with the naming / structure I proposed in my comment here: https://github.com/kubernetes-sigs/wg-device-management/pull/27/files#r1634768662

klueska · 2024-06-11T18:20:55Z

dra-evolution/pkg/api/capacity_types.go

+// SharedCapacity is a set of resources which can be drawn down upon.
+type SharedCapacity struct {
+	// Name is the name of this set of shared capacity, which is unique
+	// resource pool.
+	// +required
+	Name string `json:"name"`
+
+	// Resources are the list of resources provided in this set.
+	Resources []ResourceCapacity `json:"resources,omitempty"`
+}
+
+// SharedCapacityRequest is a request to draw down resources from a particular
+// SharedCapacity.
+type SharedCapacityRequest struct {
+
+	// Name is the name of the SharedCapacity from which to draw the
+	// resources.
+	//
+	// +required
+	Name string `json:"name"`
+
+	// Resources are the list of resources and the amount of resources
+	// to draw down.
+	Resources []ResourceCapacity `json:"resources,omitempty"`
+}


Do you expect these to diverge at some point? Right now they are identical.

Not really. I considered making them the same, but since their usage is different I kept them separate. I don't care which way we do it, whatever API reviewers prefer.

klueska · 2024-06-11T18:22:48Z

dra-evolution/pkg/api/capacity_types.go

+	// Name is the resource name/type.
+	// +required
+	Name string `json:"name"`


I assume all of the Name fields should be a DNS label?

Is there mismatch between the type names? Above it's SharedCapacity, here it's ResourceCapacity.

I am flexible on the various names. SharedCapacity is a named list of ResourceCapacity objects.

Presenting structs in the order in which they are needed would help.

The first change is adding SharedCapacity []SharedCapacity, so here I would expect the definition of that SharedCapacity struct, not the ResourceCapacity that is currently here.

If I am following the structs correctly, SharedCapacity is a map[name1]map[name2]quantity, where both maps are lists due to how we do such maps in Kubernetes.

Can't this be flattened to map[name]quantity?

I assume all of the Name fields should be a DNS label?

Can we make the validation identical to the identifier part of an attribute name? That's merely for consistency, not because the two are related. As explained elsewhere, these shared capacity names are not exposed to users.

pohly · 2024-06-11T18:47:20Z

dra-evolution/pkg/api/capacity_types.go

@@ -64,12 +64,33 @@ type ResourcePoolSpec struct {
 	// Must not have more than 128 entries.
 	Devices []Device `json:"devices,omitempty"`

+	// SharedCapacity are pooled resources that are shared by all
+	// devices in the pool. This is typically used when representing a


Is this the pool as defined in kubernetes/enhancements#4709 or the object?

This field would be directly in the ResourceSlice. The description is out of date. It's a shared set of resources, with a name, that devices in the Slice reference / consume when allocated. The are local to this specific slice and only consumable by devices listed in this slice.

This becomes relevant when considering a future "reservation" protocol. Reserving a device and the corresponding shared capacity will be a lot easier if in the same object.

You don't have to reserve the shared capacity -- just the device (as before). By reserving a device, you implicitly decrease the amount of resources available in the shared capacity.

I was thinking towards a future protocol for allocation when there are multiple schedulers in a cluster which need to allocate the same devices.

"in the pool" should read "in the ResourceSlice", so they are the same object, yes!

You get a whole device or not, that is what determines the subtraction from the shared set. It doesn't matter if there are multiple schedulers or not; they just need to coordinate on the device allocation.

Hmm, yes, you're right. In that case, they would need to know how much of the overall shared capacity has already been depleted.

Actually, so long as we can see the list of all devices that have already been reserved, then each scheduler can do the math to track the depleted capacity independently.

yes, that's exactly how we do it, this is the same with and without shared resources.

pohly · 2024-06-11T18:54:18Z

dra-evolution/pkg/api/capacity_types.go

+	// resources.
+	//
+	// +required
+	Name string `json:"name"`


Can you show some YAML examples how this is going to be used? How is potential name ambiguity resolved once someone asks for for some kind of device which could be provided by different vendors?

My gut feeling is that this may depend on the same naming convention as for device attributes and also shouldn't overlap with those, i.e. "memory" in the gpu.dra.example.com/ scope can be either an attribute or a consumable resource, but not both.

It will be used like this:
#27 (comment)

It is not referenceable outside of the devices in this particular slice. Meaning it's not visible to CEL expressions. If you want it to be visible in a CEL expression, you need to duplicate it as an attribute.

I think I was confused by the "request" part of the name. This isn't something that users put into a claim to "request" this amount.

Do users ever need to interact with this in a claim? How do they ask for "a GPU with at least 10Gi memory"?

Overlapping comments... I think you answered my question. Still, including YAMLs in the PR would help to clarify this.

To maybe be more clear -- sharedCapacities exists for the scheduler, not the end user. The scheduler will track each SharedCapacity and see what's still available in it, when considering a device for allocation that references it in its SharedCapacityConsumed. If there are not enough resources available in it (because they have already been consumed by previously allocated devices), then the device is not selectable.

I will be sure to include YAMLs in the PR when adding text about this to kubernetes/enhancements#4709

No, users never interact with this. It is internal to the device/capacity model. We can use a different word than "Request"...

Nothing changes on the claim side for this. It just helps us keep track of which partitions (devices) are allowed and which are already used up, since they can be indirectly used up by another device using a shared resource.

johnbelamaric · 2024-06-11T19:21:42Z

I'm not sold on the complex-device scenario you proposed here, but I think we could iterate on that later. The more important thing is to agree on the API for partitionable devices, and I'm fairly happy with the naming / structure I proposed in my comment here: https://github.com/kubernetes-sigs/wg-device-management/pull/27/files#r1634768662

Yeah, that is 100% on top of this without affecting what this looks like. It's something I want to prototype before too long - but it would be out-of-tree anyway :)

johnbelamaric · 2024-06-12T11:25:52Z

SGTM

pohly · 2024-06-12T12:20:06Z

The rationale for that:

The quantities are for machines, not humans, so efficiency trumps readability.
No nesting means that the maximum slice size can be higher, with the driver deciding how they want to use that.

pohly · 2024-06-14T09:01:06Z

dra-evolution/pkg/api/capacity_types.go

+
+	// Capacity is the total capacity of the named resource.
+	// +required
+	Capacity resource.Quantity `json:"capacity"`


I prefer not using a resource.Quantity. As pointed out elsewhere, this is not a user-facing API. It's for the scheduler machinery. A simpler Count int64 would be easier to encode and work with.

If there ever is a need for a value < 1, then the driver can scale both capacity and consumed values up by a fixed factor (1000, 1000000).

klueska · 2024-06-14T14:55:57Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+      - capacity: 40Gi
+        name: gpu-0-memory


I don't think this needs to be a capacity, but it does need to be an attribute so that users can select on it.

This does need to have capacities for all of the memory blocks then though (which is probably more consistent anyway).

klueska · 2024-06-14T15:02:11Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+    - attributes:
+      - name: uuid
+        string: GPU-c2fe2ade-955a-421f-976e-3f1dd7254234
+      - name: product-name
+        string: Mock NVIDIA A100-SXM4-40GB
+      - name: brand
+        string: Nvidia
+      - name: architecture
+        string: Ampere
+      - name: cuda-compute-capability
+        version: 8.0.0
+      - name: driver-version
+        version: 550.54.15
+      - name: cuda-driver-version
+        version: 12.4.0
+      - bool: true
+        name: mig-capable
+      - bool: false
+        name: mig-enabled


I imagine there to be quite a few more attributes here eventually. Specifically around NVLINKs, PCIERoot complexes, etc. These will also need to be duplicated in each MIG device (unless we build a section for common attributes, grouped by name, that can be referenced by different devices).

thockin · 2024-06-15T23:01:28Z

Since this is "what's in the KEP" I think we should merge it and rebase all the options on it, so they appear as diffs. But I screwed up and LGTM'ed option 2 (#29) I don't hjave super on this repo, so I cannot manually fix

johnbelamaric · 2024-06-16T01:34:11Z

Ok, this matches the KEP. Merging.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2024

k8s-ci-robot requested review from klueska and pohly June 6, 2024 21:30

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 6, 2024

klueska reviewed Jun 6, 2024

View reviewed changes

dra-evolution/pkg/api/capacity_types.go Outdated Show resolved Hide resolved

klueska reviewed Jun 6, 2024

View reviewed changes

dra-evolution/pkg/api/capacity_types.go Outdated Show resolved Hide resolved

pohly mentioned this pull request Jun 7, 2024

WIP: dra-evolution: network-attached devices #28

Open

This was referenced Jun 7, 2024

dra-evolution: restore minimal support for "classic DRA" #26

Open

dra-evolution: simplify CEL syntax #23

Open

KEP-4381: DRA update for 1.31 kubernetes/enhancements#4709

Merged

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 11, 2024

klueska reviewed Jun 11, 2024

View reviewed changes

pohly reviewed Jun 11, 2024

View reviewed changes

pohly reviewed Jun 14, 2024

View reviewed changes

johnbelamaric mentioned this pull request Jun 14, 2024

dra-evolution: partitioning of devices #20

Open

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 14, 2024

klueska reviewed Jun 14, 2024

View reviewed changes

This was referenced Jun 14, 2024

Paritionable model with grouped resources #29

Merged

Partitionable with common attributes #30

Open

Partitionable model with a common shared shape #31

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2024

Omit zero value quantities

41d297f

johnbelamaric force-pushed the partitionable-devices branch from cde999c to 41d297f Compare June 16, 2024 00:26

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2024

johnbelamaric merged commit fdc5827 into kubernetes-sigs:main Jun 16, 2024
1 of 2 checks passed

Minimal changes for partitionable devices in DRA evolution prototype #27

Minimal changes for partitionable devices in DRA evolution prototype #27

Conversation

johnbelamaric commented Jun 6, 2024

k8s-ci-robot commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

klueska Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Jun 7, 2024 • edited Loading

klueska commented Jun 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric commented Jun 11, 2024

johnbelamaric commented Jun 12, 2024 via email

pohly commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Jun 15, 2024

johnbelamaric commented Jun 16, 2024

klueska Jun 11, 2024 •

edited

Loading

klueska Jun 11, 2024 •

edited

Loading

johnbelamaric commented Jun 7, 2024 •

edited

Loading

klueska Jun 11, 2024 •

edited

Loading

klueska Jun 11, 2024 •

edited

Loading