diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index f49900d7616..a6f6cd80523 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -64,7 +64,7 @@ If none of those approvers are still appropriate, then changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG Architecture for cross-cutting KEPs). --> -# [KEP-3063](https://github.com/kubernetes/enhancements/issues/3063): Dynamic resource allocation +# [KEP-3063](https://github.com/kubernetes/enhancements/issues/3063): Dynamic Resource Allocation with Control Plane Controller @@ -75,52 +75,21 @@ SIG Architecture for cross-cutting KEPs). - [Non-Goals](#non-goals) - [Proposal](#proposal) - [User Stories](#user-stories) - - [Cluster add-on development](#cluster-add-on-development) - - [Cluster configuration](#cluster-configuration) - - [Partial GPU allocation](#partial-gpu-allocation) - [Network-attached accelerator](#network-attached-accelerator) - [Combined setup of different hardware functions](#combined-setup-of-different-hardware-functions) - [Notes/Constraints/Caveats](#notesconstraintscaveats) - - [Risks and Mitigations](#risks-and-mitigations) - - [Feature not used](#feature-not-used) - - [Compromised node](#compromised-node) - - [Compromised resource driver plugin](#compromised-resource-driver-plugin) - - [User permissions and quotas](#user-permissions-and-quotas) - - [Usability](#usability) - [Design Details](#design-details) - - [Theory of operation](#theory-of-operation) - - [Components](#components) - - [State and communication](#state-and-communication) - - [Custom parameters](#custom-parameters) - - [Allocation modes](#allocation-modes) - - [Sharing a single ResourceClaim](#sharing-a-single-resourceclaim) - - [Ephemeral vs. persistent ResourceClaims lifecycle](#ephemeral-vs-persistent-resourceclaims-lifecycle) + - [ResourceClass extension](#resourceclass-extension) + - [ResourceClaim extension](#resourceclaim-extension) + - [ResourceClaimStatus extension](#resourceclaimstatus-extension) + - [ResourceHandle extensions](#resourcehandle-extensions) + - [PodSchedulingContext](#podschedulingcontext) - [Coordinating resource allocation through the scheduler](#coordinating-resource-allocation-through-the-scheduler) - [Resource allocation and usage flow](#resource-allocation-and-usage-flow) - [Scheduled pods with unallocated or unreserved claims](#scheduled-pods-with-unallocated-or-unreserved-claims) - - [Handling non graceful node shutdowns](#handling-non-graceful-node-shutdowns) - - [API](#api) - - [resource.k8s.io](#resourcek8sio) - - [core](#core) - - [kube-controller-manager](#kube-controller-manager) - - [kube-scheduler](#kube-scheduler) - - [EventsToRegister](#eventstoregister) - - [PreEnqueue](#preenqueue) - - [Pre-filter](#pre-filter) - - [Filter](#filter) - - [Post-filter](#post-filter) - - [Pre-score](#pre-score) - - [Reserve](#reserve) - - [PreBind](#prebind) - - [Unreserve](#unreserve) - [Cluster Autoscaler](#cluster-autoscaler) - - [kubelet](#kubelet) - - [Managing resources](#managing-resources) - - [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin) - - [NodePrepareResource](#nodeprepareresource) - - [NodeUnprepareResource](#nodeunprepareresource) - - [Implementing optional resources](#implementing-optional-resources) - - [Implementing a plugin for node resources](#implementing-a-plugin-for-node-resources) + - [Implementing a plugin for node resources](#implementing-a-plugin-for-node-resources) + - [Implementing optional resources](#implementing-optional-resources) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -140,20 +109,6 @@ SIG Architecture for cross-cutting KEPs). - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) -- [Alternatives](#alternatives) - - [Semantic Parameters instead of PodSchedulingContext](#semantic-parameters-instead-of-podschedulingcontext) - - [ResourceClaimTemplate](#resourceclaimtemplate) - - [Reusing volume support as-is](#reusing-volume-support-as-is) - - [Extend volume support](#extend-volume-support) - - [Extend Device Plugins](#extend-device-plugins) - - [Webhooks instead of ResourceClaim updates](#webhooks-instead-of-resourceclaim-updates) - - [ResourceDriver](#resourcedriver) - - [Complex sharing of ResourceClaim](#complex-sharing-of-resourceclaim) - - [Improving scheduling performance](#improving-scheduling-performance) - - [Optimize for network-attached resources](#optimize-for-network-attached-resources) - - [Moving blocking API calls into goroutines](#moving-blocking-api-calls-into-goroutines) - - [RPC calls instead of PodSchedulingContext](#rpc-calls-instead-of-podschedulingcontext) -- [Infrastructure Needed](#infrastructure-needed) ## Release Signoff Checklist @@ -196,119 +151,31 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -Users are increasingly deploying Kubernetes as management solution for new -workloads (batch processing) and in new environments (edge computing). Such -workloads no longer need just RAM and CPU, but also access to specialized -hardware. With upcoming enhancements of data center interconnects, accelerators -can be installed outside of specific nodes and be connected to nodes -dynamically as needed. - -This KEP introduces a new API for describing which of these new resources -a pod needs. The API supports: - -- Network-attached resources. The existing [device plugin API](https://github.com/kubernetes/design-proposals-archive/blob/main/resource-management/device-plugin.md) - is limited to hardware on a node. -- Sharing of a resource allocation between multiple containers or pods. - The device manager API currently cannot share resources at all. It - could be extended to share resources between containers in a single pod, - but supporting sharing between pods would need a completely new - API similar to the one in this KEP. -- Using a resource that is expensive to initialize multiple times - in different pods. This is not possible at the moment. -- Custom parameters that describe resource requirements and initialization. - Parameters are not limited to a single, linear quantity that can be counted. - With the current Pod API, annotations have to be used to capture such - parameters and then hacks are needed to access them from a CSI driver or - device plugin. - -Support for new hardware will be provided by hardware vendor add-ons. It will -not be necessary anymore to modify Kubernetes itself. - -This KEP does not replace other means of requesting traditional resources -(RAM/CPU, volumes, extended resources). The scheduler will serve as coordinator -between the add-ons which own resources (CSI driver, resource driver) and the -resources owned and assigned by the scheduler (RAM/CPU, extended resources). +Originally, this KEP introduced DRA in Kubernetes 1.26 and the ["structured +parameters" KEP](../4381-dra-structured-parameters/README.md) added an +extension. Now the roles are reversed: #4381 defines the base functionality +and this KEP is an optional extension. + +With #4381, DRA drivers are limited by what the structured parameter model(s) +defined by Kubernetes support. New requirements for future hardware may depend +on changing Kubernetes first. + +With this KEP, parameters and resource availability are completely opaque +to Kubernetes. During scheduling of a pod, the kube-scheduler and any DRA +driver controller(s) handling claims for the pod communicate back-and-forth through the +apiserver by updating a `PodSchedulingContext` object, ultimately leading to the +allocation of all pending claims and the pod being scheduled onto a node. + +Beware that this approach poses a problem for the [Cluster +Autoscaler](https://github.com/kubernetes/autoscaler) (CA) or for any higher +level controller that needs to make decisions for a group of pods (e.g. a job +scheduler). It cannot simulate the effect of allocating or deallocating +claims over time. Only the third-party DRA drivers have the information +available to do this. Structured parameters from #4381 should be used +when cluster autoscaling is needed. ## Motivation -Originally, Kubernetes and its scheduler only tracked CPU and RAM as -resources for containers. Later, support for storage and discrete, -countable per-node extended resources was added. The kubelet device plugin -interface then made such local resources available to containers. But -for many newer devices, this approach and the Kubernetes API for -requesting these custom resources is too limited. This KEP addresses -limitations of the current approach for the following use cases: - -- *Device initialization*: When starting a workload that uses - an accelerator like an FPGA, I’d like to have the accelerator - reconfigured or reprogrammed for the workload before the workload - itself starts. For security reasons, workloads should not be able to - reconfigure devices directly. - - *Limitation*: Currently, it’s impossible to specify the desired - device properties that are required for reconfiguring devices. - For the FPGA example, a file containing the desired configuration - of the FPGA has to be referenced. - -- *Device cleanup*: When my workload is finished, I would like to have - a mechanism for cleanup of the device, that will ensure that device - does not contain traces/parameters/data from previous workloads and - appropriate power state/shutdown. For example, an FPGA might have - to be reset because its configuration for the workload was - confidential. - - *Limitation*: Post-stop actions are not supported. - -- *Partial allocation*: When deploying a container I’d like to be able - to use part of the shareable device inside a container and other - containers should be able to use other free resources on the same - device. - - *Limitation*: For example, newer generations of NVIDIA GPUs have a mode of - operation called MIG, that allow them to be sub-divided into a set of - mini-GPUs (called MIG devices) with varying amounts of memory and compute - resources provided by each. From a hardware-standpoint, configuring a GPU - into a set of MIG devices is highly-dynamic and creating a MIG device - tailored to the resource needs of a particular application is well - supported. However, with the current device plugin API, the only way to make - use of this feature is to pre-partition a GPU into a set of MIG devices and - advertise them to the kubelet in the same way a full / static GPU is - advertised. The user must then pick from this set of pre-partitioned MIG - devices instead of having one created for them on the fly based on their - particular resource constraints. Without the ability to create MIG devices - dynamically (i.e. at the time they are requested) the set of pre-defined MIG - devices must be carefully tuned to ensure that GPU resources do not go unused - because some of the pre-partioned devices are in low-demand. It also puts - the burden on the user to pick a particular MIG device type, rather than - declaring the resource constraints more abstractly. - -- *Optional allocation*: When deploying a workload I’d like to specify - soft(optional) device requirements. If a device exists and it’s - allocatable it will be allocated. If not - the workload will be run on - a node without a device. GPU and crypto-offload engines are - examples of this kind of device. If they’re not available, workloads - can still run by falling back to using only the CPU for the same - task. - - *Limitation*: Optional allocation is supported neither by the device - plugins nor by current Pod resource declaration. - -- *Support Over the Fabric devices*: When deploying a container, I’d - like to utilize devices available over the Fabric (network, special - links, etc). - - *Limitation*: The device plugin API is designed for node-local resources that - get discovered by a plugin running on the node. Projects like - [Akri](https://www.cncf.io/projects/akri/) have to work around that by - reporting the same network-attached resource on all nodes that it could - get attached to and then updating resource availability on all of those - nodes when resources get used. - -Several other limitations are addressed by -[CDI](https://github.com/container-orchestrated-devices/container-device-interface/), -a container runtime extension that this KEP is using to expose resources -inside a container. - ### Goals -* More flexibility: - * Arbitrary parameters for resource requests +* More flexibility beyond what is currently supported by structured parameters: + * Arbitrary parameters * Network-attached resources - * Arbitrary, resource-specific setup and cleanup actions - * Custom matching of resource requests with available resources, - including handling of optional resource requests -* User-friendly API for describing resource requests -* Allow resource management cluster add-ons that can be developed and deployed - without having to re-build or reconfigure core Kubernetes component - and that are independent of specific container runtimes. -* Rich enough semantic so that all current device plugins could - be implemented based on dynamic resource allocation + * Custom policies for matching of resource requests with available resources, + like handling of optional resource requests or application-specific + policies +* Prototyping future extensions with a control plane controller before + proposing them as Kubernetes enhancements for DRA with structured parameters ### Non-Goals -* Replace the device plugin API. For resources that fit into its model - of a single, linear quantity it is a good solution. Other resources - should use dynamic resource allocation. Both are expected to co-exist, with vendors - choosing the API that better suits their needs on a case-by-case - basis. Because the new API is going to be implemented independently of the - existing device plugin support, there's little risk of breaking stable APIs. - -* Extend the model that kube-scheduler has about - resources. Instead, it will need information from the resource driver for - each resource request to determine where a Pod using the resource - might run. The [Representing Compute Resources in Kubernetes - proposal](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc/edit#) - had some ideas what information the scheduler might need (“supports - overcommit”, “fractional”), but ultimately any choice regarding that - will only work for certain kinds of resources. - -* Standardize how to describe available resources. Only allocated - resources are visible through the APIs defined below. How to - advertise available resources is driver specific because it depends - on the kind of resource which attributes might be relevant. Drivers - should use and document their individual approach for this (for - example, defining a CRD and publishing through that). - -* Provide an abstraction layer for resource requests, i.e., something like a - “I want some kind of GPU”. Users will need to know about specific - resource drivers and which parameters they support. Portability of - workloads could be added on top of this proposal by introducing the - selection of a resource implementation through labels and - standardizing those labels and the associated parameters. The - [Resource Class - Proposal](https://docs.google.com/document/d/1qKiIVs9AMh2Ua5thhtvWqOqW0MSle_RV3lfriO1Aj6U/edit#heading=h.jzfmfdca34kj) - included such an approach. - - +* Supporting cluster autoscaling ## Proposal -The proposal is that a resource driver handles all operations that are specific -to the resources managed by that driver. This includes operations at -the control plane level (tracking where in the cluster resources are -available, helping with pod scheduling decisions, allocating resources -when requested) as well as the node level (preparing container -startup). Such a driver can be implemented in arbitrary programming -languages as long as it supports the resource allocation protocol and -gRPC interfaces defined in this KEP. Deploying it will not depend on -reconfiguring core Kubernetes components like the scheduler. - -New API objects define parameters for a resource request ("ResourceClaim" in -the API) and track the state of such a request. The pod spec gets extended to -reference one or more resource requests. A pod only gets scheduled onto a node -when all of its requests are reserved for the pod and available on the -node. This prevents scheduling a pod that then gets stuck on a node while -waiting for resources to become available. - -### User Stories - -#### Cluster add-on development - -As a hardware vendor, I want to make my hardware available also to applications -that run in a container under Kubernetes. I want to make it easy for a cluster -administrator to configure a cluster where some nodes have this hardware. - -I develop two components, one that runs as part of the Kubernetes control plane -and one that runs on each node, and package those inside container images. YAML -files describe how to deploy my software on a Kubernetes cluster that supports -dynamic resource allocation. - -Documentation for administrators explains how the nodes need to be set -up. Documentation for users explains which parameters control the behavior of -my hardware and how to use it inside a container. - -#### Cluster configuration - -As a cluster administrator, I want to make GPUs from vendor ACME available to users -of that cluster. I prepare the nodes and deploy the vendor's components with -`kubectl create`. - -I create a ResourceClass for the hardware with parameters that only I as the -administrator am allowed to choose, like for example running a command with -root privileges that does some cluster-specific initialization for each allocation: -``` -apiVersion: gpu.example.com/v1 -kind: GPUInit -metadata: - name: acme-gpu-init -# DANGER! This option must not be accepted for -# user-supplied parameters. A real driver might -# not even allow it for admins. This is just -# an example to show the conceptual difference -# between ResourceClass and ResourceClaim -# parameters. -initCommand: -- /usr/local/bin/acme-gpu-init -- --cluster -- my-cluster ---- -apiVersion: core.k8s.io/v1alpha2 -kind: ResourceClass -metadata: - name: acme-gpu -driverName: gpu.example.com -parametersRef: - apiGroup: gpu.example.com - kind: GPUInit - name: acme-gpu-init -``` - -#### Partial GPU allocation +A resource driver handles all operations that are specific to the allocation +and deallocation of a ResourceClaim. It does that in coordination with the +scheduler (for allocation) and kube-controller-manager (for deallocation). -As a user, I want to use a GPU as accelerator, but don't need exclusive access -to that GPU. Running my workload with just 2Gb of memory is sufficient. This is -supported by the ACME GPU hardware. I know that the administrator has created -an "acme-gpu" ResourceClass. - -For a simple trial, I create a Pod directly where two containers share the same subset -of the GPU: -``` -apiVersion: gpu.example.com/v1 -kind: GPURequirements -metadata: - name: device-consumer-gpu-parameters -memory: "2Gi" ---- -apiVersion: resource.k8s.io/v1alpha2 -kind: ResourceClaimTemplate -metadata: - name: device-consumer-gpu-template -spec: - metadata: - # Additional annotations or labels for the - # ResourceClaim could be specified here. - spec: - resourceClassName: "acme-gpu" - parametersRef: - apiGroup: gpu.example.com - kind: GPURequirements - name: device-consumer-gpu-parameters ---- -apiVersion: v1 -kind: Pod -metadata: - name: device-consumer -spec: - resourceClaims: - - name: "gpu" # this name gets referenced below under "claims" - template: - resourceClaimTemplateName: device-consumer-gpu-template - containers: - - name: workload - image: my-app - command: ["/bin/program"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - claims: - - "gpu" - - name: monitor - image: my-app - command: ["/bin/other-program"] - resources: - requests: - memory: "32Mi" - cpu: "25m" - limits: - memory: "64Mi" - cpu: "50m" - claims: - - "gpu" -``` +![components](./components.png) -This request triggers resource allocation on a node that has a GPU device with -2Gi of memory available and then the Pod runs on that node. The remaining -capacity of the GPU may be usable for other pods, with constrains like alignment -to segment sizes ensured by the resource driver. -The lifecycle of the resource -allocation is tied to the lifecycle of the Pod. -In production, a similar PodTemplateSpec in a Deployment will be used. +### User Stories #### Network-attached accelerator @@ -552,406 +245,54 @@ The hardware that is expected to need this more flexible allocation approach is likely to be used by pods that run for extended periods of time, so this is not a major concern. -### Risks and Mitigations - - - -#### Feature not used - -In a cluster where the feature is not used (no resource driver installed, no -pods using dynamic resource allocation) the impact is minimal, both for -performance and security. The scheduler plugin and resource controller will -return quickly without doing any work for pods. - -#### Compromised node - -Kubelet is intentionally limited to read-only access for all new API types -to prevent that a -compromised kubelet interferes with scheduling of pending pods, for example -by updating information normally published by the resource driver controller. -Faking such information could be used for a denial-of-service -attack against pods using those ResourceClaims, for example by overwriting -their allocation result with a node selector that matches no node. A -denial-of-service attack against the cluster and other pods is harder, but -still possible. For example, frequently updating ResourceClaims could cause new -scheduling attempts for pending pods. - -Another potential attack goal is to get pods with sensitive workloads to run on -a compromised node. For pods that don't use special resources nothing changes -in that regard. Such an attack is possible for pods with extended resources -because kubelet is in control of which capacity it reports for those: it could -publish much higher values than the device plugin reported and thus attract -pods to the node that normally would run elsewhere. With dynamic resource -allocation, such an attack is still possible, but the attack code would have to -be different for each resource driver because all of them will use their own, -custom approach for reporting resource availability. - -The security of those custom approaches is the responsibility of the resource -driver vendor. Solutions like Akri which establish their own control plane and -then communicate with Kubernetes through the device plugin API already need to -address this. - -#### Compromised resource driver plugin - -This is the result of an attack against the resource driver, either from a -container which uses a resource exposed by the driver, a compromised kubelet -which interacts with the plugin, or through a successful attack against the -node which led to root access. - -The resource driver plugin only needs read access to objects described in this -KEP, so compromising it does not interfere with dynamic resource allocation for -other drivers. It may need write access for [CRDs that communicate or -coordinate resource -availability](#implementing-a-plugin-for-node-resources). This could be used to -attack scheduling involving the driver as outlined in the previous section. - -A resource driver may need root access on the node to manage -hardware. Attacking the driver therefore may lead to root privilege -escalation. Ideally, driver authors should try to avoid depending on root -permissions and instead use capabilities or special permissions for the kernel -APIs that they depend on. - -A resource driver may also need privileged access to remote services to manage -network-attached devices. Resource driver vendors and cluster administrators -have to consider what the effect of a compromise could be for that and how such -privileges could get revoked. - -#### User permissions and quotas - -Similar to generic ephemeral inline volumes, the [ephemeral resource use -case](#ephemeral-vs-persistent-resourceclaims-lifecycle) gets covered by -creating ResourceClaims on behalf of the user automatically through -kube-controller-manager. The implication is that RBAC rules that are meant to -prevent creating ResourceClaims for certain users can be circumvented, at least -for ephemeral resources. Administrators need to be aware of this caveat when -designing user restrictions. - -A quota system that limits how much of the underlying resources a user may consume -needs to be supported by the resource driver. When a user has exhausted their -quota, the driver then would refuse to allocate further ResourceClaims. Such a -quota system cannot be implemented in core Kubernetes because Kubernetes has no -information about how much a certain ResourceClaim would count against the quota. - -What can be limited in Kubernetes itself is the number of ResourceClaims per -namespace. For this, two new ResourceQuota resource names get added: - -- `resourceclaims` limits the number of ResourceClaim objects in a namespace - across all resource class. -- `.resourceclass.node.k8s.io/resourceclaims` limits the - number of ResourceClaim objects for the specific resource class. - - -#### Usability - -Aside from security implications, usability and usefulness of dynamic resource -allocation also may turn out to be insufficient. Some risks are: - -- Slower pod scheduling due to the interaction with resource drivers. - -- Additional complexity when describing pod requirements because - separate objects must be created for the parameters. - -- Network-attached resources may have additional constraints that are not - captured yet (like limited number of nodes that they can be attached to). - -- Cluster autoscaling will not work as expected unless the DRA driver - uses [semantic parameters](https://github.com/kubernetes/enhancements/issues/4381). - -All of these risks will have to be evaluated by gathering feedback from users -and resource driver developers. - ## Design Details -### Theory of operation - -In general, this new API works as described below (more details and exploration -of corner cases will follow): - -* A user creates a ResourceClaim. This claim may be created by the user - directly (the user owns the resource's lifetime) or indirectly through a pod - (the pod owns the resource's lifetime). +### ResourceClass extension -* A resource driver observes the claim and allocates the underlying resource if - it can. +An optional field in ResourceClass enables using the DRA driver's control +plane controller: -* A pod references the claim in its spec. - -* When a claim is meant to be allocated for a specific pod, the scheduler and - the resource driver(s) coordinate to pick a viable node before the claim gets - allocated. - -* Once allocated, a pod which consumes the resource is scheduled to a node where - the resource is available. - -* Kubelet on that node communicates with the node-level part of the resource - driver to present the resource to the pod. - -* When the pod completes, the Kubelet on that node communicates with node-level - part of the resource driver to clean up. - -* When the claim is released (deleted), the resource driver can free the - underlying resource. +```go +type ResourceClass struct { + ... -### Components + // ControllerName defines the name of the dynamic resource driver that is + // used for allocation of a ResourceClaim that uses this class. If empty, + // structured parameters are used for allocating claims using this class. + // + // Resource drivers have a unique name in forward domain order + // (acme.example.com). + // + // This is an alpha field and requires enabling the DRAControlPlaneController + // feature gate. + // + // +optional + ControllerName string +} +``` -![components](./components.png) +### ResourceClaim extension -Several components must be implemented or modified in Kubernetes: -- The new API must be added to kube-apiserver. The ResourceQuota admission - plugin needs to check the new quota limits when ResourceClaims get created. -- A new controller in kube-controller-manager which creates - ResourceClaims from ResourceClaimTemplates, similar to - https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/volume/ephemeral. - It also removes the reservation entry for a user in `claim.status.reservedFor`, - the field that tracks who is allowed to use a claim, when that user no longer exists. -- A kube-scheduler plugin must detect Pods which reference a - ResourceClaim (directly or through a template) and ensure that the - resource is allocated before the Pod gets scheduled, similar to - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go -- Kubelet must be extended to retrieve information from ResourceClaims - and to call a resource kubelet plugin. That plugin returns CDI device ID(s) - which then must be passed to the container runtime. - -For a resource driver the following components are needed: -- *Resource driver controller*: a central component which handles resource allocation - by watching ResourceClaims and updating their status once it is done with - allocation. It may run inside the cluster or outside of it. The only - hard requirement is that it can connect to the API server. -- *Resource kubelet plugin*: a component which cooperates with kubelet to prepare - the usage of the resource on a node. - -A [utility library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation) for resource drivers was developed. -It does not have to be used by drivers, therefore it is not described further -in this KEP. - -### State and communication - -A ResourceClaim object defines what kind of resource is needed and what -the parameters for it are. It is owned by users and namespaced. Additional -parameters are provided by a cluster admin in ResourceClass objects. - -The ResourceClaim spec is immutable. The ResourceClaim -status is reserved for system usage and holds the current state of the -resource. The status must not get lost, which in the past was not ruled -out. For example, status could have been stored in a separate etcd instance -with lower reliability. To recover after a loss, status was meant to be recoverable. -A [recent KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/2527-clarify-status-observations-vs-rbac) -clarified that status will always be stored reliably and can be used as -proposed in this KEP. - -All relevant state of a ResourceClaim is captured inside that object -itself. For additional information that is needed only during pod scheduling, a -separate PodSchedulingContext gets created by the scheduler if needed. The -PodSchedulingContext has the same name and namespace as the pod and the pod as -its as owner. This ownership must be checked before using a PodSchedulingContext -to detect stale objects that do not match a recreated pod. Such stale -objects get deleted by the garbage collector or the scheduler, depending on who -gets to it first. - -Handling state and communication through objects has two advantages: -- Changes for a resource are atomic, which avoids race conditions. -- The only requirement for deployments is that the components can connect to - the API server. Direct communication is not needed, but in some cases - can be optionally used to improve performance. - -Using a single object is an intentional simplification compared to the -PersistentVolume/PersistentVolumeClaim model for volumes where the additional -PV object was used to capture status. That model allowed operations like -pre-provisioning volumes and then having Kubernetes bind those to claims that -get created later. For resources, the resource driver can and must handle such -pre-provisioning internally. Kubernetes wouldn't know how to match -pre-provisioned resources against claims because it has no understanding about the -parameters. - -The entire state of a resource can be determined by looking at its -ResourceClaim (see [API below](#api) for details), for example: - -- It is **allocated** if and only if `claim.status.allocation` is non-nil and - points to the `AllocationResult`, i.e. the struct where resource drivers - store information about a successful allocation. - -- It is in use if and only if `claim.status.reservedFor` contains one or - more users. It does not matter whether those users, usually pods, are - currently running because that could change at any time. - -- A resource is no longer needed when `claim.deletionTimestamp` is set. It must not - be deallocated yet when it is still in use. - -Some of the race conditions that need to be handled are: - -- A ResourceClaim gets created and deleted again while the resource driver - starts allocating it. Before it actually starts doing anything, the - driver adds a finalizer. Either adding the finalizer or removing the - ResourceClaim win. If the driver wins, it continues with allocation - and can either complete or abort the operation when it notices the non-nil - DeletionTimestamp. Otherwise, allocation gets aborted immediately. - - What this avoids is the situation where an allocation succeed without having - an object where the result can be stored. The driver can also be killed at - any time: when it restarts, the finalizer indicates that allocation may be in - progress and has to be completed or aborted. - - However, users may still force-delete a ResourceClaim, or the entire - cluster might get deleted. Driver implementations must store enough - information elsewhere to detect when some allocated resource is no - longer needed to recover from such scenarios. - -- A ResourceClaim gets deleted and recreated while the resource driver is - adding the finalizer. The driver can update the object to add the finalizer - and then will get a conflict error, which informs the driver that it must - work on a new instance of the claim. In general, patching a ResourceClaim - is only acceptable when it does not lead to race conditions. To detect - delete+recreate, the UID must be added as precondition for a patch. - To detect also potentially conflicting other changes, ResourceVersion - needs to be checked, too. - -- In a cluster with multiple scheduler instances, two pods might get - scheduled concurrently by different schedulers. When they reference - the same ResourceClaim which may only get used by one pod at a time, - only one pod can be scheduled. - - Both schedulers try to add their pod to the `claim.status.reservedFor` field, but only the - update that reaches the API server first gets stored. The other one fails - with a conflict error and the scheduler which issued it knows that it must - put the pod back into the queue, waiting for the ResourceClaim to become - usable again. - -- Two pods get created which both reference the same unallocated claim with - delayed allocation. A single scheduler could detect this special situation - and then trigger allocation only for one of the two pods. But it is simpler - to proceed with pod scheduling for both of them independently, which implies - trying to select a node and allocate for it in parallel. Depending on timing, - the resource driver will see one of the requests for allocation first and - execute it. The other pod then either can share the same resource (if - supported) or must wait until the first one is done with it and reallocate - it. - -- Scheduling a pod and allocating resources for it has been attempted, but one - claim needs to be reallocated to fit the overall resource requirements. A second - pod gets created which references the same claim that is in the process of - being deallocated. Because that is visible in the claim status, scheduling - of the second pod cannot proceed. - -### Custom parameters - -To support arbitrarily complex parameters, both ResourceClass and ResourceClaim -contain one field which references a separate object. The reference contains -API group, kind and name and thus is sufficient for generic clients to -retrieve the parameters. For ResourceClass, that object must be -cluster-scoped. For ResourceClaim, it must be in the same namespace as the -ResourceClaim and thus the Pod. Which kind of objects a resource driver accepts as parameters depends on -the driver. - -This approach was chosen because then validation of the parameters can be done -with a CRD and that validation will work regardless of where the parameters -are needed. - -A resource driver may support modification of the parameters while a resource -is in use ("online resizing"). It may update the ResourceClaim status to -reflect the modified state, for example by increasing the number of concurrent -users. The driver must not allow the state to be modified such -that a user of the resource no longer has access. - -Parameters may get deleted before the ResourceClaim or ResourceClass that -references them. In that case, a pending resource cannot be allocated until the -parameters get recreated. An allocated resource must remain usable and -deallocating it must be possible. To support this, resource drivers must copy -all relevant information: -- For usage, the `claim.status.allocation.resourceHandle` can be hold some copied information - because the ResourceClaim and thus this field must exist. -- For deallocation, drivers should use some other location to handle - cases where a user force-deletes a ResourceClaim or the entire - cluster gets removed. - -### Allocation modes - -Allocation of a resource happens either immediately when a ResourceClaim gets -created (“immediate allocation”) or when a Pod is getting scheduled which -needs the resource (“delayed allocation”), -depending on a flag in the ResourceClaim spec. +With structured parameters, allocation always happens only when a pod needs a +ResourceClaim ("delayed allocation"). With allocation through the driver, it +may also make sense to allocate a ResourceClaim as soon as it gets created +("immediate allocation"). Immediate allocation is useful when allocating a resource is expensive (for example, programming an FPGA) and the resource therefore is meant to be used by -multiple different Pods, either in parallel or one after the other. The -downside is that Pod resource requirements cannot be considered when choosing +multiple different Pods, either in parallel or one after the other. Another use +case is managing resource allocation in a third-party component which fully +understands optimal placement of everything that needs to run on a certain +cluster. + +The downside is that Pod resource requirements cannot be considered when choosing where to allocate. If a resource was allocated so that it is only available on one node and the Pod cannot run there because other resources like RAM or CPU are exhausted on that node, then the Pod cannot run elsewhere. The same applies to resources that are available on a certain subset of the nodes and those nodes are busy. -Delayed allocation solves this by integrating allocation with Pod scheduling: -an attempt to schedule a Pod triggers allocation of pending resources for nodes -that the scheduler has deemed suitable. Scheduling the pod is then put on hold -until all resources are allocated. This avoids scenarios where a Pod is -permanently assigned to a node which can't fit the pod because of the pod's -other resource requirements. - -### Sharing a single ResourceClaim - -Pods reference resource claims in a new `pod.spec.resourceClaims` list. Each -resource in that list can then be made available to one or more containers in -that Pod. Depending on the capabilities defined in the -`claim.status.allocation` by the driver, a ResourceClaim can be used exclusively -by one pod at a time or an unlimited number of pods. Support for additional -constraints (maximum number of pods, maximum number of nodes) could be -added once there are use cases for those. - -Users of a ResourceClaim don't need to be Pods. This KEP specifically supports -Pods as users and describes how kube-scheduler and kubelet will deal with Pods -that depend on a ResourceClaim, but the API and some custom resource driver -might also be useful for controllers to manage resources without using those -resources for Pods. - -### Ephemeral vs. persistent ResourceClaims lifecycle - -A persistent ResourceClaim has a lifecyle that is independent of any particular -pod. It gets created and deleted by the user. This is useful for resources -which are expensive to configure and that can be used multiple times by pods, -either at the same time or one after the other. Such persistent ResourceClaims -get referenced in the pod spec by name. When a PodTemplateSpec in an app -controller spec references a ResourceClaim by name, all pods created by that -controller also use that name and thus share the resources allocated for that -ResourceClaim. - -But often, each Pod is meant to have exclusive access to its own ResourceClaim -instance instead. To support such ephemeral resources without having to modify -all controllers that create Pods, an entry in the new PodSpec.ResourceClaims -list can also be a reference to a ResourceClaimTemplate. When a Pod gets created, such a -template will be used to create a normal ResourceClaim with the Pod as owner -with an -[OwnerReference](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#OwnerReference)), -and then the normal allocation of the resource takes place. Once the pod got -deleted, the Kubernetes garbage collector will also delete the -ResourceClaim. - -This mechanism documents ownership and serves as a fallback for scenarios where -dynamic resource allocation gets disabled in a cluster (for example, during a -downgrade). But it alone is not sufficient: for example, the job controller -does not delete pods immediately when they have completed, which would keep -their resources allocated. Therefore the resource controller watches for pods -that have completed and releases their resource allocations. - -The difference between persistent and ephemeral resources for kube-scheduler -and kubelet is that the name of the ResourceClaim needs to be determined -differently: the name of an ephemeral ResourceClaim is `-`. Ownership must be checked to detect accidental conflicts with -persistent ResourceClaims or previous incarnations of the same ephemeral -resource. This is the same approach that was chosen for [generic ephemeral -volumes](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1698-generic-ephemeral-volumes/README.md). -For a resource driver there is no difference. - -Different lifecycles can be combined with different allocation modes +Different lifecycles of a ResourceClaim can be combined with different allocation modes arbitrarily. Some combinations are more useful than others: ``` @@ -969,337 +310,17 @@ arbitrarily. Some combinations are more useful than others: +-----------+------------------------------------+---------------------------------+ ``` -### Coordinating resource allocation through the scheduler - -For immediate allocation, scheduling Pods is simple because the -resource is already allocated and determines the nodes on which the -Pod may run. The downside is that pod scheduling is less flexible. - -For delayed allocation, a node is selected tentatively by the scheduler -in an iterative process where the scheduler suggests some potential nodes -that fit the other resource requirements of a Pod and resource drivers -respond with information about whether they can allocate claims for those -nodes. This exchange of information happens through the `PodSchedulingContext` -for a Pod. The scheduler has to involve the drivers because it -doesn't know what claim parameters mean and where suitable resources are -currently available. - -Once the scheduler is confident that it has enough information to select -a node that will probably work for all claims, it asks the driver(s) to -allocate their resources for that node. If that -succeeds, the Pod can get scheduled. If it fails, the scheduler must -determine whether some other node fits the requirements and if so, -request allocation again. If no node fits because some resources were -already allocated for a node and are only usable there, then those -resources must be released and then get allocated elsewhere. - -This is a summary of the necessary [kube-scheduler changes](#kube-scheduler) in -pseudo-code: - -``` -while { - - if { - if { - - } - } else if { - - } else if { - - } -} -``` - -Randomly picking a node without knowing anything about the resource driver may -or may not succeed. To narrow the choice of suitable nodes for all claims using -a certain resource class, a node selector can be specified in that class. That -selector is static and typically will use labels that determine which nodes may -have resources available. - -To gather information about the current state of resource availability and to -trigger allocation of a claim, the scheduler creates one PodSchedulingContext -for each pod that uses claims. That object is owned by the pod and -will either get deleted by the scheduler when it is done with pod scheduling or -through the garbage collector. In the PodSchedulingContext, the scheduler posts -the list of all potential nodes that it was left with after considering all -other pod constraints and requirements. Resource drivers involved in the -scheduling of the pod respond by adding which of these nodes currently don't -have sufficient resources available. The next scheduling attempt is then more -likely to pick a node for which allocation succeeds. - -This scheduling information is optional and does not have to be in sync with -the current ResourceClaim state, therefore it is okay to store it -separately. - -Allowing the scheduler to trigger allocation in parallel to asking for more -information was chosen because for pods with a single resource claim, the cost -of guessing wrong is low: the driver just needs to inform the scheduler to try -again and provide the additional information. - -Additional heuristics are possible without changing the proposed API. For -example, the scheduler might ask for information and wait a while before -making a choice. This may be more suitable for pods using many different -resource claims because for those, allocation may succeed for some claims and -fail for others, which then may need to go through the recovery flow with -deallocating one or more claims. - -### Resource allocation and usage flow - -The following steps shows how resource allocation works for a resource that -gets defined in a ResourceClaimTemplate and referenced by a Pod. Several of these steps may fail without changing -the system state. They then must be retried until they succeed or something -else changes in the system, like for example deleting objects. - -* **user** creates Pod with reference to ResourceClaimTemplate -* **resource claim controller** checks ResourceClaimTemplate and ResourceClass, - then creates ResourceClaim with Pod as owner -* if *immediate allocation*: - * **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress - * **resource driver** finishes allocation, sets `claim.status.allocation` -> claim ready for use by any pod -* if *pod is pending*: - * **scheduler** filters nodes based on built-in resources and the filter callback of plugins, - which includes constraints imposed by already allocated resources - * if *delayed allocation and resource not allocated yet*: - * if *at least one node fits pod*: - * **scheduler** creates or updates a `PodSchedulingContext` with `podSchedulingContext.spec.potentialNodes=` - * if *exactly one claim is pending (see below)* or *all drivers have provided information*: - * **scheduler** picks one node, sets `podSchedulingContext.spec.selectedNode=` - * if *resource is available for this selected node*: - * **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress - * **resource driver** finishes allocation, sets `claim.status.allocation` and the - pod in `claim.status.reservedFor` -> claim ready for use and reserved for the pod - * else *scheduler needs to know that it must avoid this and possibly other nodes*: - * **resource driver** sets `podSchedulingContext.status.claims[name=name of claim in pod].unsuitableNodes` - * else *pod cannot be scheduled*: - * **scheduler** may trigger deallocation of some claim with delayed allocation by setting `claim.status.deallocationRequested` to true - (see [pseudo-code above](#coordinating-resource-allocation-through-the-scheduler)) or wait - * if *pod not listed in `claim.status.reservedFor` yet* (can occur for immediate allocation): - * **scheduler** adds it to `claim.status.reservedFor` - * if *resource allocated and reserved*: - * **scheduler** sets node in Pod spec -> Pod ready to run - * **scheduler** deletes `PodSchedulingContext` if one exists -* if *node is set for pod*: - * if `resource not reserved for pod` (user might have set the node field): - * **kubelet** refuses to start the pod -> permanent failure - * else `pod may run`: - * **kubelet** asks driver to prepare the resource - * if `resource is prepared`: - * **kubelet** creates container(s) which reference(s) the resource through CDI -> Pod is running -* if *pod has terminated* and *pod deleted*: - * **kubelet** asks driver to unprepare the resource - * **kubelet** allows pod deletion to complete by clearing the `GracePeriod` -* if *pod removed*: - * **garbage collector** deletes ResourceClaim -> adds `claim.deletionTimestamp` because of finalizer -* if *ResourceClaim has `claim.deletionTimestamp` and `claim.status.reservedFor` is empty*: - * **resource driver** deallocates resource - * **resource driver** clears finalizer and `claim.status.allocation` - * **API server** removes ResourceClaim - -When exactly one claim is pending, it is safe to trigger the allocation: if the -node is suitable, the allocation will succeed and the pod can get scheduled -without further delays. If the node is not suitable, allocation fails and the -next attempt can do better because it has more information. The same should not -be done when there are multiple claims because allocation might succeed for -some, but not all of them, which would force the scheduler to recover by asking -for deallocation. It's better to wait for information in this case. - -The flow is similar for a ResourceClaim that gets created as a stand-alone -object by the user. In that case, the Pod reference that ResourceClaim by -name. The ResourceClaim does not get deleted at the end and can be reused by -another Pod and/or used by multiple different Pods at the same time (if -supported by the driver). The resource remains allocated as long as the -ResourceClaim doesn't get deleted by the user. - -If a Pod references multiple claims managed by the same driver, then the driver -can combine updating `podSchedulingContext.claims[*].unsuitableNodes` for all -of them, after considering all claims. - -### Scheduled pods with unallocated or unreserved claims - -There are several scenarios where a Pod might be scheduled (= `pod.spec.nodeName` -set) while the claims that it depends on are not allocated or not reserved for -it: - -* A user might manually create a pod with `pod.spec.nodeName` already set. -* Some special cluster might use its own scheduler and schedule pods without - using kube-scheduler. -* The feature might have been disabled in kube-scheduler while scheduling - a pod with claims. - -The kubelet is refusing to run such pods and reports the situation through -an event (see below). It's an error scenario that should better be avoided. - -Users should avoid this situation by not scheduling pods manually. If they need -it for some reason, they can use a node selector which matches only the desired -node and then let kube-scheduler do the normal scheduling. - -Custom schedulers should emulate the behavior of kube-scheduler and ensure that -claims are allocated and reserved before setting `pod.spec.nodeName`. - -The last scenario might occur during a downgrade or because of an -administrator's mistake. Administrators can fix this by deleting such pods or -ensuring that claims are usable by them. The latter is work that can be -automated in kube-controller-manager: - -- If `pod.spec.nodeName` is set, kube-controller-manager can be sure that - kube-scheduler is not doing anything for the pod. -- If such a pod has unallocated claims, kube-controller-manager can - create a `PodSchedulingContext` with only the `spec.selectedNode` field set - to the name of the node chosen for the pod. There is no need to list - suitable nodes because that choice is permanent, so resource drivers don't - need check for unsuitable nodes. All that they can do is to (re)try allocating - the claim until that succeeds. -- If such a pod has allocated claims that are not reserved for it yet, - then kube-controller-manager can (re)try to reserve the claim until - that succeeds. - -Once all of those steps are complete, kubelet will notice that the claims are -ready and run the pod. Until then it will keep checking periodically, just as -it does for other reasons that prevent a pod from running. - -### Handling non graceful node shutdowns - -When a node is shut down unexpectedly and is tainted with an `out-of-service` -taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown), -all running pods on the node will be deleted by the GC controller and the -resources used by the pods will be deallocated. However, they will not be -un-prepared as the node is down and Kubelet is not running on it. - -Resource drivers should be able to handle this situation correctly and -should not expect `UnprepareNodeResources` to be always called. -If resources are unprepared when `Deallocate` is called, `Deallocate` -might need to perform additional actions to correctly deallocate -resources. - -### API - -The PodSpec gets extended. To minimize the changes in core/v1, all new types -get defined in a new resource group. This makes it possible to revise those -more experimental parts of the API in the future. The new fields in the -PodSpec are gated by the DynamicResourceAllocation feature gate and can only be -set when it is enabled. Initially, they are declared as alpha. Even though they -are alpha, changes to their schema are discouraged and would have to be done by -using new field names. - -ResourceClaim, ResourceClass and ResourceClaimTemplate are new built-in types -in `resource.k8s.io/v1alpha2`. This alpha group must be explicitly enabled in -the apiserver's runtime configuration. Using builtin types was chosen instead -of using CRDs because core Kubernetes components must interact with the new -objects and installation of CRDs as part of cluster creation is an unsolved -problem. - -Secrets are not part of this API: if a resource driver needs secrets, for -example to access its own backplane, then it can define custom parameters for -those secrets and retrieve them directly from the apiserver. This works because -drivers are expected to be written for Kubernetes. - -#### resource.k8s.io - -``` -// ResourceClass is used by administrators to influence how resources -// are allocated. -// -// This is an alpha type and requires enabling the DynamicResourceAllocation -// feature gate. -type ResourceClass struct { - metav1.TypeMeta - // Standard object metadata - // +optional - metav1.ObjectMeta - - // DriverName defines the name of the dynamic resource driver that is - // used for allocation of a ResourceClaim that uses this class. - // - // Resource drivers have a unique name in forward domain order - // (acme.example.com). - DriverName string - - // ParametersRef references an arbitrary separate object that may hold - // parameters that will be used by the driver when allocating a - // resource that uses this class. A dynamic resource driver can - // distinguish between parameters stored here and and those stored in - // ResourceClaimSpec. - // +optional - ParametersRef *ResourceClassParametersReference - - // Only nodes matching the selector will be considered by the scheduler - // when trying to find a Node that fits a Pod when that Pod uses - // a ResourceClaim that has not been allocated yet. - // - // Setting this field is optional. If null, all nodes are candidates. - // +optional - SuitableNodes *core.NodeSelector -} -``` - -A copy of the driver name is necessary to enable usage -of the claim by the kubelet in case the ResourceClass gets -removed in the meantime. It also helps the resource driver -to determine whether it needs to handle a claim that got -marked for deletion. - -``` -// ResourceClaim describes which resources are needed by a resource consumer. -// Its status tracks whether the resource has been allocated and what the -// resulting attributes are. -// -// This is an alpha type and requires enabling the DynamicResourceAllocation -// feature gate. -type ResourceClaim struct { - metav1.TypeMeta - // Standard object metadata - // +optional - metav1.ObjectMeta - - // Spec describes the desired attributes of a resource that then needs - // to be allocated. It can only be set once when creating the - // ResourceClaim. - Spec ResourceClaimSpec - - // Status describes whether the resource is available and with which - // attributes. - // +optional - Status ResourceClaimStatus -} -``` - -The driver must set a finalizer in a ResourceClaim before it attempts to allocate -the resource. It removes the finalizer when a) the allocation -attempt has definitely failed or b) the allocated resource was -deallocated. This helps to ensure that resources are not leaked -during normal operation of the cluster. - -It cannot prevent force-deleting a ResourceClaim by clearing its -finalizers (something that users should never do without being aware -of the consequences) or help when the entire cluster gets deleted. - ``` -// ResourceClaimSpec defines how a resource is to be allocated. type ResourceClaimSpec struct { - // ResourceClassName references the driver and additional parameters - // via the name of a ResourceClass that was created as part of the - // driver deployment. - ResourceClassName string - - // ParametersRef references a separate object with arbitrary parameters - // that will be used by the driver when allocating a resource for the - // claim. - // - // The object must be in the same namespace as the ResourceClaim. - // +optional - ParametersRef *ResourceClaimParametersReference - - // Allocation can start immediately or when a Pod wants to use the - // resource. "WaitForFirstConsumer" is the default. - // +optional - AllocationMode AllocationMode + ... + + // Allocation can start immediately or when a Pod wants to use the + // resource. "WaitForFirstConsumer" is the default. + // +optional + // + // This is an alpha field and requires enabling the DRAControlPlaneController + // feature gate. + AllocationMode AllocationMode } // AllocationMode describes whether a ResourceClaim gets allocated immediately @@ -1310,57 +331,58 @@ type ResourceClaimSpec struct { type AllocationMode string const ( - // When a ResourceClaim has AllocationModeWaitForFirstConsumer, allocation is - // delayed until a Pod gets scheduled that needs the ResourceClaim. The - // scheduler will consider all resource requirements of that Pod and - // trigger allocation for a node that fits the Pod. - AllocationModeWaitForFirstConsumer AllocationMode = "WaitForFirstConsumer" - - // When a ResourceClaim has AllocationModeImmediate, allocation starts - // as soon as the ResourceClaim gets created. This is done without - // considering the needs of Pods that will use the ResourceClaim - // because those Pods are not known yet. - AllocationModeImmediate AllocationMode = "Immediate" + // When a ResourceClaim has AllocationModeWaitForFirstConsumer, allocation is + // delayed until a Pod gets scheduled that needs the ResourceClaim. The + // scheduler will consider all resource requirements of that Pod and + // trigger allocation for a node that fits the Pod. + // + // The ResourceClaim gets deallocated as soon as it is not in use anymore. + AllocationModeWaitForFirstConsumer AllocationMode = "WaitForFirstConsumer" + + // When a ResourceClaim has AllocationModeImmediate and the ResourceClass + // uses a control plane controller, allocation starts + // as soon as the ResourceClaim gets created. This is done without + // considering the needs of Pods that will use the ResourceClaim + // because those Pods are not known yet. + // + // When structured parameters are used, nothing special is done for + // allocation and thus allocation happens when the scheduler handles + // first Pod which needs the ResourceClaim, as with "WaitForFirstConsumer". + // + // In both cases, claims remain allocated even when not in use. + AllocationModeImmediate AllocationMode = "Immediate" ) +``` + +### ResourceClaimStatus extension -// ResourceClaimStatus tracks whether the resource has been allocated and what -// the resulting attributes are. +``` type ResourceClaimStatus struct { - // DriverName is a copy of the driver name from the ResourceClass at - // the time when allocation started. - // +optional - DriverName string - - // Allocation is set by the resource driver once a resource or set of - // resources has been allocated successfully. If this is not specified, the - // resources have not been allocated yet. - // +optional - Allocation *AllocationResult - - // ReservedFor indicates which entities are currently allowed to use - // the claim. A Pod which references a ResourceClaim which is not - // reserved for that Pod will not be started. - // - // There can be at most 32 such reservations. This may get increased in - // the future, but not reduced. - // +optional - ReservedFor []ResourceClaimConsumerReference - - // DeallocationRequested indicates that a ResourceClaim is to be - // deallocated. - // - // The driver then must deallocate this claim and reset the field - // together with clearing the Allocation field. - // - // While DeallocationRequested is set, no new consumers may be added to - // ReservedFor. - // +optional - DeallocationRequested bool -} + ... + // ControllerName is a copy of the driver name from the ResourceClass at + // the time when allocation started. It is empty when the claim was + // allocated through structured parameters, + // + // This is an alpha field and requires enabling the DRAControlPlaneController + // feature gate. + // + // +optional + ControllerName string -// ReservedForMaxSize is the maximum number of entries in -// claim.status.reservedFor. -const ResourceClaimReservedForMaxSize = 32 + // DeallocationRequested indicates that a ResourceClaim is to be + // deallocated. + // + // The driver then must deallocate this claim and reset the field + // together with clearing the Allocation field. + // + // While DeallocationRequested is set, no new consumers may be added to + // ReservedFor. + // + // This is an alpha field and requires enabling the DRAControlPlaneController + // feature gate. + // + // +optional + DeallocationRequested bool ``` DeallocationRequested gets set by the scheduler when it detects @@ -1368,75 +390,49 @@ that pod scheduling cannot proceed because some claim was allocated for a node for which some other pending claims cannot be allocated because that node ran out of resources for those. -``` -// AllocationResult contains attributes of an allocated resource. -type AllocationResult struct { - // ResourceHandles contain the state associated with an allocation that - // should be maintained throughout the lifetime of a claim. Each - // ResourceHandle contains data that should be passed to a specific kubelet - // plugin once it lands on a node. This data is returned by the driver - // after a successful allocation and is opaque to Kubernetes. Driver - // documentation may explain to users how to interpret this data if needed. - // - // Setting this field is optional. It has a maximum size of 32 entries. - // If null (or empty), it is assumed this allocation will be processed by a - // single kubelet plugin with no ResourceHandle data attached. The name of - // the kubelet plugin invoked will match the DriverName set in the - // ResourceClaimStatus this AllocationResult is embedded in. - // - // +listType=atomic - ResourceHandles []ResourceHandle - - // This field will get set by the resource driver after it has allocated - // the resource to inform the scheduler where it can schedule Pods using - // the ResourceClaim. - // - // Setting this field is optional. If null, the resource is available - // everywhere. - // +optional - AvailableOnNodes *core.NodeSelector - - // Shareable determines whether the resource supports more - // than one consumer at a time. - // +optional - Shareable bool -} +It also gets set by kube-controller-manager when it detects that +a claim is no longer in use. -// AllocationResultResourceHandlesMaxSize represents the maximum number of -// entries in allocation.resourceHandles. -const AllocationResultResourceHandlesMaxSize = 32 +### ResourceHandle extensions + +Resource drivers can use each `ResourceHandle` to store data directly or +cross-reference some other place where information is stored. +This data is guaranteed to be available when a Pod is about +to run on a node, in contrast to the ResourceClass which +may have been deleted in the meantime. It's also protected from +modification by a user, in contrast to an annotation. +``` // ResourceHandle holds opaque resource data for processing by a specific kubelet plugin. type ResourceHandle struct { - // DriverName specifies the name of the resource driver whose kubelet - // plugin should be invoked to process this ResourceHandle's data once it - // lands on a node. This may differ from the DriverName set in - // ResourceClaimStatus this ResourceHandle is embedded in. - DriverName string - - // Data contains the opaque data associated with this ResourceHandle. It is - // set by the controller component of the resource driver whose name - // matches the DriverName set in the ResourceClaimStatus this - // ResourceHandle is embedded in. It is set at allocation time and is - // intended for processing by the kubelet plugin whose name matches - // the DriverName set in this ResourceHandle. - // - // The maximum size of this field is 16KiB. This may get increased in the - // future, but not reduced. - // +optional - Data string + ... + + // Data contains the opaque data associated with this ResourceHandle. It is + // set by the controller component of the resource driver whose name + // matches the DriverName set in the ResourceClaimStatus this + // ResourceHandle is embedded in. It is set at allocation time and is + // intended for processing by the kubelet plugin whose name matches + // the DriverName set in this ResourceHandle. + // + // The maximum size of this field is 16KiB. This may get increased in the + // future, but not reduced. + // + // This is an alpha field and requires enabling the DRAControlPlaneController feature gate. + // + // +optional + Data string } // ResourceHandleDataMaxSize represents the maximum size of resourceHandle.data. const ResourceHandleDataMaxSize = 16 * 1024 ``` -Resource drivers can use each `ResourceHandle` to store data directly or -cross-reference some other place where information is stored. -This data is guaranteed to be available when a Pod is about -to run on a node, in contrast to the ResourceClass which -may have been deleted in the meantime. It's also protected from -modification by a user, in contrast to an annotation. + +### PodSchedulingContext + +PodSchedulingContexts get created by a scheduler when it processes a pod which +uses one or more unallocated ResourceClaims with delayed allocation and +allocation of those ResourceClaims is handled by control plane controllers. ``` // PodSchedulingContext holds information that is needed to schedule @@ -1444,25 +440,21 @@ modification by a user, in contrast to an annotation. // mode. // // This is an alpha type and requires enabling the DynamicResourceAllocation -// feature gate. +// and DRAControlPlaneController feature gates. type PodSchedulingContext struct { - metav1.TypeMeta - // Standard object metadata - // +optional - metav1.ObjectMeta + metav1.TypeMeta + // Standard object metadata + // +optional + metav1.ObjectMeta - // Spec describes where resources for the Pod are needed. - Spec PodSchedulingContextSpec + // Spec describes where resources for the Pod are needed. + Spec PodSchedulingContextSpec - // Status describes where resources for the Pod can be allocated. - Status PodSchedulingContextStatus + // Status describes where resources for the Pod can be allocated. + Status PodSchedulingContextStatus } ``` -PodSchedulingContexts get created by a scheduler when it processes -a pod which uses one or more unallocated ResourceClaims with delayed -allocation. - The name of a PodSchedulingContext must be the same as the corresponding Pod. That Pod must be listed as an owner in OwnerReferences to ensure that the PodSchedulingContext gets deleted when no longer needed. Normally the scheduler @@ -1475,19 +467,19 @@ and will be removed soon. ``` // PodSchedulingContextSpec describes where resources for the Pod are needed. type PodSchedulingContextSpec struct { - // SelectedNode is the node for which allocation of ResourceClaims that - // are referenced by the Pod and that use "WaitForFirstConsumer" - // allocation is to be attempted. - SelectedNode string - - // PotentialNodes lists nodes where the Pod might be able to run. - // - // The size of this field is limited to 128. This is large enough for - // many clusters. Larger clusters may need more attempts to find a node - // that suits all pending resources. This may get increased in the - // future, but not reduced. - // +optional - PotentialNodes []string + // SelectedNode is the node for which allocation of ResourceClaims that + // are referenced by the Pod and that use "WaitForFirstConsumer" + // allocation is to be attempted. + SelectedNode string + + // PotentialNodes lists nodes where the Pod might be able to run. + // + // The size of this field is limited to 128. This is large enough for + // many clusters. Larger clusters may need more attempts to find a node + // that suits all pending resources. This may get increased in the + // future, but not reduced. + // +optional + PotentialNodes []string } ``` @@ -1517,817 +509,275 @@ when the initial choice turns out to be unsuitable after all. Drivers must not reallocate for a different node when they see such a change because it would lead to race conditions. Instead, the scheduler -will trigger deallocation of specific claims as -needed through the ResourceClaimStatus.DeallocationRequested -field. - -The ResourceClass.SuiteableNodes node selector can be -used to filter out nodes based on labels. This prevents -adding nodes here that the driver then would need to -reject through UnsuitableNodes. - -``` -// PodSchedulingContextStatus describes where resources for the Pod can be allocated. -type PodSchedulingContextStatus struct { - // ResourceClaims describes resource availability for each - // pod.spec.resourceClaim entry where the corresponding ResourceClaim - // uses "WaitForFirstConsumer" allocation mode. - // +optional - ResourceClaims []ResourceClaimSchedulingStatus - - // If there ever is a need to support other kinds of resources - // than ResourceClaim, then new fields could get added here - // for those other resources. -} -``` - -Each resource driver is responsible for providing information about -those resources in the Pod that the driver manages. It can skip -adding this information once it has allocated the resource. - -A driver must add entries here for all its pending claims, even if -the ResourceSchedulingStatus.UnsuitabeNodes field is empty, -because the scheduler may decide to wait with selecting -a node until it has information from all drivers. - -``` -// ResourceClaimSchedulingStatus contains information about one particular -// ResourceClaim with "WaitForFirstConsumer" allocation mode. -type ResourceClaimSchedulingStatus struct { - // Name matches the pod.spec.resourceClaims[*].Name field. - Name string - - // UnsuitableNodes lists nodes that the ResourceClaim cannot be - // allocated for. - // - // The size of this field is limited to 128, the same as for - // PodSchedulingContextSpec.PotentialNodes. This may get increased in the - // future, but not reduced. - // +optional - UnsuitableNodes []string -} - -// PodSchedulingContextNodeListMaxSize defines the maximum number of entries in -// the node lists that are stored in PodSchedulingContexts. This limit is part -// of the API. -const PodSchedulingContextNodeListMaxSize = 256 -``` - -UnsuitableNodes lists nodes that the claim cannot be allocated for. -Nodes listed here will be ignored by the scheduler when selecting a -node for a Pod. All other nodes are potential candidates, either -because no information is available yet or because allocation might -succeed. - -A change to the PodSchedulingContextSpec.PotentialNodes field and/or a failed -allocation attempt triggers an update of this field: the driver -then checks all nodes listed in PotentialNodes and UnsuitableNodes -and updates UnsuitableNodes. - -It must include the prior UnsuitableNodes in this check because the -scheduler will not list those again in PotentialNodes but they might -still be unsuitable. - -This can change, so the driver must also refresh this information -periodically and/or after changing resource allocation for some -other ResourceClaim until a node gets selected by the scheduler. - -``` -// ResourceClaimTemplate is used to produce ResourceClaim objects. -type ResourceClaimTemplate struct { - metav1.TypeMeta - // Standard object metadata - // +optional - metav1.ObjectMeta - - // Describes the ResourceClaim that is to be generated. - // - // This field is immutable. A ResourceClaim will get created by the - // control plane for a Pod when needed and then not get updated - // anymore. - Spec ResourceClaimTemplateSpec -} - -// ResourceClaimTemplateSpec contains the metadata and fields for a ResourceClaim. -type ResourceClaimTemplateSpec struct { - // ObjectMeta may contain labels and annotations that will be copied into the PVC - // when creating it. No other fields are allowed and will be rejected during - // validation. - // +optional - metav1.ObjectMeta - - // Spec for the ResourceClaim. The entire content is copied unchanged - // into the ResourceClaim that gets created from this template. The - // same fields as in a ResourceClaim are also valid here. - Spec ResourceClaimSpec -} - -// ResourceClassParametersReference contains enough information to let you -// locate the parameters for a ResourceClass. -type ResourceClassParametersReference struct { - // APIGroup is the group for the resource being referenced. It is - // empty for the core API. This matches the group in the APIVersion - // that is used when creating the resources. - // +optional - APIGroup string - // Kind is the type of resource being referenced. This is the same - // value as in the parameter object's metadata. - Kind string - // Name is the name of resource being referenced. - Name string - // Namespace that contains the referenced resource. Must be empty - // for cluster-scoped resources and non-empty for namespaced - // resources. - // +optional - Namespace string -} - -// ResourceClaimParametersReference contains enough information to let you -// locate the parameters for a ResourceClaim. The object must be in the same -// namespace as the ResourceClaim. -type ResourceClaimParametersReference struct { - // APIGroup is the group for the resource being referenced. It is - // empty for the core API. This matches the group in the APIVersion - // that is used when creating the resources. - // +optional - APIGroup string - // Kind is the type of resource being referenced. This is the same - // value as in the parameter object's metadata, for example "ConfigMap". - Kind string - // Name is the name of resource being referenced. - Name string -} - -// ResourceClaimConsumerReference contains enough information to let you -// locate the consumer of a ResourceClaim. The user must be a resource in the same -// namespace as the ResourceClaim. -type ResourceClaimConsumerReference struct { - // APIGroup is the group for the resource being referenced. It is - // empty for the core API. This matches the group in the APIVersion - // that is used when creating the resources. - // +optional - APIGroup string - // Resource is the type of resource being referenced, for example "pods". - Resource string - // Name is the name of resource being referenced. - Name string - // UID identifies exactly one incarnation of the resource. - UID types.UID -} -``` - -`ResourceClassParametersReference` and `ResourceClaimParametersReference` use -the more user-friendly "kind" to identify the object type because those -references are provided by users. `ResourceClaimConsumerReference` is typically -set by the control plane and therefore uses the more technically correct -"resource" name. - -#### core - -``` -type PodSpec { - ... - // ResourceClaims defines which ResourceClaims must be allocated - // and reserved before the Pod is allowed to start. The resources - // will be made available to those containers which consume them - // by name. - // - // This is an alpha field and requires enabling the - // DynamicResourceAllocation feature gate. - // - // This field is immutable. - // - // +featureGate=DynamicResourceAllocation - // +optional - ResourceClaims []PodResourceClaim - ... -} - -type ResourceRequirements { - Limits ResourceList - Requests ResourceList - ... - // Claims lists the names of resources, defined in spec.resourceClaims, - // that are used by this container. - // - // This is an alpha field and requires enabling the - // DynamicResourceAllocation feature gate. - // - // This field is immutable. - // - // +featureGate=DynamicResourceAllocation - // +optional - Claims []ResourceClaim -} - -// ResourceClaim references one entry in PodSpec.ResourceClaims. -type ResourceClaim struct { - // Name must match the name of one entry in pod.spec.resourceClaims of - // the Pod where this field is used. It makes that resource available - // inside a container. - Name string -} -``` - -`Claims` is a list of structs with a single `Name` element because that struct -can be extended later, for example to add parameters that influence how the -resource is made available to a container. This wouldn't be possible if -it was a list of strings. - -``` -// PodResourceClaim references exactly one ResourceClaim through a ClaimSource. -// It adds a name to it that uniquely identifies the ResourceClaim inside the Pod. -// Containers that need access to the ResourceClaim reference it with this name. -type PodResourceClaim struct { - // Name uniquely identifies this resource claim inside the pod. - // This must be a DNS_LABEL. - Name string - - // Source describes where to find the ResourceClaim. - Source ClaimSource -} - -// ClaimSource describes a reference to a ResourceClaim. -// -// Exactly one of these fields should be set. Consumers of this type must -// treat an empty object as if it has an unknown value. -type ClaimSource struct { - // ResourceClaimName is the name of a ResourceClaim object in the same - // namespace as this pod. - ResourceClaimName *string - - // ResourceClaimTemplateName is the name of a ResourceClaimTemplate - // object in the same namespace as this pod. - // - // The template will be used to create a new ResourceClaim, which will - // be bound to this pod. When this pod is deleted, the ResourceClaim - // will also be deleted. The name of the ResourceClaim will be -, where is the - // PodResourceClaim.Name. Pod validation will reject the pod if the - // concatenated name is not valid for a ResourceClaim (e.g. too long). - // - // An existing ResourceClaim with that name that is not owned by the - // pod will not be used for the pod to avoid using an unrelated - // resource by mistake. Scheduling and pod startup are then blocked - // until the unrelated ResourceClaim is removed. - // - // This field is immutable and no changes will be made to the - // corresponding ResourceClaim by the control plane after creating the - // ResourceClaim. - ResourceClaimTemplateName *string -} -``` - -### kube-controller-manager - -The code that creates a ResourceClaim from a ResourceClaimTemplate will -be an almost verbatim copy of the [generic ephemeral volume -code](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/volume/ephemeral), -just with different types. - -kube-controller-manager will need new [RBAC -permissions](https://github.com/kubernetes/kubernetes/commit/ff3e5e06a79bc69ad3d7ccedd277542b6712514b#diff-2ad93af2302076e0bdb5c7a4ebe68dd3188eee8959c72832181a7597417cd196) that allow creating and updating ResourceClaims. - -kube-controller-manager also removes `claim.status.reservedFor` entries that reference -deleted objects or pods that have completed ("Phase" is "done"). -This is required for pods because kubelet does not have write -permission for ResourceClaimStatus. Pods as user is the common case, so special -code based on a shared pod informer will handle it. All other user types can -be handled through a generic informer or simply polling. - -In addition to updating `claim.status.reservedFor`, kube-controller-manager also deletes -ResourceClaims that are owned by a completed pod to ensure that they -get deallocated as soon as possible once they are not needed anymore. - -Finally, kube-controller-manager tries to make pods runnable that were -[scheduled to a node -prematurely](#scheduled-pods-with-unallocated-or-unreserved-claims) by -triggering allocation and reserving claims when it is certain that -kube-scheduler is not going to handle that. - -### kube-scheduler - -The scheduler plugin for ResourceClaims ("claim plugin" in this section) -needs to implement several extension points. It handles -communication with a resource driver through the apiserver. The [volume -binder -plugin](https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins/volumebinding) -can serve as a reference. - -Scheduling of a pod using a ResourceClaim may have to wait for a resource -driver to do something, typically allocating the resource. When the scheduler -notices this, the current scheduling attempt for the pod must stop and the pod -needs to be put back into the work queue. It then gets retried whenever a -ResourceClaim gets added or modified. - -The following extension points are implemented in the new claim plugin. Except -for some unlikely edge cases (see below) there are no API calls during the main -scheduling cycle. Instead, the plugin collects information and updates the -cluster in the separate goroutine which invokes PreBind. - - -#### EventsToRegister - -This registers all cluster events that might make an unschedulable pod -schedulable, like creating a claim that the pod needs or finishing the -allocation of a claim. - -[Queuing hints](https://github.com/kubernetes/enhancements/issues/4247) are -supported. These are callbacks that can limit the effect of a cluster event to -specific pods. For example, allocating a claim only makes those pods -scheduleable which reference the claim. There is no need to try scheduling a pod -which waits for some other claim. Hints are also used to trigger the next -scheduling cycle for a pod immediately when some expected and require event -like "drivers have provided information" occurs, instead of forcing the pod to -go through the backoff queue and the usually 5 second long delay associated -with that. - -Queuing hints are an optional feature of the scheduler, with (as of Kubernetes -1.29) their own `SchedulerQueueingHints` feature gate that defaults to -off. When turned off, performance of scheduling pods with resource claims is -slower compared to a cluster configuration where they are turned on. - -#### PreEnqueue - -This checks whether all claims referenced by a pod exist. If they don't, -scheduling the pod has to wait until the kube-controller-manager or user create -the claims. - -#### Pre-filter - -This checks whether a Pod uses any ResourceClaims. If there are ResourceClaims -with immediate binding that are not allocated yet, then the Pod will be marked -as unschedulable at the moment. - -#### Filter - -This checks whether the given node has access to those ResourceClaims which -were already allocated. - -For unallocated ResourceClaims with delayed allocation, only those nodes are -filtered out that are explicitly listed in their UnsuitableNodes field of their -PodSchedulingContext.Claims entry (if such an entry already exists) or that -don't match the optional ResourceClass.SuitableNodes node selector. - -There are several -reasons why such a deny list is more suitable than an allow list: -- Nodes for which no information is available must pass the filter phase to be - included in the list that will be passed to pre-score and to get copied - into the PodSchedulingContext.PotentialNodes field there. -- A node can already be chosen while there is no information yet and, if - allocation for that node actually works, the Pod can get scheduled sooner. -- Some resource drivers might not have any unsuitable nodes, for example - because they modify containers and that works on all nodes at all - times. Forcing such drivers to set an allow list would cause unnecessary - work. - -In its state for the Pod the claim plugin must remember when it rejected a -node because of UnsuitableNodes. That information will be used in Post-filter -to deallocate resources. - -#### Post-filter - -This is called when no suitable node could be found. If the Pod depends on ResourceClaims with delayed -allocation, then deallocating one or more of these ResourceClaims may make the -Pod schedulable after allocating the resource elsewhere. Therefore each -ResourceClaim with delayed allocation is checked whether all of the following -conditions apply: -- allocated -- not currently in use -- it was the reason why some node could not fit the Pod, as recorded earlier in - Filter - -One of the ResourceClaims satisfying these criteria is picked randomly and deallocation -is requested by setting the ResourceClaimStatus.DeallocationRequested field. The scheduler then needs to wait -for the resource driver to react to that change and deallocate the resource. - -This may make it possible to run the Pod -elsewhere. If it still doesn't help, deallocation may continue with another -ResourceClaim, if there is one. To prevent deallocating all resources merely -because the scheduler happens to check quickly, the next deallocation will only -requested when there is none currently pending. - -At the moment, the claim plugin has no information that might enable it to -prioritize which resource to deallocate first. Future extensions of this KEP -might attempt to improve this. - -This is currently using blocking API calls. It's quite rare because this -situation can only arise when there are multiple claims per pod and allocation -for one of them fails despite all drivers agreeing that a node should be -suitable, or when reusing a claim for multiple pods (not a common use case) and -the original node became unusable for the next pod. - -#### Pre-score - -This is passed a list of nodes that have passed filtering by the claim -plugin and the other plugins. That list is stored by the claim plugin and will -be copied to PodSchedulingContextSpec.PotentialNodes when the claim plugin -creates or updates the object in Reserve. - -Pre-score is not called when there is only a single potential node. In that -case Reserve will store the selected node in PodSchedulingContextSpec.PotentialNodes. - -#### Reserve - -A node has been chosen for the Pod. - -If using delayed allocation and one or more claims have not been allocated yet, -the claim plugin now needs to decide whether it wants to trigger allocation by -setting the PodSchedulingContextSpec.SelectedNode field. For a single unallocated -claim that is safe even if no information about unsuitable nodes is available -because the allocation will either succeed or fail. For multiple such claims -allocation only gets triggered when that information is available, to minimize -the risk of getting only some but not all claims allocated. In both cases the -PodSchedulingContext gets created or updated as needed. This is also where the -PodSchedulingContextSpec.PotentialNodes field gets set. - -If all resources have been allocated already, -the claim plugin ensures that the Pod is listed in the `claim.status.reservedFor` field -of its ResourceClaims. The driver can and should already have added -the Pod when specifically allocating the claim for it, so it may -be possible to skip this update. - -All the PodSchedulingContext and ResourceClaim updates are recorded in the -plugin state. They will be written to the cluster during PreBind. - -If some resources are not allocated yet or reserving an allocated resource -fails, the scheduling attempt needs to be aborted and retried at a later time -or when the statuses change. The Reserve call itself never fails. If resources -are not currently available, that information is recorded in the plugin state -and will cause the PreBind call to fail instead. - -#### PreBind - -This is called in a separate goroutine. The plugin now checks all the -information gathered earlier and updates the cluster accordingly. If some -claims are not allocated or not reserved, PreBind fails and the pod must be -retried. - -#### Unreserve - -The claim plugin removes the Pod from the `claim.status.reservedFor` field -because it cannot be scheduled after all. - -This is necessary to prevent a deadlock: suppose there are two stand-alone -claims that only can be used by one pod at a time and two pods which both -reference them. Both pods will get scheduled independently, perhaps even by -different schedulers. When each pod manages to allocate and reserve one claim, -then neither of them can get scheduled because they cannot reserve the other -claim. - -Giving up the reservations in Unreserve means that the next pod scheduling -attempts have a chance to succeed. It's non-deterministic which pod will win, -but eventually one of them will. Not giving up the reservations would lead to a -permanent deadlock that somehow would have to be detected and resolved to make -progress. - -Unreserve is called in two scenarios: -- In the main goroutine when scheduling a pod has failed: in that case the plugin's - Reserve call hasn't actually changed the claim status yet, so there is nothing - that needs to be rolled back. -- After binding has failed: this runs in a goroutine, so reverting the - `claim.status.reservedFor` with a blocking call is acceptable. - -### Cluster Autoscaler - -When [Cluster -Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler) -encounters a pod that uses a resource claim for node-local resources, it needs -to understand the parameters for the claim and available capacity in order -to simulate the effect of allocating claims as part of scheduling and of -creating or removing nodes. - -This is not possible with opaque parameters as described in this KEP. If a DRA -driver developer wants to support Cluster Autoscaler, they have to use semantic -parameters. Semantic parameters are an extension of this KEP that is defined in -[KEP #4381](https://github.com/kubernetes/enhancements/issues/4381). - -Semantic parameters are not necessary for network-attached resources because -adding or removing nodes doesn't change their availability and thus Cluster -Autoscaler does not need to understand their parameters. - -### kubelet - -#### Managing resources - -kubelet must ensure that resources are ready for use on the node before running -the first Pod that uses a specific resource instance and make the resource -available elsewhere again when the last Pod has terminated that uses it. For -both operations, kubelet calls a resource kubelet plugin as explained in the next -section. - -Pods that are not listed in ReservedFor or where the ResourceClaim doesn't -exist at all must not be allowed to run. Instead, a suitable event must be -emitted which explains the problem. Such a situation can occur as part of -downgrade scenarios. - -If this was the last Pod on the node that uses the specific -resource instance, then NodeUnprepareResource (see below) must have been called -successfully before allowing the pod to be deleted. This ensures that network-attached resource are available again -for other Pods, including those that might get scheduled to other nodes. It -also signals that it is safe to deallocate and delete the ResourceClaim. - - -![kubelet](./kubelet.png) - -#### Communication between kubelet and resource kubelet plugin - -Resource kubelet plugins are discovered through the [kubelet plugin registration -mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A -new "ResourcePlugin" type will be used in the Type field of the -[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo) -response to distinguish the plugin from device and CSI plugins. - -Under the advertised Unix Domain socket the kubelet plugin provides one of the -following supported gRPC interface versions. It was inspired by -[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md), -with “volume” replaced by “resource” and volume specific parts removed. - -Key difference between interface versions: +will trigger deallocation of specific claims as +needed through the ResourceClaimStatus.DeallocationRequested +field. -- [v1alpha2](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/dra/v1alpha2/api.proto) -interface provides resource claim information to a kubelet plugin one at a -time. **Note: v1alpha2 will be deprecared, switch to v1alpha3** -- [v1alpha3](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/dra/v1alpha3/api.proto) -interface provides information about all resource claims of a pod that belong -to a particular driver in a single call. This way the kubelet plugin of this driver can consider all -resources that need to be prepared or unprepared for the pod simultaneously. +The ResourceClass.SuiteableNodes node selector can be +used to filter out nodes based on labels. This prevents +adding nodes here that the driver then would need to +reject through UnsuitableNodes. +``` +// PodSchedulingContextStatus describes where resources for the Pod can be allocated. +type PodSchedulingContextStatus struct { + // ResourceClaims describes resource availability for each + // pod.spec.resourceClaim entry where the corresponding ResourceClaim + // uses "WaitForFirstConsumer" allocation mode. + // +optional + ResourceClaims []ResourceClaimSchedulingStatus + + // If there ever is a need to support other kinds of resources + // than ResourceClaim, then new fields could get added here + // for those other resources. +} +``` -##### `NodePrepareResource` +Each resource driver is responsible for providing information about +those resources in the Pod that the driver manages. It can skip +adding this information once it has allocated the resource. -This RPC is called by the kubelet when a Pod that wants to use the specified -resource is scheduled on a node. The Plugin SHALL assume that this RPC will be -executed on the node where the resource will be used. +A driver must add entries here for all its pending claims, even if +the ResourceSchedulingStatus.UnsuitabeNodes field is empty, +because the scheduler may decide to wait with selecting +a node until it has information from all drivers. -ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name and -one of the ResourceHandles from the ResourceClaimStatus.AllocationResult with -a matching DriverName should be passed to the Plugin as parameters to identify -the claim and perform resource preparation. +``` +// ResourceClaimSchedulingStatus contains information about one particular +// ResourceClaim with "WaitForFirstConsumer" allocation mode. +type ResourceClaimSchedulingStatus struct { + // Name matches the pod.spec.resourceClaims[*].Name field. + Name string -ResourceClaim parameters (namespace, UUID, name) are useful for debugging. -They enable the Plugin to retrieve the full ResourceClaim object, should it -ever be needed (normally it shouldn't). + // UnsuitableNodes lists nodes that the ResourceClaim cannot be + // allocated for. + // + // The size of this field is limited to 128, the same as for + // PodSchedulingContextSpec.PotentialNodes. This may get increased in the + // future, but not reduced. + // +optional + UnsuitableNodes []string +} -The Plugin SHALL return fully qualified device name[s]. +// PodSchedulingContextNodeListMaxSize defines the maximum number of entries in +// the node lists that are stored in PodSchedulingContexts. This limit is part +// of the API. +const PodSchedulingContextNodeListMaxSize = 256 +``` -The Plugin SHALL ensure that there are json file[s] in CDI format -for the allocated resource. These files SHALL be used by runtime to -update runtime configuration before creating containers that use the -resource. +UnsuitableNodes lists nodes that the claim cannot be allocated for. +Nodes listed here will be ignored by the scheduler when selecting a +node for a Pod. All other nodes are potential candidates, either +because no information is available yet or because allocation might +succeed. -This operation SHALL do as little work as possible as it’s called -after a pod is scheduled to a node. All potentially failing operations -SHALL be done during allocation phase. +A change to the PodSchedulingContextSpec.PotentialNodes field and/or a failed +allocation attempt triggers an update of this field: the driver +then checks all nodes listed in PotentialNodes and UnsuitableNodes +and updates UnsuitableNodes. -This operation MUST be idempotent. If the resource corresponding to -the `resource_id` has already been prepared, the Plugin MUST reply `0 -OK`. +It must include the prior UnsuitableNodes in this check because the +scheduler will not list those again in PotentialNodes but they might +still be unsuitable. -If this RPC failed, or kubelet does not know if it failed or not, it -MAY choose to call `NodePrepareResource` again, or choose to call -`NodeUnprepareResource`. +This can change, so the driver must also refresh this information +periodically and/or after changing resource allocation for some +other ResourceClaim until a node gets selected by the scheduler. -On a successful call this RPC should return set of fully qualified -CDI device names, which kubelet MUST pass to the runtime through the CRI -protocol. For version v1alpha3, the RPC should return multiple sets of -fully qualified CDI device names, one per claim that was sent in the input parameters. +### Coordinating resource allocation through the scheduler -###### v1alpha2 +For immediate allocation, scheduling Pods is simple because the +resource is already allocated and determines the nodes on which the +Pod may run. The downside is that pod scheduling is less flexible. -> [!WARNING] -> v1alpha2 will be deprecated, switch to v1alpha3. +For delayed allocation, a node is selected tentatively by the scheduler +in an iterative process where the scheduler suggests some potential nodes +that fit the other resource requirements of a Pod and resource drivers +respond with information about whether they can allocate claims for those +nodes. This exchange of information happens through the `PodSchedulingContext` +for a Pod. The scheduler has to involve the drivers because it +doesn't know what claim parameters mean and where suitable resources are +currently available. -
-v1alpha2 +Once the scheduler is confident that it has enough information to select +a node that will probably work for all claims, it asks the driver(s) to +allocate their resources for that node. If that +succeeds, the Pod can get scheduled. If it fails, the scheduler must +determine whether some other node fits the requirements and if so, +request allocation again. If no node fits because some resources were +already allocated for a node and are only usable there, then those +resources must be released and then get allocated elsewhere. -```protobuf -message NodePrepareResourceRequest { - // The ResourceClaim namespace (ResourceClaim.meta.Namespace). - // This field is REQUIRED. - string namespace = 1; - // The UID of the Resource claim (ResourceClaim.meta.UUID). - // This field is REQUIRED. - string claim_uid = 2; - // The name of the Resource claim (ResourceClaim.meta.Name) - // This field is REQUIRED. - string claim_name = 3; - // Resource handle (AllocationResult.ResourceHandles[*].Data) - // This field is REQUIRED. - string resource_handle = 4; -} +This is a summary of the necessary [kube-scheduler changes](#kube-scheduler) in +pseudo-code: -message NodePrepareResourceResponse { - // These are the additional devices that kubelet must - // make available via the container runtime. A resource - // may have zero or more devices. - repeated string cdi_device = 1; +``` +while { + + if { + if { + + } + } else if { + + } else if { + + } } ``` -
- -###### v1alpha3 - -```protobuf -message NodePrepareResourcesRequest { - // The list of ResourceClaims that are to be prepared. - repeated Claim claims = 1; -} +Randomly picking a node without knowing anything about the resource driver may +or may not succeed. To narrow the choice of suitable nodes for all claims using +a certain resource class, a node selector can be specified in that class. That +selector is static and typically will use labels that determine which nodes may +have resources available. -message Claim { - // The ResourceClaim namespace (ResourceClaim.meta.Namespace). - // This field is REQUIRED. - string namespace = 1; - // The UID of the Resource claim (ResourceClaim.meta.UUID). - // This field is REQUIRED. - string uid = 2; - // The name of the Resource claim (ResourceClaim.meta.Name) - // This field is REQUIRED. - string name = 3; - // Resource handle (AllocationResult.ResourceHandles[*].Data) - // This field is REQUIRED. - string resource_handle = 4; -} +To gather information about the current state of resource availability and to +trigger allocation of a claim, the scheduler creates one PodSchedulingContext +for each pod that uses claims. That object is owned by the pod and +will either get deleted by the scheduler when it is done with pod scheduling or +through the garbage collector. In the PodSchedulingContext, the scheduler posts +the list of all potential nodes that it was left with after considering all +other pod constraints and requirements. Resource drivers involved in the +scheduling of the pod respond by adding which of these nodes currently don't +have sufficient resources available. The next scheduling attempt is then more +likely to pick a node for which allocation succeeds. -message NodePrepareResourcesResponse { - // The ResourceClaims for which preparation was done - // or attempted, with claim_uid as key. - // - // It is an error if some claim listed in NodePrepareResourcesRequest - // does not get prepared. NodePrepareResources - // will be called again for those that are missing. - map claims = 1; -} +This scheduling information is optional and does not have to be in sync with +the current ResourceClaim state, therefore it is okay to store it +separately. -message NodePrepareResourceResponse { - // These are the additional devices that kubelet must - // make available via the container runtime. A resource - // may have zero or more devices. - repeated string cdi_devices = 1 [(gogoproto.customname) = "CDIDevices"]; - // If non-empty, preparing the ResourceClaim failed. - // cdi_devices is ignored in that case. - string error = 2; -} -``` +Allowing the scheduler to trigger allocation in parallel to asking for more +information was chosen because for pods with a single resource claim, the cost +of guessing wrong is low: the driver just needs to inform the scheduler to try +again and provide the additional information. -CRI protocol MUST be extended for this purpose: - - * CDIDevice structure should be added to the CRI specification -```protobuf -// CDIDevice specifies a CDI device information. -message CDIDevice { - // Fully qualified CDI device name - // for example: vendor.com/gpu=gpudevice1 - // see more details in the CDI specification: - // https://github.com/container-orchestrated-devices/container-device-interface/blob/main/SPEC.md - string name = 1; -} -``` - * CDI devices should be added to the ContainerConfig structure: -```protobuf -// ContainerConfig holds all the required and optional fields for creating a -// container. -message ContainerConfig { - // Metadata of the container. This information will uniquely identify the - // container, and the runtime should leverage this to ensure correct - // operation. The runtime may also use this information to improve UX, such - // as by constructing a readable name. - ContainerMetadata metadata = 1 ; - // Image to use. - ImageSpec image = 2; - // Command to execute (i.e., entrypoint for docker) - repeated string command = 3; -... - // CDI devices for the container. - repeated CDIDevice cdi_devices = 17; -} -``` +Additional heuristics are possible without changing the proposed API. For +example, the scheduler might ask for information and wait a while before +making a choice. This may be more suitable for pods using many different +resource claims because for those, allocation may succeed for some claims and +fail for others, which then may need to go through the recovery flow with +deallocating one or more claims. -###### NodePrepareResource Errors - -If the plugin is unable to complete the NodePrepareResource call -successfully, it MUST return a non-ok gRPC code in the gRPC status. -If the conditions defined below are encountered, the plugin MUST -return the specified gRPC error code. Kubelet MUST implement the -specified error recovery behavior when it encounters the gRPC error -code. - -| Condition | gRPC Code | Description | Recovery Behavior | -|-----------|-----------|-------------|-------------------| -| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. | - - -##### `NodeUnprepareResource` - -A Kubelet Plugin MUST implement this RPC call. This RPC is a reverse -operation of `NodePrepareResource`. This RPC MUST undo the work by -the corresponding `NodePrepareResource`. This RPC SHALL be called by -kubelet at least once for each successful `NodePrepareResource`. The -Plugin SHALL assume that this RPC will be executed on the node where -the resource is being used. - -This RPC is called by the kubelet when the last Pod using the resource is being -deleted or has reached a final state ("Phase" is "done"). - -This operation MUST be idempotent. If this RPC failed, or kubelet does -not know if it failed or not, it can choose to call -`NodeUnprepareResource` again. - -###### v1alpha2 - -> [!WARNING] -> v1alpha2 will be deprecated, switch to v1alpha3. - -
-v1alpha2 - -```protobuf -message NodeUnprepareResourceRequest { - // The ResourceClaim namespace (ResourceClaim.meta.Namespace). - // This field is REQUIRED. - string namespace = 1; - // The UID of the Resource claim (ResourceClaim.meta.UUID). - // This field is REQUIRED. - string claim_uid = 2; - // The name of the Resource claim (ResourceClaim.meta.Name) - // This field is REQUIRED. - string claim_name = 3; - // Resource handle (AllocationResult.ResourceHandles[*].Data) - // This field is REQUIRED. - string resource_handle = 4; -} +### Resource allocation and usage flow -message NodeUnprepareResourceResponse { - // Intentionally empty. -} -``` +The following steps shows how resource allocation works for a resource that +gets defined in a ResourceClaimTemplate and referenced by a Pod. Several of these steps may fail without changing +the system state. They then must be retried until they succeed or something +else changes in the system, like for example deleting objects. -
+* **user** creates Pod with reference to ResourceClaimTemplate +* **resource claim controller** checks ResourceClaimTemplate and ResourceClass, + then creates ResourceClaim with Pod as owner +* if *immediate allocation*: + * **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress + * **resource driver** finishes allocation, sets `claim.status.allocation` -> claim ready for use by any pod +* if *pod is pending*: + * **scheduler** filters nodes based on built-in resources and the filter callback of plugins, + which includes constraints imposed by already allocated resources + * if *delayed allocation and resource not allocated yet*: + * if *at least one node fits pod*: + * **scheduler** creates or updates a `PodSchedulingContext` with `podSchedulingContext.spec.potentialNodes=` + * if *exactly one claim is pending (see below)* or *all drivers have provided information*: + * **scheduler** picks one node, sets `podSchedulingContext.spec.selectedNode=` + * if *resource is available for this selected node*: + * **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress + * **resource driver** finishes allocation, sets `claim.status.allocation` and the + pod in `claim.status.reservedFor` -> claim ready for use and reserved for the pod + * else *scheduler needs to know that it must avoid this and possibly other nodes*: + * **resource driver** sets `podSchedulingContext.status.claims[name=name of claim in pod].unsuitableNodes` + * else *pod cannot be scheduled*: + * **scheduler** may trigger deallocation of some claim with delayed allocation by setting `claim.status.deallocationRequested` to true + (see [pseudo-code above](#coordinating-resource-allocation-through-the-scheduler)) or wait + * if *pod not listed in `claim.status.reservedFor` yet* (can occur for immediate allocation): + * **scheduler** adds it to `claim.status.reservedFor` + * if *resource allocated and reserved*: + * **scheduler** sets node in Pod spec -> Pod ready to run + * **scheduler** deletes `PodSchedulingContext` if one exists +* if *node is set for pod*: + * if `resource not reserved for pod` (user might have set the node field): + * **kubelet** refuses to start the pod -> permanent failure + * else `pod may run`: + * **kubelet** asks driver to prepare the resource + * if `resource is prepared`: + * **kubelet** creates container(s) which reference(s) the resource through CDI -> Pod is running +* if *pod has terminated* and *pod deleted*: + * **kubelet** asks driver to unprepare the resource + * **kubelet** allows pod deletion to complete by clearing the `GracePeriod` +* if *pod removed*: + * **garbage collector** deletes ResourceClaim -> adds `claim.deletionTimestamp` because of finalizer +* if *ResourceClaim has `claim.deletionTimestamp` and `claim.status.reservedFor` is empty*: + * **resource driver** deallocates resource + * **resource driver** clears finalizer and `claim.status.allocation` + * **API server** removes ResourceClaim -###### v1alpha3 +When exactly one claim is pending, it is safe to trigger the allocation: if the +node is suitable, the allocation will succeed and the pod can get scheduled +without further delays. If the node is not suitable, allocation fails and the +next attempt can do better because it has more information. The same should not +be done when there are multiple claims because allocation might succeed for +some, but not all of them, which would force the scheduler to recover by asking +for deallocation. It's better to wait for information in this case. -```protobuf -message NodeUnprepareResourcesRequest { - // The list of ResourceClaims that are to be unprepared. - repeated Claim claims = 1; -} +The flow is similar for a ResourceClaim that gets created as a stand-alone +object by the user. In that case, the Pod reference that ResourceClaim by +name. The ResourceClaim does not get deleted at the end and can be reused by +another Pod and/or used by multiple different Pods at the same time (if +supported by the driver). The resource remains allocated as long as the +ResourceClaim doesn't get deleted by the user. -message NodeUnprepareResourcesResponse { - // The ResourceClaims for which preparation was reverted. - // The same rules as for NodePrepareResourcesResponse.claims - // apply. - map claims = 1; -} +If a Pod references multiple claims managed by the same driver, then the driver +can combine updating `podSchedulingContext.claims[*].unsuitableNodes` for all +of them, after considering all claims. -message NodeUnprepareResourceResponse { - // If non-empty, unpreparing the ResourceClaim failed. - string error = 1; -} +### Scheduled pods with unallocated or unreserved claims -message Claim { - // The ResourceClaim namespace (ResourceClaim.meta.Namespace). - // This field is REQUIRED. - string namespace = 1; - // The UID of the Resource claim (ResourceClaim.meta.UUID). - // This field is REQUIRED. - string uid = 2; - // The name of the Resource claim (ResourceClaim.meta.Name) - // This field is REQUIRED. - string name = 3; - // Resource handle (AllocationResult.ResourceHandles[*].Data) - // This field is REQUIRED. - string resource_handle = 4; -} -``` +As with structured parameters, there are several scenarios where a Pod might be +scheduled (= `pod.spec.nodeName` set) while the claims that it depends on are +not allocated or not reserved for it. The kubelet is refusing to run such pods. + +In addition to the solutions described for structured parameters, using a control +plane controller provides one additional solution: +- When kube-controller-manager observes that allocation is missing, it creates + a `PodSchedulingContext` with only the `spec.selectedNode` field set to the + name of the node chosen for the pod. There is no need to list suitable nodes + because that choice is permanent, so resource drivers don't need check for + unsuitable nodes. All that they can do is to (re)try allocating the claim + until that succeeds. +- If such a pod has allocated claims that are not reserved for it yet, + then kube-controller-manager can (re)try to reserve the claim until + that succeeds. -###### NodeUnprepareResource Errors +Once all of those steps are complete, kubelet will notice that the claims are +ready and run the pod. Until then it will keep checking periodically, just as +it does for other reasons that prevent a pod from running. -If the plugin is unable to complete the NodeUprepareResource call -successfully, it MUST return a non-ok gRPC code in the gRPC status. -If the conditions defined below are encountered, the plugin MUST -return the specified gRPC error code. Kubelet MUST implement the -specified error recovery behavior when it encounters the gRPC error -code. +### Cluster Autoscaler -| Condition | gRPC Code | Description | Recovery Behavior | -|-----------|-----------|-------------|-------------------| -| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. | +When [Cluster +Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler) +encounters a pod that uses a resource claim for node-local resources, it needs +to understand the parameters for the claim and available capacity in order +to simulate the effect of allocating claims as part of scheduling and of +creating or removing nodes. -#### Implementing optional resources +This is not possible with opaque parameters as described in this KEP. If a DRA +driver developer wants to support Cluster Autoscaler, they have to use +structured parameters as defined in [KEP +#4381](https://github.com/kubernetes/enhancements/issues/4381). -This can be handled entirely by a resource driver: its parameters can support a -range starting at zero or a boolean flag that indicates that something is not a -hard requirement. When asked to filter nodes for delayed allocation, the driver -reports nodes where the resource is available and only falls back to those -without it when resources are exhausted. When asked to allocate, it reserves -actual resources if possible, but also proceeds with marking the ResourceClaim -as allocated if that is not possible. Kubernetes then can schedule a pod using -the ResourceClaim. The pod needs to determine through information passed in by -the resource driver which resources are actually available to it. +Structured parameters are not necessary for network-attached resources because +adding or removing nodes doesn't change their availability and thus Cluster +Autoscaler does not need to understand their parameters. -#### Implementing a plugin for node resources +### Implementing a plugin for node resources The proposal depends on a central resource driver controller. Implementing that part poses an additional challenge for drivers that manage resources @@ -2361,6 +811,19 @@ module. can unset the selected node field to trigger another allocation attempt elsewhere. +### Implementing optional resources + +This can be handled entirely by a resource driver: its parameters can support a +range starting at zero or a boolean flag that indicates that something is not a +hard requirement. When asked to filter nodes for delayed allocation, the driver +reports nodes where the resource is available and only falls back to those +without it when resources are exhausted. When asked to allocate, it reserves +actual resources if possible, but also proceeds with marking the ResourceClaim +as allocated if that is not possible. Kubernetes then can schedule a pod using +the ResourceClaim. The pod needs to determine through information passed in by +the resource driver which resources are actually available to it. + + ### Test Plan [X] I/we understand the owners of the involved components may require updates to @@ -2471,7 +934,7 @@ For beta: - In normal scenarios, scheduling pods with claims must not block scheduling of other pods by doing blocking API calls -- Implement integration with Cluster Autoscaler through semantic parameters +- Implement integration with Cluster Autoscaler through structured parameters - Gather feedback from developers and surveys - Positive acknowledgment from 3 would-be implementors of a resource driver, from a diversity of companies or projects @@ -2819,6 +1282,7 @@ instructions. - Kubernetes 1.28: API break (ResourceClaim names for claims created from a template are generated instead of deterministic), scheduler performance enhancements (no more backoff delays). +- Kubernetes 1.29, 1.30: most blocking API calls moved into Pod binding goroutine ## Drawbacks @@ -2826,197 +1290,5 @@ instructions. Why should this KEP _not_ be implemented? --> -## Alternatives - -### Semantic Parameters instead of PodSchedulingContext - -When a DRA driver uses semantic parameters, there is no need for a DRA driver controller -which allocates the claim -and no need for communication between scheduler and such a controller. The -PodSchedulingContext object and the associated support in the scheduler then -aren't needed. Once semantic parameters are mature enough and confirmed to be -sufficient for DRA drivers, it might become possible to remove the -PodSchedulingContext API from this KEP. - -It might still be needed for other drivers and use cases, which then can be -discussed in a new KEP which focuses specifically on those use cases. - -### ResourceClaimTemplate - -Instead of creating a ResourceClaim from a template, the -PodStatus could be extended to hold the same information as a -ResourceClaimStatus. Every component which works with that information -then needs permission and extra code to work with PodStatus. Creating -an extra object seems simpler. - -### Reusing volume support as-is - -ResourceClaims are similar to PersistentVolumeClaims and also a lot of -the associated logic is similar. An [early -prototype](https://github.com/intel/proof-of-concept-cdi) used a -custom CSI driver to manage resources. - -The user experience with that approach is poor because per-resource -parameters must be stored in annotations of a PVC due to the lack of -custom per-PVC parameters. Passing annotations as additional parameters was [proposed -before](https://github.com/kubernetes-csi/external-provisioner/issues/86) -but were essentially [rejected by -SIG-Storage](https://github.com/kubernetes-csi/external-provisioner/issues/86#issuecomment-465836185) -because allowing apps to set custom parameters would make apps -non-portable. - -The current volume support also has open issues that affect the -“volume as resource” approach: Multiple different Pods on a node are -allowed to use the same -volume. https://github.com/kubernetes/enhancements/pull/2489 will -address that, but is still work in progress. Recovery from a bad node -selection during delayed binding may get stuck when a Pod has multiple -volumes because volumes are not getting deleted after a partial -provisioning. A proposal to fix that needs further work -(https://github.com/kubernetes/enhancements/pull/1703). Each “fake” -CSI driver would have to implement and install a scheduler extender -because storage capacity tracking only considers volume size as -criteria for selecting nodes, which is not applicable for custom -resources. - -### Extend volume support - -The StorageClass and PersistentVolumeClaim structs could be extended -to allow custom parameters. Together with an extension of the CSI -standard that would address the main objection against the previous -alternative. - -However, SIG-Storage and the CSI community would have to agree to this -kind of reuse and accept that some of the code maintained by them -becomes more complex because of these new use cases. - -### Extend Device Plugins - -The device plugins API could be extended to implement some of the -requirements mentioned in the “Motivation” section of this -document. There were certain attempts to do it, for example an attempt -to [add ‘Deallocate’ API call](https://github.com/kubernetes/enhancements/pull/1949) and [pass pod annotations to 'Allocate' API call](https://github.com/kubernetes/kubernetes/pull/61775) - -However, most of the requirements couldn’t be satisfied using this -approach as they would require major incompatible changes in the -Device Plugins API. For example: partial and optional resource -allocation couldn’t be done without changing the way resources are -currently declared on the Pod and Device Plugin level. - -Extending the device plugins API to use [Container Device Interface](https://github.com/container-orchestrated-devices/container-device-interface) -would help address some of the requirements, but not all of them. - -NodePrepareResource and NodeUnprepareResource could be added to the Device Plugins API and only get called for -resource claims. - -However, this would mean that -developers of the device plugins would have to implement mandatory -API calls (ListAndWatch, Allocate), which could create confusion -as those calls are meaningless for the Dynamic Resource Allocation -purposes. - -Even worse, existing device plugins would have to implement the new -calls with stubs that return errors because the generated Go interface -will require them. - -It should be also taken into account that device plugins API is -beta. Introducing incompatible changes to it may not be accepted by -the Kubernetes community. - -### Webhooks instead of ResourceClaim updates - -In the current design, scheduler and the third-party resource driver communicate by -updating fields in a ResourceClaim. This has several advantages compared to an -approach were kube-scheduler retrieves information from the resource driver -via HTTP: -* No need for a new webhook API. -* Simpler deployment of a resource driver because all it needs are - credentials to communicate with the apiserver. -* Current status can be checked by querying the ResourceClaim. - -The downside is higher load on the apiserver and an increase of the size of -ResourceClaim objects. - -### ResourceDriver - -Similar to CSIDriver for storage, a separate object describing a resource -driver might be useful at some point. At the moment it is not needed yet and -therefore not part of the v1alpha2 API. If it becomes necessary to describe -optional features of a resource driver, such a ResourceDriver type might look -like this: - -``` -type ResourceDriver struct { - // The name of the object is the unique driver name. - ObjectMeta - - // Features contains a list of features supported by the driver. - // New features may be added over time and must be ignored - // by code that does not know about them. - Features []ResourceDriverFeature -} - -type ResourceDriverFeature struct { - // Name is one of the pre-defined names for a feature. - Name ResourceDriverFeatureName - // Parameters might provide additional information about how - // the driver supports the feature. Boolean features have - // no parameters, merely listing them indicates support. - Parameters runtime.RawExtension -} -``` - -### Complex sharing of ResourceClaim - -At the moment, the allocation result marks as a claim as either "shareable" by -an unlimited number of consumers or "not shareable". More complex scenarios -might be useful like "may be shared by a certain number of consumers", but so -far such use cases have not come up yet. If they do, the `AllocationResult` can -be extended with new fields as defined by a follow-up KEP. - -### Improving scheduling performance - -Some enhancements are possible which haven't been implemented yet because it is -unclear how important they would be in practice. All of the following ideas -could still be added later as they don't conflict with the underlying design, -either as part of this KEP or in follow-up KEPs. - -#### Optimize for network-attached resources - -When a network-attached resource is available on all nodes in a cluster, the -driver will never mark any nodes as unsuitable. If all claims for a pod fall -into that category, the scheduler a) does not need to wait for information and -b) does not need to publish "potential nodes". - -The `ResourceClass` could be extended with a `AvailableForNodes -*core.NodeSelector`. This can be a selector that matches all nodes or a -subset. Either way, if a potential node matches this selector, the scheduler -knows that claims using this class can be allocated and can do the optimization -outlined above. - -#### Moving blocking API calls into goroutines - -This [is being -discussed](https://github.com/kubernetes/kubernetes/issues/120502) and has been -[partially -implemented](https://github.com/kubernetes/kubernetes/pull/120963). That -implementation made the scheduler framework more complex, so [the -conclusion](https://kubernetes.slack.com/archives/C09TP78DV/p1696307377064469?thread_ts=1696246271.825109&cid=C09TP78DV) -was that using blocking calls is the lesser evil until user feedback indicates -that improvements are really needed. - -#### RPC calls instead of `PodSchedulingContext` - -The current design is not making it a hard requirement that admins change the -scheduler configuration to enable communication between scheduler and DRA -drivers. For scenarios where admins and vendors are willing to invest more -effort and doing so would provide performance benefits, a communication path -similar to scheduler extenders could be added. - -## Infrastructure Needed - -Initially, all development will happen inside the main Kubernetes -repository. The mock driver can be developed inside test/e2e/dra. For the -generic part of that driver, i.e. the code that other drivers can reuse, and -other common code a new staging repo `k8s.io/dynamic-resource-allocation` is -needed. +The flow of information between the scheduler and DRA drivers through the +PodSchedulingContext is complex. diff --git a/keps/sig-node/3063-dynamic-resource-allocation/components.png b/keps/sig-node/3063-dynamic-resource-allocation/components.png index 261133200f3..f64525d86bd 100644 Binary files a/keps/sig-node/3063-dynamic-resource-allocation/components.png and b/keps/sig-node/3063-dynamic-resource-allocation/components.png differ diff --git a/keps/sig-node/3063-dynamic-resource-allocation/components.puml b/keps/sig-node/3063-dynamic-resource-allocation/components.puml index badeda28b7a..c83c7649a80 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/components.puml +++ b/keps/sig-node/3063-dynamic-resource-allocation/components.puml @@ -11,7 +11,7 @@ component Kubernetes { component apiserver { file Pod file ResourceClaim - file PodScheduling + file PodSchedulingContext } component scheduler { component "resource plugin" as k8sresourceplugin @@ -29,13 +29,13 @@ controllermanager -[hidden]> kubelet drivercontroller -[hidden]> driverplugin Pod <. ResourceClaim: owned by\n(if created from template) -Pod <. PodScheduling: owned by +Pod <. PodSchedulingContext: owned by Pod -u-> k8sresourceclaimcontroller: read claim template\nfrom Pod spec -ResourceClaim <-u- k8sresourceclaimcontroller: create claim,\nclean up users +ResourceClaim <-u- k8sresourceclaimcontroller: create claim,\nclean up users,\ntrigger deallocation\n ResourceClaim <-u-> kubelet -k8sresourceplugin <-u-> PodScheduling +k8sresourceplugin <-u-> PodSchedulingContext Pod <--> scheduler ResourceClaim <--> k8sresourceplugin @@ -43,5 +43,5 @@ ResourceClaim <--> k8sresourceplugin ResourceClaim <-> drivercontroller pluginmanager <-> driverplugin resourcemanager <-> driverplugin -PodScheduling <-> drivercontroller +PodSchedulingContext <-> drivercontroller @enduml diff --git a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml index a7d62e682fb..0e557bd11a8 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml +++ b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml @@ -24,7 +24,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.30" +latest-milestone: "v1.31" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: @@ -37,6 +37,11 @@ feature-gates: - kube-controller-manager - kube-scheduler - kubelet + - name: DRAControlPlaneController + components: + - kube-apiserver + - kube-controller-manager + - kube-scheduler disable-supported: true metrics: diff --git a/keps/sig-node/3063-dynamic-resource-allocation/kubelet.png b/keps/sig-node/3063-dynamic-resource-allocation/kubelet.png deleted file mode 100644 index a497a543336..00000000000 Binary files a/keps/sig-node/3063-dynamic-resource-allocation/kubelet.png and /dev/null differ diff --git a/keps/sig-node/4381-dra-structured-parameters/Makefile b/keps/sig-node/4381-dra-structured-parameters/Makefile new file mode 100644 index 00000000000..b489d179ddb --- /dev/null +++ b/keps/sig-node/4381-dra-structured-parameters/Makefile @@ -0,0 +1,35 @@ +# Copyright 2022 The Kubernetes Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +IMAGES += components.png kubelet.png + +all: $(IMAGES) +clean: + rm -f $(IMAGES) + +# We use the http://plantuml.com/plantuml server to generate +# images. That way nothing needs to be installed besides Go. +DOC_PLANTUML_GO = $(shell go env GOPATH)/bin/plantuml-go + +%.png: %.puml $(DOC_PLANTUML_GO) + $(DOC_PLANTUML_GO) -format png $< + +%.svg: %.puml $(DOC_PLANTUML_GO) + $(DOC_PLANTUML_GO) -format svg $< + +# Builds the binary in GOPATH/bin. Changing into / first avoids +# modifying the project's go.mod file. +$(DOC_PLANTUML_GO): + cd / && go get github.com/acarlson99/plantuml-go diff --git a/keps/sig-node/4381-dra-structured-parameters/README.md b/keps/sig-node/4381-dra-structured-parameters/README.md index 22761292342..c754cd61572 100644 --- a/keps/sig-node/4381-dra-structured-parameters/README.md +++ b/keps/sig-node/4381-dra-structured-parameters/README.md @@ -59,7 +59,7 @@ should be approved by the remaining approvers and/or the owning SIG (or SIG Architecture for cross-cutting KEPs). --> -# [KEP-4381](https://github.com/kubernetes/enhancements/issues/4381): Structured Parameters for Dynamic Resource Allocation +# [KEP-4381](https://github.com/kubernetes/enhancements/issues/4381): Dynamic Resource Allocation with Structured Parameters - [Release Signoff Checklist](#release-signoff-checklist) @@ -68,19 +68,53 @@ SIG Architecture for cross-cutting KEPs). - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Cluster add-on development](#cluster-add-on-development) + - [Cluster configuration](#cluster-configuration) + - [Partial GPU allocation](#partial-gpu-allocation) - [Publishing node resources](#publishing-node-resources) - [Using structured parameters](#using-structured-parameters) - [Communicating allocation to the DRA driver](#communicating-allocation-to-the-dra-driver) + - [Risks and Mitigations](#risks-and-mitigations) + - [Feature not used](#feature-not-used) + - [Compromised node](#compromised-node) + - [Compromised resource driver plugin](#compromised-resource-driver-plugin) + - [User permissions and quotas](#user-permissions-and-quotas) + - [Usability](#usability) - [Design Details](#design-details) - - [ResourceClass extension](#resourceclass-extension) - - [ResourceSlice](#resourceslice) - - [ResourceClaimParameters](#resourceclaimparameters) - - [ResourceClassParameters](#resourceclassparameters) - - [ResourceHandle extension](#resourcehandle-extension) - - [Implementation of structured models](#implementation-of-structured-models) - - [Scheduling + Allocation](#scheduling--allocation) - - [Deallocation](#deallocation) - - [Immediate allocation](#immediate-allocation) + - [Components](#components) + - [State and communication](#state-and-communication) + - [Custom parameters](#custom-parameters) + - [Sharing a single ResourceClaim](#sharing-a-single-resourceclaim) + - [Ephemeral vs. persistent ResourceClaims lifecycle](#ephemeral-vs-persistent-resourceclaims-lifecycle) + - [Scheduled pods with unallocated or unreserved claims](#scheduled-pods-with-unallocated-or-unreserved-claims) + - [Handling non graceful node shutdowns](#handling-non-graceful-node-shutdowns) + - [API](#api) + - [resource.k8s.io](#resourcek8sio) + - [ResourceSlice](#resourceslice) + - [ResourceClass](#resourceclass) + - [ResourceClassParameters](#resourceclassparameters) + - [ResourceClaimParameters](#resourceclaimparameters) + - [Allocation result](#allocation-result) + - [ResourceClaimTemplate](#resourceclaimtemplate) + - [Object references](#object-references) + - [core](#core) + - [kube-controller-manager](#kube-controller-manager) + - [kube-scheduler](#kube-scheduler) + - [EventsToRegister](#eventstoregister) + - [PreEnqueue](#preenqueue) + - [Pre-filter](#pre-filter) + - [Filter](#filter) + - [Post-filter](#post-filter) + - [Reserve](#reserve) + - [PreBind](#prebind) + - [Unreserve](#unreserve) + - [kubelet](#kubelet) + - [Managing resources](#managing-resources) + - [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin) + - [NodeListAndWatchResources](#nodelistandwatchresources) + - [NodePrepareResource](#nodeprepareresource) + - [NodeUnprepareResources](#nodeunprepareresources) - [Simulation with CA](#simulation-with-ca) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -105,6 +139,14 @@ SIG Architecture for cross-cutting KEPs). - [Alternatives](#alternatives) - [Publishing resource information in node status](#publishing-resource-information-in-node-status) - [Injecting vendor logic into CA](#injecting-vendor-logic-into-ca) + - [ResourceClaimTemplate](#resourceclaimtemplate-1) + - [Reusing volume support as-is](#reusing-volume-support-as-is) + - [Extend volume support](#extend-volume-support) + - [Extend Device Plugins](#extend-device-plugins) + - [Webhooks instead of ResourceClaim updates](#webhooks-instead-of-resourceclaim-updates) + - [ResourceDriver](#resourcedriver) + - [Complex sharing of ResourceClaim](#complex-sharing-of-resourceclaim) +- [Infrastructure Needed](#infrastructure-needed) ## Release Signoff Checklist @@ -151,35 +193,48 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -Dynamic Resource Allocation (DRA) was added to Kubernetes as an alpha feature in -v1.26. It defines an alternative to the traditional device-plugin API for -requesting access to third-party resources. - -By design, DRA uses parameters for resources that are completely -opaque to core Kubernetes. They get interpreted by a DRA driver's controller -(for allocating claims) and a DRA driver's kubelet plugin (for configuring -resources on a node). During scheduling of a pod, the kube-scheduler and any DRA -driver controller(s) handling claims for the pod communicate back-and-forth through the -apiserver by updating a `PodSchedulingContext` object, ultimately leading to the -allocation of all pending claims and the pod being scheduled onto a node. - -This approach poses a problem for the [Cluster -Autoscaler](https://github.com/kubernetes/autoscaler) (CA) or for any higher -level controller that needs to make decisions for a group of pods (e.g. a job -scheduler). It cannot simulate the effect of allocating or deallocating -claims over time. Only the third-party DRA drivers have the information -available to do this. - -"Structured parameters" is an extension to DRA that addresses this problem by -making claim parameters less opaque. Instead of handling the semantics of all -claim parameters themselves, drivers now manage resources and describe them -using a specific "structured model" pre-defined by Kubernetes. This allows -components aware of this "structured model" to make decisions about these -resources without outsourcing them to some third-party controller. For example, -the scheduler is now able to allocate claims rapidly, without back-and-forth -communication with DRA drivers. - -At a high-level, this extension takes the following form: +This KEP originally defined an extension of the ["classic" DRA #3063 +KEP](../3063-dynamic-resource-allocation/README.md). Now the roles are +reversed: this KEP defines the base functionality and #3063 adds an optional +extension. + +Users are increasingly deploying Kubernetes as management solution for new +workloads (batch processing) and in new environments (edge computing). Such +workloads no longer need just RAM and CPU, but also access to specialized +hardware. With upcoming enhancements of data center interconnects, accelerators +can be installed outside of specific nodes and be connected to nodes +dynamically as needed. + +This KEP introduces a new API for describing which of these new resources +a pod needs. The API supports: + +- Network-attached resources. The existing [device plugin API](https://github.com/kubernetes/design-proposals-archive/blob/main/resource-management/device-plugin.md) + is limited to hardware on a node. However, further work is still + needed to actually use the new API with those. +- Sharing of a resource allocation between multiple containers or pods. + The device manager API currently cannot share resources at all. It + could be extended to share resources between containers in a single pod, + but supporting sharing between pods would need a completely new + API similar to the one in this KEP. +- Using a resource that is expensive to initialize multiple times + in different pods. This is not possible at the moment. +- Custom parameters that describe resource requirements and initialization. + Parameters are not limited to a single, linear quantity that can be counted. + With the current Pod API, annotations have to be used to capture such + parameters and then hacks are needed to access them from a CSI driver or + device plugin. + +Support for new hardware will be provided by hardware vendor add-ons. Those add-ons +are responsible for reporting available resources in a format defined and +understood by Kubernetes and for configuring hardware before it is used. Kubernetes +handles the allocation of those resources as part of pod scheduling. + +This KEP does not replace other means of requesting traditional resources +(RAM/CPU, volumes, extended resources). The scheduler will serve as coordinator +between the add-ons which own resources (CSI driver, resource driver) and the +resources owned and assigned by the scheduler (RAM/CPU, extended resources). + +At a high-level, DRA with structured parameters takes the following form: * DRA drivers publish their available resources in the form of a `ResourceSlice` object on a node-by-node basis according to one or more of the @@ -232,27 +287,242 @@ demonstrate the interest in a KEP within the wider Kubernetes community. [experience reports]: https://github.com/golang/go/wiki/ExperienceReports --> +Originally, Kubernetes and its scheduler only tracked CPU and RAM as +resources for containers. Later, support for storage and discrete, +countable per-node extended resources was added. The kubelet device plugin +interface then made such local resources available to containers. But +for many newer devices, this approach and the Kubernetes API for +requesting these custom resources is too limited. This KEP may eventually +address limitations of the current approach for the following use cases: + +- *Device initialization*: When starting a workload that uses + an accelerator like an FPGA, I’d like to have the accelerator + reconfigured or reprogrammed without having to deploy my application + with full hardware access and/or root privileges. Running applications + with less privileges is better for overall security of the cluster. + + *Limitation*: Currently, it’s impossible to specify the desired + device properties that are required for reconfiguring devices. + For the FPGA example, a file containing the desired configuration + of the FPGA has to be referenced. + +- *Device cleanup*: When my workload is finished, I would like to have + a mechanism for cleanup of the device, that will ensure that device + does not contain traces/parameters/data from previous workloads and + appropriate power state/shutdown. For example, an FPGA might have + to be reset because its configuration for the workload was + confidential. + + *Limitation*: Post-stop actions are not supported. + +- *Partial allocation*: When workloads use only a portion of the device + capabilities, devices can be partitioned (e.g. using Nvidia MIG or SR-IOV) to + better match workload needs. Sharing the devices in this way can greatly + increase HW utilization / reduce costs. + +- *Limitation*: currently there's no API to request partial device + allocation. With the current device plugin API, devices need to be + pre-partitioned and advertised in the same way a full / static devices + are. User must then select a pre-partitioned device instead of having one + created for them on the fly based on their particular resource + constraints. Without the ability to create devices dynamically (i.e. at the + time they are requested) the set of pre-defined devices must be carefully + tuned to ensure that device resources do not go unused because some of the + pre-partioned devices are in low-demand. It also puts the burden on the user + to pick a particular device type, rather than declaring the resource + constraints more abstractly. + +- *Optional allocation*: When deploying a workload I’d like to specify + soft(optional) device requirements. If a device exists and it’s + allocatable it will be allocated. If not - the workload will be run on + a node without a device. GPU and crypto-offload engines are + examples of this kind of device. If they’re not available, workloads + can still run by falling back to using only the CPU for the same + task. + + *Limitation*: Optional allocation is supported neither by the device + plugins nor by current Pod resource declaration. + +- *Support Over the Fabric devices*: When deploying a container, I’d + like to utilize devices available over the Fabric (network, special + links, etc). + + *Limitation*: The device plugin API is designed for node-local resources that + get discovered by a plugin running on the node. Projects like + [Akri](https://www.cncf.io/projects/akri/) have to work around that by + reporting the same network-attached resource on all nodes that it could + get attached to and then updating resource availability on all of those + nodes when resources get used. + +Several other limitations are addressed by +[CDI](https://github.com/container-orchestrated-devices/container-device-interface/), +a container runtime extension that this KEP is using to expose resources +inside a container. + ### Goals - Enable cluster autoscaling when pods use resource claims, with correct decisions and changing the cluster size by more than one node at a time. -- Support node-local resources. Adding or removing nodes has no effect - on network-attached resources and therefore CA does not need to (and cannot) - simulate them. +- Support node-local resources -- Allow DRA driver developers to provide a user experience that is similar to - the one possible without structured parameters. Ideally, users should not notice +- Support claim parameters that are specified in a vendor CRD as + an alternative to letting users directly specify parameters with + the in-tree type. This provides a user experience that is similar to + what has been possible since Kubernetes 1.26. Ideally, users should not notice at all that a driver is using structured parameters under the hood. ### Non-Goals -- Scheduling performance is expected to become better compared to using the - PodSchedulingContext. However, this is not the reason for this KEP. - +* Replace the device plugin API. For resources that fit into its model + of a single, linear quantity it is a good solution. Other resources + should use dynamic resource allocation. Both are expected to co-exist, with vendors + choosing the API that better suits their needs on a case-by-case + basis. Because the new API is going to be implemented independently of the + existing device plugin support, there's little risk of breaking stable APIs. + +* Provide an abstraction layer for resource requests, i.e., something like a + “I want some kind of GPU”. Users will need to know about specific + resource drivers and which parameters they support. Portability of + workloads could be added on top of this proposal by introducing the + selection of a resource implementation through labels and + standardizing those labels and the associated parameters. The + [Resource Class + Proposal](https://docs.google.com/document/d/1qKiIVs9AMh2Ua5thhtvWqOqW0MSle_RV3lfriO1Aj6U/edit#heading=h.jzfmfdca34kj) + included such an approach. + +* Support network-attached resources ## Proposal +### User Stories + +#### Cluster add-on development + +As a hardware vendor, I want to make my hardware available also to applications +that run in a container under Kubernetes. I want to make it easy for a cluster +administrator to configure a cluster where some nodes have this hardware. + +I develop two components, one that runs as part of the Kubernetes control plane +and one that runs on each node, and package those inside container images. YAML +files describe how to deploy my software on a Kubernetes cluster that supports +dynamic resource allocation. + +Documentation for administrators explains how the nodes need to be set +up. Documentation for users explains which parameters control the behavior of +my hardware and how to use it inside a container. + +#### Cluster configuration + +As a cluster administrator, I want to make GPUs from vendor ACME available to users +of that cluster. I prepare the nodes and deploy the vendor's components with +`kubectl create`. + +I create a ResourceClass for the hardware with parameters that only I as the +administrator am allowed to choose, like for example running a command with +root privileges that does some cluster-specific initialization for each allocation: +``` +apiVersion: gpu.example.com/v1 +kind: GPUInit +metadata: + name: acme-gpu-init +# DANGER! This option must not be accepted for +# user-supplied parameters. A real driver might +# not even allow it for admins. This is just +# an example to show the conceptual difference +# between ResourceClass and ResourceClaim +# parameters. +initCommand: +- /usr/local/bin/acme-gpu-init +- --cluster +- my-cluster +--- +apiVersion: core.k8s.io/v1alpha2 +kind: ResourceClass +metadata: + name: acme-gpu +parametersRef: + apiGroup: gpu.example.com + kind: GPUInit + name: acme-gpu-init +``` + +#### Partial GPU allocation + +As a user, I want to use a GPU as accelerator, but don't need exclusive access +to that GPU. Running my workload with just 2Gb of memory is sufficient. This is +supported by the ACME GPU hardware. I know that the administrator has created +an "acme-gpu" ResourceClass. + +For a simple trial, I create a Pod directly where two containers share the same subset +of the GPU: +``` +apiVersion: gpu.example.com/v1 +kind: GPURequirements +metadata: + name: device-consumer-gpu-parameters +memory: "2Gi" +--- +apiVersion: resource.k8s.io/v1alpha2 +kind: ResourceClaimTemplate +metadata: + name: device-consumer-gpu-template +spec: + metadata: + # Additional annotations or labels for the + # ResourceClaim could be specified here. + spec: + resourceClassName: "acme-gpu" + parametersRef: + apiGroup: gpu.example.com + kind: GPURequirements + name: device-consumer-gpu-parameters +--- +apiVersion: v1 +kind: Pod +metadata: + name: device-consumer +spec: + resourceClaims: + - name: "gpu" # this name gets referenced below under "claims" + template: + resourceClaimTemplateName: device-consumer-gpu-template + containers: + - name: workload + image: my-app + command: ["/bin/program"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + claims: + - "gpu" + - name: monitor + image: my-app + command: ["/bin/other-program"] + resources: + requests: + memory: "32Mi" + cpu: "25m" + limits: + memory: "64Mi" + cpu: "50m" + claims: + - "gpu" +``` + +This request triggers resource allocation on a node that has a GPU device with +2Gi of memory available and then the Pod runs on that node. The remaining +capacity of the GPU may be usable for other pods, with constrains like alignment +to segment sizes ensured by the resource driver. +The lifecycle of the resource +allocation is tied to the lifecycle of the Pod. + +In production, a similar PodTemplateSpec in a Deployment will be used. + ### Publishing node resources The resources available on a node need to be published to the API server. In @@ -549,24 +819,374 @@ namedResources: - gpu-1 ``` -## Design Details +### Risks and Mitigations -### ResourceClass extension + + +#### Feature not used + +In a cluster where the feature is not used (no resource driver installed, no +pods using dynamic resource allocation) the impact is minimal, both for +performance and security. The scheduler plugin will +return quickly without doing any work for pods. + +#### Compromised node + +Kubelet is intentionally limited to read-only access for ResourceClass and ResourceClaim +to prevent that a +compromised kubelet interferes with scheduling of pending pods, for example +by updating status information normally set by the scheduler. +Faking such information could be used for a denial-of-service +attack against pods using those ResourceClaims, for example by overwriting +their allocation result with a node selector that matches no node. A +denial-of-service attack against the cluster and other pods is harder, but +still possible. For example, frequently updating ResourceSlice objects could +cause new scheduling attempts for pending pods. + +Another potential attack goal is to get pods with sensitive workloads to run on +a compromised node. For pods that don't use special resources nothing changes +in that regard. Such an attack is possible for pods with extended resources +because kubelet is in control of which capacity it reports for those: it could +publish much higher values than the device plugin reported and thus attract +pods to the node that normally would run elsewhere. With dynamic resource +allocation, such an attack is still possible, but the attack code would have to +be different for each resource driver because all of them will use structured +parameters differently for reporting resource availability. + +#### Compromised resource driver plugin + +This is the result of an attack against the resource driver, either from a +container which uses a resource exposed by the driver, a compromised kubelet +which interacts with the plugin, or due to resource driver running on a node +with a compromised root account. + +The resource driver plugin only needs read access to objects described in this +KEP, so compromising it does not interfere with dynamic resource allocation for +other drivers. + +A resource driver may need root access on the node to manage +hardware. Attacking the driver therefore may lead to root privilege +escalation. Ideally, driver authors should try to avoid depending on root +permissions and instead use capabilities or special permissions for the kernel +APIs that they depend on. + +A resource driver may also need privileged access to remote services to manage +network-attached devices. Resource driver vendors and cluster administrators +have to consider what the effect of a compromise could be for that and how such +privileges could get revoked. + +#### User permissions and quotas + +Similar to generic ephemeral inline volumes, the [ephemeral resource use +case](#ephemeral-vs-persistent-resourceclaims-lifecycle) gets covered by +creating ResourceClaims on behalf of the user automatically through +kube-controller-manager. The implication is that RBAC rules that are meant to +prevent creating ResourceClaims for certain users can be circumvented, at least +for ephemeral resources. Administrators need to be aware of this caveat when +designing user restrictions. + +A quota system that is based on the information in the structured parameter model +could be implemented in Kubernetes. When a user has exhausted their +quota, the scheduler then refuses to allocate further ResourceClaims. + +#### Usability + +Aside from security implications, usability and usefulness of dynamic resource +allocation also may turn out to be insufficient. Some risks are: + +- Slower pod scheduling due to more complex decision making. + +- Additional complexity when describing pod requirements because + separate objects must be created for the parameters. + +All of these risks will have to be evaluated by gathering feedback from users +and resource driver developers. + +## Design Details + +### Components + +![components](./components.png) + +Several components must be implemented or modified in Kubernetes: +- The new API must be added to kube-apiserver. +- A new controller in kube-controller-manager which creates + ResourceClaims from ResourceClaimTemplates, similar to + https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/volume/ephemeral. + It also removes the reservation entry for a consumer in `claim.status.reservedFor`, + the field that tracks who is allowed to use a claim, when that user no longer exists. + It clears the allocation and thus makes the underlying resources available again + when a ResourceClaim is no longer reserved. +- A kube-scheduler plugin must detect Pods which reference a + ResourceClaim (directly or through a template) and ensure that the + resource is allocated before the Pod gets scheduled, similar to + https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go +- Kubelet must be extended to retrieve information from ResourceClaims + and to call a resource kubelet plugin. That plugin returns CDI device ID(s) + which then must be passed to the container runtime. + +A resource driver can have the following components: +- *CRD controller* (optional): a central component which translates parameters + defined with a vendor CRD into in-tree parameter types. +- *kubelet plugin* (required): a component which cooperates with kubelet to + publish resource information and to prepare the usage of the resource on a node. + +When a resource driver doesn't use its own CRD for parameters, the CRD controller +is not needed and a ResourceClaim references ResourceClaimParameters directly. + +A [utility library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation) for resource drivers was developed. +It does not have to be used by drivers, therefore it is not described further +in this KEP. + +### State and communication + +A ResourceClaim object defines what kind of resource is needed and what +the parameters for it are. It is owned by users and namespaced. Additional +parameters are provided by a cluster admin in ResourceClass objects. + +The ResourceClaim spec is immutable. The ResourceClaim +status is reserved for system usage and holds the current state of the +resource. The status must not get lost, which in the past was not ruled +out. For example, status could have been stored in a separate etcd instance +with lower reliability. To recover after a loss, status was meant to be recoverable. +A [recent KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/2527-clarify-status-observations-vs-rbac) +clarified that status will always be stored reliably and can be used as +proposed in this KEP. + +Handling state and communication through objects has two advantages: +- Changes for a resource are (almost) atomic, which avoids race conditions. + One small exception is that changing finalizers and the status have to + be done in separate operations. +- The only requirement for deployments is that the components can connect to + the API server. + +The entire state of a resource can be determined by looking at its +ResourceClaim (see [API below](#api) for details), for example: + +- It is **allocated** if and only if `claim.status.allocation` is non-nil and + points to the `AllocationResult`, i.e. the struct where information about + a successful allocation is stored. + +- It is in use if and only if `claim.status.reservedFor` contains one or + more consumers. It does not matter whether those users, usually pods, are + currently running because that could change at any time. + +- A resource is no longer needed when `claim.deletionTimestamp` is set. It must not + be deallocated yet when it is still in use. + +Some of the race conditions that need to be handled are: + +- A ResourceClaim gets created and deleted again while the scheduler + is allocating it. Before it actually starts doing anything, the + scheduler adds a finalizer. Either adding the finalizer or removing the + ResourceClaim win. If the scheduler wins, it continues with allocation + and can either complete or abort the operation when it notices the non-nil + DeletionTimestamp. Otherwise, allocation gets aborted immediately. + + What this avoids is the situation where an allocation succeed without having + an object where the result can be stored. The driver can also be killed at + any time: when it restarts, the finalizer indicates that allocation may be in + progress and has to be completed or aborted. + + However, users may still force-delete a ResourceClaim, or the entire + cluster might get deleted. Driver implementations must store enough + information elsewhere to detect when some allocated resource is no + longer needed to recover from such scenarios. + +- A ResourceClaim gets deleted and recreated while the resource driver is + adding the finalizer. The driver can update the object to add the finalizer + and then will get a conflict error, which informs the driver that it must + work on a new instance of the claim. In general, patching a ResourceClaim + is only acceptable when it does not lead to race conditions. To detect + delete+recreate, the UID must be added as precondition for a patch. + To detect also potentially conflicting other changes, ResourceVersion + needs to be checked, too. + +- In a cluster with multiple scheduler instances, two pods might get + scheduled concurrently by different schedulers. When they reference + the same ResourceClaim which may only get used by one pod at a time, + only one pod can be scheduled. + + Both schedulers try to add their pod to the `claim.status.reservedFor` field, but only the + update that reaches the API server first gets stored. The other one fails + with a conflict error and the scheduler which issued it knows that it must + put the pod back into the queue, waiting for the ResourceClaim to become + usable again. + +- Two pods get created which both reference the same unallocated claim with + delayed allocation. A single scheduler can detect this special situation + and then do allocation only for one of the two pods. When the pods + are handled by different schedulers, only one will succeed with writing + back the `claim.status.allocation`. + +- Scheduling a pod and allocating resources for it has been attempted, but one + claim needs to be reallocated to fit the overall resource requirements. A second + pod gets created which references the same claim that is in the process of + being deallocated. Because that is visible in the claim status, scheduling + of the second pod cannot proceed. + +### Custom parameters + +To support arbitrarily complex parameters, both ResourceClass and ResourceClaim +contain one field which references a separate object. The reference contains +API group, kind and name and thus is sufficient for generic clients to +retrieve the parameters. For ResourceClass, that object must be +cluster-scoped. For ResourceClaim, it must be in the same namespace as the +ResourceClaim and thus the Pod. Which kind of objects a resource driver accepts as parameters depends on +the driver. + +This approach was chosen because then validation of the parameters can be done +with a CRD and that validation will work regardless of where the parameters +are needed. + +It is the responsibility of the resource driver to convert these CRD parameters +into in-tree ResourceClaimParameters and ResourceClassParameters. Kubernetes +finds those generated parameters based on their `generatedFrom` back-reference. + +Parameters may get deleted before the ResourceClaim or ResourceClass that +references them. In that case, a pending resource cannot be allocated until the +parameters get recreated. An allocated resource must remain usable and +deallocating it must be possible. To support this, resource drivers must copy +all relevant information: +- For usage, the `claim.status.allocation.resourceHandle` can be hold some copied information + because the ResourceClaim and thus this field must exist. +- For deallocation, drivers should use some other location to handle + cases where a user force-deletes a ResourceClaim or the entire + cluster gets removed. + +### Sharing a single ResourceClaim + +Pods reference resource claims in a new `pod.spec.resourceClaims` list. Each +resource in that list can then be made available to one or more containers in +that Pod. Depending on the capabilities defined in the +`claim.status.allocation` by the driver, a ResourceClaim can be used exclusively +by one pod at a time or an unlimited number of pods. Support for additional +constraints (maximum number of pods, maximum number of nodes) could be +added once there are use cases for those. + +Consumers of a ResourceClaim are listed in `claim.status.reservedFor`. They +don't need to be Pods. At the moment, Kubernetes itself only handles Pods and +allocation for Pods. + +### Ephemeral vs. persistent ResourceClaims lifecycle + +A persistent ResourceClaim has a lifecyle that is independent of any particular +pod. It gets created and deleted by the user. This is useful for resources +which are expensive to configure and that can be used multiple times by pods, +either at the same time or one after the other. Such persistent ResourceClaims +get referenced in the pod spec by name. When a PodTemplateSpec in an app +controller spec references a ResourceClaim by name, all pods created by that +controller also use that name and thus share the resources allocated for that +ResourceClaim. + +But often, each Pod is meant to have exclusive access to its own ResourceClaim +instance instead. To support such ephemeral resources without having to modify +all controllers that create Pods, an entry in the new PodSpec.ResourceClaims +list can also be a reference to a ResourceClaimTemplate. When a Pod gets created, such a +template will be used to create a normal ResourceClaim with the Pod as owner +with an +[OwnerReference](https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#OwnerReference)), +and then the normal allocation of the resource takes place. Once the pod got +deleted, the Kubernetes garbage collector will also delete the +ResourceClaim. + +This mechanism documents ownership and serves as a fallback for scenarios where +dynamic resource allocation gets disabled in a cluster (for example, during a +downgrade). But it alone is not sufficient: for example, the job controller +does not delete pods immediately when they have completed, which would keep +their resources allocated. Therefore the resource controller watches for pods +that have completed and releases their resource allocations. + +The difference between persistent and ephemeral resources for kube-scheduler +and kubelet is that the name of the ResourceClaim needs to be determined +differently: the name of an ephemeral ResourceClaim is recorded in the Pod status. +Ownership must be checked to detect accidental conflicts with +persistent ResourceClaims or previous incarnations of the same ephemeral +resource. + +### Scheduled pods with unallocated or unreserved claims + +There are several scenarios where a Pod might be scheduled (= `pod.spec.nodeName` +set) while the claims that it depends on are not allocated or not reserved for +it: + +* A user might manually create a pod with `pod.spec.nodeName` already set. +* Some special cluster might use its own scheduler and schedule pods without + using kube-scheduler. +* The feature might have been disabled in kube-scheduler while scheduling + a pod with claims. + +The kubelet is refusing to run such pods and reports the situation through +an event (see below). It's an error scenario that should better be avoided. + +Users should avoid this situation by not scheduling pods manually. If they need +it for some reason, they can use a node selector which matches only the desired +node and then let kube-scheduler do the normal scheduling. + +Custom schedulers should emulate the behavior of kube-scheduler and ensure that +claims are allocated and reserved before setting `pod.spec.nodeName`. + +The last scenario might occur during a downgrade or because of an +administrator's mistake. Administrators can fix this by deleting such pods. + +### Handling non graceful node shutdowns + +When a node is shut down unexpectedly and is tainted with an `out-of-service` +taint with NoExecute effect as explained in the [Non graceful node shutdown KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown), +all running pods on the node will be deleted by the GC controller and the +resources used by the pods will be deallocated. However, they will not be +un-prepared as the node is down and Kubelet is not running on it. + +Resource drivers should be able to handle this situation correctly and +should not expect `UnprepareNodeResources` to be always called. +If resources are unprepared when `Deallocate` is called, `Deallocate` +might need to perform additional actions to correctly deallocate +resources. + +### API + +``` +<<[UNRESOLVED @pohly @johnbelamaric]>> +Before 1.31, we need to re-evaluate the API, including, but not limited to: +- Do we really need a separate ResourceClaim? +- Does "Device" instead of "Resource" make the API easier to understand? +- Avoid separate parameter objects if and when possible. +<<[/UNRESOLVED]>> ``` -### ResourceSlice +The PodSpec gets extended. To minimize the changes in core/v1, all new types +get defined in a new resource group. This makes it possible to revise those +more experimental parts of the API in the future. The new fields in the +PodSpec are gated by the DynamicResourceAllocation feature gate and can only be +set when it is enabled. Initially, they are declared as alpha. Even though they +are alpha, changes to their schema are discouraged and would have to be done by +using new field names. + +ResourceClaim, ResourceClass and ResourceClaimTemplate are new built-in types +in `resource.k8s.io/v1alpha2`. This alpha group must be explicitly enabled in +the apiserver's runtime configuration. Using builtin types was chosen instead +of using CRDs because core Kubernetes components must interact with the new +objects and installation of CRDs as part of cluster creation is an unsolved +problem. + +Secrets are not part of this API: if a resource driver needs secrets, for +example to access its own backplane, then it can define custom parameters for +those secrets and retrieve them directly from the apiserver. This works because +drivers are expected to be written for Kubernetes. + +#### resource.k8s.io + +##### ResourceSlice For each node, one or more ResourceSlice objects get created. The kubelet publishes them with the node as the owner, so they get deleted when a node goes @@ -693,8 +1313,213 @@ type NamedResourcesStringSlice struct { All names must be DNS sub-domains. This excludes the "/" character, therefore combining different names with that separator to form an ID is valid. +In the Go types above, all structs starting with `NamedResources` are part of +that structured model. Code generators (more specifically, the applyconfig +generator) assume that all Go types of an API are defined in the same Go +package. If it wasn't for that, defining those structs in their own package +without the `NamedResources` prefix would be possible and make the Go code +cleaner without affecting the Kubernetes API. + +##### ResourceClass + +```go +// ResourceClass is used by administrators to influence how resources +// are allocated. +// +// This is an alpha type and requires enabling the DynamicResourceAllocation +// feature gate. +type ResourceClass struct { + metav1.TypeMeta + // Standard object metadata + // +optional + metav1.ObjectMeta + + // ParametersRef references an arbitrary separate object that may hold + // parameters that will be used by the driver when allocating a + // resource that uses this class. A dynamic resource driver can + // distinguish between parameters stored here and and those stored in + // ResourceClaimSpec. + // +optional + ParametersRef *ResourceClassParametersReference + + // Only nodes matching the selector will be considered by the scheduler + // when trying to find a Node that fits a Pod when that Pod uses + // a ResourceClaim that has not been allocated yet. + // + // Setting this field is optional. If null, all nodes are candidates. + // +optional + SuitableNodes *core.NodeSelector -### ResourceClaimParameters + // DefaultClaimParametersRef is an optional reference to an object that holds parameters + // used as default when allocating a claim which references this class. This field is utilized + // only when the ParametersRef of the claim is nil. If both ParametersRef + // and DefaultClaimParametersRef are nil, the claim requests no resources and thus + // can always be allocated. + // +optional + DefaultClaimParametersRef *ResourceClassParametersReference +} +``` + +##### ResourceClassParameters + +```go +// ResourceClassParameters defines resource requests for a ResourceClass in an +// in-tree format understood by Kubernetes. +type ResourceClassParameters struct { + metav1.TypeMeta + // Standard object metadata + metav1.ObjectMeta + + // If this object was created from some other resource, then this links + // back to that resource. This field is used to find the in-tree representation + // of the class parameters when the parameter reference of the class refers + // to some unknown type. + GeneratedFrom *ResourceClassParametersReference + + // VendorParameters are arbitrary setup parameters for all claims using + // this class. They are ignored while allocating the claim. There must + // not be more than one entry per driver. + VendorParameters []VendorParameters + + // Filters describes additional contraints that must be met when using the class. + Filters []ResourceFilter +} + +// ResourceFilter is a filter for resources from one particular driver. +type ResourceFilter struct { + // DriverName is the name used by the DRA driver kubelet plugin. + DriverName string + + ResourceFilterModel +} + +// ResourceFilterModel must have one and only one field set. +type ResourceFilterModel struct { + // NamedResources describes a resource filter using the named resources model. + NamedResources *NamedResourcesFilter +} + +// NamedResourcesFilter is used in ResourceFilterModel. +type NamedResourcesFilter struct { + // Selector is a CEL expression which must evaluate to true if a + // resource instance is suitable. The language is as defined in + // https://kubernetes.io/docs/reference/using-api/cel/ + // + // In addition, for each type in NamedResourcesAttributeValue there is a map that + // resolves to the corresponding value of the instance under evaluation. + // For example: + // + // attributes.quantity["a"].isGreaterThan(quantity("0")) && + // attributes.stringslice["b"].isSorted() + Selector string +} +``` + +###### ResourceClaim + + +```go +// ResourceClaim describes which resources are needed by a resource consumer. +// Its status tracks whether the resource has been allocated and what the +// resulting attributes are. +// +// This is an alpha type and requires enabling the DynamicResourceAllocation +// feature gate. +type ResourceClaim struct { + metav1.TypeMeta + // Standard object metadata + // +optional + metav1.ObjectMeta + + // Spec describes the desired attributes of a resource that then needs + // to be allocated. It can only be set once when creating the + // ResourceClaim. + Spec ResourceClaimSpec + + // Status describes whether the resource is available and with which + // attributes. + // +optional + Status ResourceClaimStatus +} + +// Finalizer is the finalizer that gets set for claims +// which were allocated through a builtin controller. +const Finalizer = "dra.k8s.io/delete-protection" +``` + +The scheduler must set a finalizer in a ResourceClaim before it adds +an allocation. This ensures that an allocated, reserved claim cannot +be removed accidentally by a user. + +If storing the status fails, the scheduler will retry on the next +scheduling attempt. If the ResourceClaim gets deleted in the meantime, +the scheduler will not try to schedule again. This situation is handled +by the kube-controller-manager by removing the finalizer. + +Force-deleting a ResourceClaim by clearing its finalizers (something that users +should never do without being aware of the consequences) cannot be +prevented. Deleting the entire cluster also leaves resources allocated outside +of the cluster in an allocated state. + +```go +// ResourceClaimSpec defines how a resource is to be allocated. +type ResourceClaimSpec struct { + // ResourceClassName references the driver and additional parameters + // via the name of a ResourceClass that was created as part of the + // driver deployment. + // +optional + ResourceClassName string + + // ParametersRef references a separate object with arbitrary parameters + // that will be used by the driver when allocating a resource for the + // claim. + // + // The object must be in the same namespace as the ResourceClaim. + // +optional + ParametersRef *ResourceClaimParametersReference +} +``` + +The `ResourceClassName` field may be left empty. The parameters are sufficient +to determine which driver needs to provide resources. This leads to some corner cases: +- Empty `ResourceClassName` and nil `ParametersRef`: this is a claim which requests + no resources. Such a claim can always be allocated with an empty result. Allowing + this simplifies code generators which dynamically fill in the resource requests + because they are allowed to generate an empty claim. +- Non-empty `ResourceClassName`, nil `ParametersRef`, nil + `ResourceClass.DefaultClaimParametersRef`: this is handled the same way, the + only difference is that the cluster admin has decided that such claims need + no resources by not providing default parameters. + +There is no default ResourceClass. If that is desirable, then it can be +implemented with a mutating and/or admission webhook. + +``` +// ResourceClaimStatus tracks whether the resource has been allocated and what +// the resulting attributes are. +type ResourceClaimStatus struct { + // Allocation is set by the resource driver once a resource or set of + // resources has been allocated successfully. If this is not specified, the + // resources have not been allocated yet. + // +optional + Allocation *AllocationResult + + // ReservedFor indicates which entities are currently allowed to use + // the claim. A Pod which references a ResourceClaim which is not + // reserved for that Pod will not be started. + // + // There can be at most 32 such reservations. This may get increased in + // the future, but not reduced. + // +optional + ReservedFor []ResourceClaimConsumerReference +} + +// ReservedForMaxSize is the maximum number of entries in +// claim.status.reservedFor. +const ResourceClaimReservedForMaxSize = 32 +``` + +##### ResourceClaimParameters ```go // ResourceClaimParameters defines resource requests for a ResourceClaim in an @@ -757,7 +1582,7 @@ type NamedResourcesRequest struct { // resource instance is suitable. The language is as defined in // https://kubernetes.io/docs/reference/using-api/cel/ // - // In addition, for each type NamedResourcesin AttributeValue there is a map that + // In addition, for each type in NamedResourcesAttributeValue there is a map that // resolves to the corresponding value of the instance under evaluation. // For example: // @@ -767,71 +1592,49 @@ type NamedResourcesRequest struct { } ``` -### ResourceClassParameters - -```go -// ResourceClassParameters defines resource requests for a ResourceClass in an -// in-tree format understood by Kubernetes. -type ResourceClassParameters struct { - metav1.TypeMeta - // Standard object metadata - metav1.ObjectMeta - - // If this object was created from some other resource, then this links - // back to that resource. This field is used to find the in-tree representation - // of the class parameters when the parameter reference of the class refers - // to some unknown type. - GeneratedFrom *ResourceClassParametersReference - - // VendorParameters are arbitrary setup parameters for all claims using - // this class. They are ignored while allocating the claim. There must - // not be more than one entry per driver. - VendorParameters []VendorParameters - - // Filters describes additional contraints that must be met when using the class. - Filters []ResourceFilter -} +NamedResourcesFilter and NamedResourcesRequest currently have the same +content. Despite that, they are defined as separate structs because that might +change in the future. -// ResourceFilter is a filter for resources from one particular driver. -type ResourceFilter struct { - // DriverName is the name used by the DRA driver kubelet plugin. - DriverName string - ResourceFilterModel -} +##### Allocation result -// ResourceFilterModel must have one and only one field set. -type ResourceFilterModel struct { - // NamedResources describes a resource filter using the named resources model. - NamedResources *NamedResourcesFilter -} - -// NamedResourcesFilter is used in ResourceFilterModel. -type NamedResourcesFilter struct { - // Selector is a CEL expression which must evaluate to true if a - // resource instance is suitable. The language is as defined in - // https://kubernetes.io/docs/reference/using-api/cel/ +```go +// AllocationResult contains attributes of an allocated resource. +type AllocationResult struct { + // ResourceHandles contain the state associated with an allocation that + // should be maintained throughout the lifetime of a claim. Each + // ResourceHandle contains data that should be passed to a specific kubelet + // plugin once it lands on a node. // - // In addition, for each type NamedResourcesin AttributeValue there is a map that - // resolves to the corresponding value of the instance under evaluation. - // For example: + // Setting this field is optional. It has a maximum size of 32 entries. + // If null (or empty), it is assumed this allocation will be processed by a + // single kubelet plugin with no ResourceHandle data attached. The name of + // the kubelet plugin invoked will match the DriverName set in the + // ResourceClaimStatus this AllocationResult is embedded in. // - // attributes.quantity["a"].isGreaterThan(quantity("0")) && - // attributes.stringslice["b"].isSorted() - Selector string -} -``` + // +listType=atomic + ResourceHandles []ResourceHandle -NamedResourcesFilter and NamedResourcesRequest currently have the same -content. Despite that, they are defined as separate structs because that might -change in the future. + // This field will get set by the resource driver after it has allocated + // the resource to inform the scheduler where it can schedule Pods using + // the ResourceClaim. + // + // Setting this field is optional. If null, the resource is available + // everywhere. + // +optional + AvailableOnNodes *core.NodeSelector -### ResourceHandle extension + // Shareable determines whether the resource supports more + // than one consumer at a time. + // +optional + Shareable bool +} -The ResourceHandle is embedded inside the claim status. When using structured parameters, -a new field must get populated instead of the opaque driver data. +// AllocationResultResourceHandlesMaxSize represents the maximum number of +// entries in allocation.resourceHandles. +const AllocationResultResourceHandlesMaxSize = 32 -```go // ResourceHandle holds opaque resource data for processing by a specific kubelet plugin. type ResourceHandle struct { // DriverName specifies the name of the resource driver whose kubelet @@ -840,19 +1643,8 @@ type ResourceHandle struct { // ResourceClaimStatus this ResourceHandle is embedded in. DriverName string - // Data contains the opaque data associated with this ResourceHandle. It is - // set by the controller component of the resource driver whose name - // matches the DriverName set in the ResourceClaimStatus this - // ResourceHandle is embedded in. It is set at allocation time and is - // intended for processing by the kubelet plugin whose name matches - // the DriverName set in this ResourceHandle. - // - // The maximum size of this field is 16KiB. This may get increased in the - // future, but not reduced. - // +optional - Data string - - // If StructuredData is set, then it needs to be used instead of Data. + // StructuredData captures the result of the allocation for this + // particular driver. StructuredData *StructuredResourceHandle } @@ -890,58 +1682,601 @@ type AllocationResultModel struct { // NamedResources describes the allocation result when using the named resources model. NamedResources *NamedResourcesAllocationResult } +``` -// NamedResourcesAllocationResult is used in AllocationResultModel. -type NamedResourcesAllocationResult struct { - // Name is the name of the selected resource instance. +##### ResourceClaimTemplate + +```go +// ResourceClaimTemplate is used to produce ResourceClaim objects. +type ResourceClaimTemplate struct { + metav1.TypeMeta + // Standard object metadata + // +optional + metav1.ObjectMeta + + // Describes the ResourceClaim that is to be generated. + // + // This field is immutable. A ResourceClaim will get created by the + // control plane for a Pod when needed and then not get updated + // anymore. + Spec ResourceClaimTemplateSpec +} + +// ResourceClaimTemplateSpec contains the metadata and fields for a ResourceClaim. +type ResourceClaimTemplateSpec struct { + // ObjectMeta may contain labels and annotations that will be copied into the PVC + // when creating it. No other fields are allowed and will be rejected during + // validation. + // +optional + metav1.ObjectMeta + + // Spec for the ResourceClaim. The entire content is copied unchanged + // into the ResourceClaim that gets created from this template. The + // same fields as in a ResourceClaim are also valid here. + Spec ResourceClaimSpec +} +``` + +##### Object references + +```go +// ResourceClassParametersReference contains enough information to let you +// locate the parameters for a ResourceClass. +type ResourceClassParametersReference struct { + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. + // +optional + APIGroup string + // Kind is the type of resource being referenced. This is the same + // value as in the parameter object's metadata. + Kind string + // Name is the name of resource being referenced. Name string + // Namespace that contains the referenced resource. Must be empty + // for cluster-scoped resources and non-empty for namespaced + // resources. + // +optional + Namespace string +} + +// ResourceClaimParametersReference contains enough information to let you +// locate the parameters for a ResourceClaim. The object must be in the same +// namespace as the ResourceClaim. +type ResourceClaimParametersReference struct { + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. + // +optional + APIGroup string + // Kind is the type of resource being referenced. This is the same + // value as in the parameter object's metadata, for example "ConfigMap". + Kind string + // Name is the name of resource being referenced. + Name string +} + +// ResourceClaimConsumerReference contains enough information to let you +// locate the consumer of a ResourceClaim. The user must be a resource in the same +// namespace as the ResourceClaim. +type ResourceClaimConsumerReference struct { + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. + // +optional + APIGroup string + // Resource is the type of resource being referenced, for example "pods". + Resource string + // Name is the name of resource being referenced. + Name string + // UID identifies exactly one incarnation of the resource. + UID types.UID } ``` -### Implementation of structured models +`ResourceClassParametersReference` and `ResourceClaimParametersReference` use +the more user-friendly "kind" to identify the object type because those +references are provided by users. `ResourceClaimConsumerReference` is typically +set by the control plane and therefore uses the more technically correct +"resource" name. -In the Go types above, all structs starting with `NamedResources` are part of -that structured model. Code generators (more specifically, the applyconfig -generator) assume that all Go types of an API are defined in the same Go -package. If it wasn't for that, defining those structs in their own package -without the `NamedResources` prefix would be possible and make the Go code -cleaner without affecting the Kubernetes API. +#### core -### Scheduling + Allocation +```go +type PodSpec { + ... + // ResourceClaims defines which ResourceClaims must be allocated + // and reserved before the Pod is allowed to start. The resources + // will be made available to those containers which consume them + // by name. + // + // This is an alpha field and requires enabling the + // DynamicResourceAllocation feature gate. + // + // This field is immutable. + // + // +featureGate=DynamicResourceAllocation + // +optional + ResourceClaims []PodResourceClaim + ... +} + +type ResourceRequirements { + Limits ResourceList + Requests ResourceList + ... + // Claims lists the names of resources, defined in spec.resourceClaims, + // that are used by this container. + // + // This is an alpha field and requires enabling the + // DynamicResourceAllocation feature gate. + // + // This field is immutable. + // + // +featureGate=DynamicResourceAllocation + // +optional + Claims []ResourceClaim +} -The dynamic resource scheduler plugin handles the common fields of -ResourceSlice, ResourceClaimParameters and StructuredResourceHandle. For the -structured model fields it calls out to code that is associated with the -corresponding model. +// ResourceClaim references one entry in PodSpec.ResourceClaims. +type ResourceClaim struct { + // Name must match the name of one entry in pod.spec.resourceClaims of + // the Pod where this field is used. It makes that resource available + // inside a container. + Name string +} +``` -During filtering it is decided which nodes have the necessary resources. If a -node is found, the scheduler plugin updates the resource claim status as part -of goroutine which handles pod binding. +`Claims` is a list of structs with a single `Name` element because that struct +can be extended later, for example to add parameters that influence how the +resource is made available to a container. This wouldn't be possible if +it was a list of strings. -Like a normal DRA driver controller, the scheduler also sets a finalizer to -ensure that users cannot accidentally delete the allocated claim while a pod -is about to start which depends on it. That finalizer is -"structured.dra.k8s.io/delete-protection". +```go +// PodResourceClaim references exactly one ResourceClaim through a ClaimSource. +// It adds a name to it that uniquely identifies the ResourceClaim inside the Pod. +// Containers that need access to the ResourceClaim reference it with this name. +type PodResourceClaim struct { + // Name uniquely identifies this resource claim inside the pod. + // This must be a DNS_LABEL. + Name string -### Deallocation + // Source describes where to find the ResourceClaim. + Source ClaimSource +} -Deallocation is handled by kube-controller-manager when its claim controller -observes that a claim is no longer in use *and* the claim has the special -"structured.dra.k8s.io/delete-protection" finalizer. This finalizer tells the -controller that it may clear the allocation result directly instead of setting -the `DeletionRequested` field, which is what it normally would do. +// ClaimSource describes a reference to a ResourceClaim. +// +// Exactly one of these fields should be set. Consumers of this type must +// treat an empty object as if it has an unknown value. +type ClaimSource struct { + // ResourceClaimName is the name of a ResourceClaim object in the same + // namespace as this pod. + ResourceClaimName *string + + // ResourceClaimTemplateName is the name of a ResourceClaimTemplate + // object in the same namespace as this pod. + // + // The template will be used to create a new ResourceClaim, which will + // be bound to this pod. When this pod is deleted, the ResourceClaim + // will also be deleted. The pod name and resource name, along with a + // generated component, will be used to form a unique name for the + // ResourceClaim, which will be recorded in pod.status.resourceClaimStatuses. + // + // This field is immutable and no changes will be made to the + // corresponding ResourceClaim by the control plane after creating the + // ResourceClaim. + ResourceClaimTemplateName *string +} +struct PodStatus { + ... + // Status of resource claims. + // +featureGate=DynamicResourceAllocation + // +optional + ResourceClaimStatuses []PodResourceClaimStatus +} + +// PodResourceClaimStatus is stored in the PodStatus for each PodResourceClaim +// which references a ResourceClaimTemplate. It stores the generated name for +// the corresponding ResourceClaim. +type PodResourceClaimStatus struct { + // Name uniquely identifies this resource claim inside the pod. + // This must match the name of an entry in pod.spec.resourceClaims, + // which implies that the string must be a DNS_LABEL. + Name string + + // ResourceClaimName is the name of the ResourceClaim that was + // generated for the Pod in the namespace of the Pod. If this is + // unset, then generating a ResourceClaim was not necessary. The + // pod.spec.resourceClaims entry can be ignored in this case. + ResourceClaimName *string +} +``` + +### kube-controller-manager + +The code that creates a ResourceClaim from a ResourceClaimTemplate started +as an almost verbatim copy of the [generic ephemeral volume +code](https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/volume/ephemeral), +just with different types. Later, generating the name of the ephemeral ResourceClaim +was added. + +kube-controller-manager needs [RBAC +permissions](https://github.com/kubernetes/kubernetes/commit/ff3e5e06a79bc69ad3d7ccedd277542b6712514b#diff-2ad93af2302076e0bdb5c7a4ebe68dd3188eee8959c72832181a7597417cd196) that allow creating and updating ResourceClaims. + +kube-controller-manager also removes `claim.status.reservedFor` entries that reference +removed pods or pods that have completed ("Phase" is "done" or will never start). +This is required for pods because kubelet does not have write +permission for ResourceClaimStatus. Pods as user is the common case, so special +code based on a shared pod informer will handle it. Other consumers +need to be handled by whatever controller added them. + +In addition to updating `claim.status.reservedFor`, kube-controller-manager also +removes the allocation from ResourceClaims that are no longer in use. Updating the claim during deallocation will be observed by kube-scheduler and tells it that it can use the capacity set aside for the claim again. kube-controller-manager itself doesn't need to support specific structured models. -### Immediate allocation +### kube-scheduler + +The scheduler plugin for ResourceClaims ("claim plugin" in this section) +needs to implement several extension points. It is responsible for +ensuring that a ResourceClaim is allocated and reserved for a Pod before +the final binding of a Pod to a node. + +The following extension points are implemented in the new claim plugin. Except +for some unlikely edge cases (see below) there are no API calls during the main +scheduling cycle. Instead, the plugin collects information and updates the +cluster in the separate goroutine which invokes PreBind. + + +#### EventsToRegister + +This registers all cluster events that might make an unschedulable pod +schedulable, like creating a claim that the pod needs or finishing the +allocation of a claim. + +[Queuing hints](https://github.com/kubernetes/enhancements/issues/4247) are +supported. These are callbacks that can limit the effect of a cluster event to +specific pods. For example, allocating a claim only makes those pods +scheduleable which reference the claim. There is no need to try scheduling a pod +which waits for some other claim. Hints are also used to trigger the next +scheduling cycle for a pod immediately when some expected and require event +like "drivers have provided information" occurs, instead of forcing the pod to +go through the backoff queue and the usually 5 second long delay associated +with that. + +Queuing hints are an optional feature of the scheduler, with (as of Kubernetes +1.29) their own `SchedulerQueueingHints` feature gate that defaults to +off. When turned off, performance of scheduling pods with resource claims is +slower compared to a cluster configuration where they are turned on. + +#### PreEnqueue + +This checks whether all claims referenced by a pod exist. If they don't, +scheduling the pod has to wait until the kube-controller-manager or user create +the claims. PreEnqueue tries to finish quickly because it is called from +event handlers, so not everything is checked. + +#### Pre-filter + +This is a more thorough version of the checks done by PreEnqueue. It ensures +that all information that is needed (ResourceClaim, ResourceClass, parameters) +is available. + +Another reason why a Pod might not be schedulable is when it depends on claims +which are in the process of being allocated. That process starts in Reserve and +ends in PreBind or Unreserve (see below). + +It then prepares for filtering by converting information stored in various +places (node filter in ResourceClass, available resources in ResourceSlices, +allocated resources in ResourceClaim statuses, in-flight allocations) into a +format that can be used efficiently by Filter. + +#### Filter + +This checks whether the given node has access to those ResourceClaims which +were already allocated. For ResourceClaims that were not, it checks that the +allocation can succeed for a node. + +#### Post-filter + +This is called when no suitable node could be found. If the Pod depends on ResourceClaims with delayed +allocation, then deallocating one or more of these ResourceClaims may make the +Pod schedulable after allocating the resource elsewhere. Therefore each +ResourceClaim with delayed allocation is checked whether all of the following +conditions apply: +- allocated +- not currently in use +- it was the reason why some node could not fit the Pod, as recorded earlier in + Filter + +One of the ResourceClaims satisfying these criteria is picked randomly and gets +deallocated by clearing the allocation in its status. This may make it possible to run the Pod +elsewhere. If it still doesn't help, deallocation may continue with another +ResourceClaim, if there is one. + +This is currently using blocking API calls. It's quite rare because this +situation can only arise when there are multiple claims per pod and writing +the status of one of them fails, thus leaving the other claims in the +allocated state. + +#### Reserve + +A node has been chosen for the Pod. + +For each unallocated claim, the actual allocation result is determined now. To +avoid blocking API calls, that result is not written to the status yet. Instead, +it gets stored in a map of in-flight claims. + +#### PreBind + +This is called in a separate goroutine. The plugin now checks all the +information gathered earlier and updates the cluster accordingly. If some +some API request fails now, PreBind fails and the pod must be +retried. + +Claims whose status got written back get removed from the in-flight claim map. + +#### Unreserve + +The claim plugin removes the Pod from the `claim.status.reservedFor` field if +set there because it cannot be scheduled after all. + +This is necessary to prevent a deadlock: suppose there are two stand-alone +claims that only can be used by one pod at a time and two pods which both +reference them. Both pods will get scheduled independently, perhaps even by +different schedulers. When each pod manages to allocate and reserve one claim, +then neither of them can get scheduled because they cannot reserve the other +claim. + +Giving up the reservations in Unreserve means that the next pod scheduling +attempts have a chance to succeed. It's non-deterministic which pod will win, +but eventually one of them will. Not giving up the reservations would lead to a +permanent deadlock that somehow would have to be detected and resolved to make +progress. + +All claims get removed from the in-flight claim map. + +Unreserve is called in two scenarios: +- In the main goroutine when scheduling a pod has failed: in that case the plugin's + Reserve call hasn't actually changed the claim status yet, so there is nothing + that needs to be rolled back. +- After binding has failed: this runs in a goroutine, so reverting the + `claim.status.reservedFor` with a blocking call is acceptable. + +### kubelet + +#### Managing resources + +kubelet must ensure that resources are ready for use on the node before running +the first Pod that uses a specific resource instance and make the resource +available elsewhere again when the last Pod has terminated that uses it. For +both operations, kubelet calls a resource kubelet plugin as explained in the next +section. + +Pods that are not listed in ReservedFor or where the ResourceClaim doesn't +exist at all must not be allowed to run. Instead, a suitable event must be +emitted which explains the problem. Such a situation can occur as part of +downgrade scenarios. + +If this was the last Pod on the node that uses the specific +resource instance, then NodeUnprepareResource (see below) must have been called +successfully before allowing the pod to be deleted. This ensures that network-attached resource are available again +for other Pods, including those that might get scheduled to other nodes. It +also signals that it is safe to deallocate and delete the ResourceClaim. + + +![kubelet](./kubelet.png) + +#### Communication between kubelet and resource kubelet plugin + +Resource kubelet plugins are discovered through the [kubelet plugin registration +mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A +new "ResourcePlugin" type will be used in the Type field of the +[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo) +response to distinguish the plugin from device and CSI plugins. + +Under the advertised Unix Domain socket the kubelet plugin provides the +k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by +[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md), +with “volume” replaced by “resource” and volume specific parts removed. + +##### NodeListAndWatchResources + +NodeListAndWatchResources returns a stream of NodeResourcesResponse objects. +At the start and whenever resource availability changes, the +plugin must send one such object with all information to the kubelet. The +kubelet then syncs that information with ResourceSlice objects. + +``` +message NodeListAndWatchResourcesRequest { +} + +message NodeListAndWatchResourcesResponse { + repeated k8s.io.api.resource.v1alpha2.ResourceModel resources = 1; +} +``` + +##### NodePrepareResource + +This RPC is called by the kubelet when a Pod that wants to use the specified +resource is scheduled on a node. The Plugin SHALL assume that this RPC will be +executed on the node where the resource will be used. + +ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name and +one of the ResourceHandles from the ResourceClaimStatus.AllocationResult with +a matching DriverName should be passed to the Plugin as parameters to identify +the claim and perform resource preparation. + +ResourceClaim parameters (namespace, UUID, name) are useful for debugging. +They enable the Plugin to retrieve the full ResourceClaim object, should it +ever be needed (normally it shouldn't). + +The Plugin SHALL return fully qualified device name[s]. + +The Plugin SHALL ensure that there are json file[s] in CDI format +for the allocated resource. These files SHALL be used by runtime to +update runtime configuration before creating containers that use the +resource. + +This operation SHALL do as little work as possible as it’s called +after a pod is scheduled to a node. All potentially failing operations +SHALL be done during allocation phase. + +This operation MUST be idempotent. If the resource corresponding to +the `resource_id` has already been prepared, the Plugin MUST reply `0 +OK`. + +If this RPC failed, or kubelet does not know if it failed or not, it +MAY choose to call `NodePrepareResource` again, or choose to call +`NodeUnprepareResource`. + +On a successful call this RPC should return set of fully qualified +CDI device names, which kubelet MUST pass to the runtime through the CRI +protocol. For version v1alpha3, the RPC should return multiple sets of +fully qualified CDI device names, one per claim that was sent in the input parameters. + +```protobuf +message NodePrepareResourcesRequest { + // The list of ResourceClaims that are to be prepared. + repeated Claim claims = 1; +} + +message Claim { + // The ResourceClaim namespace (ResourceClaim.meta.Namespace). + // This field is REQUIRED. + string namespace = 1; + // The UID of the Resource claim (ResourceClaim.meta.UUID). + // This field is REQUIRED. + string uid = 2; + // The name of the Resource claim (ResourceClaim.meta.Name) + // This field is REQUIRED. + string name = 3; + // Resource handle (AllocationResult.ResourceHandles[*].Data) + // This field is OPTIONAL. + string resource_handle = 4; + // Structured parameter resource handle (AllocationResult.ResourceHandles[*].StructuredData). + // This field is OPTIONAL. If present, it needs to be used + // instead of resource_handle. It will only have a single entry. + // + // Using "repeated" instead of "optional" is a workaround for https://github.com/gogo/protobuf/issues/713. + repeated k8s.io.api.resource.v1alpha2.StructuredResourceHandle structured_resource_handle = 5; +} +``` + +`resource_handle` and `structured_resource_handle` will be set depending on how +the claim was allocated. See also KEP #3063. + +``` +message NodePrepareResourcesResponse { + // The ResourceClaims for which preparation was done + // or attempted, with claim_uid as key. + // + // It is an error if some claim listed in NodePrepareResourcesRequest + // does not get prepared. NodePrepareResources + // will be called again for those that are missing. + map claims = 1; +} +``` + +CRI protocol MUST be extended for this purpose: + + * CDIDevice structure should be added to the CRI specification +```protobuf +// CDIDevice specifies a CDI device information. +message CDIDevice { + // Fully qualified CDI device name + // for example: vendor.com/gpu=gpudevice1 + // see more details in the CDI specification: + // https://github.com/container-orchestrated-devices/container-device-interface/blob/main/SPEC.md + string name = 1; +} +``` + * CDI devices should be added to the ContainerConfig structure: +```protobuf +// ContainerConfig holds all the required and optional fields for creating a +// container. +message ContainerConfig { + // Metadata of the container. This information will uniquely identify the + // container, and the runtime should leverage this to ensure correct + // operation. The runtime may also use this information to improve UX, such + // as by constructing a readable name. + ContainerMetadata metadata = 1 ; + // Image to use. + ImageSpec image = 2; + // Command to execute (i.e., entrypoint for docker) + repeated string command = 3; +... + // CDI devices for the container. + repeated CDIDevice cdi_devices = 17; +} +``` + +###### NodePrepareResource Errors + +If the plugin is unable to complete the NodePrepareResource call +successfully, it MUST return a non-ok gRPC code in the gRPC status. +If the conditions defined below are encountered, the plugin MUST +return the specified gRPC error code. Kubelet MUST implement the +specified error recovery behavior when it encounters the gRPC error +code. + +| Condition | gRPC Code | Description | Recovery Behavior | +|-----------|-----------|-------------|-------------------| +| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. | + + +##### NodeUnprepareResources + +A Kubelet Plugin MUST implement this RPC call. This RPC is a reverse +operation of `NodePrepareResource`. This RPC MUST undo the work by +the corresponding `NodePrepareResource`. This RPC SHALL be called by +kubelet at least once for each successful `NodePrepareResource`. The +Plugin SHALL assume that this RPC will be executed on the node where +the resource is being used. + +This RPC is called by the kubelet when the last Pod using the resource is being +deleted or has reached a final state ("Phase" is "done"). + +This operation MUST be idempotent. If this RPC failed, or kubelet does +not know if it failed or not, it can choose to call +`NodeUnprepareResource` again. + +```protobuf +message NodeUnprepareResourcesRequest { + // The list of ResourceClaims that are to be unprepared. + repeated Claim claims = 1; +} + +message NodeUnprepareResourcesResponse { + // The ResourceClaims for which preparation was reverted. + // The same rules as for NodePrepareResourcesResponse.claims + // apply. + map claims = 1; +} + +message NodeUnprepareResourceResponse { + // If non-empty, unpreparing the ResourceClaim failed. + string error = 1; +} +``` + +###### NodeUnprepareResource Errors + +If the plugin is unable to complete the NodeUprepareResource call +successfully, it MUST return a non-ok gRPC code in the gRPC status. +If the conditions defined below are encountered, the plugin MUST +return the specified gRPC error code. Kubelet MUST implement the +specified error recovery behavior when it encounters the gRPC error +code. + +| Condition | gRPC Code | Description | Recovery Behavior | +|-----------|-----------|-------------|-------------------| +| Resource does not exist | 5 NOT_FOUND | Indicates that a resource corresponding to the specified `resource_id` does not exist. | Caller MUST verify that the `resource_id` is correct and that the resource is accessible and has not been deleted before retrying with exponential back off. | -Because there is no separate controller anymore, claims with immediate -allocation will only get allocated once there is a pod which needs them. The -remaining structured difference compared to delayed allocation is that claims -with immediate allocation remain allocated when no longer in use. ### Simulation with CA @@ -1014,17 +2349,13 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +- `k8s.io/kubernetes/pkg/scheduler`: 2022-05-24 - 75.0% +- `k8s.io/kubernetes/pkg/scheduler/framework`: 2022-05-24 - 76.3% +- `k8s.io/kubernetes/pkg/controller`: 2022-05-24 - 69.4% +- `k8s.io/kubernetes/pkg/kubelet`: 2022-05-24 - 64.5% ##### Integration tests - - -- : - ##### e2e tests -- : +End-to-end testing depends on a working resource driver and a container runtime +with CDI support. A [test driver](https://github.com/kubernetes/kubernetes/tree/master/test/e2e/dra/test-driver) +was developed in parallel to developing the +code in Kubernetes. + +That test driver simply takes parameters from ResourceClass +and ResourceClaim and turns them into environment variables that then get +checked inside containers. Tests for different behavior of an driver in various +scenarios can be simulated by running the control-plane part of it in the E2E +test itself. For interaction with kubelet, proxying of the gRPC interface can +be used, as in the +[csi-driver-host-path](https://github.com/kubernetes-csi/csi-driver-host-path/blob/16251932ab81ad94c9ec585867104400bf4f02e5/cmd/hostpathplugin/main.go#L61-L63): +then the kubelet plugin runs on the node(s), but the actual processing of gRPC +calls happens inside the E2E test. + +All tests that don't involve actually running a Pod can become part of +conformance testing. Those tests that run Pods cannot be because CDI support in +runtimes is not required. + +For beta: +- pre-merge with kind (optional, triggered for code which has an impact on DRA): https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-kind-dra +- periodic with kind: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#ci-kind-dra +- pre-merge with CRI-O: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-node-dra +- periodic with CRI-O: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#ci-node-e2e-crio-dra + ### Graduation Criteria @@ -1400,10 +2753,13 @@ Major milestones might include: - when the KEP was retired or superseded --> +- Kubernetes 1.30: Code merged as "alpha" + ## Drawbacks DRA driver developers have to give up some flexibility with regards to -parameters. They have to learn and understand how structured models +parameters compared to opaque parameters in KEP #3063. +They have to learn and understand how structured models work to pick something which fits their needs. ## Alternatives @@ -1444,3 +2800,144 @@ several different ways: support of such a rebuilt CA binary. However, technically it [becomes possible](https://github.com/kubernetes-sigs/kube-scheduler-wasm-extension) with this KEP. + +### ResourceClaimTemplate + +Instead of creating a ResourceClaim from a template, the +PodStatus could be extended to hold the same information as a +ResourceClaimStatus. Every component which works with that information +then needs permission and extra code to work with PodStatus. Creating +an extra object seems simpler. + +### Reusing volume support as-is + +ResourceClaims are similar to PersistentVolumeClaims and also a lot of +the associated logic is similar. An [early +prototype](https://github.com/intel/proof-of-concept-cdi) used a +custom CSI driver to manage resources. + +The user experience with that approach is poor because per-resource +parameters must be stored in annotations of a PVC due to the lack of +custom per-PVC parameters. Passing annotations as additional parameters was [proposed +before](https://github.com/kubernetes-csi/external-provisioner/issues/86) +but were essentially [rejected by +SIG-Storage](https://github.com/kubernetes-csi/external-provisioner/issues/86#issuecomment-465836185) +because allowing apps to set custom parameters would make apps +non-portable. + +The current volume support also has open issues that affect the +“volume as resource” approach: Multiple different Pods on a node are +allowed to use the same +volume. https://github.com/kubernetes/enhancements/pull/2489 will +address that, but is still work in progress. Recovery from a bad node +selection during delayed binding may get stuck when a Pod has multiple +volumes because volumes are not getting deleted after a partial +provisioning. A proposal to fix that needs further work +(https://github.com/kubernetes/enhancements/pull/1703). Each “fake” +CSI driver would have to implement and install a scheduler extender +because storage capacity tracking only considers volume size as +criteria for selecting nodes, which is not applicable for custom +resources. + +### Extend volume support + +The StorageClass and PersistentVolumeClaim structs could be extended +to allow custom parameters. Together with an extension of the CSI +standard that would address the main objection against the previous +alternative. + +However, SIG-Storage and the CSI community would have to agree to this +kind of reuse and accept that some of the code maintained by them +becomes more complex because of these new use cases. + +### Extend Device Plugins + +The device plugins API could be extended to implement some of the +requirements mentioned in the “Motivation” section of this +document. There were certain attempts to do it, for example an attempt +to [add ‘Deallocate’ API call](https://github.com/kubernetes/enhancements/pull/1949) and [pass pod annotations to 'Allocate' API call](https://github.com/kubernetes/kubernetes/pull/61775) + +However, most of the requirements couldn’t be satisfied using this +approach as they would require major incompatible changes in the +Device Plugins API. For example: partial and optional resource +allocation couldn’t be done without changing the way resources are +currently declared on the Pod and Device Plugin level. + +Extending the device plugins API to use [Container Device Interface](https://github.com/container-orchestrated-devices/container-device-interface) +would help address some of the requirements, but not all of them. + +NodePrepareResource and NodeUnprepareResource could be added to the Device Plugins API and only get called for +resource claims. + +However, this would mean that +developers of the device plugins would have to implement mandatory +API calls (ListAndWatch, Allocate), which could create confusion +as those calls are meaningless for the Dynamic Resource Allocation +purposes. + +Even worse, existing device plugins would have to implement the new +calls with stubs that return errors because the generated Go interface +will require them. + +It should be also taken into account that device plugins API is +beta. Introducing incompatible changes to it may not be accepted by +the Kubernetes community. + +### Webhooks instead of ResourceClaim updates + +In the current design, scheduler and the third-party resource driver communicate by +updating fields in a ResourceClaim. This has several advantages compared to an +approach were kube-scheduler retrieves information from the resource driver +via HTTP: +* No need for a new webhook API. +* Simpler deployment of a resource driver because all it needs are + credentials to communicate with the apiserver. +* Current status can be checked by querying the ResourceClaim. + +The downside is higher load on the apiserver and an increase of the size of +ResourceClaim objects. + +### ResourceDriver + +Similar to CSIDriver for storage, a separate object describing a resource +driver might be useful at some point. At the moment it is not needed yet and +therefore not part of the v1alpha2 API. If it becomes necessary to describe +optional features of a resource driver, such a ResourceDriver type might look +like this: + +``` +type ResourceDriver struct { + // The name of the object is the unique driver name. + ObjectMeta + + // Features contains a list of features supported by the driver. + // New features may be added over time and must be ignored + // by code that does not know about them. + Features []ResourceDriverFeature +} + +type ResourceDriverFeature struct { + // Name is one of the pre-defined names for a feature. + Name ResourceDriverFeatureName + // Parameters might provide additional information about how + // the driver supports the feature. Boolean features have + // no parameters, merely listing them indicates support. + Parameters runtime.RawExtension +} +``` + +### Complex sharing of ResourceClaim + +At the moment, the allocation result marks as a claim as either "shareable" by +an unlimited number of consumers or "not shareable". More complex scenarios +might be useful like "may be shared by a certain number of consumers", but so +far such use cases have not come up yet. If they do, the `AllocationResult` can +be extended with new fields as defined by a follow-up KEP. + +## Infrastructure Needed + +Initially, all development will happen inside the main Kubernetes +repository. The mock driver can be developed inside test/e2e/dra. For the +generic part of that driver, i.e. the code that other drivers can reuse, and +other common code a new staging repo `k8s.io/dynamic-resource-allocation` is +needed. diff --git a/keps/sig-node/4381-dra-structured-parameters/components.png b/keps/sig-node/4381-dra-structured-parameters/components.png new file mode 100644 index 00000000000..417fe21ac42 Binary files /dev/null and b/keps/sig-node/4381-dra-structured-parameters/components.png differ diff --git a/keps/sig-node/4381-dra-structured-parameters/components.puml b/keps/sig-node/4381-dra-structured-parameters/components.puml new file mode 100644 index 00000000000..36eb3afc16f --- /dev/null +++ b/keps/sig-node/4381-dra-structured-parameters/components.puml @@ -0,0 +1,60 @@ +@startuml +!theme reddress-lightblue +skinparam componentStyle rectangle +left to right direction + +cloud "resource driver" as resourcedriver { + component "CRD controller" as drivercrdcontroller + component "kubelet plugin" as driverplugin +} + +component Kubernetes { + component apiserver { + component namespaced { + file ResourceClaimTemplate + file Pod + file ResourceClaim + file DriverCRDParameters + file ResourceClaimParameters + } + component "cluster-scoped" as clusterscoped { + file ResourceClass + file ResourceSlice + } + } + component scheduler { + component "resource plugin" as k8sresourceplugin + } + component "controller-manager" as controllermanager { + component "resource claim controller" as k8sresourceclaimcontroller + } + component kubelet { + component "plugin manager" as pluginmanager + component "resource manager" as resourcemanager + } +} + +' Kubernetes ---> resourcedriver + +ResourceClaimTemplate <. Pod +Pod <. ResourceClaim: owned by\n(if created from template) +ResourceClaim .> ResourceClass +ResourceClaim .> DriverCRDParameters +ResourceClaimParameters .> DriverCRDParameters: generated from,\nowned by + +Pod -u-> k8sresourceclaimcontroller +ResourceClaimTemplate -u-> k8sresourceclaimcontroller +ResourceClaim <-u- k8sresourceclaimcontroller: create claim,\nclean up consumers,\ntrigger deallocation +k8sresourceplugin <- ResourceClaimParameters + +Pod <--> kubelet +Pod <--> scheduler +ResourceClaim <--> k8sresourceplugin + +ResourceClaimParameters <- drivercrdcontroller: create, update +DriverCRDParameters ---> drivercrdcontroller: read +resourcemanager <-> driverplugin: calls gRPC methods,\nreceives stream of\nresource information +ResourceSlice-> k8sresourceplugin: consumes +resourcemanager --> ResourceSlice: publishes +pluginmanager <-> driverplugin: registers +@enduml diff --git a/keps/sig-node/4381-dra-structured-parameters/kep.yaml b/keps/sig-node/4381-dra-structured-parameters/kep.yaml index 1c17a58f9ab..08b6dcf1c4b 100644 --- a/keps/sig-node/4381-dra-structured-parameters/kep.yaml +++ b/keps/sig-node/4381-dra-structured-parameters/kep.yaml @@ -25,7 +25,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.30" +latest-milestone: "v1.31" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: @@ -34,7 +34,12 @@ milestone: # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled feature-gates: - # DynamicResourceAllocation, same as for DRA without this KEP + - name: DynamicResourceAllocation + components: + - kube-apiserver + - kube-controller-manager + - kube-scheduler + - kubelet disable-supported: true # The following PRR answers are required at beta release diff --git a/keps/sig-node/4381-dra-structured-parameters/kubelet.png b/keps/sig-node/4381-dra-structured-parameters/kubelet.png new file mode 100644 index 00000000000..3de5946d6d4 Binary files /dev/null and b/keps/sig-node/4381-dra-structured-parameters/kubelet.png differ diff --git a/keps/sig-node/3063-dynamic-resource-allocation/kubelet.puml b/keps/sig-node/4381-dra-structured-parameters/kubelet.puml similarity index 100% rename from keps/sig-node/3063-dynamic-resource-allocation/kubelet.puml rename to keps/sig-node/4381-dra-structured-parameters/kubelet.puml