forked from kubernetes/website
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
blog: add dynamic resource allocation feature blog post
This feature got added as alpha in 1.26. kubernetes/enhancements#3063
- Loading branch information
Showing
1 changed file
with
304 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,304 @@ | ||
--- | ||
layout: blog | ||
title: "Dynamic Resource Allocation in Kubernetes 1.26" | ||
date: | ||
slug: dynamic-resource-allocation | ||
--- | ||
|
||
**Authors:** Patrick Ohly (Intel), Kevin Klues (NVIDIA) | ||
|
||
Dynamic resource allocation is a new API for requesting resources. It is a | ||
generalization of the persistent volumes API for generic resources, making it possible to: | ||
|
||
- access the same resource instance in different pods and containers, | ||
- attach arbitrary constraints to a resource request to get the exact resource | ||
you are looking for, | ||
- initialize a resource according to parameters provided by the user. | ||
|
||
Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in. | ||
|
||
Dynamic resource allocation is an *alpha feature* and only enabled when the | ||
`DynamicResourceAllocation` [feature | ||
gate](/docs/reference/command-line-tools-reference/feature-gates/) and the | ||
`resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group" | ||
term_id="api-group" >}} are enabled. For details, see the | ||
`--feature-gates` and `--runtime-config` [kube-apiserver | ||
parameters](/docs/reference/command-line-tools-reference/kube-apiserver/). | ||
The kube-scheduler, kube-controller-manager and kubelet components all need | ||
the feature gate enabled as well. | ||
|
||
The default configuration of kube-scheduler enables the "DynamicResources" | ||
plugin if and only if the feature gate is enabled. Custom configurations may | ||
have to be modified to include it. | ||
|
||
Once dynamic resource allocation is enabled, resource drivers can be installed | ||
to manage certain kinds of hardware. Kubernetes has a test driver that is used | ||
for end-to-end testing, but also can be run manually. See | ||
[below](#running-the-test-driver) for step-by-step instructions. | ||
|
||
## API | ||
|
||
The new `resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group" | ||
term_id="api-group" >}} provides four new types: | ||
|
||
ResourceClass | ||
: Defines which resource driver handles a certain kind of | ||
resource and provides common parameters for it. ResourceClasses | ||
are created by a cluster administrator when installing a resource | ||
driver. | ||
|
||
ResourceClaim | ||
: Defines a particular resource instances that is required by a | ||
workload. Created by a user (lifecycle managed manually, can be shared | ||
between different Pods) or for individual Pods by the control plane based on | ||
a ResourceClaimTemplate (automatic lifecycle, typically used by just one | ||
Pod). | ||
|
||
ResourceClaimTemplate | ||
: Defines the spec and some meta data for creating | ||
ResourceClaims. Created by a user when deploying a workload. | ||
|
||
PodScheduling | ||
: Used internally by the control plane and resource drivers | ||
to coordinate pod scheduling when ResourceClaims need to be allocated | ||
for a Pod. | ||
|
||
Parameters for ResourceClass and ResourceClaim are stored in separate objects, | ||
typically using the type defined by a {{< glossary_tooltip | ||
term_id="CustomResourceDefinition" text="CRD" >}} that was created when | ||
installing a resource driver. | ||
|
||
The `core/v1` `PodSpec` defines ResourceClaims that are needed for a Pod in a new | ||
`resourceClaims` field. Entries in that list reference either a ResourceClaim | ||
or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using | ||
this PodSpec (for example, inside a Deployment or StatefulSet) share the same | ||
ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets | ||
its own instance. | ||
|
||
The `resources.claims` list for container resources defines whether a container gets | ||
access to these resource instances, which makes it possible to share resources | ||
between one or more containers. | ||
|
||
Here is an example of a fictional resource driver. Two ResourceClaim objects | ||
will get created for this Pod and each container gets access to one of them. | ||
|
||
```yaml | ||
apiVersion: resource.k8s.io/v1alpha1 | ||
kind: ResourceClass | ||
name: resource.example.com | ||
driverName: resource-driver.example.com | ||
--- | ||
apiVersion: cats.resource.example.com/v1 | ||
kind: ClaimParameters | ||
name: large-black-cat-claim-parameters | ||
spec: | ||
color: black | ||
size: large | ||
--- | ||
apiVersion: resource.k8s.io/v1alpha1 | ||
kind: ResourceClaimTemplate | ||
metadata: | ||
name: large-black-cat-claim-template | ||
spec: | ||
spec: | ||
resourceClassName: resource.example.com | ||
parametersRef: | ||
apiGroup: cats.resource.example.com | ||
kind: ClaimParameters | ||
name: large-black-cat-claim-parameters | ||
–-- | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: pod-with-cats | ||
spec: | ||
containers: | ||
- name: container0 | ||
image: ubuntu:20.04 | ||
command: ["sleep", "9999"] | ||
resources: | ||
claims: | ||
- name: cat-0 | ||
- name: container1 | ||
image: ubuntu:20.04 | ||
command: ["sleep", "9999"] | ||
resources: | ||
claims: | ||
- name: cat-1 | ||
resourceClaims: | ||
- name: cat-0 | ||
source: | ||
resourceClaimTemplateName: large-black-cat-claim-template | ||
- name: cat-1 | ||
source: | ||
resourceClaimTemplateName: large-black-cat-claim-template | ||
``` | ||
|
||
## Scheduling | ||
|
||
In contrast to native resources (CPU, RAM) and extended resources (managed by a | ||
device plugin, advertised by kubelet), the scheduler has no knowledge of what | ||
dynamic resources are available in a cluster or how they could be split up to | ||
satisfy the requirements of a specific ResourceClaim. Resource drivers are | ||
responsible for that. They mark ResourceClaims as "allocated" once resources | ||
for it are reserved. This also then tells the scheduler where in the cluster a | ||
ResourceClaim is available. | ||
|
||
ResourceClaims can get allocated as soon as they are created ("immediate | ||
allocation"), without considering which Pods will use them. The default is to | ||
delay allocation until a Pod gets scheduled which needs the ResourceClaim | ||
(i.e. "wait for first consumer"). | ||
|
||
In that mode, the scheduler checks all ResourceClaims needed by a Pod and | ||
creates a PodScheduling object where it informs the resource drivers | ||
responsible for those ResourceClaims about nodes that the scheduler considers | ||
suitable for the Pod. The resource drivers respond by excluding nodes that | ||
don't have enough of the driver's resources left. Once the scheduler has that | ||
information, it selects one node and stores that choice in the PodScheduling | ||
object. The resource drivers then allocate their ResourceClaims so that the | ||
resources will be available on that node. Once that is complete, the Pod | ||
gets scheduled. | ||
|
||
As part of this process, ResourceClaims also get reserved for the | ||
Pod. Currently ResourceClaims can either be used exclusively by a single Pod or | ||
an unlimited number of Pods. | ||
|
||
One key feature is that Pods do not get scheduled to a node unless all of | ||
their resources are allocated and reserved. This avoids the scenario where a Pod | ||
gets scheduled onto one node and then cannot run there, which is bad because | ||
such a pending Pod also blocks all other resources like RAM or CPU that were | ||
set aside for it. | ||
|
||
## Limitations | ||
|
||
The scheduler plugin must be involved in scheduling Pods which use | ||
ResourceClaims. Bypassing the scheduler by setting the `nodeName` field leads | ||
to Pods that the kubelet refuses to start because the ResourceClaims are not | ||
reserved or not even allocated. It may be possible to [remove this | ||
limitation](https://github.com/kubernetes/kubernetes/issues/114005) in the | ||
future. | ||
|
||
## Writing a resource driver | ||
|
||
A dynamic resource allocation driver consists of two separate-but-coordinating | ||
components: a centralized controller and a daemonset of node-local kubelet | ||
plugins. Most of the work required by the centralized controller to coordinate | ||
with the scheduler can be handled by boilerplate code. Only the business logic | ||
required to actually allocate ResourceClaims against the ResourceClasses owned | ||
by the plugin needs to be customized. As such, the following package is | ||
provide by Kubernetes to include APIs for invoking this boilerplate code as | ||
well as a Driver interface that one can implement to provide their custom | ||
business logic: | ||
|
||
- [k8s.io/dynamic-resource-allocation/controller](https://github.com/kubernetes/dynamic-resource-allocation/tree/master/controller) | ||
|
||
Likewise, boilerplate code can be used to register the node-local plugin with | ||
the kubelet, as well as start a gRPC server to implement the kubelet plugin | ||
API. The following package is provided for this purpose: | ||
|
||
- [k8s.io/dynamic-resource-allocation/kubeletplugin](https://github.com/kubernetes/dynamic-resource-allocation/tree/master/kubeletplugin) | ||
|
||
It is up to the driver developer to decide how these two components | ||
communicate. The KEP outlines an [approach using | ||
CRDs](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#implementing-a-plugin-for-node-resources). | ||
|
||
We also plan to provide a complete [example | ||
driver](https://github.com/kubernetes-sigs/dra-example-driver) that can serve | ||
as a template for other drivers. | ||
|
||
## Running the test driver | ||
|
||
The following steps bring up a local, one-node cluster directly from the | ||
Kubernetes source code. As a prerequisite, a [CDI-enabled](https://github.com/container-orchestrated-devices/container-device-interface) container runtime must be installed on your system (e.g. containerd [v1.7+](https://github.com/containerd/containerd/releases/tag/v1.7.0-beta.0) or CRI-O [v1.23.2+](https://github.com/cri-o/cri-o/releases/tag/v1.23.2)). In the example below we use CRI-O. | ||
|
||
First, inside the root directory of Kubernetes, run: | ||
|
||
```console | ||
$ hack/install-etcd.sh | ||
... | ||
|
||
$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \ | ||
FEATURE_GATES=DynamicResourceAllocation=true \ | ||
DNS_ADDON="coredns" \ | ||
CGROUP_DRIVER=systemd \ | ||
CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \ | ||
LOG_LEVEL=6 \ | ||
ENABLE_CSI_SNAPSHOTTER=false \ | ||
API_SECURE_PORT=6444 \ | ||
ALLOW_PRIVILEGED=1 \ | ||
PATH=$(pwd)/third_party/etcd:$PATH \ | ||
./hack/local-up-cluster.sh -O | ||
... | ||
To start using your cluster, you can open up another terminal/tab and run: | ||
|
||
export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig | ||
... | ||
``` | ||
|
||
Once the cluster is up, in another | ||
terminal run the test driver controller. `KUBECONFIG` must be set for all of | ||
the following commands. | ||
|
||
```console | ||
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller | ||
``` | ||
|
||
In another terminal, run the kubelet plugin: | ||
|
||
```console | ||
$ sudo mkdir -p /var/run/cdi && \ | ||
sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/ | ||
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin | ||
``` | ||
|
||
Changing the permissions of the directories makes it possible to run and (when | ||
using delve) debug the kubelet plugin as normal user, which is convenient | ||
because it uses the already populated Go cache. Remember to restore permissions | ||
with `sudo chmod go-w` when done. Alternatively, one can also build the binary | ||
and run that as root. | ||
|
||
Now the cluster is ready to create objects: | ||
|
||
```console | ||
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml | ||
resourceclass.resource.k8s.io/example created | ||
|
||
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml | ||
configmap/test-inline-claim-parameters created | ||
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created | ||
pod/test-inline-claim created | ||
|
||
$ kubectl get resourceclaims | ||
NAME RESOURCECLASSNAME ALLOCATIONMODE STATE AGE | ||
test-inline-claim-resource example WaitForFirstConsumer allocated,reserved 8s | ||
|
||
$ kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
test-inline-claim 0/2 Completed 0 21s | ||
``` | ||
|
||
The test driver doesn't do much, it only sets environment variables as defined | ||
in the ConfigMap. The test pod dumps the environment, so the log can be checked | ||
to verify that everything worked: | ||
|
||
```console | ||
$ kubectl logs test-inline-claim with-resource | grep user_a | ||
user_a='b' | ||
``` | ||
|
||
## {{% heading "whatsnext" %}} | ||
|
||
- For more information on the design, see the | ||
[Dynamic Resource Allocation KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md). | ||
- The Kubernetes documentation is available under ["Kubernetes Documentation | ||
Concepts Scheduling, Preemption and Eviction/Dynamic Resource | ||
Allocation"](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/). | ||
- This feature gets discussed in [SIG | ||
Node](https://github.com/kubernetes/community/blob/master/sig-node/README.md) | ||
and the [CNCF Container Orchestrated Device Working | ||
Group](https://github.com/cncf/tag-runtime/blob/master/wg/COD.md). There is | ||
a [project board](https://github.com/orgs/kubernetes/projects/95/views/1) | ||
with open issues that will get addressed next. | ||
- In order to move this feature towards beta, we need feedback from hardware | ||
vendors, so here's a call to action: try out this feature, consider how it can help | ||
with problems that your users are having, and write resource drivers... |