blog: add dynamic resource allocation feature blog post

This feature got added as alpha in 1.26. kubernetes/enhancements#3063
pohly · Nov 30, 2022 · f41da92 · f41da92
1 parent 3801b57
commit f41da92
Showing 1 changed file with 304 additions and 0 deletions.
diff --git a/content/en/blog/_posts/data-tbd/index.md b/content/en/blog/_posts/data-tbd/index.md
@@ -0,0 +1,304 @@
+---
+layout: blog
+title: "Dynamic Resource Allocation in Kubernetes 1.26"
+date: 
+slug: dynamic-resource-allocation
+---
+
+ **Authors:** Patrick Ohly (Intel), Kevin Klues (NVIDIA)
+
+Dynamic resource allocation is a new API for requesting resources. It is a
+generalization of the persistent volumes API for generic resources, making it possible to:
+
+- access the same resource instance in different pods and containers,
+- attach arbitrary constraints to a resource request to get the exact resource
+  you are looking for,
+- initialize a resource according to parameters provided by the user.
+
+Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in. 
+
+Dynamic resource allocation is an *alpha feature* and only enabled when the
+`DynamicResourceAllocation` [feature
+gate](/docs/reference/command-line-tools-reference/feature-gates/) and the
+`resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group"
+term_id="api-group" >}} are enabled. For details, see the
+`--feature-gates` and `--runtime-config` [kube-apiserver
+parameters](/docs/reference/command-line-tools-reference/kube-apiserver/).
+The kube-scheduler, kube-controller-manager and kubelet components all need
+the feature gate enabled as well.
+
+The default configuration of kube-scheduler enables the "DynamicResources"
+plugin if and only if the feature gate is enabled. Custom configurations may
+have to be modified to include it.
+
+Once dynamic resource allocation is enabled, resource drivers can be installed
+to manage certain kinds of hardware. Kubernetes has a test driver that is used
+for end-to-end testing, but also can be run manually. See
+[below](#running-the-test-driver) for step-by-step instructions.
+
+## API
+
+The new `resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group"
+term_id="api-group" >}} provides four new types:
+
+ResourceClass
+: Defines which resource driver handles a certain kind of
+  resource and provides common parameters for it. ResourceClasses
+  are created by a cluster administrator when installing a resource
+  driver.
+
+ResourceClaim
+: Defines a particular resource instances that is required by a
+  workload. Created by a user (lifecycle managed manually, can be shared
+  between different Pods) or for individual Pods by the control plane based on
+  a ResourceClaimTemplate (automatic lifecycle, typically used by just one
+  Pod).
+
+ResourceClaimTemplate
+: Defines the spec and some meta data for creating
+  ResourceClaims. Created by a user when deploying a workload.
+
+PodScheduling
+: Used internally by the control plane and resource drivers
+  to coordinate pod scheduling when ResourceClaims need to be allocated
+  for a Pod.
+
+Parameters for ResourceClass and ResourceClaim are stored in separate objects,
+typically using the type defined by a {{< glossary_tooltip
+term_id="CustomResourceDefinition" text="CRD" >}} that was created when
+installing a resource driver.
+
+The `core/v1` `PodSpec` defines ResourceClaims that are needed for a Pod in a new
+`resourceClaims` field. Entries in that list reference either a ResourceClaim
+or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using
+this PodSpec (for example, inside a Deployment or StatefulSet) share the same
+ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets
+its own instance.
+
+The `resources.claims` list for container resources defines whether a container gets
+access to these resource instances, which makes it possible to share resources
+between one or more containers.
+
+Here is an example of a fictional resource driver. Two ResourceClaim objects
+will get created for this Pod and each container gets access to one of them.
+
+```yaml
+apiVersion: resource.k8s.io/v1alpha1
+kind: ResourceClass
+name: resource.example.com
+driverName: resource-driver.example.com
+---
+apiVersion: cats.resource.example.com/v1
+kind: ClaimParameters
+name: large-black-cat-claim-parameters
+spec:
+  color: black
+  size: large
+---
+apiVersion: resource.k8s.io/v1alpha1
+kind: ResourceClaimTemplate
+metadata:
+  name: large-black-cat-claim-template
+spec:
+  spec:
+    resourceClassName: resource.example.com
+    parametersRef:
+      apiGroup: cats.resource.example.com
+      kind: ClaimParameters
+      name: large-black-cat-claim-parameters
+–--
+apiVersion: v1
+kind: Pod
+metadata:
+  name: pod-with-cats
+spec:
+  containers:
+  - name: container0
+    image: ubuntu:20.04
+    command: ["sleep", "9999"]
+    resources:
+      claims:
+      - name: cat-0
+  - name: container1
+    image: ubuntu:20.04
+    command: ["sleep", "9999"]
+    resources:
+      claims:
+      - name: cat-1
+  resourceClaims:
+  - name: cat-0
+    source:
+      resourceClaimTemplateName: large-black-cat-claim-template
+  - name: cat-1
+    source:
+      resourceClaimTemplateName: large-black-cat-claim-template
+```
+
+## Scheduling
+
+In contrast to native resources (CPU, RAM) and extended resources (managed by a
+device plugin, advertised by kubelet), the scheduler has no knowledge of what
+dynamic resources are available in a cluster or how they could be split up to
+satisfy the requirements of a specific ResourceClaim. Resource drivers are
+responsible for that. They mark ResourceClaims as "allocated" once resources
+for it are reserved. This also then tells the scheduler where in the cluster a
+ResourceClaim is available.
+
+ResourceClaims can get allocated as soon as they are created ("immediate
+allocation"), without considering which Pods will use them. The default is to
+delay allocation until a Pod gets scheduled which needs the ResourceClaim
+(i.e. "wait for first consumer").
+
+In that mode, the scheduler checks all ResourceClaims needed by a Pod and
+creates a PodScheduling object where it informs the resource drivers
+responsible for those ResourceClaims about nodes that the scheduler considers
+suitable for the Pod. The resource drivers respond by excluding nodes that
+don't have enough of the driver's resources left. Once the scheduler has that
+information, it selects one node and stores that choice in the PodScheduling
+object. The resource drivers then allocate their ResourceClaims so that the
+resources will be available on that node. Once that is complete, the Pod
+gets scheduled.
+
+As part of this process, ResourceClaims also get reserved for the
+Pod. Currently ResourceClaims can either be used exclusively by a single Pod or
+an unlimited number of Pods.
+
+One key feature is that Pods do not get scheduled to a node unless all of
+their resources are allocated and reserved. This avoids the scenario where a Pod
+gets scheduled onto one node and then cannot run there, which is bad because
+such a pending Pod also blocks all other resources like RAM or CPU that were
+set aside for it.
+
+## Limitations
+
+The scheduler plugin must be involved in scheduling Pods which use
+ResourceClaims. Bypassing the scheduler by setting the `nodeName` field leads
+to Pods that the kubelet refuses to start because the ResourceClaims are not
+reserved or not even allocated. It may be possible to [remove this
+limitation](https://github.com/kubernetes/kubernetes/issues/114005) in the
+future.
+
+## Writing a resource driver
+
+A dynamic resource allocation driver consists of two separate-but-coordinating
+components: a centralized controller and a daemonset of node-local kubelet
+plugins. Most of the work required by the centralized controller to coordinate
+with the scheduler can be handled by boilerplate code. Only the business logic
+required to actually allocate ResourceClaims against the ResourceClasses owned
+by the plugin needs to be customized. As such, the following package is
+provide by Kubernetes to include APIs for invoking this boilerplate code as
+well as a Driver interface that one can implement to provide their custom
+business logic:
+
+- [k8s.io/dynamic-resource-allocation/controller](https://github.com/kubernetes/dynamic-resource-allocation/tree/master/controller)
+
+Likewise, boilerplate code can be used to register the node-local plugin with
+the kubelet, as well as start a gRPC server to implement the kubelet plugin
+API. The following package is provided for this purpose:
+
+- [k8s.io/dynamic-resource-allocation/kubeletplugin](https://github.com/kubernetes/dynamic-resource-allocation/tree/master/kubeletplugin)
+
+It is up to the driver developer to decide how these two components
+communicate. The KEP outlines an [approach using
+CRDs](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#implementing-a-plugin-for-node-resources).
+
+We also plan to provide a complete [example
+driver](https://github.com/kubernetes-sigs/dra-example-driver) that can serve
+as a template for other drivers.
+
+## Running the test driver
+
+The following steps bring up a local, one-node cluster directly from the
+Kubernetes source code. As a prerequisite, a [CDI-enabled](https://github.com/container-orchestrated-devices/container-device-interface) container runtime must be installed on your system (e.g. containerd [v1.7+](https://github.com/containerd/containerd/releases/tag/v1.7.0-beta.0) or CRI-O [v1.23.2+](https://github.com/cri-o/cri-o/releases/tag/v1.23.2)). In the example below we use CRI-O.
+
+First, inside the root directory of Kubernetes, run:
+
+```console
+$ hack/install-etcd.sh
+...
+
+$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
+  FEATURE_GATES=DynamicResourceAllocation=true \
+  DNS_ADDON="coredns" \
+  CGROUP_DRIVER=systemd \
+  CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
+  LOG_LEVEL=6 \
+  ENABLE_CSI_SNAPSHOTTER=false \
+  API_SECURE_PORT=6444 \
+  ALLOW_PRIVILEGED=1 \
+  PATH=$(pwd)/third_party/etcd:$PATH \
+  ./hack/local-up-cluster.sh -O
+...
+To start using your cluster, you can open up another terminal/tab and run:
+
+  export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
+...
+```
+
+Once the cluster is up, in another
+terminal run the test driver controller. `KUBECONFIG` must be set for all of
+the following commands.
+
+```console
+$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller
+```
+
+In another terminal, run the kubelet plugin:
+
+```console
+$ sudo mkdir -p /var/run/cdi && \
+  sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/
+$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin
+```
+
+Changing the permissions of the directories makes it possible to run and (when
+using delve) debug the kubelet plugin as normal user, which is convenient
+because it uses the already populated Go cache. Remember to restore permissions
+with `sudo chmod go-w` when done. Alternatively, one can also build the binary
+and run that as root.
+
+Now the cluster is ready to create objects:
+
+```console
+$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
+resourceclass.resource.k8s.io/example created
+
+$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
+configmap/test-inline-claim-parameters created
+resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
+pod/test-inline-claim created
+
+$ kubectl get resourceclaims
+NAME                         RESOURCECLASSNAME   ALLOCATIONMODE         STATE                AGE
+test-inline-claim-resource   example             WaitForFirstConsumer   allocated,reserved   8s
+
+$ kubectl get pods
+NAME                READY   STATUS      RESTARTS   AGE
+test-inline-claim   0/2     Completed   0          21s
+```
+
+The test driver doesn't do much, it only sets environment variables as defined
+in the ConfigMap. The test pod dumps the environment, so the log can be checked
+to verify that everything worked:
+
+```console
+$ kubectl logs test-inline-claim with-resource | grep user_a
+user_a='b'
+```
+
+## {{% heading "whatsnext" %}}
+
+ - For more information on the design, see the
+[Dynamic Resource Allocation KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md).
+ - The Kubernetes documentation is available under ["Kubernetes Documentation
+Concepts Scheduling, Preemption and Eviction/Dynamic Resource
+Allocation"](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/).
+ - This feature gets discussed in [SIG
+   Node](https://github.com/kubernetes/community/blob/master/sig-node/README.md)
+   and the [CNCF Container Orchestrated Device Working
+   Group](https://github.com/cncf/tag-runtime/blob/master/wg/COD.md). There is
+   a [project board](https://github.com/orgs/kubernetes/projects/95/views/1)
+   with open issues that will get addressed next.
+ - In order to move this feature towards beta, we need feedback from hardware
+   vendors, so here's a call to action: try out this feature, consider how it can help
+   with problems that your users are having, and write resource drivers...