Skip to content

Commit

Permalink
Document NFD for GPU Labeling
Browse files Browse the repository at this point in the history
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
  • Loading branch information
ArangoGutierrez committed Jan 29, 2024
1 parent 54ab2e8 commit 31c34dc
Showing 1 changed file with 48 additions and 7 deletions.
55 changes: 48 additions & 7 deletions content/en/docs/tasks/manage-gpus/scheduling-gpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ spec:
gpu-vendor.example/example-gpu: 1 # requesting 1 GPU
```
## Clusters containing different types of GPUs
## Manage clusters with different types of GPUs
If different nodes in your cluster have different types of GPUs, then you
can use [Node Labels and Node Selectors](/docs/tasks/configure-pod-container/assign-pods-nodes/)
Expand All @@ -83,10 +83,51 @@ a different label key if you prefer.

## Automatic node labelling {#node-labeller}

If you're using AMD GPU devices, you can deploy
[Node Labeller](https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller).
Node Labeller is a {{< glossary_tooltip text="controller" term_id="controller" >}} that automatically
labels your nodes with GPU device properties.
As an administrator, you can automatically discover and label all your GPU enabled nodes
by deploying Kubernetes [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) (NFD).
NFD detects the hardware features that are available on each node in a Kubernetes cluster and advertises those features.
Typically, NFD adds node labels to advertise the features, but NFD can also add extended resources, annotations, and node taints.
NFD is compatible with all [supported versions](/releases/version-skew-policy/#supported-versions) of Kubernetes.

Similar functionality for NVIDIA is provided by
[GPU feature discovery](https://github.com/NVIDIA/gpu-feature-discovery/blob/main/README.md).
Administrators can leverage NFD to also taint nodes with specific features, so that only pods that request those features can be scheduled on those nodes.
After a cluster is labeled with the GPU feature, you can schedule pods on GPU nodes by adding the following to your pod spec:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: example-vector-add
spec:
# You can then use Kubernetes node affinity to schedule pods on GPU nodes.
# https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "gpu.gpu-vendor.example/installed-memory"
operator: Gt
values: ["40535"]
- key: "gpu.gpu-vendor.example/family"
operator: In
values: # examples are GCU families not GPU families 😉
- arbitrary
- armchair-traveller
- just-read-the-instructions
- steely-glint
restartPolicy: Never
containers:
- name: example-vector-add
image: "registry.example/example-vector-add:v42"
resources:
limits:
gpu-vendor.example/example-gpu: 1 # requesting 1 GPU
nodeSelector:
# Use Kubernetes node selector to schedule pods on GPU nodes.
# https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
gpu-vendor.example/example-gpu-present: "true"
```
#### GPU vendor implementations
- NVIDIA [GPU feature discovery](https://github.com/NVIDIA/gpu-feature-discovery/#readme).

0 comments on commit 31c34dc

Please sign in to comment.