K3S GPU Support #7601

ald0t1 · 2023-05-23T08:26:29Z

ald0t1
May 23, 2023

Environmental Info:
K3s Version: v1.25.8+k3s1

Node(s) CPU architecture, OS, and Version: Ubuntu 20.04.1 x86_64 GNU/Linux

Cluster Configuration: 1 server , 2 agents

Describe the bug:

The bug consists while trying to add node access to NVIDIA Gpu. Trying to configure the config.toml.tmpl file seems to not work properly and the node does not get resource access to Gpu. Another way I managed to make it work is installing containerd on your system and then apply a config.toml.tmpl file and it works (Tested with NVIDIA Gpu Job). But if you try to do so with the built in containerd on agent it seems to break the node to state NotReady with network issues when you describe the node or it does not manage to allocate Gpu resources.
It does work with some older version but I could not find any relevant information on how to setup tmpl file to work with gpu.

Steps To Reproduce:

Installed K3s:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 sh -
Installed the nvidia drivers and container runtime from docs [https://github.com/NVIDIA/k8s-device-plugin](Nvidia K8s device plugin)
sudo mkdir -p /var/lib/rancher/k3s/agent/etc/containerd
sudo wget https://k3d.io/v4.4.8/usage/guides/cuda/config.toml.tmpl -O /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
Tried to modify the tmpl file but with no success to make it run with my k3s version but no success
sudo systemctl start k3s && sudo kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Expected behavior:

kubectl describe node gpu-node command should return there are gpu resources or running any gpu job should work.

Actual behavior:

kubectl describe node gpu-node command return only cpu and memory among others as resources and any gpu job pod stuck in pending state with the error it can not find any nvidia/gpu

brandond · 2023-05-24T00:59:14Z

brandond
May 24, 2023
Collaborator

Trying to configure the config.toml.tmpl file seems to not work properly and the node does not get resource access to Gpu

Why are you trying to configure the containerd config template? You don't need to do this in order for GPU runtime detection to work; the only reason to do this is if you want to change the default runtime to one of the nvidia runtimes, but I wouldn't do that until after confirming that you are actually able to use the runtimes in a standard configuration as described at https://docs.k3s.io/advanced#nvidia-container-runtime-support

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K3S GPU Support #7601

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

K3S GPU Support #7601

Uh oh!

Uh oh!

ald0t1 May 23, 2023

Replies: 1 comment

Uh oh!

brandond May 24, 2023 Collaborator

ald0t1
May 23, 2023

brandond
May 24, 2023
Collaborator