K3S GPU Support
#7601
Replies: 1 comment
-
Why are you trying to configure the containerd config template? You don't need to do this in order for GPU runtime detection to work; the only reason to do this is if you want to change the default runtime to one of the nvidia runtimes, but I wouldn't do that until after confirming that you are actually able to use the runtimes in a standard configuration as described at https://docs.k3s.io/advanced#nvidia-container-runtime-support |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Environmental Info:
K3s Version: v1.25.8+k3s1
Node(s) CPU architecture, OS, and Version: Ubuntu 20.04.1 x86_64 GNU/Linux
Cluster Configuration: 1 server , 2 agents
Describe the bug:
The bug consists while trying to add node access to NVIDIA Gpu. Trying to configure the config.toml.tmpl file seems to not work properly and the node does not get resource access to Gpu. Another way I managed to make it work is installing containerd on your system and then apply a config.toml.tmpl file and it works (Tested with NVIDIA Gpu Job). But if you try to do so with the built in containerd on agent it seems to break the node to state NotReady with network issues when you describe the node or it does not manage to allocate Gpu resources.
It does work with some older version but I could not find any relevant information on how to setup tmpl file to work with gpu.
Steps To Reproduce:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 sh -sudo mkdir -p /var/lib/rancher/k3s/agent/etc/containerdsudo wget https://k3d.io/v4.4.8/usage/guides/cuda/config.toml.tmpl -O /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmplsudo systemctl start k3s && sudo kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.ymlExpected behavior:
kubectl describe node gpu-nodecommand should return there are gpu resources or running any gpu job should work.Actual behavior:
kubectl describe node gpu-nodecommand return only cpu and memory among others as resources and any gpu job pod stuck in pending state with the error it can not find any nvidia/gpuBeta Was this translation helpful? Give feedback.
All reactions