Using in clusters which contains both GPU nodes and non-GPU nodes #9

idealhack · 2017-11-22T12:04:16Z

When using daemon sets in this kind of cluster, non-GPU nodes will complains

Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=ALL --utility --compute --pid=16424 /var/lib/docker/overlay/a86473af4c52afb44dfdfdcc817edb45316d520cccfb086d87cc227314d09015/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\""

It's straightforward to use taints (which could be documented), but how about also done it in this plugin (i.e. better error handling)?

The text was updated successfully, but these errors were encountered:

flx42 · 2017-11-22T16:42:41Z

Sure, it's a problem. But looks like you also installed nvidia-container-runtime on this node?

flx42 · 2017-11-22T16:46:05Z

@RenaudWasTaken when deploying to a node with no GPU, we should wait indefinitely, right? We already have a case for this, but it assumes NVML is present and working:
https://github.com/NVIDIA/k8s-device-plugin/blob/master/main.go#L46-L49

RenaudWasTaken · 2017-11-22T18:58:09Z

@RenaudWasTaken when deploying to a node with no GPU, we should wait indefinitely, right?

The way I expected device plugins to work is to stop when they detect that no devices are available on the node.
I also expect the restart policy to be OnFailure.

flx42 · 2017-11-22T23:02:04Z

Fixed by 9b54e91

idealhack · 2017-11-23T08:03:51Z

@RenaudWasTaken “A Pod Template in a DaemonSet must have a RestartPolicy equal to Always, or be unspecified, which defaults to Always.” https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#pod-template

As for now, a node without GPU will be in a crash loop.

RenaudWasTaken · 2017-11-23T08:28:25Z

@RenaudWasTaken “A Pod Template in a DaemonSet must have a RestartPolicy equal to Always, or be unspecified, which defaults to Always.” https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#pod-template

Looks like we'll have to update the k8s docs.

As for now, a node without GPU will be in a crash loop.

I just pushed a fix for that :)

idealhack · 2017-11-23T09:16:02Z

@RenaudWasTaken I‘m getting The DaemonSet "nvidia-device-plugin-daemonset" is invalid: spec.template.spec.restartPolicy: Unsupported value: "OnFailure": supported values: Always when applying the manifest on k8s 1.8.1, which is right according to the docs. Also I failed to find any clues about this have changed in newer version.

RenaudWasTaken · 2017-11-23T10:50:32Z

@idealhack thanks for noticing this mistake.
Looks like you have to manually label your nodes if you don't want the plugin to be running on every nodes.

idealhack · 2017-11-23T12:06:05Z

So crashing is actually the right behavior for GPU plugin on non-GPU node, right?

Speaking of docs, how about we add a note about using taints to handle this kind of clusters in README, like https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#example-use-cases

flx42 closed this as completed Nov 22, 2017

pineking mentioned this issue Feb 22, 2018

kubernetes GPU device plugin for Ubuntu and Debian images kubernetes/kubernetes#54011

Closed

pineking mentioned this issue May 15, 2018

device plugin runs on ALL nodes #50

Closed

yuvalk pushed a commit to PomVom/k8s-device-plugin that referenced this issue Jul 10, 2019

Rename branches (NVIDIA#9)

50a8b83

DPS0340 mentioned this issue Oct 16, 2022

SO1S-434 Containerd GPU 연동 트러블슈팅 so1s/so1s-infra#13

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using in clusters which contains both GPU nodes and non-GPU nodes #9

Using in clusters which contains both GPU nodes and non-GPU nodes #9

idealhack commented Nov 22, 2017

flx42 commented Nov 22, 2017

flx42 commented Nov 22, 2017

RenaudWasTaken commented Nov 22, 2017

flx42 commented Nov 22, 2017

idealhack commented Nov 23, 2017

RenaudWasTaken commented Nov 23, 2017

idealhack commented Nov 23, 2017 •

edited

RenaudWasTaken commented Nov 23, 2017 •

edited

idealhack commented Nov 23, 2017

Using in clusters which contains both GPU nodes and non-GPU nodes #9

Using in clusters which contains both GPU nodes and non-GPU nodes #9

Comments

idealhack commented Nov 22, 2017

flx42 commented Nov 22, 2017

flx42 commented Nov 22, 2017

RenaudWasTaken commented Nov 22, 2017

flx42 commented Nov 22, 2017

idealhack commented Nov 23, 2017

RenaudWasTaken commented Nov 23, 2017

idealhack commented Nov 23, 2017 • edited

RenaudWasTaken commented Nov 23, 2017 • edited

idealhack commented Nov 23, 2017

idealhack commented Nov 23, 2017 •

edited

RenaudWasTaken commented Nov 23, 2017 •

edited