Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using in clusters which contains both GPU nodes and non-GPU nodes #9

Closed
idealhack opened this issue Nov 22, 2017 · 9 comments
Closed

Comments

@idealhack
Copy link

When using daemon sets in this kind of cluster, non-GPU nodes will complains

Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=ALL --utility --compute --pid=16424 /var/lib/docker/overlay/a86473af4c52afb44dfdfdcc817edb45316d520cccfb086d87cc227314d09015/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\""

It's straightforward to use taints (which could be documented), but how about also done it in this plugin (i.e. better error handling)?

@flx42
Copy link
Member

flx42 commented Nov 22, 2017

Sure, it's a problem. But looks like you also installed nvidia-container-runtime on this node?

@flx42
Copy link
Member

flx42 commented Nov 22, 2017

@RenaudWasTaken when deploying to a node with no GPU, we should wait indefinitely, right? We already have a case for this, but it assumes NVML is present and working:
https://github.com/NVIDIA/k8s-device-plugin/blob/master/main.go#L46-L49

@RenaudWasTaken
Copy link
Contributor

@RenaudWasTaken when deploying to a node with no GPU, we should wait indefinitely, right?

The way I expected device plugins to work is to stop when they detect that no devices are available on the node.
I also expect the restart policy to be OnFailure.

@flx42
Copy link
Member

flx42 commented Nov 22, 2017

Fixed by 9b54e91

@flx42 flx42 closed this as completed Nov 22, 2017
@idealhack
Copy link
Author

@RenaudWasTaken “A Pod Template in a DaemonSet must have a RestartPolicy equal to Always, or be unspecified, which defaults to Always.” https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#pod-template

As for now, a node without GPU will be in a crash loop.

@RenaudWasTaken
Copy link
Contributor

@RenaudWasTaken “A Pod Template in a DaemonSet must have a RestartPolicy equal to Always, or be unspecified, which defaults to Always.” https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#pod-template

Looks like we'll have to update the k8s docs.

As for now, a node without GPU will be in a crash loop.

I just pushed a fix for that :)

@idealhack
Copy link
Author

idealhack commented Nov 23, 2017

@RenaudWasTaken I‘m getting The DaemonSet "nvidia-device-plugin-daemonset" is invalid: spec.template.spec.restartPolicy: Unsupported value: "OnFailure": supported values: Always when applying the manifest on k8s 1.8.1, which is right according to the docs. Also I failed to find any clues about this have changed in newer version.

@RenaudWasTaken
Copy link
Contributor

RenaudWasTaken commented Nov 23, 2017

@idealhack thanks for noticing this mistake.
Looks like you have to manually label your nodes if you don't want the plugin to be running on every nodes.

@idealhack
Copy link
Author

So crashing is actually the right behavior for GPU plugin on non-GPU node, right?

Speaking of docs, how about we add a note about using taints to handle this kind of clusters in README, like https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#example-use-cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants