Issues when requesting for more than 1 GPU #25

jonathan-goh · 2018-02-01T01:19:58Z

Hi there,

My Kubernetes cluster is as such

Master (no GPU)
Node 1 (GPU)
Node 2 (GPU)
Node 3 (GPU)
Node 4 (GPU)

Nodes 1 - 4 have Nvidia drivers (384) and nvidia docker 2 installed.

First issue:
When i run the command
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml"

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

Second issue:
I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

How do i solve this? Thanks.

pineking · 2018-02-01T01:56:09Z

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

correct.

I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

paste your yaml file and the output of kubectl describe node for each node

jonathan-goh · 2018-02-01T02:55:38Z

ok. I ran nvidia_pod.yaml and got the following error:
Message: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu.

Attached are the description of each node.

node2.txt
node3.txt
node4.txt
node1.txt
nvidia_pod.yml.txt

jonathan-goh · 2018-02-01T03:26:28Z

Am I suppose to have both:

alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1

in capacity and allocation fields?

pineking · 2018-02-01T16:50:06Z

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

RenaudWasTaken · 2018-02-01T17:14:12Z

Hello @jonathan-goh !

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

Looks like it thanks for handling this issue @pineking !

Am I suppose to have both:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
in capacity and allocation fields?

Ideally it's better if you don't enable the Accelerator flags on kubelet. Though it shouldn't have any impact.

jonathan-goh · 2018-02-02T01:14:20Z

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

RenaudWasTaken · 2018-02-02T01:18:12Z

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

If you are talking about MPI, that's not supported yet but we are working on it :)

pineking · 2018-02-02T03:45:09Z

If you are talking about MPI, that's not supported yet but we are working on it :)

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@jonathan-goh For distributed training, you can create more than 1 pod (worker) , each pod has 1 GPU. https://github.com/kubeflow/kubeflow https://github.com/tensorflow/k8s
For TensorFolw and MPI, see https://github.com/uber/horovod

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

@jonathan-goh I think you can ignore it.

RenaudWasTaken · 2018-02-06T13:47:45Z

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

Nope, it's on our roadmap but really depends on getting the Resource Class API merged.

RenaudWasTaken closed this as completed Feb 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when requesting for more than 1 GPU #25

Issues when requesting for more than 1 GPU #25

jonathan-goh commented Feb 1, 2018

pineking commented Feb 1, 2018

jonathan-goh commented Feb 1, 2018

jonathan-goh commented Feb 1, 2018

pineking commented Feb 1, 2018

RenaudWasTaken commented Feb 1, 2018

jonathan-goh commented Feb 2, 2018

RenaudWasTaken commented Feb 2, 2018

pineking commented Feb 2, 2018 •

edited

RenaudWasTaken commented Feb 6, 2018

Issues when requesting for more than 1 GPU #25

Issues when requesting for more than 1 GPU #25

Comments

jonathan-goh commented Feb 1, 2018

pineking commented Feb 1, 2018

jonathan-goh commented Feb 1, 2018

jonathan-goh commented Feb 1, 2018

pineking commented Feb 1, 2018

RenaudWasTaken commented Feb 1, 2018

jonathan-goh commented Feb 2, 2018

RenaudWasTaken commented Feb 2, 2018

pineking commented Feb 2, 2018 • edited

RenaudWasTaken commented Feb 6, 2018

pineking commented Feb 2, 2018 •

edited