Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when requesting for more than 1 GPU #25

Closed
jonathan-goh opened this issue Feb 1, 2018 · 9 comments
Closed

Issues when requesting for more than 1 GPU #25

jonathan-goh opened this issue Feb 1, 2018 · 9 comments

Comments

@jonathan-goh
Copy link

Hi there,

My Kubernetes cluster is as such

Master (no GPU)
Node 1 (GPU)
Node 2 (GPU)
Node 3 (GPU)
Node 4 (GPU)

Nodes 1 - 4 have Nvidia drivers (384) and nvidia docker 2 installed.

First issue:
When i run the command
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml"

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

Second issue:
I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

How do i solve this? Thanks.

@pineking
Copy link

pineking commented Feb 1, 2018

The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?

correct.

I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.

paste your yaml file and the output of kubectl describe node for each node

@jonathan-goh
Copy link
Author

ok. I ran nvidia_pod.yaml and got the following error:
Message: 0/5 nodes are available: 5 Insufficient nvidia.com/gpu.

Attached are the description of each node.

node2.txt
node3.txt
node4.txt
node1.txt
nvidia_pod.yml.txt

@jonathan-goh
Copy link
Author

Am I suppose to have both:

alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1

in capacity and allocation fields?

@pineking
Copy link

pineking commented Feb 1, 2018

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

@RenaudWasTaken
Copy link
Contributor

Hello @jonathan-goh !

@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod.

Looks like it thanks for handling this issue @pineking !

Am I suppose to have both:
alpha.kubernetes.io/nvidia-gpu: 1
nvidia.com/gpu: 1
in capacity and allocation fields?

Ideally it's better if you don't enable the Accelerator flags on kubelet. Though it shouldn't have any impact.

@jonathan-goh
Copy link
Author

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

@RenaudWasTaken
Copy link
Contributor

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

If you are talking about MPI, that's not supported yet but we are working on it :)

@pineking
Copy link

pineking commented Feb 2, 2018

If you are talking about MPI, that's not supported yet but we are working on it :)

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment?

@jonathan-goh For distributed training, you can create more than 1 pod (worker) , each pod has 1 GPU. https://github.com/kubeflow/kubeflow https://github.com/tensorflow/k8s
For TensorFolw and MPI, see https://github.com/uber/horovod

@RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there..

@jonathan-goh I think you can ignore it.

@RenaudWasTaken
Copy link
Contributor

@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?

Nope, it's on our roadmap but really depends on getting the Resource Class API merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants