-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues when requesting for more than 1 GPU #25
Comments
correct.
paste your yaml file and the output of |
ok. I ran nvidia_pod.yaml and got the following error: Attached are the description of each node. |
Am I suppose to have both: alpha.kubernetes.io/nvidia-gpu: 1 in capacity and allocation fields? |
@jonathan-goh there is only 1 GPU on every node, so you can not request 2 GPUs in one pod. |
Hello @jonathan-goh !
Looks like it thanks for handling this issue @pineking !
Ideally it's better if you don't enable the Accelerator flags on kubelet. Though it shouldn't have any impact. |
@pineking oh ok. Sorry! I did not know that as I am really new to this! But lets say I want to do distributed learning on my cluster, how do I do that? Do I use a deployment? @RenaudWasTaken That is the thing, I already removed it, reloaded and restarted the kube, but it is still there.. |
If you are talking about MPI, that's not supported yet but we are working on it :) |
@RenaudWasTaken Are there some issues on GitHub or docs/links to track the progress?
@jonathan-goh For distributed training, you can create more than 1 pod (worker) , each pod has 1 GPU. https://github.com/kubeflow/kubeflow https://github.com/tensorflow/k8s
@jonathan-goh I think you can ignore it. |
Nope, it's on our roadmap but really depends on getting the Resource Class API merged. |
Hi there,
My Kubernetes cluster is as such
Master (no GPU)
Node 1 (GPU)
Node 2 (GPU)
Node 3 (GPU)
Node 4 (GPU)
Nodes 1 - 4 have Nvidia drivers (384) and nvidia docker 2 installed.
First issue:
When i run the command
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml"
The nvidia plugin is also running on the master node which have no nvidia drivers and nvidia docker installed. Is this behaviour correct?
Second issue:
I can only run 1 GPU on my cluster at a time. For example, if I run the tensorflow notebook with 1 GPU, it works. But if I deploy another pod utilising another 1 GPU, the pod status gets stuck on pending, stating that there are insufficient GPU resource.
How do i solve this? Thanks.
The text was updated successfully, but these errors were encountered: