New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory #1915
Comments
The problem is with the image that you have created. It is not with Katib. Did you use GPU drivers in the image? |
I am able to execute "nvidia-smi" on the image and get the correct output. For this to happen, shouldnt the drivers be installed in the image ? Just to be sure, can you provide me with details on how to use GPU drivers in the image ? |
You can use Nvidia NGC containers based on your framework https://catalog.ngc.nvidia.com/containers |
I have tried using the Nvidia NGC containers as mentioned below
PS : I have pulled the same image in both the containers of my pipeline but I am still getting this problem Also, I have a question. I am setting GPU limit on my pipeline component using .set_gpu_limit(1) as given below.
and the ARGO_CONTAINER is showing nvidia.com/gpu : 1 So, my question is that, Do I need to specify GPU request on my trial spec in katib as well like below ?
Also, I kindly request you to help me solve this GPU usage problem. |
I haven't tried gpu limit with Pipelines. Easiest way is to check the experiment yaml using kubectl. Trial Spec should need gpu limit if trial pod needs to access GPU. |
This is what happens when I specify GPU request in the trial spec but not in the pipeline component.
This is my kubectl describe node
Any idea how I can add toleration to this taint and make the pod allocate GPU ? This is my pod yaml
and this is my katib experiment yaml
even though it shows running.. it will timeout eventually. What am I missing here ? |
This is not specific to Katib. It means that trials could not find a node which satisfies these resource requirements to start the pod |
Here is the gist of my working sample, you can ignore the node selector stuff, it just helps to schedule the pod on the gpu node I want (dedicated for training in my case) : trial_spec={
"apiVersion": "batch/v1",
"kind": "Job",
"spec": {
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false"
}
},
"spec": {
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "k8s.amazonaws.com/accelerator",
"operator": "In",
"values": [
"nvidia-tesla-v100"
]
},
{
"key": "ai-gpu-2",
"operator": "In",
"values": [
"true"
]
}
]
}
]
}
}
},
"containers": [
{
"resources" : {
"limits" : {
"nvidia.com/gpu" : 1
}
},
"name": training_container_name,
"image": "xxxxxxxxxxxxxxxxxxxxx__YOUR_IMAGE_HERE_xxxxxxxxxxxxxx",
"imagePullPolicy": "Always",
"command": train_params + [
"--learning_rate=${trialParameters.learning_rate}",
"--optimizer=${trialParameters.optimizer}",
"--batch_size=${trialParameters.batch_size}",
"--max_epochs=${trialParameters.max_epochs}"
]
}
],
"restartPolicy": "Never",
"serviceAccountName": "default-editor"
}
}
}
} |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Feel free to re-open an issue if you have any followup problems. |
/kind bug
What steps did you take and what happened:
I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands
I then created a kubeflow pipeline:
These are my two continers.
gcr.io/.............../hptunekatibclient:v7
Dockerfile
gcr.io/.............../hptunekatib:v7
Dockerfile
The pipeline runs but it doesnot use the GPU and this piece of code
gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container
What did you expect to happen:
I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.22.8-gke.202uname -a
): linux/ COS in containersImpacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
The text was updated successfully, but these errors were encountered: