-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash loop backoff with Error: Failed to initialize NVML
on GKE
#59
Comments
Just to clarify, the libnvidia-ml.so files were found inside the container? I just pulled them and checked and did not find them. Also, have you followed the integration guide for kubernetes found here? https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#integrating-gpu-telemetry-into-kubernetes |
@glowkey the Nvidia drivers were already installed on the node with the The
|
@praveenperera, |
@nikkon-dev yes sorry |
@praveenperera, |
Hey @nikkon-dev this is the output I get: https://gist.github.com/praveenperera/48ca14a4a898ef9a51d9e8b91b5076b1 And the output of
|
Was there any progress on this in the last month? We're seeing the exact same issue on GKE, and it would be great to get some actual metrics from the GPUs. |
I took a look at your configuration, and here is some issue I noticed: |
I don't believe the nvidia docker runtime is in play in GKE, but I could be wrong. As far as I'm aware, this is all native |
I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment) My helm chart values:
and...
|
I'll try that thanks! |
Thanks a lot for sharing the values. I also got it running on GKE with that setup. |
Thanks for sharing the values. I did the same thing to bump the memory to 256Mi and things working now. but it's weird that there are 17 pods, but 4 are still having the CrashLoopBackOff issue. not sure if anyone has a clue |
Similar to issue: #27
My daemonset.yaml
What I've tried
nvidia-smi
in container, same errorldconfig -p | grep -i libnvidia-ml.so
the library was found in the/usr/local/nvidia/lib64/
/usr/bin/nv-hostengine -f /tmp/nvhostengine.debug.log --log-level debug
The text was updated successfully, but these errors were encountered: