-
-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.5_cuda-11.6_ubuntu-20.04_python-only NVML version mismatch, 1.4 works fine #106
Comments
I was able to reproduce your issue on another server, however I can't update packages or reboot at the moment. Some users say that a reboot helped. Could you tell me if a reboot helped? gpu-jupyter/src/Dockerfile.gpulibs Lines 44 to 49 in 69a81e3
|
Thanks for the help! Rebooting alone did not work, but updating nvidia drivers on the VM from 510->530 resolved the nvml mismatch issue I was seeing. This also updated the VM cuda version to 12.1. The image is now cuda version 12.1 instead of 11.6, which was what I was expecting based on the name.
|
That is a strange behavior I also see in my setup similarly: I'm installing Cuda and driver 11.6.2 as described in the medium-blog
nvcc --verison shows the correct version, but nvidia-smi also Cuda 12.1 (and NVIDIA driver is 530), as seen here It seems nvidia-smi can show a different version than nvcc, as noted in the nvidia-forum. So I suppose the current installations in version 1.5 requires CUDA 11.6.2. I found that one of the packages in I will find and downgrade this package! |
It seems the installation of nvtop is the problem here: gpu-jupyter/src/Dockerfile.gpulibs Lines 39 to 42 in 69a81e3
It installs the dependencies After removing nvtop it works, however nvcc does not work anymore. I will figure out a solution here. |
Interestingly, nvtop have already made some troubles, see here I'm looking forward to get rid of it, hopefully all tests pass. |
I did notice that nvtop wasn't working in my custom build, but it was low on my priority list of things to fix haha |
Another interesting insight is that the NVIDIA version on which the image is built affects the subsequent installations.
However, if the same Dockerfile is built on the node with 520 it works. I'll built and and push the images now on the server with version 520 and hope its upwards compatible! @njacobson-nci please check if you can successfully build and run |
@ChristophSchranz Sorry for the delay but I'm able to run the pushed 1.5 image on a VM with driver version 510 and CUDA 11.6 now Thanks for all the help! |
I found this while debugging a different bitsandbytes issue, but figured I should post my finding.
This image does not work
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.5_cuda-11.6_ubuntu-20.04_python-only
This works
docker run --gpus all nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
This works
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.4_cuda-11.6_ubuntu-20.04_python-only
This is the output from the VM, not the containers.
This is from the 1.4 container.
The text was updated successfully, but these errors were encountered: