v1.5_cuda-11.6_ubuntu-20.04_python-only NVML version mismatch, 1.4 works fine #106

njacobson-nci · 2023-03-22T21:11:15Z

I found this while debugging a different bitsandbytes issue, but figured I should post my finding.

This image does not work
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.5_cuda-11.6_ubuntu-20.04_python-only

Failed to initialize NVML: Driver/library version mismatch
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

This works
docker run --gpus all nvidia/cuda:11.6.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

This works
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.4_cuda-11.6_ubuntu-20.04_python-only

This is the output from the VM, not the containers.

Wed Mar 22 17:08:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   27C    P0    25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:13:00.0 Off |                    0 |
| N/A   26C    P0    25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    537570      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    537570      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

This is from the 1.4 container.

(base) root@52c19100d278:~# nvidia-smi
Wed Mar 22 20:58:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   27C    P0    25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:13:00.0 Off |                    0 |
| N/A   26C    P0    25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

ChristophSchranz · 2023-03-23T15:14:53Z

Hi @njacobson-nci

I was able to reproduce your issue on another server, however I can't update packages or reboot at the moment.

Some users say that a reboot helped.
Otherwise, it seems that version 1.5 requires an updated nvidia driver. You are using version 510 but 530 might be required here.

Could you tell me if a reboot helped?
I'm about to build it new on the server where I could reproduce your issue, maybe this will help. Another critical part could be a new installation to fix the pxtas-issue (#93)

gpu-jupyter/src/Dockerfile.gpulibs

Lines 44 to 49 in 69a81e3

    
           USER $NB_UID 
        
           RUN conda install -c nvidia cuda-nvcc -y && \ 
        
               conda clean --all -f -y && \ 
        
               fix-permissions $CONDA_DIR && \ 
        
               fix-permissions /home/$NB_USER

njacobson-nci · 2023-03-23T16:11:54Z

Thanks for the help!

Rebooting alone did not work, but updating nvidia drivers on the VM from 510->530 resolved the nvml mismatch issue I was seeing. This also updated the VM cuda version to 12.1.

The image is now cuda version 12.1 instead of 11.6, which was what I was expecting based on the name.


(base) root@9aa6ea4bbbce:~# nvidia-smi
Thu Mar 23 16:03:38 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB            On | 00000000:0B:00.0 Off |                    0 |
| N/A   27C    P0               25W / 250W|      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB            On | 00000000:13:00.0 Off |                    0 |
| N/A   27C    P0               25W / 250W|      4MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
(base) root@9aa6ea4bbbce:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

ChristophSchranz · 2023-03-23T16:30:09Z

That is a strange behavior I also see in my setup similarly: I'm installing Cuda and driver 11.6.2 as described in the medium-blog

sudo apt update
apt policy cuda  # check available versions of cuda
sudo apt-get install cuda=11.6.2-1
apt policy nvidia-gds  # check available versions of nvidia-gds
sudo apt install nvidia-gds=11.6.2-1

nvcc --verison shows the correct version, but nvidia-smi also Cuda 12.1 (and NVIDIA driver is 530), as seen here

It seems nvidia-smi can show a different version than nvcc, as noted in the nvidia-forum.

So I suppose the current installations in version 1.5 requires CUDA 11.6.2. I found that one of the packages in src/Dockerfile.gpulibs forces cuda to upgrade which causes the failure for host systems with a CUDA version below 11.6.2

I will find and downgrade this package!

ChristophSchranz · 2023-03-23T16:52:30Z

It seems the installation of nvtop is the problem here:

gpu-jupyter/src/Dockerfile.gpulibs

Lines 39 to 42 in 69a81e3

    
           RUN apt-get update && \ 
        
               apt-get install -y --no-install-recommends cmake libncurses5-dev libncursesw5-dev git nvtop && \ 
        
               apt-get clean && rm -rf /var/lib/apt/lists/*

It installs the dependencies libnvidia-compute-418 libnvidia-compute-430 libnvidia-compute-530 which might be incompatible.

After removing nvtop it works, however nvcc does not work anymore. I will figure out a solution here.

…, solving #106

ChristophSchranz · 2023-03-23T17:17:46Z

Interestingly, nvtop have already made some troubles, see here

I'm looking forward to get rid of it, hopefully all tests pass.

njacobson-nci · 2023-03-23T17:31:49Z

I did notice that nvtop wasn't working in my custom build, but it was low on my priority list of things to fix haha

ChristophSchranz · 2023-03-24T09:10:09Z

Another interesting insight is that the NVIDIA version on which the image is built affects the subsequent installations.
An image built on the driver version 530 leads on a node with version 520 to this error:

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

However, if the same Dockerfile is built on the node with 520 it works.

I'll built and and push the images now on the server with version 520 and hope its upwards compatible!

@njacobson-nci please check if you can successfully build and run nvidia-smi using the merged changes on the driver version 510 you are using and if it also works with the pulled version that will be pushed in the next hours on Dockerhub :)

njacobson-nci · 2023-03-27T16:50:14Z

@ChristophSchranz Sorry for the delay but I'm able to run the pushed 1.5 image on a VM with driver version 510 and CUDA 11.6 now

Thanks for all the help!

ChristophSchranz added a commit that referenced this issue Mar 23, 2023

removing nvtop as it is redundant and installs unchecked dependencies…

2c3a99b

…, solving #106

ChristophSchranz mentioned this issue Mar 23, 2023

removing nvtop as it is redundant and installs unchecked dependencies… #107

Merged

njacobson-nci closed this as completed Mar 27, 2023

ChristophSchranz mentioned this issue Jul 14, 2023

nvcc issue #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.5_cuda-11.6_ubuntu-20.04_python-only NVML version mismatch, 1.4 works fine #106

v1.5_cuda-11.6_ubuntu-20.04_python-only NVML version mismatch, 1.4 works fine #106

njacobson-nci commented Mar 22, 2023

ChristophSchranz commented Mar 23, 2023

njacobson-nci commented Mar 23, 2023

ChristophSchranz commented Mar 23, 2023

ChristophSchranz commented Mar 23, 2023

ChristophSchranz commented Mar 23, 2023 •

edited

Loading

njacobson-nci commented Mar 23, 2023

ChristophSchranz commented Mar 24, 2023 •

edited

Loading

njacobson-nci commented Mar 27, 2023

v1.5_cuda-11.6_ubuntu-20.04_python-only NVML version mismatch, 1.4 works fine #106

v1.5_cuda-11.6_ubuntu-20.04_python-only NVML version mismatch, 1.4 works fine #106

Comments

njacobson-nci commented Mar 22, 2023

ChristophSchranz commented Mar 23, 2023

njacobson-nci commented Mar 23, 2023

ChristophSchranz commented Mar 23, 2023

ChristophSchranz commented Mar 23, 2023

ChristophSchranz commented Mar 23, 2023 • edited Loading

njacobson-nci commented Mar 23, 2023

ChristophSchranz commented Mar 24, 2023 • edited Loading

njacobson-nci commented Mar 27, 2023

ChristophSchranz commented Mar 23, 2023 •

edited

Loading

ChristophSchranz commented Mar 24, 2023 •

edited

Loading