[BUG] Docker Image CUDA ERROR #104

aerdem4 · 2023-05-10T13:12:36Z

🐛 Bug

I am getting the warning below and the nightly Docker image doesn't see my GPU. I have RTX 3090 with Driver Version: 470.182.03 CUDA Version: 11.4 on the host machine.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)

Docker Image CUDA version seems to be 11.8 and my driver version should support it.

To Reproduce

sudo docker run --runtime=nvidia --shm-size=64g --init --rm -p 10101:10101 -v pwd/data:/workspace/data -v pwd/output:/workspace/output gcr.io/vorvan/h2oai/h2o-llmstudio:nightly

The text was updated successfully, but these errors were encountered:

pascal-pfeiffer · 2023-05-10T19:37:06Z

Thank you for reporting, @aerdem4

I am receiving the same error on my machine with a 3090 and the host cuda:
Driver Version: 510.108.03 CUDA Version: 11.6

As everything runs smoothly on all other tested machines, I expected that to be a rare issue. Seems, I was wrong. I'll investigate more and try to find a solution. Do you have any special ENV vars set on your host machine regarding cuda? That is one thing that I have set differently on the machine where the docker can't initialize the GPU.

aerdem4 · 2023-05-10T21:13:31Z

Do you have any special ENV vars set on your host machine regarding cuda?

I don't think so. Maybe 11.8 is just not compatible with 3090? Was any of successful tests on 3090?

pascal-pfeiffer · 2023-05-11T12:20:58Z

I don't think so. Maybe 11.8 is just not compatible with 3090? Was any of successful tests on 3090?

no tests that I am aware of. Other tests included A100, A10G, A6000, V100 (all successful)

I just tested a docker build with nvidia/cuda:11.6.2-devel-ubuntu20.04 and that seems to work. Could you maybe test it on your machine, too? I'll merge the fix/downgrade if you confirm (#105).

aerdem4 · 2023-05-11T17:49:54Z

It didn't work for me. Same error.

aerdem4 · 2023-05-11T18:28:59Z

When I change the base image to my host machine CUDA version, it works.

psinger · 2023-05-12T10:00:29Z

Yeah, unfortunately bitsandbytes tries to use the global cuda instead of the local pytorch cuda which has caused all sorts of issues - this one might be related to it.

I hope that this PR fixes it:
TimDettmers/bitsandbytes#375

For now I am hesitant to try to address this too much, as it seems this is also only an issue on Docker for some setups. If manually fixing the cuda version fixes it for you, it sounds like a good workaround.

Otherwise it might be also good idea to run it outside of Docker with the make commands.

krzysztofantczak · 2023-05-21T11:41:34Z

Seems to be working for me on 3090 and docker but I seem to have different versions of stuff.

Docker image used: gcr.io/vorvan/h2oai/h2o-llmstudio:nightly (appears to be created on May 21, 2023, 6:12:32 AM)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
INFO:     127.0.0.1:33352 - "POST / HTTP/1.1" 200 OK
2023-05-21 09:17:29,190 - INFO: Initializing app ...
2023-05-21 09:17:29,201 - INFO: Initializing app ... done
2023-05-21 09:17:29,201 - INFO: Initializing client None
2023-05-21 09:17:29,239 - INFO: User name: anon
2023-05-21 09:17:29,242 - INFO: Downloading default dataset...

nvidia-smi (ran from inside of the container)

Sun May 21 11:40:49 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         Off| 00000000:00:10.0 Off |                  N/A |
|  0%   48C    P8               10W / 350W|      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

psinger · 2023-08-18T08:06:21Z

We updated bitsandbytes to 0.41.0 which should solve this

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those.

Please reopen if issues still persist.

aerdem4 added the type/bug Bug in code label May 10, 2023

pascal-pfeiffer linked a pull request May 11, 2023 that will close this issue

downgrade cuda to 11.6.2 in Dockerfile #105

Closed

psinger closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Docker Image CUDA ERROR #104

[BUG] Docker Image CUDA ERROR #104

aerdem4 commented May 10, 2023

pascal-pfeiffer commented May 10, 2023 •

edited

aerdem4 commented May 10, 2023

pascal-pfeiffer commented May 11, 2023

aerdem4 commented May 11, 2023

aerdem4 commented May 11, 2023

psinger commented May 12, 2023 •

edited

krzysztofantczak commented May 21, 2023 •

edited

psinger commented Aug 18, 2023 •

edited

[BUG] Docker Image CUDA ERROR #104

[BUG] Docker Image CUDA ERROR #104

Comments

aerdem4 commented May 10, 2023

🐛 Bug

To Reproduce

pascal-pfeiffer commented May 10, 2023 • edited

aerdem4 commented May 10, 2023

pascal-pfeiffer commented May 11, 2023

aerdem4 commented May 11, 2023

aerdem4 commented May 11, 2023

psinger commented May 12, 2023 • edited

krzysztofantczak commented May 21, 2023 • edited

psinger commented Aug 18, 2023 • edited

pascal-pfeiffer commented May 10, 2023 •

edited

psinger commented May 12, 2023 •

edited

krzysztofantczak commented May 21, 2023 •

edited

psinger commented Aug 18, 2023 •

edited