Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Docker Image CUDA ERROR #104

Closed
aerdem4 opened this issue May 10, 2023 · 8 comments
Closed

[BUG] Docker Image CUDA ERROR #104

aerdem4 opened this issue May 10, 2023 · 8 comments
Labels
type/bug Bug in code

Comments

@aerdem4
Copy link

aerdem4 commented May 10, 2023

馃悰 Bug

I am getting the warning below and the nightly Docker image doesn't see my GPU. I have RTX 3090 with Driver Version: 470.182.03 CUDA Version: 11.4 on the host machine.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)

Docker Image CUDA version seems to be 11.8 and my driver version should support it.

To Reproduce

sudo docker run --runtime=nvidia --shm-size=64g --init --rm -p 10101:10101 -v pwd/data:/workspace/data -v pwd/output:/workspace/output gcr.io/vorvan/h2oai/h2o-llmstudio:nightly

@aerdem4 aerdem4 added the type/bug Bug in code label May 10, 2023
@pascal-pfeiffer
Copy link
Collaborator

pascal-pfeiffer commented May 10, 2023

Thank you for reporting, @aerdem4

I am receiving the same error on my machine with a 3090 and the host cuda:
Driver Version: 510.108.03 CUDA Version: 11.6

As everything runs smoothly on all other tested machines, I expected that to be a rare issue. Seems, I was wrong. I'll investigate more and try to find a solution. Do you have any special ENV vars set on your host machine regarding cuda? That is one thing that I have set differently on the machine where the docker can't initialize the GPU.

@aerdem4
Copy link
Author

aerdem4 commented May 10, 2023

Do you have any special ENV vars set on your host machine regarding cuda?

I don't think so. Maybe 11.8 is just not compatible with 3090? Was any of successful tests on 3090?

@pascal-pfeiffer
Copy link
Collaborator

I don't think so. Maybe 11.8 is just not compatible with 3090? Was any of successful tests on 3090?

no tests that I am aware of. Other tests included A100, A10G, A6000, V100 (all successful)

I just tested a docker build with nvidia/cuda:11.6.2-devel-ubuntu20.04 and that seems to work. Could you maybe test it on your machine, too? I'll merge the fix/downgrade if you confirm (#105).

@pascal-pfeiffer pascal-pfeiffer linked a pull request May 11, 2023 that will close this issue
@aerdem4
Copy link
Author

aerdem4 commented May 11, 2023

It didn't work for me. Same error.

@aerdem4
Copy link
Author

aerdem4 commented May 11, 2023

When I change the base image to my host machine CUDA version, it works.

@psinger
Copy link
Collaborator

psinger commented May 12, 2023

Yeah, unfortunately bitsandbytes tries to use the global cuda instead of the local pytorch cuda which has caused all sorts of issues - this one might be related to it.

I hope that this PR fixes it:
TimDettmers/bitsandbytes#375

For now I am hesitant to try to address this too much, as it seems this is also only an issue on Docker for some setups. If manually fixing the cuda version fixes it for you, it sounds like a good workaround.

Otherwise it might be also good idea to run it outside of Docker with the make commands.

@krzysztofantczak
Copy link

krzysztofantczak commented May 21, 2023

Seems to be working for me on 3090 and docker but I seem to have different versions of stuff.

Docker image used: gcr.io/vorvan/h2oai/h2o-llmstudio:nightly (appears to be created on May 21, 2023, 6:12:32 AM)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/workspace-dqq3IVyd/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
INFO:     127.0.0.1:33352 - "POST / HTTP/1.1" 200 OK
2023-05-21 09:17:29,190 - INFO: Initializing app ...
2023-05-21 09:17:29,201 - INFO: Initializing app ... done
2023-05-21 09:17:29,201 - INFO: Initializing client None
2023-05-21 09:17:29,239 - INFO: User name: anon
2023-05-21 09:17:29,242 - INFO: Downloading default dataset...

nvidia-smi (ran from inside of the container)

Sun May 21 11:40:49 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         Off| 00000000:00:10.0 Off |                  N/A |
|  0%   48C    P8               10W / 350W|      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@psinger
Copy link
Collaborator

psinger commented Aug 18, 2023

We updated bitsandbytes to 0.41.0 which should solve this

Release 0.41.0 features an overhaul of the CUDA_SETUP routine. We trust PyTorch to find the proper CUDA binaries and use those.

Please reopen if issues still persist.

@psinger psinger closed this as completed Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants