Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA exception! Error code: no CUDA-capable device is detected when training Lora #270

Open
luojiesi opened this issue Mar 9, 2023 · 4 comments

Comments

@luojiesi
Copy link

luojiesi commented Mar 9, 2023

I was using https://github.com/derrian-distro/LoRA_Easy_Training_Scripts to train my lora embedding and encountered with this issue. I believe this is an issue of the underlying sd-scripts. That's why I am posting here. Let me know if I have it wrong.

This is the full output:

/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:94: UserWarning: /home/jiesiluo/anaconda3/envs/loratraining did not contain libcudart.so as expected! Searching further paths...
warn(
/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/cv2/../../lib64')}
warn(
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA exception! Error code: no CUDA-capable device is detected
CUDA exception! Error code: initialization error
CUDA SETUP: Highest compute capability among GPUs detected: None
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda121_nocublaslt.so
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
Traceback (most recent call last):
File "/home/jiesiluo/LoRA_Easy_Training_Scripts/main.py", line 147, in
main()
File "/home/jiesiluo/LoRA_Easy_Training_Scripts/main.py", line 80, in main
train_network.train(args)
File "/home/jiesiluo/LoRA_Easy_Training_Scripts/sd_scripts/train_network.py", line 168, in train
optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable_params)
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/library/train_util.py", line 1707, in get_optimizer
import bitsandbytes as bnb
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in
from .autograd._functions import (
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 5, in
import bitsandbytes.functional as F
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/functional.py", line 13, in
from .cextension import COMPILED_WITH_CUDA, lib
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 41, in
lib = CUDALibrary_Singleton.get_instance().lib
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 37, in get_instance
cls._instance.initialize()
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!
Traceback (most recent call last):
File "/home/jiesiluo/anaconda3/envs/loratraining/bin/accelerate", line 8, in
sys.exit(main())
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/jiesiluo/anaconda3/envs/loratraining/bin/python', 'main.py', '--load_json_path', '/home/jiesiluo/training/config.json']' returned non-zero exit status 1.

I have removed & install cuda and cuda driver (12.1 only) with directions from here https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#network-repo-installation-for-ubuntu. I have verified the cuda installation by the following commands:

(loratraining) jiesiluo@SFF-PC:/LoRA_Easy_Training_Scripts$ /usr/local/cuda/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
(loratraining) jiesiluo@SFF-PC:
/LoRA_Easy_Training_Scripts$ nvidia-smi
Thu Mar 9 01:01:30 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 531.18 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 0% 36C P8 5W / 450W| 1725MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 39 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+

I am using commit 7b0af4f. Can some one help? Thanks.

@1500256797
Copy link

我也是这个问题,你解决了吗

@kohya-ss
Copy link
Owner

kohya-ss commented Apr 9, 2023

File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!

It seems that bitsandbytes causes an error. Unfortunately I've tested with Ubuntu nor CUDA 12.X, but the latest bitsandbytes seems to support CUDA 12.1, so the version may solve the issue.

@luojiesi
Copy link
Author

File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!

It seems that bitsandbytes causes an error. Unfortunately I've tested with Ubuntu nor CUDA 12.X, but the latest bitsandbytes seems to support CUDA 12.1, so the version may solve the issue.

Thanks for the response. Yes that was the issue. I was able to work around it by choosing a combination of params which doesn't call bitsandbytes. Can you share which commit of bitsandbytes resolved the issue? I'll try again.

@ArtificialCleverness
Copy link

I solved this problem by removing bitsandbytes version, try
pip uninstall bitsandbytes ./setup.sh
#465 I also created a PR by removing the version constraint

File "/home/jiesiluo/anaconda3/envs/loratraining/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!

It seems that bitsandbytes causes an error. Unfortunately I've tested with Ubuntu nor CUDA 12.X, but the latest bitsandbytes seems to support CUDA 12.1, so the version may solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants