Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fatal error: cusolverDn.h: No such file or directory #2684

Closed
IamHussain503 opened this issue Jan 10, 2023 · 14 comments
Closed

[BUG] fatal error: cusolverDn.h: No such file or directory #2684

IamHussain503 opened this issue Jan 10, 2023 · 14 comments
Assignees
Labels
bug Something isn't working training

Comments

@IamHussain503
Copy link

Describe the bug
When I have installed deepspeed and dependencies gcc and g++ from the given links :

https://lindevs.com/install-gcc-on-ubuntu
https://lindevs.com/install-g-on-ubuntu

I am trying to run in python environment:
import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()

which should result successful loading of cpu_adam, however, there is error
fatal error: cusolverDn.h: No such file or directory
and other error in the end is:
RuntimeError: Error building extension 'cpu_adam'

I have downloaded the packages
https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/
cuda-license-10-0_10.0.130-1_amd64.deb
cuda-cublas-dev-10-0_10.0.130-1_amd64.deb
cuda-cublas-10-0_10.0.130-1_amd64.deb

cuda-cusolver-10-0_10.0.130-1_amd64.deb
cuda-cusolver-dev-10-0_10.0.130-1_amd64.deb

cuda-curand-10-0_10.0.130-1_amd64.deb

and installed them all, however error does not go away.

import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:3:0,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:11,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:1:
/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
[2/3] /opt/conda/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
_ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/opt/conda/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:3:0,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:1:
/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/bitten/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load
return self.jit_load(verbose)
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load
op_module = load(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'

To Reproduce
Steps to reproduce the behavior:
OS version 18.04 Ubuntu
(bitten) root@C.5718699:$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
(bitten) root@C.5718699:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46 Driver Version: 495.46 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:04:00.0 Off | Off |
| 30% 28C P8 18W / 230W | 1MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:44:00.0 Off | Off |
| 30% 27C P8 19W / 230W | 1MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.
(bitten) root@C.5718699:~$ ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/bitten/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Please help , thanks

@IamHussain503 IamHussain503 added bug Something isn't working training labels Jan 10, 2023
@HeyangQin
Copy link
Contributor

Hello @Shaukat-Hussain
You are using nvcc from /opt/conda/envs/bitten/bin/nvcc. Are you sure this is the correct nvcc you want to use? Also the command line does not include the system CUDA dir /usr/local/cuda/include/ where the cusolverDn.h locates.

Could you try export PATH=/usr/local/cuda/bin:$PATH to see if that fixes the problem? (Replace /usr/local/cuda/ with your cuda dir)

@HeyangQin
Copy link
Contributor

To follow up on this issue: the root cause is on the pytorch side. They accidentally shipped the nvcc with their conda package which breaks the toolchain. The issue has been reported to the pytorch team and it should be fixed in the next release.

For now, please use temporary workaround: export PATH=/usr/local/cuda/bin:$PATH

Ref: https://discuss.pytorch.org/t/not-able-to-include-cusolverdn-h/169122

Please feel free to reopen the issue if the above solution doesn't work.

@cocohao715
Copy link

sudo apt install nvidia-cuda-dev

@tornikeo
Copy link

tornikeo commented Apr 4, 2023

sudo apt install nvidia-cuda-dev

This can lead to Failed to initialize NVML: Driver/library version mismatch Use with caution.

@tornikeo
Copy link

tornikeo commented Apr 4, 2023

I solved this issue by swapping out docker base image.

Used pytorch/pytorch_1.13.1-cuda11.6-cudnn8-devel
Instead of pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel.

And the issue went away. Hope this helps.

@charush12
Copy link

Along with adding to $PATH, make sure CUDA_HOME is also set properly to the nvcc version, that resolved the issue for me

@thanhlong1997
Copy link

thanhlong1997 commented Jun 26, 2023

@HeyangQin Can you help me sir. I have checked nvcc dir is correct. cuda is already added to $PATH but still get this error

ERROR TraceBack

Installed CUDA version 11.2 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination
Using /home/jovyan/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
In file included from /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:8:
/opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~
compilation terminated.
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/valle/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jovyan/vall-e/train.py", line 128, in
main()
File "/home/jovyan/vall-e/train.py", line 119, in main
trainer.train(
File "/home/jovyan/vall-e/vall_e/utils/trainer.py", line 125, in train
engines = engines_loader()
File "/home/jovyan/vall-e/train.py", line 21, in load_engines
model=trainer.Engine(
File "/home/jovyan/vall-e/vall_e/utils/engines.py", line 22, in init
super().init(None, *args, **kwargs)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1283, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1360, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 73, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 485, in load
return self.jit_load(verbose)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 520, in jit_load
op_module = load(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

DS_REPORT:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/valle/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

@HeyangQin
Copy link
Contributor

Hi @thanhlong1997. Could you manually check if cusolverDn.h exists in the include dir?

@YeSho-cpp
Copy link

Hi @thanhlong1997. Could you manually check if cusolverDn.h exists in the include dir?你好你能手动检查包含目录中是否存在 cusolverDn.h 吗?

hello,export PATH=/usr/local/cuda/bin:$PATH,I want to ask how to find my cuda dir。This is my command which nvcc
~/miniconda3/envs/myseg/bin/nvcc This is an error message ---share/home/ncu10/miniconda3/envs/myseg/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~

@raoshashank
Copy link

For me this solved the issue: export CPATH=/usr/local/cuda/include:$CPATH
(solution provided by ChatGPT)

@ChdDongyang
Copy link

For me this solved the issue: export CPATH=/usr/local/cuda/include:$CPATH (solution provided by ChatGPT)

牛逼!I solved problem by this way!

@IcarusWizard
Copy link

Another solution if use still want to use conda to manage cuda: simply install libcusolver-dev from nvidia for your cuda version. For example, I am using cuda11.6.1, so I can run conda install nvidia/label/cuda-11.6.1::libcusolver-dev .

@Ethan-Chen-plus
Copy link

conda install nvidia/label/cuda-11.6.1::libcusolver-dev

that's not work for me😭

@Ethan-Chen-plus
Copy link

I found that one of the best method is:

git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed/
DS_BUILD_CPU_ADAM=1 python setup.py build_ext -j8 bdist_wheel
pip install dist/deepspeed-0.14.3+b6e24adb-cp312-cp312-linux_x86_64.whl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests