[BUG] fatal error: cusolverDn.h: No such file or directory #2684

IamHussain503 · 2023-01-10T10:13:38Z

Describe the bug
When I have installed deepspeed and dependencies gcc and g++ from the given links :

https://lindevs.com/install-gcc-on-ubuntu
https://lindevs.com/install-g-on-ubuntu

I am trying to run in python environment:
import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()

which should result successful loading of cpu_adam, however, there is error
fatal error: cusolverDn.h: No such file or directory
and other error in the end is:
RuntimeError: Error building extension 'cpu_adam'

I have downloaded the packages
https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/
cuda-license-10-0_10.0.130-1_amd64.deb
cuda-cublas-dev-10-0_10.0.130-1_amd64.deb
cuda-cublas-10-0_10.0.130-1_amd64.deb

cuda-cusolver-10-0_10.0.130-1_amd64.deb
cuda-cusolver-dev-10-0_10.0.130-1_amd64.deb

cuda-curand-10-0_10.0.130-1_amd64.deb

and installed them all, however error does not go away.

import deepspeed
deepspeed.ops.op_builder.CPUAdamBuilder().load()
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:3:0,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:11,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:1:
/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
[2/3] /opt/conda/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS_ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/opt/conda/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/envs/bitten/include -isystem /opt/conda/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:3:0,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16,
from /opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:1:
/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
#include <cusolverDn.h>
^~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/bitten/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load
return self.jit_load(verbose)
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load
op_module = load(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'

To Reproduce
Steps to reproduce the behavior:
OS version 18.04 Ubuntu
(bitten) root@C.5718699:$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
(bitten) root@C.5718699:$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46 Driver Version: 495.46 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:04:00.0 Off | Off |
| 30% 28C P8 18W / 230W | 1MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:44:00.0 Off | Off |
| 30% 27C P8 19W / 230W | 1MiB / 24256MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run `ds_report` to give us details about your setup.
(bitten) root@C.5718699:~$ ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/bitten/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/opt/conda/envs/bitten/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Please help , thanks

The text was updated successfully, but these errors were encountered:

HeyangQin · 2023-01-10T19:23:51Z

Hello @Shaukat-Hussain
You are using nvcc from /opt/conda/envs/bitten/bin/nvcc. Are you sure this is the correct nvcc you want to use? Also the command line does not include the system CUDA dir /usr/local/cuda/include/ where the cusolverDn.h locates.

Could you try export PATH=/usr/local/cuda/bin:$PATH to see if that fixes the problem? (Replace /usr/local/cuda/ with your cuda dir)

HeyangQin · 2023-01-12T00:48:05Z

To follow up on this issue: the root cause is on the pytorch side. They accidentally shipped the nvcc with their conda package which breaks the toolchain. The issue has been reported to the pytorch team and it should be fixed in the next release.

For now, please use temporary workaround: export PATH=/usr/local/cuda/bin:$PATH

Ref: https://discuss.pytorch.org/t/not-able-to-include-cusolverdn-h/169122

Please feel free to reopen the issue if the above solution doesn't work.

cocohao715 · 2023-02-08T02:53:55Z

sudo apt install nvidia-cuda-dev

tornikeo · 2023-04-04T08:23:22Z

sudo apt install nvidia-cuda-dev

This can lead to Failed to initialize NVML: Driver/library version mismatch Use with caution.

tornikeo · 2023-04-04T09:33:43Z

I solved this issue by swapping out docker base image.

Used pytorch/pytorch_1.13.1-cuda11.6-cudnn8-devel
Instead of pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel.

And the issue went away. Hope this helps.

charush12 · 2023-06-25T17:27:27Z

Along with adding to $PATH, make sure CUDA_HOME is also set properly to the nvcc version, that resolved the issue for me

thanhlong1997 · 2023-06-26T07:04:38Z

@HeyangQin Can you help me sir. I have checked nvcc dir is correct. cuda is already added to $PATH but still get this error

ERROR TraceBack

Installed CUDA version 11.2 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination
Using /home/jovyan/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
In file included from /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:8:
/opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~
compilation terminated.
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/envs/valle/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/valle/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/valle/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jovyan/vall-e/train.py", line 128, in
main()
File "/home/jovyan/vall-e/train.py", line 119, in main
trainer.train(
File "/home/jovyan/vall-e/vall_e/utils/trainer.py", line 125, in train
engines = engines_loader()
File "/home/jovyan/vall-e/train.py", line 21, in load_engines
model=trainer.Engine(
File "/home/jovyan/vall-e/vall_e/utils/engines.py", line 22, in init
super().init(None, *args, **kwargs)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 340, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1283, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1360, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 73, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 485, in load
return self.jit_load(verbose)
File "/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 520, in jit_load
op_module = load(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/valle/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

DS_REPORT:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/valle/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/opt/conda/envs/valle/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

HeyangQin · 2023-06-29T18:40:23Z

Hi @thanhlong1997. Could you manually check if cusolverDn.h exists in the include dir?

YeSho-cpp · 2023-06-30T13:26:42Z

Hi @thanhlong1997. Could you manually check if cusolverDn.h exists in the include dir?你好你能手动检查包含目录中是否存在 cusolverDn.h 吗？

hello，export PATH=/usr/local/cuda/bin:$PATH，I want to ask how to find my cuda dir。This is my command which nvcc
~/miniconda3/envs/myseg/bin/nvcc This is an error message ---share/home/ncu10/miniconda3/envs/myseg/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~

raoshashank · 2023-10-06T09:42:07Z

For me this solved the issue: export CPATH=/usr/local/cuda/include:$CPATH
(solution provided by ChatGPT)

ChdDongyang · 2023-11-14T07:18:00Z

For me this solved the issue: export CPATH=/usr/local/cuda/include:$CPATH (solution provided by ChatGPT)

牛逼！I solved problem by this way！

IcarusWizard · 2024-02-07T15:41:24Z

Another solution if use still want to use conda to manage cuda: simply install libcusolver-dev from nvidia for your cuda version. For example, I am using cuda11.6.1, so I can run conda install nvidia/label/cuda-11.6.1::libcusolver-dev .

Ethan-Chen-plus · 2024-06-12T06:44:23Z

conda install nvidia/label/cuda-11.6.1::libcusolver-dev

that's not work for me😭

Ethan-Chen-plus · 2024-06-12T07:05:48Z

I found that one of the best method is:

git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed/
DS_BUILD_CPU_ADAM=1 python setup.py build_ext -j8 bdist_wheel
pip install dist/deepspeed-0.14.3+b6e24adb-cp312-cp312-linux_x86_64.whl

IamHussain503 added bug Something isn't working training labels Jan 10, 2023

tjruwase assigned HeyangQin Jan 11, 2023

HeyangQin closed this as completed Jan 12, 2023

loadams mentioned this issue Jan 13, 2023

Cannot find cuda_profiler_api.h when building cpu_adam #2622

Closed

zyuh mentioned this issue Apr 3, 2023

About deepspeed and fsdp speed differences？ OptimalScale/LMFlow#24

Closed

Godofnothing mentioned this issue Apr 5, 2023

fatal error: cusolverDn.h: No such file or directory Dao-AILab/flash-attention#157

Closed

mcale6 mentioned this issue Jun 19, 2023

Building Flash Attention from source: Fatal error: cusolverDn.h: No such file or directory Dao-AILab/flash-attention#280

Open

freecraver mentioned this issue Nov 14, 2023

[Bug]: what cuda version dlib supports so far davisking/dlib#2832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] fatal error: cusolverDn.h: No such file or directory #2684

[BUG] fatal error: cusolverDn.h: No such file or directory #2684

IamHussain503 commented Jan 10, 2023

HeyangQin commented Jan 10, 2023

HeyangQin commented Jan 12, 2023

cocohao715 commented Feb 8, 2023

tornikeo commented Apr 4, 2023

tornikeo commented Apr 4, 2023

charush12 commented Jun 25, 2023

thanhlong1997 commented Jun 26, 2023 •

edited

Loading

HeyangQin commented Jun 29, 2023

YeSho-cpp commented Jun 30, 2023

raoshashank commented Oct 6, 2023

ChdDongyang commented Nov 14, 2023

IcarusWizard commented Feb 7, 2024

Ethan-Chen-plus commented Jun 12, 2024

Ethan-Chen-plus commented Jun 12, 2024

[BUG] fatal error: cusolverDn.h: No such file or directory #2684

[BUG] fatal error: cusolverDn.h: No such file or directory #2684

Comments

IamHussain503 commented Jan 10, 2023

ds_report output Please run ds_report to give us details about your setup. (bitten) root@C.5718699:~$ ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

HeyangQin commented Jan 10, 2023

HeyangQin commented Jan 12, 2023

cocohao715 commented Feb 8, 2023

tornikeo commented Apr 4, 2023

tornikeo commented Apr 4, 2023

charush12 commented Jun 25, 2023

thanhlong1997 commented Jun 26, 2023 • edited Loading

ERROR TraceBack

DS_REPORT:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

HeyangQin commented Jun 29, 2023

YeSho-cpp commented Jun 30, 2023

raoshashank commented Oct 6, 2023

ChdDongyang commented Nov 14, 2023

IcarusWizard commented Feb 7, 2024

Ethan-Chen-plus commented Jun 12, 2024

Ethan-Chen-plus commented Jun 12, 2024

ds_report output
Please run `ds_report` to give us details about your setup.
(bitten) root@C.5718699:~$ ds_report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

thanhlong1997 commented Jun 26, 2023 •

edited

Loading

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]