Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CUDA extension build skipped when installing from source #891

Closed
imabackstabber opened this issue Apr 27, 2022 · 12 comments
Closed
Labels
bug Something isn't working

Comments

@imabackstabber
Copy link

🐛 Describe the bug

Hi,I use Install From Source option to install colossalAI,but i encouter problem like:
/path/to/myconda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling') Colossalai should be built with cuda extension to use the FP16 optimizer If you want to activate cuda mode for MoE, please install with cuda_ext!
I have installed torch1.11+cu11.3 and using cuda 11.1
any suggestion?

Environment

Pytorch 1.11+cu11.3
CUDA 11.1

@imabackstabber imabackstabber added the bug Something isn't working label Apr 27, 2022
@FrankLeeeee
Copy link
Contributor

Hi, you should use torch built with cuda 11.1 as your cuda version is 11.1.

@FrankLeeeee
Copy link
Contributor

Generally, this is allowed in Colossal-AI to build with torch 1.11+cu11.3 and cuda 11.1. However, I think the problem is that you don't have a GPU on your system as the log says CUDA is not available.

@FrankLeeeee
Copy link
Contributor

You can check this by

python -c "import torch;print(torch.cuda.is_available());"

@imabackstabber
Copy link
Author

You can check this by

python -c "import torch;print(torch.cuda.is_available());"

thanks for your reply.Now I assured my cuda is available,but I got error like:

/mnt/cache/share/cuda-11.1/bin/nvcc -I/tmp/pip-req-build-2nvgz3bf/colossalai/kernel/cuda_native/csrc/kernels/include -I/path/to/conda/anaconda3/envs/py
37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include -I/path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/in$
lude/torch/csrc/api/include -I/path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/TH -I/mnt/cache/yangwendi.ven$
or/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/THC -I/mnt/cache/share/cuda-11.1/include -I/path/to/conda/anaconda3/env$
/py37-pt111-cu111-colai/include/python3.7m -c colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu -o build/temp.linux-x86_64-3.7/colossalai/k$
rnel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA$
NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -U__CUDA_NO_HALF_OPERAT$
RS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -DTORCH_API$
INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=colossal_scaled_upper_tria$
g_masked_softmax -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
    /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h: In instant$
ation of ‘std::shared_ptr<torch::nn::Module> torch::nn::Cloneable<Derived>::clone(const c10::optional<c10::Device>&) const [with Derived = torch::nn::CrossMapLRN2dImpl$
’:
    /tmp/tmpxft_000187da_00000000-6_scaled_upper_triang_masked_softmax_cuda.compute_80.cudafe1.stub.c:209:27:   required from here
    /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:57:59: erro$
: invalid static_cast from type ‘const torch::OrderedDict<std::basic_string<char>, at::Tensor>’ to type ‘torch::OrderedDict<std::basic_string<char>, at::Tensor>&’
    /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:69:61: erro$
: invalid static_cast from type ‘const torch::OrderedDict<std::basic_string<char>, std::shared_ptr<torch::nn::Module> >’ to type ‘torch::OrderedDict<std::basic_string<$
har>, std::shared_ptr<torch::nn::Module> >&’
    ...
    ...
    /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/optim/sgd.h:49:48:   requir
ed from here
    /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:57:59: error
: invalid static_cast from type ‘const torch::OrderedDict<std::basic_string<char>, at::Tensor>’ to type ‘torch::OrderedDict<std::basic_string<char>, at::Tensor>&’
    /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:69:61: error
: invalid static_cast from type ‘const torch::OrderedDict<std::basic_string<char>, std::shared_ptr<torch::nn::Module> >’ to type ‘torch::OrderedDict<std::basic_string<c
har>, std::shared_ptr<torch::nn::Module> >&’
    error: command '/mnt/cache/share/cuda-11.1/bin/nvcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/bin/python -u -c 'import io, os, sys, setuptools, token
ize; sys.argv[0] = '"'"'/tmp/pip-req-build-2nvgz3bf/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-2nvgz3bf/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(_
_file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();e
xec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-b988udni/install-record.txt --single-version-externally-managed --compile --install-heade
rs /path/to/conda/anaconda3/envs/py37-pt111-cu111-colai/include/python3.7m/colossalai Check the logs for full command output.

@imabackstabber
Copy link
Author

Any possible solution? @FrankLeeeee

@FrankLeeeee
Copy link
Contributor

Any possible solution? @FrankLeeeee

Hi, one possible reason is that you PyTorch is built with cuda 11.3 and your CUDA is 11.1. Can you tried this version?

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

@imabackstabber
Copy link
Author

Any possible solution? @FrankLeeeee

Hi, one possible reason is that you PyTorch is built with cuda 11.3 and your CUDA is 11.1. Can you tried this version?

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

Thanks for the prompt reply, I will try it.

@imabackstabber
Copy link
Author

well,I tried the abve command line and test it like:

>>> import torch
>>> torch.__version__
'1.10.0+cu111'
>>> import torchvision
>>> torchvision.__version__
'0.11.2+cu111'

and still can't compile it.Maybe you have a prebuilt package for release?

@feifeibear
Copy link
Contributor

@imabackstabber Sorry makes you struggle to install our software. We will solve the cuda dependency problems ASAP. Could you please use the docker container to install and run?

@FrankLeeeee
Copy link
Contributor

@imabackstabber We are building the self-hosted pip source, we will let you known ASAP.

@imabackstabber
Copy link
Author

ok,looking forward to your reply!

@FrankLeeeee
Copy link
Contributor

Colossal-AI pre-built with CUDA extension is provided at https://www.colossalai.org/download now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants