Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImportError: ./mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd #385

Open
KorovkoAlexander opened this issue Mar 10, 2019 · 18 comments

Comments

@KorovkoAlexander
Copy link

commented Mar 10, 2019

Hi, I`m facing the problem with training:

alexander@alexander-desktop:~/Code/Projects/mmdetection$ ./tools/dist_train.sh configs/retinanet_r101_fpn_1x.py 1
Traceback (most recent call last):
File "./tools/train.py", line 7, in
from mmdet.datasets import get_dataset
File "/home/alexander/Code/Projects/mmdetection/mmdet/datasets/init.py", line 1, in
from .custom import CustomDataset
File "/home/alexander/Code/Projects/mmdetection/mmdet/datasets/custom.py", line 11, in
from .extra_aug import ExtraAugmentation
File "/home/alexander/Code/Projects/mmdetection/mmdet/datasets/extra_aug.py", line 5, in
from mmdet.core.evaluation.bbox_overlaps import bbox_overlaps
File "/home/alexander/Code/Projects/mmdetection/mmdet/core/init.py", line 6, in
from .post_processing import * # noqa: F401, F403
File "/home/alexander/Code/Projects/mmdetection/mmdet/core/post_processing/init.py", line 1, in
from .bbox_nms import multiclass_nms
File "/home/alexander/Code/Projects/mmdetection/mmdet/core/post_processing/bbox_nms.py", line 3, in
from mmdet.ops.nms import nms_wrapper
File "/home/alexander/Code/Projects/mmdetection/mmdet/ops/init.py", line 5, in
from .nms import nms, soft_nms
File "/home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/init.py", line 1, in
from .nms_wrapper import nms, soft_nms
File "/home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/nms_wrapper.py", line 4, in
from .gpu_nms import gpu_nms
ImportError: /home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd

I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04
Everything compiled well during installation.

@KorovkoAlexander KorovkoAlexander changed the title ImportError: /home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd ImportError: ./mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd Mar 10, 2019
@donglao

This comment has been minimized.

Copy link

commented Mar 10, 2019

Just solved the same issue. Check the compatibility of your Pytorch and Cuda version.

@Hesene

This comment has been minimized.

Copy link

commented Mar 11, 2019

Just solved the same issue. Check the compatibility of your Pytorch and Cuda version.

Hi, Thank you for your sharing.
I also have this problem. I want to know if it can work in cuda10 and torch1.0 environments? and what's your cuda version and torch version?
Thank you for your answer!!

@donglao

This comment has been minimized.

Copy link

commented Mar 11, 2019

Just solved the same issue. Check the compatibility of your Pytorch and Cuda version.

Hi, Thank you for your sharing.
I also have this problem. I want to know if it can work in cuda10 and torch1.0 environments? and what's your cuda version and torch version?
Thank you for your answer!!

I'm using cuda 10.1 and pytorch 1.0.

@hellock

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

As a reference, we have tried mmdetection on CUDA 9.0/9.2/10.0 with PyTorch 1.0 and CUDA 9.0/9.2 with PyTorch 0.4.1.

@Baby47

This comment has been minimized.

Copy link

commented Apr 4, 2019

Hello, @hellock, @donglao, @Hesene, I meet a similar problem:
File "./mmdetection/mmdet/ops/dcn/init.py", line 1, in
from .functions.deform_conv import deform_conv, modulated_deform_conv
File "./mmdetection/mmdet/ops/dcn/functions/deform_conv.py", line 5, in
from .. import deform_conv_cuda
ImportError: ./mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-35m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd

I'm using cuda8.0 and PyTorch 1.0. The GCC version is 5.4.0.
I wonder if you have tried mmdetection under this configuration.
Thanks a lot!

@ruiyuanlu

This comment has been minimized.

Copy link

commented Apr 15, 2019

I've met the same issue.
I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04, too.
Note that CUDA 8.0 for Ubuntu 18.04 is not available. I've tried to compile pytorch 1.0.1.post2 and install from source code with CUDA 10.1, the error "undefined symbol: __cudaRegisterFatBinaryEnd" still occurred.

I've also tried CUDA 9.0 and pytroch 1.0.1.post2, and got error: "undefined symbol: __cudaPopCallConfiguration. Any tips?

@yhcao6

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2019

what is your gcc version?

@ruiyuanlu

This comment has been minimized.

Copy link

commented Apr 15, 2019

what is your gcc version?

gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0

@yhcao6

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2019

I compile it successfully under gcc 5.4 and CUDA 9.

@ruiyuanlu

This comment has been minimized.

Copy link

commented Apr 15, 2019

I compile it successfully under gcc 5.4 and CUDA 9.

I'll try gcc 5.4 tomorrow.
Have you tried CUDA 10.0 or CUDA 10.1? The only relative code I could find is defined at:
https://github.com/llvm-mirror/clang/blob/master/lib/CodeGen/CGCUDANV.cpp

where line 699 define the creation of function "__cudaRegisterFatBinaryEnd" and line 292 define the creation of function "__cudaPopCallConfiguration".

As for __cudaRegisterFatBinaryEnd, I found that cuda define a new attribute for CUDA 10.1 at:
https://clang.llvm.org/doxygen/include_2clang_2Basic_2Cuda_8h_source.html#l00108

 //  Various SDK-dependent features that affect CUDA compilation
 enum class CudaFeature {
   // CUDA-9.2+ uses a new API for launching kernels.
   CUDA_USES_NEW_LAUNCH,
   // CUDA-10.1+ needs explicit end of GPU binary registration.
   CUDA_USES_FATBIN_REGISTER_END,
 };

And CUDA_USES_FATBIN_REGISTER_END is checked in line 663 of
https://github.com/llvm-mirror/clang/blob/master/lib/CodeGen/CGCUDANV.cpp#L663:

// Call __cudaRegisterFatBinaryEnd(Handle) if this CUDA version needs it.
     if (CudaFeatureEnabled(CGM.getTarget().getSDKVersion(),
                            CudaFeature::CUDA_USES_FATBIN_REGISTER_END)) {
       // void __cudaRegisterFatBinaryEnd(void **);
       llvm::FunctionCallee RegisterFatbinEndFunc = CGM.CreateRuntimeFunction(
           llvm::FunctionType::get(VoidTy, VoidPtrPtrTy, false),
           "__cudaRegisterFatBinaryEnd");
       CtorBuilder.CreateCall(RegisterFatbinEndFunc, RegisterFatbinCall);
     }
   }

Any thing mismatch with CUDA 10.1 and pytorch?

@hellock

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

@ruiyuanlu I tried CUDA 10 and gcc 7.x before, which is compatible.

@ruiyuanlu

This comment has been minimized.

Copy link

commented Apr 16, 2019

Well, I've solved this issue on my machine using pytorch 1.1.0 (latest version on github).

gcc 5.x doesn't help, because some compile options in the CMakeLists.txt of pytorch 1.1.0 is not supported by gcc 5.x, while gcc 7.x is OK to compile pytroch.

It seems that CUDA 10.0 is slightly different from CUDA 10.1. This issue is caused by the CUDA version mismatch of pytorch during compiling time and run-time. Other conditions might occur if other mismatched run-time CUDA version is installed. For example, this error: "undefined symbol: __cudaPopCallConfiguration" might occur for earlier version of CUDA. Thus, my solution is to recompile pytorch to match the run-time CUDA version. Maybe change the CUDA run-time version also works, I didn't test that. Here is how I fixed it.

(Ubuntu 18.04 only)

1. Uninstall pytorch if it doesn't work:

pip uninstall pytorch #  conda uninstall pytorch, if you use conda

2. Install CUDA-10.0 (optional)

This step is optional, other version of CUDA should be OK, if the CUDA version of compiling time matches run-time version.

Following the instructions of run file here:

Then check nvcc version:

nvcc -V

The output should be something like (release 10.0):

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Note that symlink is needed after cuda installation:

sudo rm -f /usr/local/cuda # optional, only if you already have this symlink
sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda

Then, add paths to your ~/.basrc file. These paths will be used during pytorch compiling.

export CUDA_HOME=/usr/local/cuda
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/cuda/lib64"

Use source to make sure the paths above will be loaded.

source ~/.bashrc

3. Compile pytorch

The instructions can be found here, but some details might be different.

Note that mkl=2019.3 is required. Details can be found in this issue.

conda install numpy pyyaml mkl=2019.3 mkl-include setuptools cmake cffi typing
conda install -c pytorch magma-cuda100 # optional step
# clone the pytorch source code
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
make clean # make clean is needed in my case
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
sudo python setup.py install # sudo is needed in my case.

After all the steps aforementioned, it finally works.

@wyhcqq

This comment has been minimized.

Copy link

commented Apr 22, 2019

I've met the same issue.
I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04, too.
Note that CUDA 8.0 for Ubuntu 18.04 is not available. I've tried to compile pytorch 1.0.1.post2 and install from source code with CUDA 10.1, the error "undefined symbol: __cudaRegisterFatBinaryEnd" still occurred.

I've also tried CUDA 9.0 and pytroch 1.0.1.post2, and got error: "undefined symbol: __cudaPopCallConfiguration. Any tips?

have you solve this problem? Can u help me ?

@ruiyuanlu

This comment has been minimized.

Copy link

commented Apr 23, 2019

I've met the same issue.
I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04, too.
Note that CUDA 8.0 for Ubuntu 18.04 is not available. I've tried to compile pytorch 1.0.1.post2 and install from source code with CUDA 10.1, the error "undefined symbol: __cudaRegisterFatBinaryEnd" still occurred.
I've also tried CUDA 9.0 and pytroch 1.0.1.post2, and got error: "undefined symbol: __cudaPopCallConfiguration. Any tips?

have you solve this problem? Can u help me ?

I've already solved this problem by recompiling pytorch and the solution has been posted here. Just follow my steps. It works on my machine. More details are needed if it doesn't work.

@wyhcqq

This comment has been minimized.

Copy link

commented Apr 23, 2019

@ruiyuanlu I need uninstall CUDA 10.1 and install cuda10 ?I use cuda10.1, ubuntu16.04 before .
(1. Uninstall pytorch if it doesn't work:
pip uninstall pytorch # conda uninstall pytorch, if you use conda
2. Install CUDA-10.0)

@ruiyuanlu

This comment has been minimized.

Copy link

commented Apr 23, 2019

@wyhcqq Not really. I didn't test CUDA 10.1 on my machine. In my experience, it is OK to use CUDA10.1, just make sure the CUDA version of compiling time matches the run-time version.

@clearRain

This comment has been minimized.

Copy link

commented Jun 28, 2019

I have the same question. My versions are pytorch1.1, cuda9.1, gcc5.4 Must i upgrade cuda9.1 to 9.0 or other versions?

@Shudeng

This comment has been minimized.

Copy link

commented Oct 15, 2019

I encounter the same question when runing following example :

python tools/test.py configs/faster_rcnn_r50_fpn_1x.py \
   checkpoints/faster_rcnn_r50_fpn_1x_20181010-3d1b3351.pth \
   --show

I have tried cuda10.0/10.1 with pytorch1.3.0/1.2.0 and torchvision 0.4.1/0.4.0, and gcc version is 8.0. Dissapointly, I'm still stuck in this problem. Next I want to try with docker.


Finaly , I solve this problem by using official dockerfile in this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.