Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR Trying to train 3DSSD #426

Closed
RolandoAvides opened this issue Apr 8, 2021 · 20 comments
Closed

ERROR Trying to train 3DSSD #426

RolandoAvides opened this issue Apr 8, 2021 · 20 comments

Comments

@RolandoAvides
Copy link

Hello,

I had this error when trying to execute train.py for 3DSSD model:

File "/home/rmoreira/.local/lib/python3.8/site-packages/mmdet3d-0.11.0-py3.8-linux-x86_64.egg/mmdet3d/ops/ball_query/ball_query.py", line 4, in <module>
    from . import ball_query_ext
ImportError: libcudart.so.10.1: cannot open shared object file: No such file or directory

Hope anyone has some knowledge about this.

Thanks in advance!

@Tai-Wang
Copy link
Member

Tai-Wang commented Apr 8, 2021

It seems your cuda environment has something wrong. Please check the compatibility of your cuda, GPU driver and pytorch version.

@RolandoAvides
Copy link
Author

Thanks! I'll look into it.

However, I strictly followed the installation steps.

@Tai-Wang
Copy link
Member

Tai-Wang commented Apr 8, 2021

But there may still exist some problems when you need to decide which version of cuda/pytorch to be installed. You can check whether torch.cuda.is_available() is True.

@RolandoAvides
Copy link
Author

RolandoAvides commented Apr 11, 2021

Does mmdet3D have specific requirements for cuda and pytorch version?

am currently using cuda 11.0 and pytorch 1.7.1

@Tai-Wang
Copy link
Member

Ah yes, if you use pytorch 1.7, maybe you need to update to the latest master or at least 0.12.0 because we fix some errors of scatter_points_cuda.cu that will be triggered with pytorch>=1.7. But it seems not related to your bug. I think you should double check the environment about cuda (like the cuda version and driver, as well as the cuda used to compile mmdet3d and pytorch).

@RolandoAvides
Copy link
Author

I checked my versions and everything seems to be alright. Do you suggest using another pytorch and cuda version?

@Tai-Wang
Copy link
Member

Can you run python mmdet3d/utils/collect_env.py to collect the environment information? Please post it if it is successful. You can also refer to #438 and have a try.

@RolandoAvides
Copy link
Author

Traceback (most recent call last):
File "mmdet3d/utils/collect_env.py", line 5, in
import mmdet3d
ModuleNotFoundError: No module named 'mmdet3d'

I reinstalled mmdet3d, but I get this error. However, I have mmdet3d in my env.

@RolandoAvides
Copy link
Author

After conda list:

mmdet3d 0.12.0 dev_0

@Tai-Wang
Copy link
Member

Try to run PYTHONPATH="$(dirname $0)":$PYTHONPATH python mmdet3d/utils/collect_env.py.

@RolandoAvides
Copy link
Author

/home/rmoreira/.conda/envs/thesis/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
sys.platform: linux
Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
CUDA available: False
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.7.1
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2
OpenCV: 4.5.1
MMCV: 1.3.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: not available
MMDetection: 2.11.0
MMDetection3D: 0.12.0+d055876

Since I am using a slurm server, I have to run each python script with srun. That is why in beginning it didn't find any GPU.

@RolandoAvides
Copy link
Author

Also, when I use python in cmd, I successfully import mmcv, mmdet and mmdet3d.

@Tai-Wang
Copy link
Member

Then please use srun to run the command.

@RolandoAvides
Copy link
Author

I got an error. However, I reinstalled mmdetedction3d and got a new error:

File "/home/rmoreira/.conda/envs/thesis/lib/python3.8/site-packages/mmcv/ops/nms.py", line 19, in forward
    inds = ext_module.nms(
RuntimeError: nms is not compiled with GPU support

Thanks for your help.

@Tai-Wang
Copy link
Member

You need to use srun to compile mmcv/mmdet/mmdet3d. Otherwise, the local environment without GPU will only compile the lite CPU version packages.

@RolandoAvides
Copy link
Author

RolandoAvides commented Apr 14, 2021

Ok. I'm running this command:

srun python demo/pcd_demo.py /home/rmoreira/mmdetection3d/demo/kitti_000008.bin /home/rmoreira/mmdetection3d/configs/3dssd/3dssd_kitti-3d-car.py /home/rmoreira/mmdetection3d/checkpoints/3dssd_kitti-3d-car_20210324_122002-07e9a19b.pth

Thus, I need to reinstall mmcv and mmdet with srun right? Can I use srun pip....?

@Tai-Wang
Copy link
Member

Typically I use srun python -m pip install ... But you can have a try.

@RolandoAvides
Copy link
Author

Ok, I will try! Thank you! I'll soon give some feedback.

@RolandoAvides
Copy link
Author

RolandoAvides commented Apr 14, 2021

@Tai-Wang I manage to run the demo.py! Thank you!

I'll take this opportunity to show you the results. I found the first warning about incompatibility strange...

srun python demo/pcd_demo.py /home/rmoreira/mmdetection3d/demo/kitti_000008.bin /home/rmoreira/mmdetection3d/configs/3dssd/3dssd_kitti-3d-car.py /home/rmoreira/mmdetection3d/checkpoints/3dssd_kitti-3d-car_20210324_122002-07e9a19b.pthUse load_from_local loader
The model and loaded state dict do not match exactly

size mismatch for bbox_head.conv_pred.conv_cls.weight: copying a param with shape torch.Size([1, 128, 1]) from checkpoint, the shape in current model is torch.Size([3, 128, 1]).
size mismatch for bbox_head.conv_pred.conv_cls.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3]).
[{'boxes_3d': LiDARInstance3DBoxes(
    tensor([[ 55.7404,  -8.6077,  -0.4224,  ...,   4.5783,   1.8301,   1.0405],
        [ 25.6243, -10.9809,  -1.5819,  ...,   3.7168,   1.4777,  -1.0689],
        [ 66.4182, -26.8103,  -1.1402,  ...,   3.7059,   1.5027,   1.6769],
        ...,
        [ 29.5463, -12.6929,  -3.2484,  ...,   4.1319,   1.8958,   3.1369],
        [ 33.4877,  -7.3111,  -1.3107,  ...,   4.0879,   1.6466,   1.7741],
        [ 20.6169,   0.4046,  -1.3966,  ...,   3.7587,   1.5981,   1.9495]])), 'scores_3d': tensor([0.7574, 0.6495, 0.6280, 0.6921, 0.7645, 0.7266, 0.7165, 0.5575, 0.5733,
        0.7091, 0.8246, 0.7201, 0.5388, 0.7428, 0.6678, 0.7330, 0.4847, 0.6177,
        0.7114, 0.6585, 0.6402, 0.5568, 0.7016, 0.5209, 0.6761, 0.6874, 0.7599,
        0.6817, 0.4953, 0.6884, 0.7174, 0.5856, 0.4783, 0.7312, 0.5483, 0.5400,
        0.5401, 0.4533, 0.6866, 0.6345, 0.7415, 0.7145, 0.5942, 0.7193, 0.6310,
        0.6636, 0.7538, 0.7477, 0.5689, 0.7458, 0.7574, 0.6495, 0.6280, 0.6921,
        0.7645, 0.7266, 0.7165, 0.5575, 0.5733, 0.7091, 0.8246, 0.7201, 0.5388,
        0.7428, 0.6678, 0.7330, 0.4847, 0.6177, 0.7114, 0.6585, 0.6402, 0.5568,
        0.7016, 0.5209, 0.6761, 0.6874, 0.7599, 0.6817, 0.4953, 0.6884, 0.7174,
        0.5856, 0.4783, 0.7312, 0.5483, 0.5400, 0.5401, 0.4533, 0.6866, 0.6345,
        0.7415, 0.7145, 0.5942, 0.7193, 0.6310, 0.6636, 0.7538, 0.7477, 0.5689,
        0.7458, 0.7574, 0.6495, 0.6280, 0.6921, 0.7645, 0.7266, 0.7165, 0.5575,
        0.5733, 0.7091, 0.8246, 0.7201, 0.5388, 0.7428, 0.6678, 0.7330, 0.4847,
        0.6177, 0.7114, 0.6585, 0.6402, 0.5568, 0.7016, 0.5209, 0.6761, 0.6874,
        0.7599, 0.6817, 0.4953, 0.6884, 0.7174, 0.5856, 0.4783, 0.7312, 0.5483,
        0.5400, 0.5401, 0.4533, 0.6866, 0.6345, 0.7415, 0.7145, 0.5942, 0.7193,
        0.6310, 0.6636, 0.7538, 0.7477, 0.5689, 0.7458]), 'labels_3d': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2])}]
/home/rmoreira/mmcv/mmcv/cnn/bricks/conv_module.py:107: UserWarning: ConvModule has norm and bias at the same time
  warnings.warn('ConvModule has norm and bias at the same time')
/home/rmoreira/.conda/envs/thesis/lib/python3.8/site-packages/torch/nn/functional.py:1639: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
/home/rmoreira/mmcv/mmcv/utils/misc.py:303: UserWarning: "iou_thr" is deprecated in `nms`, please use "iou_threshold" instead
  warnings.warn(

While I got this using PointPillars...

[{'boxes_3d': LiDARInstance3DBoxes(
    tensor([[  7.6368,   6.0337,  -0.7049,   0.5380,   1.6368,   1.6816,   1.1690],
        [  8.0813,   1.2360,  -1.5221,   1.5556,   3.5950,   1.5098,  -1.2870],
        [  6.4173,  -3.8148,  -1.7652,   1.4713,   3.1078,   1.4624,   1.8872],
        [ 14.7695,  -1.1130,  -1.5709,   1.5431,   3.8069,   1.4875,   1.8945],
        [ 33.3250,  -7.0627,  -1.2610,   1.6582,   4.0907,   1.6339,  -1.2844],
        [ 20.3003,  -8.4587,  -1.6727,   1.5208,   2.6914,   1.6241,   1.9094],
        [  3.6622,   2.7378,  -1.5460,   1.5642,   3.6437,   1.4970,   1.7988],
        [ 28.6279,  -1.6022,  -1.0425,   1.5140,   3.8013,   1.4418,   0.3095],
        [ 55.6062, -20.1399,  -1.3588,   1.6496,   4.0910,   1.6164,  -1.2947],
        [ 24.9041, -10.1028,  -1.6406,   1.6510,   3.6680,   1.4994,   1.8840],
        [ 40.7069,  -9.7905,  -1.3112,   1.5934,   3.9164,   1.5893,  -1.2865]])), 'scores_3d': tensor([0.4375, 0.9567, 0.9528, 0.9437, 0.8850, 0.8328, 0.7940, 0.7637, 0.6648,
        0.6040, 0.4795]), 'labels_3d': tensor([1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}]

Thanks!

@RolandoAvides
Copy link
Author

Never mind about the first error. I forgot I changed the 3DSSD config file

tpoisonooo pushed a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022
* fix ci

* add nvidia key

* remote torch

* recover pytorch
tpoisonooo added a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022
* docs(docs/zh_cn): add doc and link checker

* docs(REAME): update

* docs(docs/zh_cn): update

* docs(benchmark): update table

* docs(zh_cn/benchmark): update link

* CI(docs): update link check

* ci(doc): update checker

* docs(zh_cn): update

* style(ci): remove useless para

* style(ci): update

* docs(zh_cn): update

* docs(benchmark.md): fix mobilnet link error

* docs(docs/zh_cn): add doc and link checker

* docs(REAME): update

* docs(docs/zh_cn): update

* docs(benchmark): update table

* docs(zh_cn/benchmark): update link

* CI(docs): update link check

* ci(doc): update checker

* docs(zh_cn): update

* style(ci): remove useless para

* style(ci): update

* docs(zh_cn): update

* docs(benchmark.md): fix mobilnet link error

* docs(zh_cn/do_regression_test.md): rebase

* docs(docs/zh_cn): add doc and link checker

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update index.rst

* Update check-doc-link.yml

* [Fix] Fix ci (open-mmlab#426)

* fix ci

* add nvidia key

* remote torch

* recover pytorch

* ci(codecov): ignore ci

* docs(zh_cn): add get_started.md

* docs(zh_cn): fix review advice

* docs(readthedocs): update

* docs(zh_CN): update

* docs(zh_CN): revert

* fix(docs): review advices

* fix(docs): review advices

* fix(docs): review

Co-authored-by: q.yao <streetyao@live.com>
tpoisonooo added a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022
* refactor(onnx2ncnn.cpp): split it to shape_inference, pass and utils

* refactor(onnx2ncnn.cpp): split it to shape_inference, pass and utils

* refactor(onnx2ncnn.cpp): split code

* refactor(net_module.cpp): fix build error

* ci(test_onnx2ncnn.py): add generate model adn run

* ci(onnx2ncnn): add ncnn backend

* ci(test_onnx2ncnn): add converted onnx model`

* ci(onnx2ncnn): fix ncnn tar

* ci(backed-ncnn): simplify dependency install

* ci(onnx2ncnn): fix apt install

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* Update backend-ncnn.yml

* fix(ci): add include algorithm

* Update build.yml

* parent aa85760
author q.yao <streetyao@live.com> 1651287879 +0800
committer tpoisonooo <khj.application@aliyun.com> 1652169959 +0800

[Fix] Fix ci (open-mmlab#426)

* fix ci

* add nvidia key

* remote torch

* recover pytorch

refactor(onnx2ncnn.cpp): split it to shape_inference, pass and utils

* fix(onnx2ncnn): review

* fix(onnx2ncnn): build error

Co-authored-by: q.yao <streetyao@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants