Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #19

Closed
wxggzz opened this issue Feb 21, 2022 · 9 comments
Closed

RuntimeError: CUDA error: an illegal memory access was encountered #19

wxggzz opened this issue Feb 21, 2022 · 9 comments

Comments

@wxggzz
Copy link

wxggzz commented Feb 21, 2022


 File "/workspace/mmrotate/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 133, in forward
    roi_feats[inds] = roi_feats_t
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1607370141920/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f16c35478b2 in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f16c3799982 in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f16c3532b7d in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fea0a (0x7f1700884a0a in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5feab6 (0x7f1700884ab6 in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf0 (0x7f1724052830 in /lib/x86_64-linux-gnu/libc.so.6)

以上是出现报错的信息,卡是2080ti
下面是运行代码:

python tools/train.py ./configs/oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py --gpu-ids 5  --work-dir /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221

环境: cuda10.1, pytorch1.71

@wxggzz
Copy link
Author

wxggzz commented Feb 21, 2022

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6: GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.105
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.7.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.0a0
OpenCV: 4.5.5
MMCV: 1.4.5
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 10.1
MMRotate: 0.1.0+6519a36

上面是收集的环境信息

@zytx121
Copy link
Collaborator

zytx121 commented Feb 21, 2022

Hi @wxggzz
Please run the following command and upload your error report again.

CUDA_LAUNCH_BLOCKING=1 python tools/train.py ./configs/oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py --gpu-ids 5  --work-dir /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221

@yangxue0827 yangxue0827 changed the title hello 训练开始出现RuntimeError: CUDA error: an illegal memory access was encountered RuntimeError: CUDA error: an illegal memory access was encountered Feb 21, 2022
@wxggzz
Copy link
Author

wxggzz commented Feb 21, 2022

Hi @wxggzz Please run the following command and upload your error report again.

CUDA_LAUNCH_BLOCKING=1 python tools/train.py ./configs/oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py --gpu-ids 5  --work-dir /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221

hello same result

 --------------------
2022-02-21 14:15:20,008 - mmrotate - INFO - workflow: [('train', 1)], max: 12 epochs
2022-02-21 14:15:20,008 - mmrotate - INFO - Checkpoints will be saved to /workspace/mmrotate/ /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221 by HardDiskBackend.
/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmdet/models/dense_heads/anchor_head.py:123: UserWarning: DeprecationWarning: anchor_generator is deprecated, please use "prior_generator" instead
  warnings.warn('DeprecationWarning: anchor_generator is deprecated, '
Traceback (most recent call last):
  File "tools/train.py", line 182, in <module>
    main()
  File "tools/train.py", line 178, in main
    meta=meta)
  File "/workspace/mmrotate/mmrotate/apis/train.py", line 156, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 109, in new_func
    return old_func(*args, **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/workspace/mmrotate/mmrotate/models/detectors/two_stage.py", line 150, in forward_train
    **kwargs)
  File "/workspace/mmrotate/mmrotate/models/roi_heads/oriented_standard_roi_head.py", line 74, in forward_train
    img_metas)
  File "/workspace/mmrotate/mmrotate/models/roi_heads/oriented_standard_roi_head.py", line 97, in _bbox_forward_train
    bbox_results = self._bbox_forward(x, rois)
  File "/workspace/mmrotate/mmrotate/models/roi_heads/rotate_standard_roi_head.py", line 170, in _bbox_forward
    x[:self.bbox_roi_extractor.num_inputs], rois)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 197, in new_func
    return old_func(*args, **kwargs)
  File "/workspace/mmrotate/mmrotate/models/roi_heads/roi_extractors/rotate_single_level_roi_extractor.py", line 132, in forward
    roi_feats_t = self.roi_layers[i](feats[i], rois_)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 177, in forward
    self.clockwise)
  File "/root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/ops/roi_align_rotated.py", line 78, in forward
    clockwise=clockwise)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1607370141920/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f55f20eb8b2 in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f55f233d982 in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f55f20d6b7d in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fea0a (0x7f562f428a0a in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5feab6 (0x7f562f428ab6 in /root/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf0 (0x7f5652bf6830 in /lib/x86_64-linux-gnu/libc.so.6)

Below is the log file
20220221_064433.log

@yangxue0827
Copy link
Collaborator

yangxue0827 commented Feb 21, 2022

try

CUDA_VISIBLE_DEVICES=5 PORT=29808 ./tools/dist_train.sh ./configs/oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py --work-dir /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221 1

@wxggzz
Copy link
Author

wxggzz commented Feb 22, 2022

try

CUDA_VISIBLE_DEVICES=5 PORT=29808 ./tools/dist_train.sh ./configs/oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py --work-dir /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221 1

Training with this command is no problem, now I will test it

@wxggzz
Copy link
Author

wxggzz commented Feb 22, 2022

try

CUDA_VISIBLE_DEVICES=5 PORT=29808 ./tools/dist_train.sh ./configs/oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py --work-dir /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221 1

and the test is also the same problem, you have to use dist_test.py

@yangxue0827
Copy link
Collaborator

yangxue0827 commented Feb 22, 2022

CUDA_VISIBLE_DEVICES=5 PORT=29807 \
       ./tools/dist_test.sh oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py \
        /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221/epoch_12.pth 1 --format-only \
        --eval-options submission_dir=/workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221/Task1_results

@wxggzz
Copy link
Author

wxggzz commented Feb 23, 2022

CUDA_VISIBLE_DEVICES=5 PORT=29807 \
       ./tools/dist_test.sh oriented_rcnn/oriented_rcnn_r50_fpn_1x_dota_le90.py \
        /workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221/epoch_12.pth 1 --format-only \
        --eval-options submission_dir=/workspace/mmrotate/work_dirs/ms/oriented_rcnn/0221/Task1_results

thanks

@yangxue0827
Copy link
Collaborator

A successful solution: set smaller nms_pre

test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(iou_thr=0.4),
        max_per_img=2000))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants