Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: device-side assert triggered #5801

Closed
wedlight opened this issue Aug 5, 2021 · 4 comments
Closed

RuntimeError: CUDA error: device-side assert triggered #5801

wedlight opened this issue Aug 5, 2021 · 4 comments
Assignees

Comments

@wedlight
Copy link

wedlight commented Aug 5, 2021

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.

Describe the bug
A clear and concise description of what the bug is.
RuntimeError: CUDA error: device-side assert triggered

Reproduction

  1. What command or script did you run?
    CUDA_VISIBLE_DEVICES=3 python tools/train.py configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py

  2. Did you make any modifications on the code or config? Did you understand what you have modified?
    I don't modifications on the code or config.

  3. What dataset did you use?
    coco

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
    sys.platform: linux
    Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
    CUDA available: True
    GPU 0,1,2,3: GeForce GTX 1080 Ti
    CUDA_HOME: /disk1/huim/softwares/cuda-10.1
    NVCC: Cuda compilation tools, release 10.1, V10.1.243
    GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
    PyTorch: 1.8.1
    PyTorch compiling details: PyTorch built with:
  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210617 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.1, CUDNN_VERSION=7.6.3, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.1
OpenCV: 4.5.3
MMCV: 1.3.9
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 10.1
MMDetection: 2.15.0+62a1cd3

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
      conda install pytorch cudatoolkit=10.1 torchvision -c pytorch

Error traceback
If applicable, paste the error trackback here.

2021-08-05 23:34:13,140 - mmdet - INFO - Epoch [1][50/58633] lr: 1.978e-03, eta: 3 days, 7:04:43, time: 0.405, data_time: 0.054, memory: 3786, loss_rpn_cls: 0.5552, loss_rpn_bbox: nan, loss_cls: 0.7884, acc: 88.8535, loss_bbox: 0.1492, loss: nan
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [32,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [33,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [34,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [35,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [4,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [5,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [6,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [7,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [8,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [10,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [12,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [13,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [14,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [15,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [16,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [17,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [18,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [19,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [20,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [23,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "tools/train.py", line 188, in
main()
File "tools/train.py", line 184, in main
meta=meta)
File "/disk1/huim/projects/mmdetection/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/disk1/huim/projects/mmdetection/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(**data)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/detectors/base.py", line 171, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/detectors/two_stage.py", line 140, in forward_train
proposal_cfg=proposal_cfg)
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 54, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 74, in loss
gt_bboxes_ignore=gt_bboxes_ignore)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/anchor_head.py", line 463, in loss
label_channels=label_channels)
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/anchor_head.py", line 345, in get_targets
unmap_outputs=unmap_outputs)
File "/disk1/huim/projects/mmdetection/mmdet/core/utils/misc.py", line 29, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/anchor_head.py", line 236, in _get_targets_single
sampling_result.pos_bboxes, sampling_result.pos_gt_bboxes)
File "/disk1/huim/projects/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py", line 59, in encode
encoded_bboxes = bbox2delta(bboxes, gt_bboxes, self.means, self.stds)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/utils/parrots_jit.py", line 21, in wrapper_inner
return func(args, kargs)
File "/disk1/huim/projects/mmdetection/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py", line 136, in bbox2delta
means = deltas.new_tensor(means).unsqueeze(0)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554827596/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa0e48092f2 in /disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x7fa0e480667b in /disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x809 (0x7fa0e4a62219 in /disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa0e47f13a4 in /disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e6aba (0x7fa12ca4baba in /disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e6b61 (0x7fa12ca4bb61 in /disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #24: __libc_start_main + 0xf0 (0x7fa152495840 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

I not found reasons of error

@jshilong
Copy link
Collaborator

jshilong commented Aug 7, 2021

The error message indicates there is a data structure(list, Tensor) out of indexes, you can find the more precise position of the error by adding CUDA_LAUNCH_BLOCKING=1 before your command

@jshilong
Copy link
Collaborator

Feel free to reopen the issue if there is any question

@huimlight
Copy link

Feel free to reopen the issue if there is any question

Traceback (most recent call last):
File "tools/train.py", line 189, in
main()
File "tools/train.py", line 185, in main
meta=meta)
File "/disk1/huim/projects/mmdetection/mmdet/apis/train.py", line 174, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/disk1/huim/projects/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
losses = self(**data)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/detectors/single_stage.py", line 84, in forward_train
gt_labels, gt_bboxes_ignore)
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 55, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/anchor_head.py", line 490, in loss
num_total_samples=num_total_samples)
File "/disk1/huim/projects/mmdetection/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/disk1/huim/projects/mmdetection/mmdet/models/dense_heads/anchor_head.py", line 405, in loss_single
cls_score, labels, label_weights, avg_factor=num_total_samples)
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/disk1/huim/projects/mmdetection/mmdet/models/losses/focal_loss.py", line 178, in forward
avg_factor=avg_factor)
File "/disk1/huim/projects/mmdetection/mmdet/models/losses/focal_loss.py", line 87, in sigmoid_focal_loss
alpha, None, 'none')
File "/disk1/huim/softwares/anaconda3/envs/openmmlab/lib/python3.7/site-packages/mmcv/ops/focal_loss.py", line 56, in forward
input, target, weight, output, gamma=ctx.gamma, alpha=ctx.alpha)
RuntimeError: target.max().item<int64_t>() <= (int64_t)num_classes INTERNAL ASSERT FAILED at "/tmp/pip-install-rnu39oc3/mmcv-full_c3244ef0af18442e9c75bebe181bc703/mmcv/ops/csrc/pytorch/cuda/focal_loss_cuda.cu":13, please report a bug to PyTorch. target label should smaller or equal than num classes

@huimlight
Copy link

Feel free to reopen the issue if there is any question

I use this CUDA_LAUNCH_BLOCKING=1 before your command. This is an error after adding the command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants