Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. #3

Open
fancy-chenyao opened this issue Aug 31, 2023 · 1 comment

Comments

@fancy-chenyao
Copy link

您好,在调试您的代码的过程中遇到了这个错误,我用的自己的数据集,请问问题是出在哪里呢?具体报错如下所示,非常期待您的回复!
/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1634272126608/work/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed.
0%| | 0/1125 [00:02<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 7, in
processor.run()
File "/root/cascade-rcnn/solver/ddp_mix_solver.py", line 213, in run
self.train(epoch)
File "/root/cascade-rcnn/solver/ddp_mix_solver.py", line 113, in train
targets={"target": targets_tensor, "batch_len": batch_len})
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 702, in forward
box_predicts, cls_predicts, roi_losses = self.cascade_head(feature_dict, boxes, valid_size, targets)
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 621, in forward
boxes, cls, loss = self.roi_heads[i](feature_dict, boxes, valid_size, targets)
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 590, in forward
cls_loss, box_loss = self.compute_loss(proposals, cls_predicts, box_predicts, targets)
File "/root/cascade-rcnn/nets/cascade_rcnn.py", line 564, in compute_loss
cls_loss = self.ce(loss_cls_predicts, loss_cls_targets)
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1152, in forward
label_smoothing=self.label_smoothing)
File "/root/miniconda3/envs/my-env/lib/python3.7/site-packages/torch/nn/functional.py", line 2846, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2085) of binary: /root/miniconda3/envs/my-env/bin/python

@yuanfangshang888
Copy link

我感觉是博主是使用的分布式训练,但是你使用的是单GPU训练,使用才出现了这个问题,因为我看到了你bug中的RROR:torch.distributed.elastic.multiprocessing.api:failed,我的猜测是这样的,你有时间了可以尝试一下这个思路

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants