train error #436

wizholy · 2019-03-22T03:07:27Z

2019-03-22 10:55:37,793 - INFO - Start running, host: wizard@wizard-W560-G20, work_dir: /home/wizard/projects/mmdetection041/work_dirs/ssd512_voc
2019-03-22 10:55:37,793 - INFO - workflow: [('train', 1)], max: 24 epochs
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [15,0,0], thread: [882,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f81f9dbe780>>
Traceback (most recent call last):
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in del
self._shutdown_workers()
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/reduction.py", line 153, in recvfds
msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "tools/train.py", line 90, in
main()
File "tools/train.py", line 86, in main
logger=logger)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/apis/train.py", line 59, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/apis/train.py", line 121, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmcv/runner/runner.py", line 355, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmcv/runner/runner.py", line 261, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/apis/train.py", line 37, in batch_processor
losses = model(**data)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/detectors/base.py", line 80, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/detectors/single_stage.py", line 49, in forward_train
losses = self.bbox_head.loss(*loss_inputs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/anchor_heads/ssd_head.py", line 183, in loss
cfg=cfg)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/core/utils/misc.py", line 24, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/anchor_heads/ssd_head.py", line 113, in loss_single
pos_inds = (labels > 0).nonzero().view(-1)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'at::Error'
what(): CUDA error: invalid device pointer (CudaCachingDeleter at /pytorch/aten/src/THC/THCCachingAllocator.cpp:498)
frame #0: THStorage_free + 0x44 (0x7f81eb3160d4 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #1: THTensor_free + 0x2f (0x7f81eb3b57df in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7f81c7fbb579 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: torch::autograd::Variable::Impl::~Impl() + 0x291 (0x7f81f3422411 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7f81f3422589 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x777989 (0x7f81f343b989 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x777a34 (0x7f81f343ba34 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

frame #26: __libc_start_main + 0xf0 (0x7f8210097830 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

The text was updated successfully, but these errors were encountered:

wizholy · 2019-03-22T07:25:46Z

finish it. classes number+1.

datvtn · 2020-09-16T10:22:04Z

how I can increase classes number?

* fix arg names and orders * address comments

wizholy closed this as completed Mar 22, 2019

FANGAreNotGnu pushed a commit to FANGAreNotGnu/mmdetection that referenced this issue Oct 23, 2023

Fix open-mmlab#218: fix arg names and orders (open-mmlab#436)

5168b89

* fix arg names and orders * address comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train error #436

train error #436

wizholy commented Mar 22, 2019

wizholy commented Mar 22, 2019

datvtn commented Sep 16, 2020

train error #436

train error #436

Comments

wizholy commented Mar 22, 2019

wizholy commented Mar 22, 2019

datvtn commented Sep 16, 2020