Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train error #436

Closed
wizholy opened this issue Mar 22, 2019 · 2 comments
Closed

train error #436

wizholy opened this issue Mar 22, 2019 · 2 comments

Comments

@wizholy
Copy link

wizholy commented Mar 22, 2019

2019-03-22 10:55:37,793 - INFO - Start running, host: wizard@wizard-W560-G20, work_dir: /home/wizard/projects/mmdetection041/work_dirs/ssd512_voc
2019-03-22 10:55:37,793 - INFO - workflow: [('train', 1)], max: 24 epochs
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor<Dtype, 2, int, DefaultPtrTraits>, THCDeviceTensor<long, 1, int, DefaultPtrTraits>, THCDeviceTensor<Dtype, 1, int, DefaultPtrTraits>, Dtype *, int, int) [with Dtype = float]: block: [15,0,0], thread: [882,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f81f9dbe780>>
Traceback (most recent call last):
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in del
self._shutdown_workers()
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/multiprocessing/reduction.py", line 153, in recvfds
msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "tools/train.py", line 90, in
main()
File "tools/train.py", line 86, in main
logger=logger)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/apis/train.py", line 59, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/apis/train.py", line 121, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmcv/runner/runner.py", line 355, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmcv/runner/runner.py", line 261, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/apis/train.py", line 37, in batch_processor
losses = model(**data)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/detectors/base.py", line 80, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/detectors/single_stage.py", line 49, in forward_train
losses = self.bbox_head.loss(*loss_inputs)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/anchor_heads/ssd_head.py", line 183, in loss
cfg=cfg)
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/core/utils/misc.py", line 24, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/mmdet-0.5.7+unknown-py3.6.egg/mmdet/models/anchor_heads/ssd_head.py", line 113, in loss_single
pos_inds = (labels > 0).nonzero().view(-1)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'at::Error'
what(): CUDA error: invalid device pointer (CudaCachingDeleter at /pytorch/aten/src/THC/THCCachingAllocator.cpp:498)
frame #0: THStorage_free + 0x44 (0x7f81eb3160d4 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #1: THTensor_free + 0x2f (0x7f81eb3b57df in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7f81c7fbb579 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: torch::autograd::Variable::Impl::~Impl() + 0x291 (0x7f81f3422411 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7f81f3422589 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x777989 (0x7f81f343b989 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x777a34 (0x7f81f343ba34 in /home/wizard/anaconda3/envs/pytorch041/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

frame #26: __libc_start_main + 0xf0 (0x7f8210097830 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

@wizholy
Copy link
Author

wizholy commented Mar 22, 2019

finish it. classes number+1.

@wizholy wizholy closed this as completed Mar 22, 2019
@datvtn
Copy link

datvtn commented Sep 16, 2020

how I can increase classes number?

FANGAreNotGnu pushed a commit to FANGAreNotGnu/mmdetection that referenced this issue Oct 23, 2023
* fix arg names and orders

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants