Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError:Expected to have finished reduction in the prior iteration before starting a new one #2153

Closed
vincentwei0919 opened this issue Feb 25, 2020 · 33 comments
Assignees

Comments

@vincentwei0919
Copy link

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run?
    I have change the config name from faster_rcnn_r50_fpn_1x.py to element.py
CUDA_VISIBLE_DEVICES=1,2,3 ./tools/dist_train.sh configs/element.py 3 --autoscale-lr
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    only num_classes and work_dir in config

  2. What dataset did you use?
    my own dataset which is made the same as VOC format
    Environment
    image

  3. Please run python mmdet/utils/collect_env.py to collect necessary environment infomation and paste it here.

  4. You may add addition that may be helpful for locating the problem, such as

    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f92f4501441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f92f4500d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f92f4de983c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7f92f4ddf2bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7f92f484acfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7f92f8173830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)

Traceback (most recent call last):
  File "./tools/train.py", line 142, in <module>
    main()
  File "./tools/train.py", line 138, in main
    meta=meta)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 102, in train_detector
    meta=meta)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 171, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 371, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 275, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 392, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fcaf0f72441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fcaf0f71d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7fcaf185a83c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7fcaf18502bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7fcaf12bbcfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7fcaf4be4830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)

^CTraceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 228, in main
    process.wait()
  File "/usr/lib/python3.6/subprocess.py", line 1457, in wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.6/subprocess.py", line 1404, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
root@83403c5335c7:mmdetection_v2# ^C

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@ZwwWayne
Copy link
Collaborator

Hi @vincentwei0919 ,
For now, you can set find_unused_parameters=True in the train.py line 215 to make the experiment running as indicated by this issue: #2117.
We will also try to debug and find the real reason at the same time.

@237014845
Copy link

@ZwwWayne I have a similar problem, can you explain which line of code is preceded by 'find_unused_parameters=True'?Because everyone has a different version of mmdet

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Mar 2, 2020

@237014845 ,
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/apis/train.py#L135, when calling the MMDistributedDataParallel.

@237014845
Copy link

@ZwwWayne thx

@vincentwei0919
Copy link
Author

thank you! before I raise this error issue, I've tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up!
hope you can pay some attention to this issue. By the way, I notice that the code you published about a half year ago worked well. Is this a point you can make use of? thank you again!

@237014845
Copy link

@vincentwei0919 I have a similar problem

@Mxbonn
Copy link
Contributor

Mxbonn commented Mar 6, 2020

@ZwwWayne Would it be a possibility to change the train api to support for example the user setting this in a flag in the config file or as an addititonal argument?

@jajajajaja121
Copy link

thank you! before I raise this error issue, I've tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up!
hope you can pay some attention to this issue. By the way, I notice that the code you published about a half year ago worked well. Is this a point you can make use of? thank you again!

I also meet the same situation, and when I use non_distributed training, but use two card, it will raise valueerror All dicts must have the same number of keys.

@ZwwWayne
Copy link
Collaborator

Hi @jajajajaja121 ,
Could you use distributed training and see whether your bug still exists?
The DataParallel does not have options to allow unused_parameters, so we recommend you to use distributed training.

@huuquan1994
Copy link

I also got the same problem. Even I set the flag find_unused_parameters=True, the problem doesn't disappear. The training will keep freezing without any error logs.
Also, I found that the code you published half-year ago worked really well as @jajajajaja121 mentioned.
I hope we can fix this issue soon.

@ZwwWayne
Copy link
Collaborator

Hi @huuquan1994 ,
Could you provide reproducible scripts and configs?
We have not met your cases before so we are not sure where the bug is. We may not be able to do much help with limited information here.

@jajajajaja121
Copy link

Hi @jajajajaja121 ,
Could you use distributed training and see whether your bug still exists?
The DataParallel does not have options to allow unused_parameters, so we recommend you to use distributed training.

Yes this bug is still exist, and here is my config
cascade_rcnn_r50_fpn_1x.zip

@jajajajaja121
Copy link

Hi @jajajajaja121 ,
Could you use distributed training and see whether your bug still exists?
The DataParallel does not have options to allow unused_parameters, so we recommend you to use distributed training.

And this is my environment
env

@jajajajaja121
Copy link

My bug disappeared, after I reinstall the leatest version of mmdetection, you can try this method.

@huuquan1994
Copy link

@ZwwWayne Sorry for the late reply!
Since I switched to the older version of mmdetection, I will try to re-install the newest version as @jajajajaja121 mentioned and check if the bug is gone or not.

@vincentwei0919
Copy link
Author

oh, thanks, I will try @jajajajaja121

@laycoding
Copy link

laycoding commented Apr 20, 2020

Have you solved the problem? @vincentwei0919 @huuquan1994 I used the newest version of mmdet and still got stuck when I trying to train HTC on my own dataset.
image
And the strangest part is after I changed the "find_unused_parameters=True" , it always stuck at a particular iteration without any log Infos(34 epoch 1850/6735) :(

@huuquan1994
Copy link

huuquan1994 commented Apr 21, 2020

@laycoding
I tried to install the newest mmdetection but got the same error as you mentioned even I set the flag find_unused_parameters=True.

@ZwwWayne
I used docker and tried to build two different docker images as follow:

  1. I built my docker image using the Dockerfile in your master branch Dockerfile

  2. I changed the Dockerfile config:
    Pytorch 1.3 to Pytorch 1.2
    CUDA 10.1 to CUDA 10.0
    and built the second image.
    Here is my environment on the second Docker image:

sys.platform: linux
Python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0,1,2,3,4,5,6,7,8,9: GeForce RTX 2080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
PyTorch: 1.2.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.18.1 (Git Hash 7de7e5d02bf687f971e7668963649728356e0c20)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.0
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.2
  - Magma 2.5.0
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.4.0a0+6b959ee
OpenCV: 4.2.0
MMCV: 0.4.3
MMDetection: 1.1.0+63887bb
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 10.0

The two Docker images got the same error log as @laycoding mentioned. Training got stuck infinitely at the frame #50 (Please see the error log below).
I used the training config faster_rcnn_x101_64x4d_fpn_1x.py on my custom VOC dataset.

2020-04-20 07:55:57,096 - mmdet - INFO - Epoch [3][1050/5501]   lr: 0.01000, eta: 3 days, 4:15:46, time: 1.041, data_time: 0.007, memory: 8559, loss_rpn_cls: 0.0161, loss_rpn_bbox: 0.0184, loss_cls: 0.1549, acc: 93.3047, loss_bbox: 0.1351, loss: 0.3244
2020-04-20 07:56:49,137 - mmdet - INFO - Epoch [3][1100/5501]   lr: 0.01000, eta: 3 days, 4:14:51, time: 1.041, data_time: 0.007, memory: 8559, loss_rpn_cls: 0.0147, loss_rpn_bbox: 0.0184, loss_cls: 0.1481, acc: 93.7173, loss_bbox: 0.1308, loss: 0.3120
2020-04-20 07:57:41,310 - mmdet - INFO - Epoch [3][1150/5501]   lr: 0.01000, eta: 3 days, 4:13:58, time: 1.043, data_time: 0.007, memory: 8559, loss_rpn_cls: 0.0159, loss_rpn_bbox: 0.0183, loss_cls: 0.1540, acc: 93.2119, loss_bbox: 0.1330, loss: 0.3212
Traceback (most recent call last):
  File "./tools/train.py", line 151, in <module>
    main()
  File "./tools/train.py", line 147, in main
    meta=meta)
  File "/mmdetection/mmdet/apis/train.py", line 165, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 359, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 263, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/mmdetection/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 459, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1570910687230/work/torch/csrc/distributed/c10d/reducer.cpp:518)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f442f259687 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x7b7 (0x7f44635c2c97 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x7ca341 (0x7f44635b1341 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x206aa6 (0x7f4462fedaa6 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyCFunction_FastCallDict + 0x154 (0x56335e43fc54 in /opt/conda/bin/python)
frame #5: <unknown function> + 0x199c0e (0x56335e4c7c0e in /opt/conda/bin/python)
frame #6: _PyEval_EvalFrameDefault + 0x30a (0x56335e4ea75a in /opt/conda/bin/python)
frame #7: <unknown function> + 0x192e66 (0x56335e4c0e66 in /opt/conda/bin/python)
frame #8: _PyFunction_FastCallDict + 0x3d8 (0x56335e4c2598 in /opt/conda/bin/python)
frame #9: _PyObject_FastCallDict + 0x26f (0x56335e44001f in /opt/conda/bin/python)
frame #10: _PyObject_Call_Prepend + 0x63 (0x56335e444aa3 in /opt/conda/bin/python)
frame #11: PyObject_Call + 0x3e (0x56335e43fa5e in /opt/conda/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x19e7 (0x56335e4ebe37 in /opt/conda/bin/python)
frame #13: <unknown function> + 0x192e66 (0x56335e4c0e66 in /opt/conda/bin/python)
frame #14: _PyFunction_FastCallDict + 0x3d8 (0x56335e4c2598 in /opt/conda/bin/python)
frame #15: _PyObject_FastCallDict + 0x26f (0x56335e44001f in /opt/conda/bin/python)
frame #16: _PyObject_Call_Prepend + 0x63 (0x56335e444aa3 in /opt/conda/bin/python)
frame #17: PyObject_Call + 0x3e (0x56335e43fa5e in /opt/conda/bin/python)
frame #18: <unknown function> + 0x16b371 (0x56335e499371 in /opt/conda/bin/python)
frame #19: PyObject_Call + 0x3e (0x56335e43fa5e in /opt/conda/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x19e7 (0x56335e4ebe37 in /opt/conda/bin/python)
frame #21: PyEval_EvalCodeEx + 0x329 (0x56335e4c29b9 in /opt/conda/bin/python)
frame #22: <unknown function> + 0x1958e6 (0x56335e4c38e6 in /opt/conda/bin/python)
frame #23: PyObject_Call + 0x3e (0x56335e43fa5e in /opt/conda/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x19e7 (0x56335e4ebe37 in /opt/conda/bin/python)
frame #25: <unknown function> + 0x192e66 (0x56335e4c0e66 in /opt/conda/bin/python)
frame #26: _PyFunction_FastCallDict + 0x3d8 (0x56335e4c2598 in /opt/conda/bin/python)
frame #27: _PyObject_FastCallDict + 0x26f (0x56335e44001f in /opt/conda/bin/python)
frame #28: _PyObject_Call_Prepend + 0x63 (0x56335e444aa3 in /opt/conda/bin/python)
frame #29: PyObject_Call + 0x3e (0x56335e43fa5e in /opt/conda/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x19e7 (0x56335e4ebe37 in /opt/conda/bin/python)
frame #31: <unknown function> + 0x192e66 (0x56335e4c0e66 in /opt/conda/bin/python)
frame #32: <unknown function> + 0x193ed6 (0x56335e4c1ed6 in /opt/conda/bin/python)
frame #33: <unknown function> + 0x199b95 (0x56335e4c7b95 in /opt/conda/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x30a (0x56335e4ea75a in /opt/conda/bin/python)
frame #35: <unknown function> + 0x19329e (0x56335e4c129e in /opt/conda/bin/python)
frame #36: <unknown function> + 0x193ed6 (0x56335e4c1ed6 in /opt/conda/bin/python)
frame #37: <unknown function> + 0x199b95 (0x56335e4c7b95 in /opt/conda/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x10cc (0x56335e4eb51c in /opt/conda/bin/python)
frame #39: <unknown function> + 0x193c5b (0x56335e4c1c5b in /opt/conda/bin/python)
frame #40: <unknown function> + 0x199b95 (0x56335e4c7b95 in /opt/conda/bin/python)
frame #41: _PyEval_EvalFrameDefault + 0x30a (0x56335e4ea75a in /opt/conda/bin/python)
frame #42: PyEval_EvalCodeEx + 0x329 (0x56335e4c29b9 in /opt/conda/bin/python)
frame #43: PyEval_EvalCode + 0x1c (0x56335e4c375c in /opt/conda/bin/python)
frame #44: <unknown function> + 0x215744 (0x56335e543744 in /opt/conda/bin/python)
frame #45: PyRun_FileExFlags + 0xa1 (0x56335e543b41 in /opt/conda/bin/python)
frame #46: PyRun_SimpleFileExFlags + 0x1c3 (0x56335e543d43 in /opt/conda/bin/python)
frame #47: Py_Main + 0x613 (0x56335e547833 in /opt/conda/bin/python)
frame #48: main + 0xee (0x56335e41188e in /opt/conda/bin/python)
frame #49: __libc_start_main + 0xf0 (0x7f447232c830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #50: <unknown function> + 0x1c3160 (0x56335e4f1160 in /opt/conda/bin/python)

@huuquan1994
Copy link

@laycoding
I'm using the mmdetection version v1.0rc0 and it works perfectly.
You can try to test it.

@laycoding
Copy link

@laycoding
I'm using the mmdetection version v1.0rc0 and it works perfectly.
You can try to test it.

Thx, I will try it!

@SystemErrorWang
Copy link

got the same error when training mask rcnn on mmdetection 2.0.0. When I switch to non-dist training, it works fine. Whould like to know what caused this problem

@mdv3101
Copy link
Contributor

mdv3101 commented May 20, 2020

@SystemErrorWang I am also facing the same problem. When i set find_unused_parameters = cfg.get('find_unused_parameters', True), then the error disappeared, but my training process got stuck.

@Kaeseknacker
Copy link

I am getting the same error after adding the following code to fpn.py: (I want to freeze the FPN weights)

def _freeze_stages(self):
    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            m.eval()
            for p in m.parameters():
                p.requires_grad = False

def train(self, mode=True):
    super(FPN, self).train(mode)
    if self.freeze_weights:
        self._freeze_stages()

Setting find_unused_parameters = True also solved my problem.

@mdv3101
Copy link
Contributor

mdv3101 commented Jun 4, 2020

I am using the latest version of mmdetection, but still it is showing error. And when i set find_unused_parameters = True, error disappears but training freezes. Can anyone please help in solving it.

Traceback (most recent call last):
  File "./tools/train.py", line 161, in <module>
    main()
  File "./tools/train.py", line 157, in main
    meta=meta)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 179, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 383, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 282, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 74, in batch_processor
    losses = model(**data)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 464, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:514)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x14e1cc446193 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x731 (0x14e217e956f1 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0xa168ea (0x14e217e818ea in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x295a74 (0x14e217700a74 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x56198c3af004 in /home/madhav3101/torch-env/bin/python3)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x56198c3af121 in /home/madhav3101/torch-env/bin/python3)
frame #6: _PyEval_EvalFrameDefault + 0x532e (0x56198c40b40e in /home/madhav3101/torch-env/bin/python3)
frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #8: _PyFunction_FastCallDict + 0x3d8 (0x56198c34d1e8 in /home/madhav3101/torch-env/bin/python3)
frame #9: _PyObject_Call_Prepend + 0x63 (0x56198c363cb3 in /home/madhav3101/torch-env/bin/python3)
frame #10: PyObject_Call + 0x6e (0x56198c3587de in /home/madhav3101/torch-env/bin/python3)
frame #11: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #13: _PyFunction_FastCallDict + 0x3d8 (0x56198c34d1e8 in /home/madhav3101/torch-env/bin/python3)
frame #14: _PyObject_Call_Prepend + 0x63 (0x56198c363cb3 in /home/madhav3101/torch-env/bin/python3)
frame #15: <unknown function> + 0x170cca (0x56198c3a6cca in /home/madhav3101/torch-env/bin/python3)
frame #16: PyObject_Call + 0x6e (0x56198c3587de in /home/madhav3101/torch-env/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #19: _PyFunction_FastCallDict + 0x3d8 (0x56198c34d1e8 in /home/madhav3101/torch-env/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #22: _PyFunction_FastCallDict + 0x1d4 (0x56198c34cfe4 in /home/madhav3101/torch-env/bin/python3)
frame #23: _PyObject_Call_Prepend + 0x63 (0x56198c363cb3 in /home/madhav3101/torch-env/bin/python3)
frame #24: PyObject_Call + 0x6e (0x56198c3587de in /home/madhav3101/torch-env/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x56198c3ae337 in /home/madhav3101/torch-env/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x535 (0x56198c406615 in /home/madhav3101/torch-env/bin/python3)
frame #29: _PyEval_EvalCodeWithName + 0xba9 (0x56198c34c7c9 in /home/madhav3101/torch-env/bin/python3)
frame #30: _PyFunction_FastCallKeywords + 0x387 (0x56198c3ae337 in /home/madhav3101/torch-env/bin/python3)
frame #31: _PyEval_EvalFrameDefault + 0x14f5 (0x56198c4075d5 in /home/madhav3101/torch-env/bin/python3)
frame #32: _PyFunction_FastCallKeywords + 0xfb (0x56198c3ae0ab in /home/madhav3101/torch-env/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x6f6 (0x56198c4067d6 in /home/madhav3101/torch-env/bin/python3)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #35: PyEval_EvalCodeEx + 0x44 (0x56198c34cdd4 in /home/madhav3101/torch-env/bin/python3)
frame #36: PyEval_EvalCode + 0x1c (0x56198c34cdfc in /home/madhav3101/torch-env/bin/python3)
frame #37: <unknown function> + 0x22f9e4 (0x56198c4659e4 in /home/madhav3101/torch-env/bin/python3)
frame #38: PyRun_FileExFlags + 0xa1 (0x56198c46fbd1 in /home/madhav3101/torch-env/bin/python3)
frame #39: PyRun_SimpleFileExFlags + 0x1c3 (0x56198c46fdc3 in /home/madhav3101/torch-env/bin/python3)
frame #40: <unknown function> + 0x23aedb (0x56198c470edb in /home/madhav3101/torch-env/bin/python3)
frame #41: _Py_UnixMain + 0x3c (0x56198c470fbc in /home/madhav3101/torch-env/bin/python3)
frame #42: __libc_start_main + 0xf0 (0x14e21e95f830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #43: <unknown function> + 0x1dfed2 (0x56198c415ed2 in /home/madhav3101/torch-env/bin/python3)

Traceback (most recent call last):
  File "./tools/train.py", line 161, in <module>
    main()
  File "./tools/train.py", line 157, in main
    meta=meta)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 179, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 383, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 282, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 74, in batch_processor
    losses = model(**data)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 464, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:514)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x1503c9261193 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x731 (0x150414cb06f1 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0xa168ea (0x150414c9c8ea in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x295a74 (0x15041451ba74 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x56157b7d7004 in /home/madhav3101/torch-env/bin/python3)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x56157b7d7121 in /home/madhav3101/torch-env/bin/python3)
frame #6: _PyEval_EvalFrameDefault + 0x532e (0x56157b83340e in /home/madhav3101/torch-env/bin/python3)
frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #8: _PyFunction_FastCallDict + 0x3d8 (0x56157b7751e8 in /home/madhav3101/torch-env/bin/python3)
frame #9: _PyObject_Call_Prepend + 0x63 (0x56157b78bcb3 in /home/madhav3101/torch-env/bin/python3)
frame #10: PyObject_Call + 0x6e (0x56157b7807de in /home/madhav3101/torch-env/bin/python3)
frame #11: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #13: _PyFunction_FastCallDict + 0x3d8 (0x56157b7751e8 in /home/madhav3101/torch-env/bin/python3)
frame #14: _PyObject_Call_Prepend + 0x63 (0x56157b78bcb3 in /home/madhav3101/torch-env/bin/python3)
frame #15: <unknown function> + 0x170cca (0x56157b7cecca in /home/madhav3101/torch-env/bin/python3)
frame #16: PyObject_Call + 0x6e (0x56157b7807de in /home/madhav3101/torch-env/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #19: _PyFunction_FastCallDict + 0x3d8 (0x56157b7751e8 in /home/madhav3101/torch-env/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #22: _PyFunction_FastCallDict + 0x1d4 (0x56157b774fe4 in /home/madhav3101/torch-env/bin/python3)
frame #23: _PyObject_Call_Prepend + 0x63 (0x56157b78bcb3 in /home/madhav3101/torch-env/bin/python3)
frame #24: PyObject_Call + 0x6e (0x56157b7807de in /home/madhav3101/torch-env/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x56157b7d6337 in /home/madhav3101/torch-env/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x535 (0x56157b82e615 in /home/madhav3101/torch-env/bin/python3)
frame #29: _PyEval_EvalCodeWithName + 0xba9 (0x56157b7747c9 in /home/madhav3101/torch-env/bin/python3)
frame #30: _PyFunction_FastCallKeywords + 0x387 (0x56157b7d6337 in /home/madhav3101/torch-env/bin/python3)
frame #31: _PyEval_EvalFrameDefault + 0x14f5 (0x56157b82f5d5 in /home/madhav3101/torch-env/bin/python3)
frame #32: _PyFunction_FastCallKeywords + 0xfb (0x56157b7d60ab in /home/madhav3101/torch-env/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x6f6 (0x56157b82e7d6 in /home/madhav3101/torch-env/bin/python3)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #35: PyEval_EvalCodeEx + 0x44 (0x56157b774dd4 in /home/madhav3101/torch-env/bin/python3)
frame #36: PyEval_EvalCode + 0x1c (0x56157b774dfc in /home/madhav3101/torch-env/bin/python3)
frame #37: <unknown function> + 0x22f9e4 (0x56157b88d9e4 in /home/madhav3101/torch-env/bin/python3)
frame #38: PyRun_FileExFlags + 0xa1 (0x56157b897bd1 in /home/madhav3101/torch-env/bin/python3)
frame #39: PyRun_SimpleFileExFlags + 0x1c3 (0x56157b897dc3 in /home/madhav3101/torch-env/bin/python3)
frame #40: <unknown function> + 0x23aedb (0x56157b898edb in /home/madhav3101/torch-env/bin/python3)
frame #41: _Py_UnixMain + 0x3c (0x56157b898fbc in /home/madhav3101/torch-env/bin/python3)
frame #42: __libc_start_main + 0xf0 (0x15041b77a830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #43: <unknown function> + 0x1dfed2 (0x56157b83ded2 in /home/madhav3101/torch-env/bin/python3)

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/home/madhav3101/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/madhav3101/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/madhav3101/torch-env/bin/python3', '-u', './tools/train.py', '--local_rank=1', 'configs/dcn/db_cascade_mask_rcnn_r101_fpn_dconv_c3-c5_1x_coco.py', '--launcher', 'pytorch', '--work-dir', '/ssd_scratch/cvit/madhav/train_dataset/coco/logs/', '--gpus', '2']' returned non-zero exit status 1.

@hanjianhua44
Copy link

@mdv3101 Any solutions? I meet the same problem with mmdetection 2.0.0 and mmcv0.6.2

I am using the latest version of mmdetection, but still it is showing error. And when i set find_unused_parameters = True, error disappears but training freezes. Can anyone please help in solving it.

@hyhyni
Copy link

hyhyni commented Jul 11, 2020

@mdv3101 Any solutions? I meet the same problem with mmdetection 2.0.0 and mmcv0.6.2

I am using the latest version of mmdetection, but still it is showing error. And when i set find_unused_parameters = True, error disappears but training freezes. Can anyone please help in solving it.

I meet the same problem. Have you solved it? Thank you.

@MarsJunhaoHu
Copy link

thank you! before I raise this error issue, I've tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up!
hope you can pay some attention to this issue. By the way, I notice that the code you published about a half year ago worked well. Is this a point you can make use of? thank you again!

I also meet the same situation, and when I use non_distributed training, but use two card, it will raise valueerror All dicts must have the same number of keys.

@jajajajaja121 Hi. I read all your comments in mmdetection. I meet exactly the same problem as you. Have you solved the problem? Is it a bug in the custom dataset?

@ghost
Copy link

ghost commented Jul 16, 2020

Same here, find_unused_parameters = True does not exist in train.py.

@yinghuang
Copy link

I met the same issue.
But i solved it.
The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function,
but in forward function I only use 4 of them.
When I use all of them, the problem was solved.
This is my supposed conclusion: you should use all output of each module in forward function.

@edgarschnfld
Copy link

I met the same issue.
But i solved it.
The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function,
but in forward function I only use 4 of them.
When I use all of them, the problem was solved.
This is my supposed conclusion: you should use all output of each module in forward function.

This was helpful. I encountered the same error message in a custom architecture. Here is how you solve it without changing the module: If you define 5 layers, but only use the output of the 4th layer to calculate a specific loss, then you can solve the problem by multiplying the output of the 5th layer with zero and adding it to the loss. This way, you trick pytorch into believing that all parameters contribute to the loss. Problem solved. Deleting the 5th layer is not an option in my case, because I need the output of this layer in most training steps (but not all).

loss = your_loss_function(ouput_layer_4) + 0*output_layer_5.mean()

@JeffWang987
Copy link

I met the same issue.
But i solved it.
The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function,
but in forward function I only use 4 of them.
When I use all of them, the problem was solved.
This is my supposed conclusion: you should use all output of each module in forward function.

You save my ass

@shallowtoil
Copy link

I am getting the same error after adding the following code to fpn.py: (I want to freeze the FPN weights)

def _freeze_stages(self):
    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            m.eval()
            for p in m.parameters():
                p.requires_grad = False

def train(self, mode=True):
    super(FPN, self).train(mode)
    if self.freeze_weights:
        self._freeze_stages()

Setting find_unused_parameters = True also solved my problem.

Freeze the layers during the initialization or before distributing the model with MMDistributedDataParallel will solve the issue!

liuhuiCNN pushed a commit to liuhuiCNN/mmdetection that referenced this issue May 21, 2021
@baishiruyue
Copy link

maybe the reason of the bug is that you do not pass the new defined classes to the data and test config item,(i face the the same question when trainning my self dataset but not pass the new classes to the config item)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests