-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError:Expected to have finished reduction in the prior iteration before starting a new one #2153
Comments
Hi @vincentwei0919 , |
@ZwwWayne I have a similar problem, can you explain which line of code is preceded by 'find_unused_parameters=True'?Because everyone has a different version of mmdet |
@237014845 , |
@ZwwWayne thx |
thank you! before I raise this error issue, I've tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up! |
@vincentwei0919 I have a similar problem |
@ZwwWayne Would it be a possibility to change the train api to support for example the user setting this in a flag in the config file or as an addititonal argument? |
I also meet the same situation, and when I use non_distributed training, but use two card, it will raise valueerror All dicts must have the same number of keys. |
Hi @jajajajaja121 , |
I also got the same problem. Even I set the flag |
Hi @huuquan1994 , |
Yes this bug is still exist, and here is my config |
|
My bug disappeared, after I reinstall the leatest version of mmdetection, you can try this method. |
@ZwwWayne Sorry for the late reply! |
oh, thanks, I will try @jajajajaja121 |
Have you solved the problem? @vincentwei0919 @huuquan1994 I used the newest version of mmdet and still got stuck when I trying to train HTC on my own dataset. |
@laycoding @ZwwWayne
The two Docker images got the same error log as @laycoding mentioned. Training got stuck infinitely at the
|
@laycoding |
Thx, I will try it! |
got the same error when training mask rcnn on mmdetection 2.0.0. When I switch to non-dist training, it works fine. Whould like to know what caused this problem |
@SystemErrorWang I am also facing the same problem. When i set |
I am getting the same error after adding the following code to fpn.py: (I want to freeze the FPN weights) def _freeze_stages(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
m.eval()
for p in m.parameters():
p.requires_grad = False
def train(self, mode=True):
super(FPN, self).train(mode)
if self.freeze_weights:
self._freeze_stages() Setting |
I am using the latest version of mmdetection, but still it is showing error. And when i set
|
@mdv3101 Any solutions? I meet the same problem with mmdetection 2.0.0 and mmcv0.6.2 I am using the latest version of mmdetection, but still it is showing error. And when i set find_unused_parameters = True, error disappears but training freezes. Can anyone please help in solving it. |
I meet the same problem. Have you solved it? Thank you. |
@jajajajaja121 Hi. I read all your comments in mmdetection. I meet exactly the same problem as you. Have you solved the problem? Is it a bug in the custom dataset? |
Same here, find_unused_parameters = True does not exist in train.py. |
I met the same issue. |
This was helpful. I encountered the same error message in a custom architecture. Here is how you solve it without changing the module: If you define 5 layers, but only use the output of the 4th layer to calculate a specific loss, then you can solve the problem by multiplying the output of the 5th layer with zero and adding it to the loss. This way, you trick pytorch into believing that all parameters contribute to the loss. Problem solved. Deleting the 5th layer is not an option in my case, because I need the output of this layer in most training steps (but not all).
|
You save my ass |
Freeze the layers during the initialization or before distributing the model with MMDistributedDataParallel will solve the issue! |
maybe the reason of the bug is that you do not pass the new defined classes to the data and test config item,(i face the the same question when trainning my self dataset but not pass the new classes to the config item) |
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
A clear and concise description of what the bug is.
Reproduction
I have change the config name from faster_rcnn_r50_fpn_1x.py to element.py
Did you make any modifications on the code or config? Did you understand what you have modified?
only num_classes and work_dir in config
What dataset did you use?
![image](https://user-images.githubusercontent.com/24283259/75251117-eb63d500-5814-11ea-9bbc-361989af994e.png)
my own dataset which is made the same as VOC format
Environment
Please run
python mmdet/utils/collect_env.py
to collect necessary environment infomation and paste it here.You may add addition that may be helpful for locating the problem, such as
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Error traceback
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
The text was updated successfully, but these errors were encountered: