Error occured during training BasicVSR #273

NK-CS-ZZL · 2021-04-23T12:48:54Z

During training, error as follows occured. I run it with official config in "configs/restorers/basicvsr/basicvsr_reds4.py".

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument "find_unused_parameters=True" to "torch.nn.parallel.DistributedDataParallel"; (2) making sure all "forward" function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's "forward" function. Please include the loss function and the structure of the return value of "forward" of your module when reporting this issue (e.g. list, dict, iterable).

Though I solved it by adding find_unused_parameters=True in line 128, "apis/train.py" and removing self.generator.find_unused_parameters = False in line 84, "models/restorers/basicvsr.py", I still don't know why that error happened.

In "models/restorers/basicvsr.py", author declares a variable "self.generator.find_unused_parameters = False", which seems that shouldn't have influence on program, for it's never used as a parameter in program. However, it indeed causes the error mentioned above.

The text was updated successfully, but these errors were encountered:

ckkelvinchan · 2021-04-23T12:58:54Z

Did you use the provided configuration? This error occur because some trainable parameters do not receive gradient.

NK-CS-ZZL · 2021-04-23T13:06:18Z

Did you use the provided configuration? This error occur because some trainable parameters do not receive gradient.

Yes, I use configuration in "configs/restorers/basicvsr/basicvsr_reds4.py". if I don't add find_unused_parameters=True, program crashes after the first iteration and if I don't remove self.generator.find_unused_parameters = False, program crashes while self.step_counter == self.fix_iter.

ckkelvinchan · 2021-04-23T13:27:39Z

I just tried and did not encounter the error. Could you provide me with more details? What command do you use for training? And did you modify any codes in the model or architecture?

NK-CS-ZZL · 2021-04-23T13:55:21Z

Environment：
python==3.7.10
pytorch==1.8.1
torchvision==0.9.1
openCV==4.5.1
mmcv==1.3.1
mmediting==0.6.0+b6a516c

command for training: ./tools/dist_train.sh ./configs/restorers/basicvsr/basicvsr_reds4.py 2

I double check configuation file and make sure that I use official configuation. And here is my training log.

And did you modify any codes in the model or architecture?

No, I just remove self.generator.find_unused_parameters = False for continuing training after 5k iterations(fix_iter==5k). Other components are kept unchanged.

20210423_214330.log

ckkelvinchan · 2021-04-23T14:23:33Z

Could you revert to the original code (don't add or remove self.generator.find_unused_parameters) and try fix_iter==-1, and let me know what happened?

NK-CS-ZZL · 2021-04-23T14:40:33Z

I tried original code with fix_iter==-1 and it works well. However, if I change fix_iter to a positive number, the same error occurs again.

ckkelvinchan · 2021-04-23T14:49:17Z

I just tried to run ./tools/dist_train.sh ./configs/restorers/basicvsr/basicvsr_reds4.py 2 and did not encounter any error. I am using PyTorch 1.6, and that may be the reason.

Can you please try to print out the value of self.generator.find_unused_parameters before line 91 (i.e. optimizer['generator'].zero_grad())?

NK-CS-ZZL · 2021-04-23T14:55:50Z

It is True .

ckkelvinchan · 2021-04-23T14:57:20Z

It is True but you still encounter the error after the first iteration, even when fix_iter==5000?

NK-CS-ZZL · 2021-04-23T14:58:00Z

Yes...I'm also very confused about that.

NK-CS-ZZL · 2021-04-23T15:03:25Z

It seems that in the repository "MMDetection", someone encountered similiar issue

ckkelvinchan · 2021-04-23T15:11:44Z

That's strange, I will look into that.

NK-CS-ZZL · 2021-04-23T15:16:54Z

Thanks for your your detailed reply.

ckkelvinchan · 2021-04-25T05:12:46Z

@NK-CS-ZZL Hi, would you be able to test whether the problem exists in PyTorch 1.6.0? I want to see whether it is the problem of PyTorch version, but my CUDA driver is too old to install PyTorch 1.8.1. It may take some time to reconfigure the server.

NK-CS-ZZL · 2021-04-30T04:09:14Z

@NK-CS-ZZL Hi, would you be able to test whether the problem exists in PyTorch 1.6.0? I want to see whether it is the problem of PyTorch version, but my CUDA driver is too old to install PyTorch 1.8.1. It may take some time to reconfigure the server.

I tried it and code works well. After 5000 iterations, Spynet updates as expected.
My current environment:
PyTorch: 1.6.0
TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.3.2
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
CUDA Runtime 10.1
CuDNN 7.6.3
MMEditing: 0.6.0+67ac739

20210430_120021.log

ckkelvinchan · 2021-04-30T04:52:57Z

Then it's probably because of the difference in the PyTorch version. I will further look into the problem. Thanks for your report.

ckkelvinchan · 2021-05-24T11:44:17Z

Fixed in #290. Thank you for your report. Please reopen there are still any issues.

ckkelvinchan self-assigned this Apr 23, 2021

ckkelvinchan mentioned this issue May 6, 2021

Fix find_unused_parameters in PyTorch 1.8 for BasicVSR #290

Merged

ckkelvinchan closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occured during training BasicVSR #273

Error occured during training BasicVSR #273

NK-CS-ZZL commented Apr 23, 2021 •

edited

Loading

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021 •

edited

Loading

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021 •

edited

Loading

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 25, 2021 •

edited

Loading

NK-CS-ZZL commented Apr 30, 2021

ckkelvinchan commented Apr 30, 2021

ckkelvinchan commented May 24, 2021

Error occured during training BasicVSR #273

Error occured during training BasicVSR #273

Comments

NK-CS-ZZL commented Apr 23, 2021 • edited Loading

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021 • edited Loading

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021 • edited Loading

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 23, 2021

NK-CS-ZZL commented Apr 23, 2021

ckkelvinchan commented Apr 25, 2021 • edited Loading

NK-CS-ZZL commented Apr 30, 2021

ckkelvinchan commented Apr 30, 2021

ckkelvinchan commented May 24, 2021

NK-CS-ZZL commented Apr 23, 2021 •

edited

Loading

NK-CS-ZZL commented Apr 23, 2021 •

edited

Loading

NK-CS-ZZL commented Apr 23, 2021 •

edited

Loading

ckkelvinchan commented Apr 25, 2021 •

edited

Loading