Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occured during training BasicVSR #273

Closed
NK-CS-ZZL opened this issue Apr 23, 2021 · 17 comments
Closed

Error occured during training BasicVSR #273

NK-CS-ZZL opened this issue Apr 23, 2021 · 17 comments
Assignees

Comments

@NK-CS-ZZL
Copy link
Contributor

NK-CS-ZZL commented Apr 23, 2021

During training, error as follows occured. I run it with official config in "configs/restorers/basicvsr/basicvsr_reds4.py".

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument "find_unused_parameters=True" to "torch.nn.parallel.DistributedDataParallel"; (2) making sure all "forward" function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's "forward" function. Please include the loss function and the structure of the return value of "forward" of your module when reporting this issue (e.g. list, dict, iterable).

Though I solved it by adding find_unused_parameters=True in line 128, "apis/train.py" and removing self.generator.find_unused_parameters = False in line 84, "models/restorers/basicvsr.py", I still don't know why that error happened.

In "models/restorers/basicvsr.py", author declares a variable "self.generator.find_unused_parameters = False", which seems that shouldn't have influence on program, for it's never used as a parameter in program. However, it indeed causes the error mentioned above.

@ckkelvinchan
Copy link
Member

Did you use the provided configuration? This error occur because some trainable parameters do not receive gradient.

@ckkelvinchan ckkelvinchan self-assigned this Apr 23, 2021
@NK-CS-ZZL
Copy link
Contributor Author

NK-CS-ZZL commented Apr 23, 2021

Did you use the provided configuration? This error occur because some trainable parameters do not receive gradient.

Yes, I use configuration in "configs/restorers/basicvsr/basicvsr_reds4.py". if I don't add find_unused_parameters=True, program crashes after the first iteration and if I don't remove self.generator.find_unused_parameters = False, program crashes while self.step_counter == self.fix_iter.

@ckkelvinchan
Copy link
Member

I just tried and did not encounter the error. Could you provide me with more details? What command do you use for training? And did you modify any codes in the model or architecture?

@NK-CS-ZZL
Copy link
Contributor Author

NK-CS-ZZL commented Apr 23, 2021

Environment:
python==3.7.10
pytorch==1.8.1
torchvision==0.9.1
openCV==4.5.1
mmcv==1.3.1
mmediting==0.6.0+b6a516c

command for training: ./tools/dist_train.sh ./configs/restorers/basicvsr/basicvsr_reds4.py 2

I double check configuation file and make sure that I use official configuation. And here is my training log.

And did you modify any codes in the model or architecture?

No, I just remove self.generator.find_unused_parameters = False for continuing training after 5k iterations(fix_iter==5k). Other components are kept unchanged.

20210423_214330.log

@ckkelvinchan
Copy link
Member

Could you revert to the original code (don't add or remove self.generator.find_unused_parameters) and try fix_iter==-1, and let me know what happened?

@NK-CS-ZZL
Copy link
Contributor Author

I tried original code with fix_iter==-1 and it works well. However, if I change fix_iter to a positive number, the same error occurs again.

@ckkelvinchan
Copy link
Member

I just tried to run ./tools/dist_train.sh ./configs/restorers/basicvsr/basicvsr_reds4.py 2 and did not encounter any error. I am using PyTorch 1.6, and that may be the reason.

Can you please try to print out the value of self.generator.find_unused_parameters before line 91 (i.e. optimizer['generator'].zero_grad())?

@NK-CS-ZZL
Copy link
Contributor Author

It is True .

@ckkelvinchan
Copy link
Member

It is True but you still encounter the error after the first iteration, even when fix_iter==5000?

@NK-CS-ZZL
Copy link
Contributor Author

Yes...I'm also very confused about that.

@NK-CS-ZZL
Copy link
Contributor Author

It seems that in the repository "MMDetection", someone encountered similiar issue

@ckkelvinchan
Copy link
Member

That's strange, I will look into that.

@NK-CS-ZZL
Copy link
Contributor Author

Thanks for your your detailed reply.

@ckkelvinchan
Copy link
Member

ckkelvinchan commented Apr 25, 2021

@NK-CS-ZZL Hi, would you be able to test whether the problem exists in PyTorch 1.6.0? I want to see whether it is the problem of PyTorch version, but my CUDA driver is too old to install PyTorch 1.8.1. It may take some time to reconfigure the server.

@NK-CS-ZZL
Copy link
Contributor Author

@NK-CS-ZZL Hi, would you be able to test whether the problem exists in PyTorch 1.6.0? I want to see whether it is the problem of PyTorch version, but my CUDA driver is too old to install PyTorch 1.8.1. It may take some time to reconfigure the server.

I tried it and code works well. After 5000 iterations, Spynet updates as expected.
My current environment:
PyTorch: 1.6.0
TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.3.2
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
CUDA Runtime 10.1
CuDNN 7.6.3
MMEditing: 0.6.0+67ac739

20210430_120021.log

@ckkelvinchan
Copy link
Member

Then it's probably because of the difference in the PyTorch version. I will further look into the problem. Thanks for your report.

@ckkelvinchan
Copy link
Member

Fixed in #290. Thank you for your report. Please reopen there are still any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants