-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaNs during CycleGAN training #20
Comments
Hi revilokeb, I believe it is due to repeatedly applying normalization with images of low variance. When applying normalization like InstanceNorm, the gradients tend to blow up fast if the image has low variance. This becomes even worse because we are going through multiple normalizations in the deep network. This problem is more frequent in the CycleGAN architecture because we use InstanceNorm and deep network with many normalization layers. I personally ran into this issue when one image in the dataset was uniformly black due to corrupt image. I think this problem can be alleviated by increasing the value of epsilon, or removing the few images that cause the problem. |
I am running into the same problem. So are you saying that images with low-variance may lead to exploding gradients? Could some type of gradient clipping be used to avoid that? I've been trying to debug this, but it is hard to isolate the problem =/ |
The problem was solved by upgrading the version of torch. |
which version of torch should I choose? The default is torch1.4 |
I have run approx. a dozen test runs (using train.py) on 2 datasets (maps and my own custom dataset trying to convert Synthia to Cityscapes). Every run so far is giving NaNs after a couple of epochs, sometimes after more than 70 epochs, sometimes after only a handful of epochs. Until I am getting only NaNs actual learning seems to really happen as e.g. evidenced by looking at transformed images over epoch number. I have also played with various learning rates, but even at pretty low lr NaNs seem to eventually occur.
My question: Is this something others have also observed? Second: in case this is "normal" and e.g. due to the difficulties of training GANs (min-max), what would be critical params to vary to eventually avoid training to break down?
The text was updated successfully, but these errors were encountered: