Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Error #33

Closed
EyalMichaeli opened this issue Jun 22, 2022 · 16 comments
Closed

Training Error #33

EyalMichaeli opened this issue Jun 22, 2022 · 16 comments

Comments

@EyalMichaeli
Copy link

First, thank you for the great work, really inspiring!

To the point:
I'm trying to use EPE on my own data (Carla as source/fake domain, A set of real images as real domain).
I created fake_gbuffers, created patches, matched them, and all is working correctly.

For some reason, at iteration a little above 5000, the function clip_gradient_norm throws Error/Warning, and from that point on the reconstructed images are black, and all outputs are 0/nan.
I checked, and clip_gradient_norm results in a NAN value, hence the error.

Looking at the tensor itself, it seems that most values(weights) are indeed very close to 0.

My question is what do you think can cause this?
a few notes that might be relevant:

  1. source domain is RGB, target is grayscale (I don't see why would that be a problem actually)
  2. I have (currently, just as a test) 100 images from each domain. In general, I have a total of 100k images from each domain so that won't be a problem...

Thanks.

@EyalMichaeli
Copy link
Author

EyalMichaeli commented Aug 16, 2022

Forgot to update here In case someone is struggling with this, I solved it eventually.

The reason for nan was lpips.

Instructions to solve:

  1. set torch.autograd.set_detect_anomaly(True) at the beginning of the EPEExperiment.py to let you know where are this nan from (I saw it came from lpips), then
  2. usually it's in ops like sqrt, pow, etc. simply add an epsilon to each one and re-run. I used epsilon=1e-08

Closed.

@nmaanvi
Copy link

nmaanvi commented Sep 17, 2022

@EyalMichaeli Thank you for coming back to update about how you fixed the nan problem. I am facing the same issue.
My losses and gradient norms are becoming NaN, and hence I added eps=1e-08 to the normalized tensor in LPIPS.forward() function in the lpips library. I even added 'eps' to discriminator_losses.py in /../code/epe/network
The problem of NaN still persists. I was wondering if you could shed some light as to where exactly you added eps and if you did anything else to fix this problem.

@EyalMichaeli
Copy link
Author

@nmaanvi did you set torch.autograd.set_detect_anomaly(True) at the beginning of the EPEExperiment.py as stated in step 1?
This should result in PyTorch telling exactly which computation does the harm (normally it’s a specific sqrt or other math function).
So, you should simply follow that debugging methodology: run, add eps, run, add eps to a different function …… until no nans.
This took me a few training by cycles.

tip: increase the LR to reach the problematic phase faster (not too much though)

@nmaanvi
Copy link

nmaanvi commented Sep 17, 2022

Thank you for the prompt reply. Yes, I did set_detect_anomaly(True) at the beginning. DivBackward0 returns nan, and the part where it's throwing the problem is loss.backward() going through _run_generator() and the corresponding loss for that is the LPIPS loss.
Thanks for the tip to increase the LR!

@KacperKazan
Copy link

Hey @nmaanvi, I'm running into similar issues with nan. Did you manage to fix it?
I'm adding Epsilons as you do here but it's also not working

@EyalMichaeli Thank you for coming back to update about how you fixed the nan problem. I am facing the same issue. My losses and gradient norms are becoming NaN, and hence I added eps=1e-08 to the normalized tensor in LPIPS.forward() function in the lpips library. I even added 'eps' to discriminator_losses.py in /../code/epe/network The problem of NaN still persists. I was wondering if you could shed some light as to where exactly you added eps and if you did anything else to fix this problem.

@nmaanvi
Copy link

nmaanvi commented Sep 30, 2022

Hello @KacperKazan , I was able to train the model for some more iterations by reducing the learning rates of both generator and discriminator. However, the NaN problem comes up again after 100K iterations.

@yuanhaorannnnnn
Copy link

In EPEExpreiment.py, the original code use "loss = 0 " to initialize the loss value, so my idea is just to modify this with "loss = 1e-08", you guys thinks it's ok or no?

@KacperKazan
Copy link

Hey @nmaanvi, hmm I really don't understand the NaN problem :( maybe we could help each other out to solve this issue

Is your training able to produce any good results? and also which generator network do you use? 'hr' or 'hr_new'

For example, for me when I use 'hr' then the training crashes after about 5000 iterations on a NaN error in the loss backwards. However when I use 'hr_new' it doesn't crash but for most of the time it just outputs black images and now recently it started outputting noise like what can be seen below.
image
image

Also, @EyalMichaeli were you able to resolve this issue fully? Just adding epsilon to functions doesn't seem to work for me

@KacperKazan
Copy link

In EPEExpreiment.py, the original code use "loss = 0 " to initialize the loss value, so my idea is just to modify this with "loss = 1e-08", you guys thinks it's ok or no?

I guess it wouldn't hurt trying it. Does it work in solving this NaN issue?

@nmaanvi
Copy link

nmaanvi commented Oct 3, 2022

@KacperKazan, I used 'hr' and with a reduced learning rate I could train the model(with pretty good results) until 100K iterations after which the same NaN problem crops up. When I used 'hr_new' NaNs come up after 5K iterations. I am currently trying to solve this NaN problem without changing the LR and also to train the model as per the authors' suggestion (1M iterations).

@EyalMichaeli
Copy link
Author

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity)
simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.

@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here.
side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

@nmaanvi
Copy link

nmaanvi commented Oct 6, 2022

@KacperKazan , I tried working with 'hr_new' again and see that the generator (PassThruGen) output has NaN which is resulting in the NaNs everywhere else. After seeing your question here (#45), I just added the same sigmoid function used in 'hr' and the model seems to at least not give black images and is training (albeit poorly). How have you handled the unbounded output from the generator?

@yuanhaorannnnnn
Copy link

yuanhaorannnnnn commented Oct 8, 2022

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.

@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

@yuanhaorannnnnn
Copy link

yuanhaorannnnnn commented Oct 9, 2022

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.
@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

no luck,crashed at 3725 interations, everything else I set as default.

@yuanhaorannnnnn
Copy link

yuanhaorannnnnn commented Oct 9, 2022

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.
@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

no luck,crashed at 3725 interations, everything else I set as default.

I just set "spectral = False" in function "make_conv_layer" in network_factory.py, and the training process goes well so far.

@yuanhaorannnnnn
Copy link

yuanhaorannnnnn commented Oct 10, 2022

Hey guys, please try the lpips version I used (I forked it here: https://github.com/EyalMichaeli/PerceptualSimilarity) simply install it locally (editble so you can change it if you like) and try running with it, perhaps it'll work. If it doesn't work, let me know, I'll try to figure out what else I changed.
@KacperKazan Regarding your question, yes. I'm training smoothly since I posted the comment here. side note: I'm using a different set of datasets, and I think each dataset might have different functions that are prone to produce NaNs, so make sure you put eps wherever pytorch anomaly detection tells you.

I will try your lpips, and give you feedback later.

no luck,crashed at 3725 interations, everything else I set as default.

I just set "spectral = False" in function "make_conv_layer" in network_factory.py, and the training process goes well so far.

it works, my solution is ugly(drop the spectral norm), and I'll try with real g-buffer to see if I can get a good result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants