NaNs during CycleGAN training #20

revilokeb · 2017-05-04T13:49:49Z

I have run approx. a dozen test runs (using train.py) on 2 datasets (maps and my own custom dataset trying to convert Synthia to Cityscapes). Every run so far is giving NaNs after a couple of epochs, sometimes after more than 70 epochs, sometimes after only a handful of epochs. Until I am getting only NaNs actual learning seems to really happen as e.g. evidenced by looking at transformed images over epoch number. I have also played with various learning rates, but even at pretty low lr NaNs seem to eventually occur.

My question: Is this something others have also observed? Second: in case this is "normal" and e.g. due to the difficulties of training GANs (min-max), what would be critical params to vary to eventually avoid training to break down?

taesungp · 2017-05-24T01:41:29Z

Hi revilokeb,

I believe it is due to repeatedly applying normalization with images of low variance.

When applying normalization like InstanceNorm, the gradients tend to blow up fast if the image has low variance. This becomes even worse because we are going through multiple normalizations in the deep network. This problem is more frequent in the CycleGAN architecture because we use InstanceNorm and deep network with many normalization layers.

I personally ran into this issue when one image in the dataset was uniformly black due to corrupt image. I think this problem can be alleviated by increasing the value of epsilon, or removing the few images that cause the problem.

bernardohenz · 2018-01-24T18:26:07Z

I am running into the same problem.

So are you saying that images with low-variance may lead to exploding gradients? Could some type of gradient clipping be used to avoid that?

I've been trying to debug this, but it is hard to isolate the problem =/

Merge

jialeluD · 2023-10-03T16:22:21Z

The problem was solved by upgrading the version of torch.

TattooPro · 2023-12-15T07:45:48Z

which version of torch should I choose? The default is torch1.4

taesungp closed this as completed May 24, 2017

JiahangLiGary pushed a commit to lanbas/pytorch-CycleGAN-and-pix2pix that referenced this issue Apr 19, 2023

Merge pull request junyanz#20 from ThomasThelen/master

cb6ca3b

Merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs during CycleGAN training #20

NaNs during CycleGAN training #20

revilokeb commented May 4, 2017

taesungp commented May 24, 2017

bernardohenz commented Jan 24, 2018

jialeluD commented Oct 3, 2023

TattooPro commented Dec 15, 2023

NaNs during CycleGAN training #20

NaNs during CycleGAN training #20

Comments

revilokeb commented May 4, 2017

taesungp commented May 24, 2017

bernardohenz commented Jan 24, 2018

jialeluD commented Oct 3, 2023

TattooPro commented Dec 15, 2023