Curious about Glow implementation: some weights look frozen? #20

christabella · 2019-12-20T13:45:56Z

Hi Krzysztof,

When visualizing the distribution of weights and gradients of each tensor over training, I noticed that some the weights don't seem to be updating. E.g. InvertibleConv1x1Layer's U_mat, L_mat, and log_S.

My first thought was that maybe the gradients are too small, but it doesn't look like that's the case:

Weights remain mostly constant:

But gradients are... pretty explosive 😔

I didn't change the core code and used the high-level API, but trained it on a different task and it is plugged into a larger model.

I will try running the original example you provided and report back with that, but in the meantime I was wondering if you (or anyone else) had any early ideas about this. Thanks!

The text was updated successfully, but these errors were encountered:

kmkolasinski · 2020-01-01T16:04:42Z

This is interesting, but unfortunately I have no idea for this. Do you have some minimal example of code to run or test like small model with fake input data ? Are you sure you are plotting the right thing?
Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

christabella · 2020-01-03T19:01:27Z

Thank you very much for the reply! I just tried visualizing the weights in the provided notebook Celeba48x48_22steps.ipynb without changing anything else, and some gradients are also exploding even in step 2:

Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

If I understand your suggestion correctly, I will try to reduce the flow's complexity (number of steps etc.) and plot the weights again. Maybe on MNIST instead of CelebA...

kmkolasinski · 2020-01-04T08:20:20Z

Yes good idea, maybe it would be better to start from much more smaller learning rate and simpler architecture. MNIST is good idea also. Can you check this ?

geosada · 2020-01-07T15:08:48Z

Though I'm not too sure if it's going to be useful, let me share with you what I encountered while playing with the code for CelebA 64x64.
Increasing flow steps from 22 (the orginal Krzysztof's setting) to 32 caused a loss to get NaN, which I suspect that a gradient explosion probably happened somewhere.

Actually, at that time, I changed the learning dynamics carelessly from the original: I started to run the training with num_steps=1000 and lr_ph: 0.0005 from the beginning, and it turned out that this change caused NaN error after all.

In short, the warm-up strategy, that is, In [33] and In [34] for example in Celeba64x64_22steps.ipynb, seems important.
So, Christabella, I also think that it is worth trying to start the training with much smaller lr_ph, as Krzysztof suggested in the last comment.

kmkolasinski · 2020-01-07T15:15:08Z

@geosada thanks for your comment 👍

christabella · 2020-01-10T12:28:07Z

Thanks a lot @geosada for pointing out the importance of the warm-up strategy! Although I have heard of starting with a small learning rate that goes up and down again, but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

/\    instead of  /\ 
  \/             /  \

[33]: lr=0.0001 for 1 x 100 steps
[34]: lr=0.0005 for 5 x 1000 steps
[35]: lr=0.0001 for 5 x 1000 steps
[36]: lr=0.00005 for 5 x 1000 steps
[37]: lr=0.0001 for 5 x 1000 steps

christabella · 2020-01-10T14:53:43Z

I also noticed that the CelebA notebook uses a per-pixel loss

loss_per_pixel = loss / image_size / image_size  
total_loss = l2_loss + loss_per_pixel

while for MNIST, it's the sum over all pixels
total_loss = l2_loss + loss

Maybe that’s why the l2 loss goes up in the MNIST notebook, because the loss summed over all pixels overpowers the l2 loss:

Furthermore, in the official GLOW implementation, log p(x) (bits_x) is also divided by the number of subpixels.

But when I tried to divide the MNIST loss by image_size=28, there was instability at around 10-15K steps:

And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:

This was with lr=0.005 which is maybe too high as @geosada mentioned and MNIST did not have a warm-up strategy. I will try using the learning rate warm-up.

Also another unrelated thought--I wonder if act norm, if it acts like batch norm, will interact with l2 regularization to produce adaptive learning rate.

kmkolasinski · 2020-01-11T09:31:47Z

but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

Hi, I believe I was playing with LR rate manually, so there is no specific reason for this schedule. When experimenting in jupyter notebooks, I usually start from small LR e.g. 0.0001 and test model for few epochs to check whether it trains or not. If so, I increase LR and then systematically decrease e.g. [0.001, 0.0005, 0.0001, 0.00005]. This time I increased LR at the end, probably because I noted that model is not learning fast enough, so I wanted to help it, or maybe this is just a typo. Sorry for the inconvenience.

kmkolasinski · 2020-01-11T09:51:09Z

Wow, you're doing a great detective work :)

And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:

I believe you should work with smaller LR. NFs are very complicated networks with many parts which are very fragile and they don't train fast (at least from my experience). If you have a bad luck you can always sample some hard minibatch which will generate large loss and which will break whole training. You can try to overcome this problem with some gradient clipping techniques. Or maybe one could write some custom optimizer which would reject these updates for which loss is far from the current running mean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curious about Glow implementation: some weights look frozen? #20

Curious about Glow implementation: some weights look frozen? #20

christabella commented Dec 20, 2019 •

edited

Loading

kmkolasinski commented Jan 1, 2020

christabella commented Jan 3, 2020

kmkolasinski commented Jan 4, 2020

geosada commented Jan 7, 2020

kmkolasinski commented Jan 7, 2020

christabella commented Jan 10, 2020

christabella commented Jan 10, 2020

kmkolasinski commented Jan 11, 2020

kmkolasinski commented Jan 11, 2020

Curious about Glow implementation: some weights look frozen? #20

Curious about Glow implementation: some weights look frozen? #20

Comments

christabella commented Dec 20, 2019 • edited Loading

kmkolasinski commented Jan 1, 2020

christabella commented Jan 3, 2020

kmkolasinski commented Jan 4, 2020

geosada commented Jan 7, 2020

kmkolasinski commented Jan 7, 2020

christabella commented Jan 10, 2020

christabella commented Jan 10, 2020

kmkolasinski commented Jan 11, 2020

kmkolasinski commented Jan 11, 2020

christabella commented Dec 20, 2019 •

edited

Loading