Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curious about Glow implementation: some weights look frozen? #20

Open
christabella opened this issue Dec 20, 2019 · 9 comments
Open

Curious about Glow implementation: some weights look frozen? #20

christabella opened this issue Dec 20, 2019 · 9 comments

Comments

@christabella
Copy link
Contributor

christabella commented Dec 20, 2019

Hi Krzysztof,

When visualizing the distribution of weights and gradients of each tensor over training, I noticed that some the weights don't seem to be updating. E.g. InvertibleConv1x1Layer's U_mat, L_mat, and log_S.
screenshot_2019-12-19_11-40-01

My first thought was that maybe the gradients are too small, but it doesn't look like that's the case:
IMAGE 2019-12-20 14:39:48
image
Weights remain mostly constant:
image
image

But gradients are... pretty explosive 😔
image

I didn't change the core code and used the high-level API, but trained it on a different task and it is plugged into a larger model.

I will try running the original example you provided and report back with that, but in the meantime I was wondering if you (or anyone else) had any early ideas about this. Thanks!

@kmkolasinski
Copy link
Owner

This is interesting, but unfortunately I have no idea for this. Do you have some minimal example of code to run or test like small model with fake input data ? Are you sure you are plotting the right thing?
Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

@christabella
Copy link
Contributor Author

Thank you very much for the reply! I just tried visualizing the weights in the provided notebook Celeba48x48_22steps.ipynb without changing anything else, and some gradients are also exploding even in step 2:

image

Probably it would be worth to check whether optimizer applies these gradients to L, U, scale weights. Maybe we could start from single layer with 1D-invertible conv and check whether it can learn some expected distribution like gaussian multivariate with nontrivial covariance matrix.

If I understand your suggestion correctly, I will try to reduce the flow's complexity (number of steps etc.) and plot the weights again. Maybe on MNIST instead of CelebA...

@kmkolasinski
Copy link
Owner

Yes good idea, maybe it would be better to start from much more smaller learning rate and simpler architecture. MNIST is good idea also. Can you check this ?

@geosada
Copy link

geosada commented Jan 7, 2020

Though I'm not too sure if it's going to be useful, let me share with you what I encountered while playing with the code for CelebA 64x64.
Increasing flow steps from 22 (the orginal Krzysztof's setting) to 32 caused a loss to get NaN, which I suspect that a gradient explosion probably happened somewhere.

Actually, at that time, I changed the learning dynamics carelessly from the original: I started to run the training with num_steps=1000 and lr_ph: 0.0005 from the beginning, and it turned out that this change caused NaN error after all.

In short, the warm-up strategy, that is, In [33] and In [34] for example in Celeba64x64_22steps.ipynb, seems important.
So, Christabella, I also think that it is worth trying to start the training with much smaller lr_ph, as Krzysztof suggested in the last comment.

@kmkolasinski
Copy link
Owner

@geosada thanks for your comment 👍

@christabella
Copy link
Contributor Author

Thanks a lot @geosada for pointing out the importance of the warm-up strategy! Although I have heard of starting with a small learning rate that goes up and down again, but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

/\    instead of  /\ 
  \/             /  \

[33]: lr=0.0001 for 1 x 100 steps
[34]: lr=0.0005 for 5 x 1000 steps
[35]: lr=0.0001 for 5 x 1000 steps
[36]: lr=0.00005 for 5 x 1000 steps
[37]: lr=0.0001 for 5 x 1000 steps

@christabella
Copy link
Contributor Author

I also noticed that the CelebA notebook uses a per-pixel loss

loss_per_pixel = loss / image_size / image_size  
total_loss = l2_loss + loss_per_pixel

while for MNIST, it's the sum over all pixels
total_loss = l2_loss + loss

Maybe that’s why the l2 loss goes up in the MNIST notebook, because the loss summed over all pixels overpowers the l2 loss:
image

Furthermore, in the official GLOW implementation, log p(x) (bits_x) is also divided by the number of subpixels.

But when I tried to divide the MNIST loss by image_size=28, there was instability at around 10-15K steps:
image
image
And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:
image

This was with lr=0.005 which is maybe too high as @geosada mentioned and MNIST did not have a warm-up strategy. I will try using the learning rate warm-up.


Also another unrelated thought--I wonder if act norm, if it acts like batch norm, will interact with l2 regularization to produce adaptive learning rate.

@kmkolasinski
Copy link
Owner

but in Celeba64x64_22steps.ipynb the learning rate schedule goes up, down, down, up---just wondering, why that schedule?

Hi, I believe I was playing with LR rate manually, so there is no specific reason for this schedule. When experimenting in jupyter notebooks, I usually start from small LR e.g. 0.0001 and test model for few epochs to check whether it trains or not. If so, I increase LR and then systematically decrease e.g. [0.001, 0.0005, 0.0001, 0.00005]. This time I increased LR at the end, probably because I noted that model is not learning fast enough, so I wanted to help it, or maybe this is just a typo. Sorry for the inconvenience.

@kmkolasinski
Copy link
Owner

Wow, you're doing a great detective work :)

And they encountered NaN in step 8K and 4K respectively. The NaN first pops up in gradient/ChainLayer/Scale3/ChainLayer/Step5/ChainLayer//AffineCouplingLayer/Step5/Conv/biases_0 which is really strange because it doesn't look abnormal:

I believe you should work with smaller LR. NFs are very complicated networks with many parts which are very fragile and they don't train fast (at least from my experience). If you have a bad luck you can always sample some hard minibatch which will generate large loss and which will break whole training. You can try to overcome this problem with some gradient clipping techniques. Or maybe one could write some custom optimizer which would reject these updates for which loss is far from the current running mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants