Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELU seems a good drop-in replacement for PReLu #29

Closed
yu45020 opened this issue May 31, 2018 · 10 comments
Closed

SELU seems a good drop-in replacement for PReLu #29

yu45020 opened this issue May 31, 2018 · 10 comments

Comments

@yu45020
Copy link

yu45020 commented May 31, 2018

I code your model in PyTorch (with some changes) and run some experiment on self-collected images. I have very limited GPU resources, so I am not able to test selu thoroughly in the model. The results might be an illusion but worth more testing.

Model specification:

  • upscale: 2
  • feature-extraction layers: 12
  • first feature-extraction filters: 198
  • last feature-extraction filters: 48 (number of filters are decreased by exponential decay)
  • reconstruction filters ("nin_filters"): 128
  • up sampling filters ("reconstruct_filters"): 32

no dropouts, no weight decays, no ensemble.

Dataset:
High resolution png images are cropped into 96x96 patches (non overlapping) and down sampled by bicubic. They are then converted into jpeg with 50% quality as input data. In the final reconstruction part, input images are up sampled by bilinear.

optimizer:
Adam, initial learning rate is 5e-4, betas are (0.9, 0.999)

Here are the results:
Y axis is ssim score, and x axis is the number of epochs.

SELU with L1 loss:
initial weights are normal with 0, 1/number of weights as recommended by SELU's creators.
init
selu l12 196 48 128 32

PReLu with L1 loss
prelu l12 196 48 128 32

PReLU with MSE loss
prelu mse l12 196 48 128 32

Some extra potential benefits on using SELU:
Batch norm, dropout and alpha dropout may have negative impact during training. I add alpha dropout with drop probability of 0.02 and find the model has much higher loss and lower SSIM. In addition, gradient clipping seems redundant.

When I set the reconstruction filters, or "nin_filters" in your model's term, to be 64, the model is unstable in the first 100 epochs. But as I increase the number, the instability fades away.

@jiny2001
Copy link
Owner

Hi yu45020,

This is great, thx!

I was intersted in SELU but I have never tried since I thought it whould be something like leakyReLU.
I'll try SELU and report the result here.
I'm also thinking gradient clipping would be redundant since the performance doesn't change when I don't use it. But I feel there maybe some case that it makes training faster??

BTW, I'll try SELU with some parameters. Thank you so much for the suggestions!

@yu45020
Copy link
Author

yu45020 commented May 31, 2018

I monitor every layer's gradient values including their mean of absolute value, mean, and l2 norm in every epoch. Only when I set the initial learning rate above 5e-3, gradient can explode. Similar phenomenons happen when I use SELU to train some other CNN architectures.

My experiment in your model is very limited. When the layer number is 12, ReLu, leakyReLu, and PReLu seems to have gradient vanishing problem earlier than SELU. But when I train the model with more than 40k image patches, gradient still approaches 0 quickly. The feature extraction part may not be optimal to go deep. (I am probably wrong) 8 layers seems a sweat spot for personal application.

@jiny2001
Copy link
Owner

jiny2001 commented Jun 2, 2018

Hi, thank you for the intersring info!

I tried SELU instead of PReLU and here is the result for layers=8, filters=96 and dataset=yang91+bsd200. (You are right, actually I use that parameter set for testing the model's performance. )

SELU(Blue) vs PReLU(Red), x:epoch, y: PSNR of test dataset(set5)
screen shot 2018-06-02 at 3 53 40 pm

So far from the graph, SELU has a rather stable performance on early training epoch?

I added option for using SELU to my code. Since PReLU has extra parameters to train, it is reasonable that it converges slowly. So I will check other parameters/conditions to see if SELU can better suppress gradient vanishing problem than other activators. Thx!

@yu45020
Copy link
Author

yu45020 commented Jun 3, 2018

Such graph is exactly what I expect. I am not able to reproduce the instability caused by selu in the 12 layer model, so I guess it is caused by my incorrect data augment.

I am reading the paper on Inception -v4 where the authors scale the residual connections (over 1,000 filters) to speed up training. In your model's design, the reconstruction part takes in over 1000 filters stacked from the feature extraction part, I am doing experiment to check whether rescaling that part helps

12 layer model: The model is trained by Adam first, and then in 100 epos, I change it to SGD with 5e-2 learning rate, momentum and Nesterov, which cause the first up-shot. In the 171 epos, I rescale the concatenated features by 0.1. It doesn't hurt but neither help.

loss

(The first 100 epos are omitted)
ssim

Statistics for each layer's weights and gradient in the 170 epos:

# I don't think the model has converged before the 170th epos since it performances worse than the 8 layer model with the same epos training.

++++++++++++++++    Feature Extraction  ++++++++++++++++
Mean weights 2.00643e-02
Weight size torch.Size([196, 3, 3, 3])
Mean of gradient -4.32021e-07
Mean of absolute gradient 7.84731e-05
Grad norm 1.94507e-02
=================
Mean weights 6.94653e-03
Weight size torch.Size([196])
Mean of gradient 1.02367e-05
Mean of absolute gradient 1.01716e-04
Grad norm 4.58856e-03
=================
Mean weights 1.68120e-02
Weight size torch.Size([172, 196, 3, 3])
Mean of gradient -6.75329e-08
Mean of absolute gradient 9.96305e-07
Grad norm 2.00953e-03
=================
Mean weights 6.76388e-03
Weight size torch.Size([172])
Mean of gradient -2.90675e-05
Mean of absolute gradient 1.02409e-04
Grad norm 3.19141e-03
=================
Mean weights 2.09343e-02
Weight size torch.Size([152, 172, 3, 3])
Mean of gradient 2.67625e-08
Mean of absolute gradient 1.38833e-06
Grad norm 1.60710e-03
=================
Mean weights 8.35950e-03
Weight size torch.Size([152])
Mean of gradient -3.47598e-05
Mean of absolute gradient 1.44659e-04
Grad norm 2.87709e-03
=================
Mean weights 2.88402e-02
Weight size torch.Size([134, 152, 3, 3])
Mean of gradient 2.30347e-08
Mean of absolute gradient 1.53296e-06
Grad norm 1.21241e-03
=================
Mean weights 1.08884e-02
Weight size torch.Size([134])
Mean of gradient 6.08737e-06
Mean of absolute gradient 4.01889e-05
Grad norm 7.26781e-04
=================
Mean weights 1.87330e-02
Weight size torch.Size([118, 134, 3, 3])
Mean of gradient -1.22249e-09
Mean of absolute gradient 1.77477e-06
Grad norm 1.17263e-03
=================
Mean weights 9.69623e-03
Weight size torch.Size([118])
Mean of gradient 4.94964e-06
Mean of absolute gradient 3.85554e-05
Grad norm 5.88135e-04
=================
Mean weights 2.07184e-02
Weight size torch.Size([103, 118, 3, 3])
Mean of gradient -3.03622e-08
Mean of absolute gradient 2.08999e-06
Grad norm 1.06449e-03
=================
Mean weights 7.29633e-03
Weight size torch.Size([103])
Mean of gradient 3.76601e-06
Mean of absolute gradient 3.24164e-05
Grad norm 4.99544e-04
=================
Mean weights 2.57879e-02
Weight size torch.Size([91, 103, 3, 3])
Mean of gradient 3.44454e-08
Mean of absolute gradient 2.08558e-06
Grad norm 8.80915e-04
=================
Mean weights 8.27794e-03
Weight size torch.Size([91])
Mean of gradient 4.34465e-07
Mean of absolute gradient 2.28164e-05
Grad norm 3.12153e-04
=================
Mean weights 2.58527e-02
Weight size torch.Size([80, 91, 3, 3])
Mean of gradient 2.88046e-08
Mean of absolute gradient 2.21727e-06
Grad norm 8.25830e-04
=================
Mean weights 8.25959e-03
Weight size torch.Size([80])
Mean of gradient 4.78102e-06
Mean of absolute gradient 2.72991e-05
Grad norm 3.57171e-04
=================
Mean weights 2.57654e-02
Weight size torch.Size([70, 80, 3, 3])
Mean of gradient 1.11181e-07
Mean of absolute gradient 2.29656e-06
Grad norm 7.88695e-04
=================
Mean weights 6.58082e-03
Weight size torch.Size([70])
Mean of gradient 5.85765e-06
Mean of absolute gradient 3.38972e-05
Grad norm 4.05154e-04
=================
Mean weights 2.62148e-02
Weight size torch.Size([62, 70, 3, 3])
Mean of gradient 2.89599e-08
Mean of absolute gradient 2.04895e-06
Grad norm 6.12328e-04
=================
Mean weights 6.81924e-03
Weight size torch.Size([62])
Mean of gradient 9.27251e-06
Mean of absolute gradient 4.54503e-05
Grad norm 5.32394e-04
=================
Mean weights 3.24337e-02
Weight size torch.Size([55, 62, 3, 3])
Mean of gradient -8.63035e-08
Mean of absolute gradient 1.47375e-06
Grad norm 3.90908e-04
=================
Mean weights 1.15499e-02
Weight size torch.Size([55])
Mean of gradient 1.13744e-05
Mean of absolute gradient 3.50520e-05
Grad norm 3.43914e-04
=================
Mean weights 4.34008e-02
Weight size torch.Size([48, 55, 3, 3])
Mean of gradient 2.54653e-08
Mean of absolute gradient 9.23437e-07
Grad norm 2.15546e-04
=================
Mean weights 1.46988e-02
Weight size torch.Size([48])
Mean of gradient 2.39356e-06
Mean of absolute gradient 1.67525e-05
Grad norm 1.51227e-04
=================

++++++++++++++++		Reconstruction		++++++++++++++++

----------------	Part A	-------------------
Mean weights 1.90763e-02
Weight size torch.Size([128, 1281, 1, 1])
Mean of gradient 3.62798e-09
Mean of absolute gradient 5.81404e-07
Grad norm 4.13150e-04
=================
Mean weights 3.71320e-03
Weight size torch.Size([128])
Mean of gradient 1.18073e-05
Mean of absolute gradient 1.80647e-04
Grad norm 2.87256e-03
=================

----------------	Part B	-------------------

Mean weights 2.59529e-02
Weight size torch.Size([64, 1281, 1, 1])
Mean of gradient -1.99345e-08
Mean of absolute gradient 1.55461e-06
Grad norm 7.71659e-04
=================
Mean weights 6.55973e-03
Weight size torch.Size([64])
Mean of gradient -2.91245e-05
Mean of absolute gradient 2.88596e-04
Grad norm 2.91292e-03
=================
Mean weights 2.20177e-02
Weight size torch.Size([128, 64, 3, 3])
Mean of gradient 2.86761e-08
Mean of absolute gradient 1.54201e-06
Grad norm 6.14671e-04
=================
Mean weights 2.88305e-03
Weight size torch.Size([128])
Mean of gradient 9.47305e-06
Mean of absolute gradient 7.90976e-05
Grad norm 1.28537e-03
=================

++++++++++++++++		Upsampling		++++++++++++++++
Mean weights 2.94045e-02
Weight size torch.Size([128, 256, 3, 3])
Mean of gradient 2.16205e-08
Mean of absolute gradient 8.94768e-07
Grad norm 8.28572e-04
=================
Mean weights 7.13767e-03
Weight size torch.Size([128])
Mean of gradient 1.16159e-05
Mean of absolute gradient 1.90500e-04
Grad norm 3.82092e-03
=================
Mean weights 1.41379e-02
Weight size torch.Size([3, 32, 3, 3])
Mean of gradient -5.13495e-06
Mean of absolute gradient 9.34122e-05
Grad norm 3.54389e-03
=================

@jiny2001
Copy link
Owner

jiny2001 commented Jun 4, 2018

Hi, I have naver think about changing activator on the half way. This is very interesting.

For clarification, I have a question. Does the L1 loss in the graph show only the loss for L1 norm? Or does it cotains MSE-loss + scaled L1 loss?

Thank you for the data. I will add gradient distibutuin and norm to tensorboard and try to watch it.

BTW, I tried 12 layers 196 filters with SELU. 1 epoch=120,000 images (dataset is DIV2k).
The performance result is really same as below but I feel like SELU will converge faster. I'll try bigger learning late with smaller dataset.

=== [set5] MSE:12.887093, PSNR:37.774630 ===
=== [set14] MSE:46.922695, PSNR:33.253676 ===
=== [bsd100] MSE:61.218412, PSNR:32.000580 ===

The loss transition is here.
loss = MSE + 0.0001 * l2_norm_of_CNN_weights
loss_l2 = 0.0001 * l2_norm_of_CNN_weights

loss

And the performance (PSNR) is here.

psnr

@yu45020
Copy link
Author

yu45020 commented Jun 4, 2018

The model is trained with L1 loss only without dropout nor weight decay. Adding weight decay almost doubles the training time and kills my GPU budgets. I test MSE on the 8-layer version with small dataset. Both PSNR and SSIM look similar to the L1 loss. But this paper shows L1 loss is better. In addition, PyTorch implantation on L1 loss is a few seconds faster than MSE.

I usually train a model with Adam with amsgrad and monitor the loss plot. Once the loss curve becomes flat, I make a check point and switch to SGD or average SGD. Then the loss usually keeps going down for a while.

The paper On the convergence of Adam and beyond shows Adam might have hard time to converge. But AMSGrad doesn't always save the day.

@jiny2001
Copy link
Owner

jiny2001 commented Jun 5, 2018

Thank you for your explanation. Now I got it. I have heard L1 loss can be better but never tried.

For now, I added 2 functions to my repository. One is for logging each layers' mean/max gradient. And the other is for using L1 loss instead of L2(MSE).

I'll check the performance with L1 loss and will report the performance. Thx!

@jiny2001
Copy link
Owner

Hi, sorry for late.

The paper you provide is really interesting.
I'm testing the way switching loss function on the training. So far, I had no luck. I think it's because the performance I got is now reaching my model's potential. I also tried to train with L1-loss but the result was mostly same as L2(MSE).

So I believe I must improve my model first. Will try digging deeper.

BTW, I'm now testing with SELU/PRelU and with/without gradient clipping. Here is the result.
Both are layers=8, features=96, optimizer is Adam. C=NA means no gradient clipping, C=5 means clipped by 5.0.

comparing

This is interesting that

  • SELU learns faster without Gradient Clipping
  • Gradient Clipping makes converges much faster in this case

I also observed that when I train without Gradient Clipping, sometimes earlier layers' weights and biases' gradient gets 0. But when epoch goes on and gets lower learning late, gradient suddenly recovers. The graph below is first CNN's gradient(weight and bias) with mean/std_dev.

clip0

I believe it happens because I'm using not proper learning late, initial variables or optimizer setting. It looks that earlier CNN layers begin learning only after Pixel Shuffler and A/B layer have learned properly. However, those issue can be mitigated by Gradient Clipping.

Though there are still lots to think and test about it, I'm gonna close this issue for now. Will play more with SELU and L1 loss since it will make training and calculation faster . Thx!

@yu45020
Copy link
Author

yu45020 commented Jun 11, 2018

Excellent ! Thank you very much for the results.

For the switching loss issue, I got similar result in other architectures. Both L1 and MSE product similar results after models converge.

The the gradient clipping, I guess the value can be far smaller, say 0.5 for the first few epochs. Then, I can set the initial learning rate to be as large as 5e-3. But with any rates larger than that, the model will fail to converge.

Thank you again for the interesting plots.

@jiny2001
Copy link
Owner

Yeah, maximum clipping value can be far smaller since actual gradient is much smaller than that.

But so far in my experiment, I believe 1.0-5.0 works well on the very early stage of training. I had tried far smaller or far larger clipping value but the result was mostly same. Will try to decay clipping value as you suggest.

It is very confusing that if I change the balance of initial weight / learning rate / optimizer / input normalization or others, the performance easily corrupts. Let's enjoy! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants