-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SELU seems a good drop-in replacement for PReLu #29
Comments
Hi yu45020, This is great, thx! I was intersted in SELU but I have never tried since I thought it whould be something like leakyReLU. BTW, I'll try SELU with some parameters. Thank you so much for the suggestions! |
I monitor every layer's gradient values including their mean of absolute value, mean, and l2 norm in every epoch. Only when I set the initial learning rate above 5e-3, gradient can explode. Similar phenomenons happen when I use SELU to train some other CNN architectures. My experiment in your model is very limited. When the layer number is 12, ReLu, leakyReLu, and PReLu seems to have gradient vanishing problem earlier than SELU. But when I train the model with more than 40k image patches, gradient still approaches 0 quickly. The feature extraction part may not be optimal to go deep. (I am probably wrong) 8 layers seems a sweat spot for personal application. |
Hi, I have naver think about changing activator on the half way. This is very interesting. For clarification, I have a question. Does the L1 loss in the graph show only the loss for L1 norm? Or does it cotains MSE-loss + scaled L1 loss? Thank you for the data. I will add gradient distibutuin and norm to tensorboard and try to watch it. BTW, I tried 12 layers 196 filters with SELU. 1 epoch=120,000 images (dataset is DIV2k). === [set5] MSE:12.887093, PSNR:37.774630 === The loss transition is here. And the performance (PSNR) is here. |
The model is trained with L1 loss only without dropout nor weight decay. Adding weight decay almost doubles the training time and kills my GPU budgets. I test MSE on the 8-layer version with small dataset. Both PSNR and SSIM look similar to the L1 loss. But this paper shows L1 loss is better. In addition, PyTorch implantation on L1 loss is a few seconds faster than MSE. I usually train a model with Adam with amsgrad and monitor the loss plot. Once the loss curve becomes flat, I make a check point and switch to SGD or average SGD. Then the loss usually keeps going down for a while. The paper On the convergence of Adam and beyond shows Adam might have hard time to converge. But AMSGrad doesn't always save the day. |
Thank you for your explanation. Now I got it. I have heard L1 loss can be better but never tried. For now, I added 2 functions to my repository. One is for logging each layers' mean/max gradient. And the other is for using L1 loss instead of L2(MSE). I'll check the performance with L1 loss and will report the performance. Thx! |
Hi, sorry for late. The paper you provide is really interesting. So I believe I must improve my model first. Will try digging deeper. BTW, I'm now testing with SELU/PRelU and with/without gradient clipping. Here is the result. This is interesting that
I also observed that when I train without Gradient Clipping, sometimes earlier layers' weights and biases' gradient gets 0. But when epoch goes on and gets lower learning late, gradient suddenly recovers. The graph below is first CNN's gradient(weight and bias) with mean/std_dev. I believe it happens because I'm using not proper learning late, initial variables or optimizer setting. It looks that earlier CNN layers begin learning only after Pixel Shuffler and A/B layer have learned properly. However, those issue can be mitigated by Gradient Clipping. Though there are still lots to think and test about it, I'm gonna close this issue for now. Will play more with SELU and L1 loss since it will make training and calculation faster . Thx! |
Excellent ! Thank you very much for the results. For the switching loss issue, I got similar result in other architectures. Both L1 and MSE product similar results after models converge. The the gradient clipping, I guess the value can be far smaller, say 0.5 for the first few epochs. Then, I can set the initial learning rate to be as large as 5e-3. But with any rates larger than that, the model will fail to converge. Thank you again for the interesting plots. |
Yeah, maximum clipping value can be far smaller since actual gradient is much smaller than that. But so far in my experiment, I believe 1.0-5.0 works well on the very early stage of training. I had tried far smaller or far larger clipping value but the result was mostly same. Will try to decay clipping value as you suggest. It is very confusing that if I change the balance of initial weight / learning rate / optimizer / input normalization or others, the performance easily corrupts. Let's enjoy! :) |
I code your model in PyTorch (with some changes) and run some experiment on self-collected images. I have very limited GPU resources, so I am not able to test selu thoroughly in the model. The results might be an illusion but worth more testing.
Model specification:
no dropouts, no weight decays, no ensemble.
Dataset:
High resolution png images are cropped into 96x96 patches (non overlapping) and down sampled by bicubic. They are then converted into jpeg with 50% quality as input data. In the final reconstruction part, input images are up sampled by bilinear.
optimizer:
Adam, initial learning rate is 5e-4, betas are (0.9, 0.999)
Here are the results:
Y axis is ssim score, and x axis is the number of epochs.
SELU with L1 loss:
initial weights are normal with 0, 1/number of weights as recommended by SELU's creators.
init
PReLu with L1 loss
PReLU with MSE loss
Some extra potential benefits on using SELU:
Batch norm, dropout and alpha dropout may have negative impact during training. I add alpha dropout with drop probability of 0.02 and find the model has much higher loss and lower SSIM. In addition, gradient clipping seems redundant.
When I set the reconstruction filters, or "nin_filters" in your model's term, to be 64, the model is unstable in the first 100 epochs. But as I increase the number, the instability fades away.
The text was updated successfully, but these errors were encountered: