Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates #66

Open
howardyclo opened this issue Jan 18, 2020 · 1 comment

Comments

@howardyclo
Copy link
Owner

howardyclo commented Jan 18, 2020

Metadata

Although this paper hasn't been accepted to ML conferences, the usefulness of the paper is quite under-estimated by the ML community. The technique presented by this paper has been used by fast.ai. More importantly, the paper provides insights relates to understanding generalization in deep learning. Overall, I think this paper is very well-written and of high quality.

@howardyclo
Copy link
Owner Author

howardyclo commented Jan 18, 2020

Preliminary


See "Cyclical Learning Rates for Training Neural Networks" by Smith (WACV 2017).

TL;DR

  • Present an unknown phenomenon called "super-convergence": we can train DNNs an order of magnitude faster (also w/ the performance boost). E.g., training a resnet-56 to achieve 92.4% accuracy w/ 10k iterations v.s. 91.2% accuracy w/ 80k iterations on Cifar-10.
  • The method is inspired by the regularization effect from large learning rate (LR) (however other form of regularization must be reduced. Why?).
  • Present a method that can estimate the optimal LR
  • Show that adaptive LR methods do not lead to "super-convergence" by themselves. However, with "one-cycle policy", the adaptive methods (without Adam) can lead to super-convergence.

fast.ai shows that Adam (with correctly implemented weight decay called "AdamW") can actually lead to "super-convergence" when its hyper-parameters are well-tuned](https://www.fast.ai/2018/07/02/adam-weight-decay/))

  • Test resnet wide-resnet, densenet and inceptionnet on Cifar-10/100, MNIST & ImageNet.

LR Range Test

  • Train DNNs from zero or very small LR which is slowing increased linearly.
  • Pick a maximum LR which makes test accuracy a distinct peak.
  • Pick a minimum LR which is 10^3 or 10^4 smaller than the maximum LR.
  • The optimal base learning rate in common training usually falls in this range.

Unusual Behavior: Achieve high accuracy using unusual high learning rates

  • This figure shows an unusual behavior: DNNs can maintain consistently high accuracy over this unusual range of large learning rates.

Q: Maybe need to warmup in order to make this behavior possible.

One-cycle Policy

Here we suggest a slight modification of cyclical learning rate policy for super-convergence; always use one cycle that is smaller than the total number of iterations/epochs and allow the learning rate to decrease several orders of magnitude less than the initial learning rate for the remaining iterations. We named this learning rate policy “1cycle” and in our experiments this policy allows an improvement in the accuracy.

Principle of balancing for different forms of regularizations

  • Besides controlling learning rate, there are also other regularization techniques like small batch size, weight decay and dropout. The paper empirically shows that the need of balancing these techniques is important to get super-convergence.
  • Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.

But.. what if we want to preserve the benefits of regularizing weight from weight decay?

The amount of performance boost from super-convergence

  • Small labeled data > Large labeled data
  • Shallow NNs > Deep NNs
  • Large batch size > small batch size. Note: It may due to using large batch size is the ability to use large LR.
  • With batch norm > Without batch norm

Question by myself: Isn't small batch size (i.e., gradient noise) better for generalization? (which is same as the intuition of using large learning rate but this conflicts the results of the paper)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant