Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates #66

howardyclo · 2020-01-18T11:32:01Z

Metadata

Author: Leslie N. Smith & Nicholay Topin
Organization: U.S. Naval Research Laboratory & University of Maryland
Paper: https://arxiv.org/pdf/1708.07120.pdf
Code: https://github.com/lnsmith54/super-convergence (Caffe)

Although this paper hasn't been accepted to ML conferences, the usefulness of the paper is quite under-estimated by the ML community. The technique presented by this paper has been used by fast.ai. More importantly, the paper provides insights relates to understanding generalization in deep learning. Overall, I think this paper is very well-written and of high quality.

howardyclo · 2020-01-18T11:32:22Z

Preliminary

See "Cyclical Learning Rates for Training Neural Networks" by Smith (WACV 2017).

TL;DR

Present an unknown phenomenon called "super-convergence": we can train DNNs an order of magnitude faster (also w/ the performance boost). E.g., training a resnet-56 to achieve 92.4% accuracy w/ 10k iterations v.s. 91.2% accuracy w/ 80k iterations on Cifar-10.
The method is inspired by the regularization effect from large learning rate (LR) (however other form of regularization must be reduced. Why?).
Present a method that can estimate the optimal LR
Show that adaptive LR methods do not lead to "super-convergence" by themselves. However, with "one-cycle policy", the adaptive methods (without Adam) can lead to super-convergence.

fast.ai shows that Adam (with correctly implemented weight decay called "AdamW") can actually lead to "super-convergence" when its hyper-parameters are well-tuned](https://www.fast.ai/2018/07/02/adam-weight-decay/))

Test resnet wide-resnet, densenet and inceptionnet on Cifar-10/100, MNIST & ImageNet.

LR Range Test

Train DNNs from zero or very small LR which is slowing increased linearly.
Pick a maximum LR which makes test accuracy a distinct peak.
Pick a minimum LR which is 10^3 or 10^4 smaller than the maximum LR.
The optimal base learning rate in common training usually falls in this range.

Unusual Behavior: Achieve high accuracy using unusual high learning rates

This figure shows an unusual behavior: DNNs can maintain consistently high accuracy over this unusual range of large learning rates.

Q: Maybe need to warmup in order to make this behavior possible.

One-cycle Policy

Here we suggest a slight modification of cyclical learning rate policy for super-convergence; always use one cycle that is smaller than the total number of iterations/epochs and allow the learning rate to decrease several orders of magnitude less than the initial learning rate for the remaining iterations. We named this learning rate policy “1cycle” and in our experiments this policy allows an improvement in the accuracy.

See this nice blog for clarity.

Principle of balancing for different forms of regularizations

Besides controlling learning rate, there are also other regularization techniques like small batch size, weight decay and dropout. The paper empirically shows that the need of balancing these techniques is important to get super-convergence.
Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.

But.. what if we want to preserve the benefits of regularizing weight from weight decay?

The amount of performance boost from super-convergence

Small labeled data > Large labeled data
Shallow NNs > Deep NNs
Large batch size > small batch size. Note: It may due to using large batch size is the ability to use large LR.
With batch norm > Without batch norm

Question by myself: Isn't small batch size (i.e., gradient noise) better for generalization? (which is same as the intuition of using large learning rate but this conflicts the results of the paper)

howardyclo added the Deep Learning label Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates #66

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates #66

howardyclo commented Jan 18, 2020 •

edited

Loading

howardyclo commented Jan 18, 2020 •

edited

Loading

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates #66

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates #66

Comments

howardyclo commented Jan 18, 2020 • edited Loading

Metadata

howardyclo commented Jan 18, 2020 • edited Loading

Preliminary

TL;DR

LR Range Test

Unusual Behavior: Achieve high accuracy using unusual high learning rates

One-cycle Policy

Principle of balancing for different forms of regularizations

The amount of performance boost from super-convergence

howardyclo commented Jan 18, 2020 •

edited

Loading

howardyclo commented Jan 18, 2020 •

edited

Loading