Do we need some kind of Learning rate decay with Ranger? #12

avostryakov · 2019-09-13T12:08:58Z

For AdamW people usually add some sort of learning rate decay: linear, cosine triangle, etc. Also, warm up steps are also popular.

Do we need all of these with Ranger or just use a fixed learning rate?

avostryakov · 2019-09-13T13:09:16Z

Sorry, I didn't notice it: flat + cosine anneal training curve. So, Is it mean don't change LR ~0.72% of steps and after it to anneal LR with a cosine function, right?

lessw2020 · 2019-09-13T13:16:27Z

Hi @avostryakov,
Correct, we found a cosine anneal after 72% or so works best with Ranger.
There is no need for warmup with Ranger - it uses the RAdam rectifier to monitor that variance automatically.
Note, as a bit of a preview, currently testing a new calibrated adaptive lr for Ranger right now...so may have an updated version in a few days.
Hope this info helps!
Less

avostryakov · 2019-09-13T13:28:35Z

Thanks a lot!

lessw2020 closed this as completed Sep 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need some kind of Learning rate decay with Ranger? #12

Do we need some kind of Learning rate decay with Ranger? #12

avostryakov commented Sep 13, 2019

avostryakov commented Sep 13, 2019

lessw2020 commented Sep 13, 2019

avostryakov commented Sep 13, 2019

Do we need some kind of Learning rate decay with Ranger? #12

Do we need some kind of Learning rate decay with Ranger? #12

Comments

avostryakov commented Sep 13, 2019

avostryakov commented Sep 13, 2019

lessw2020 commented Sep 13, 2019

avostryakov commented Sep 13, 2019