Potentially redundant learning rate scheduling #195

nikitakit · 2019-01-14T09:10:06Z

In the following two code snippets below:

https://github.com/huggingface/pytorch-pretrained-BERT/blob/647c98353090ee411e1ef9016b2a458becfe36f9/examples/run_lm_finetuning.py#L570-L573

https://github.com/huggingface/pytorch-pretrained-BERT/blob/647c98353090ee411e1ef9016b2a458becfe36f9/examples/run_lm_finetuning.py#L611-L613

it appears that learning rate warmup is being done twice: once in the example file, and once inside the BertAdam class. Am I reading this wrong? Because I'm pretty sure the BertAdam class performs its own warm-up when initialized with those arguments.

Here is an excerpt from the BertAdam class, where warm-up is also applied:
https://github.com/huggingface/pytorch-pretrained-BERT/blob/647c98353090ee411e1ef9016b2a458becfe36f9/pytorch_pretrained_bert/optimization.py#L146-L150

This also applies to other examples, e.g.
https://github.com/huggingface/pytorch-pretrained-BERT/blob/647c98353090ee411e1ef9016b2a458becfe36f9/examples/run_squad.py#L848-L851
https://github.com/huggingface/pytorch-pretrained-BERT/blob/647c98353090ee411e1ef9016b2a458becfe36f9/examples/run_squad.py#L909-L911

thomwolf · 2019-01-14T09:21:10Z

Humm could be the case indeed. What do think about this @tholor?

nikitakit · 2019-01-14T09:24:10Z

As far as I can tell this was introduced in c8ea286 as a byproduct of adding float16 support, and was then copied to other example files as well.

tholor · 2019-01-17T13:26:04Z

I agree, there seems to be double LR scheduling. The applied LR is therefore lower than intended. Quick plot of the LR being set in the outer scope (i.e. in run_squad or run_lm_finetuning) vs. the inner one (in BERTAdam) shows this:

In addition, I have noticed two further parts for potential clean up:

I don't see a reason why the function warmup_linear() is implemented in two places: In optimization.py and in each example script.
Is the method optimizer.get_lr() ever being called? There's actually another LR scheduling.
https://github.com/huggingface/pytorch-pretrained-BERT/blob/f040a43cb3954e14dc47a815de012ac3f87a85d0/pytorch_pretrained_bert/optimization.py#L79-L92

matej-svejda · 2019-01-22T22:45:11Z

There is als an additional problem that causes the learning rate to not be set correctly in run_classifier.py. I created a pull request for that (and the double warmup problem): #218

kugwzk · 2019-01-27T16:32:44Z

Is there are something done for this double warmup bug?

tholor · 2019-01-27T16:49:14Z

Yes, @matej-svejda worked on this in #218

kugwzk · 2019-01-27T16:55:33Z

I see that, but it isn't merge now?

tholor · 2019-01-27T17:20:45Z

No, not yet. As you can see in the PR it's still WIP and he committed only 4 hours ago. If you need the fix urgently, you can apply the changes easily locally. It's quite a small fix.

kugwzk · 2019-01-27T17:40:13Z

Sorry,I forget to see the time :)

kugwzk · 2019-01-28T07:57:34Z

By the way, how can I draw a picture about the LR schedule about BERT like yours. I see if use print(optimizer.param_groups['lr'] , the learning rate is always like I init it.

tholor · 2019-01-29T07:52:30Z

I have plotted optimizer.param_groups[0]["lr"] from here:
https://github.com/huggingface/pytorch-pretrained-BERT/blob/f040a43cb3954e14dc47a815de012ac3f87a85d0/examples/run_lm_finetuning.py#L610-L616

and lr_scheduled from here:
https://github.com/huggingface/pytorch-pretrained-BERT/blob/f040a43cb3954e14dc47a815de012ac3f87a85d0/pytorch_pretrained_bert/optimization.py#L145-L152

Your above could should actually throw an exception because optimizer.param_groups is a list. Try optimizer.param_groups[0]["lr"] or lr_this_step.

thomwolf · 2019-02-05T16:07:58Z

Ok this should be fixed in master now!

thomwolf closed this as completed Feb 5, 2019

thomwolf mentioned this issue Feb 5, 2019

What is get_lr() meaning in the optimizer.py #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially redundant learning rate scheduling #195

Potentially redundant learning rate scheduling #195

nikitakit commented Jan 14, 2019 •

edited

thomwolf commented Jan 14, 2019

nikitakit commented Jan 14, 2019

tholor commented Jan 17, 2019

matej-svejda commented Jan 22, 2019

kugwzk commented Jan 27, 2019

tholor commented Jan 27, 2019

kugwzk commented Jan 27, 2019

tholor commented Jan 27, 2019

kugwzk commented Jan 27, 2019

kugwzk commented Jan 28, 2019

tholor commented Jan 29, 2019

thomwolf commented Feb 5, 2019

Potentially redundant learning rate scheduling #195

Potentially redundant learning rate scheduling #195

Comments

nikitakit commented Jan 14, 2019 • edited

thomwolf commented Jan 14, 2019

nikitakit commented Jan 14, 2019

tholor commented Jan 17, 2019

matej-svejda commented Jan 22, 2019

kugwzk commented Jan 27, 2019

tholor commented Jan 27, 2019

kugwzk commented Jan 27, 2019

tholor commented Jan 27, 2019

kugwzk commented Jan 27, 2019

kugwzk commented Jan 28, 2019

tholor commented Jan 29, 2019

thomwolf commented Feb 5, 2019

nikitakit commented Jan 14, 2019 •

edited