Hyper-parameter in pretraining #18

zhangliang-04 · 2021-09-04T04:48:08Z

Hi,
I found that the learning rate in pretraining Stage I released in the paper is 1e-3, and batchsize is 600. The scripts in this repo suggests 1e-4 and 1920, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48 vs 960). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?

The text was updated successfully, but these errors were encountered:

ArrowLuo · 2021-09-04T14:37:17Z

Hi @zhangliang-04,

The lr of Stage I set as 1e-3 and 1e-4 is ok. We test both of them at Stage I and they work well. However, the 1e-4 is more stable. So we set 1e-4 when writing the README.md.

For the batch size, our principle is to fill up the GPUs due to the limited resources for pretraining. Sorry that we missed introducing the gradient_accumulation_steps of Stage II in the paper, and 48 is the forward batch size of each step.

In summary, 1) Ignore our batch size and use your GPU memory as much as possible, and set gradient_accumulation_steps to make it a little fast. 2) set both stages 1e-4 is ok.

Best,

zhangliang-04 · 2021-09-05T01:39:39Z

Thanks for your suggestions!

zhangliang-04 closed this as completed Sep 5, 2021

HenryHZY mentioned this issue Dec 26, 2021

Questions on retrieval result and "Info: Weight doesn't exsits" #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyper-parameter in pretraining #18

Hyper-parameter in pretraining #18

zhangliang-04 commented Sep 4, 2021 •

edited

ArrowLuo commented Sep 4, 2021

zhangliang-04 commented Sep 5, 2021

Hyper-parameter in pretraining #18

Hyper-parameter in pretraining #18

Comments

zhangliang-04 commented Sep 4, 2021 • edited

ArrowLuo commented Sep 4, 2021

zhangliang-04 commented Sep 5, 2021

zhangliang-04 commented Sep 4, 2021 •

edited