Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyper-parameter in pretraining #18

Closed
zhangliang-04 opened this issue Sep 4, 2021 · 2 comments
Closed

Hyper-parameter in pretraining #18

zhangliang-04 opened this issue Sep 4, 2021 · 2 comments

Comments

@zhangliang-04
Copy link

zhangliang-04 commented Sep 4, 2021

Hi,
I found that the learning rate in pretraining Stage I released in the paper is 1e-3, and batchsize is 600. The scripts in this repo suggests 1e-4 and 1920, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48 vs 960). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?

@ArrowLuo
Copy link
Contributor

ArrowLuo commented Sep 4, 2021

Hi @zhangliang-04,

The lr of Stage I set as 1e-3 and 1e-4 is ok. We test both of them at Stage I and they work well. However, the 1e-4 is more stable. So we set 1e-4 when writing the README.md.

For the batch size, our principle is to fill up the GPUs due to the limited resources for pretraining. Sorry that we missed introducing the gradient_accumulation_steps of Stage II in the paper, and 48 is the forward batch size of each step.

In summary, 1) Ignore our batch size and use your GPU memory as much as possible, and set gradient_accumulation_steps to make it a little fast. 2) set both stages 1e-4 is ok.

Best,

@zhangliang-04
Copy link
Author

Thanks for your suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants