You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I found that the learning rate in pretraining Stage I released in the paper is 1e-3, and batchsize is 600. The scripts in this repo suggests 1e-4 and 1920, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48 vs 960). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?
The text was updated successfully, but these errors were encountered:
The lr of Stage I set as 1e-3 and 1e-4 is ok. We test both of them at Stage I and they work well. However, the 1e-4 is more stable. So we set 1e-4 when writing the README.md.
For the batch size, our principle is to fill up the GPUs due to the limited resources for pretraining. Sorry that we missed introducing the gradient_accumulation_steps of Stage II in the paper, and 48 is the forward batch size of each step.
In summary, 1) Ignore our batch size and use your GPU memory as much as possible, and set gradient_accumulation_steps to make it a little fast. 2) set both stages 1e-4 is ok.
Hi,
I found that the learning rate in pretraining Stage I released in the paper is
1e-3
, and batchsize is600
. The scripts in this repo suggests1e-4
and1920
, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48
vs960
). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?The text was updated successfully, but these errors were encountered: