-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT Loss not decreasing #29
Comments
Hi, @jaredcasper @raulpuric could you please offer some help? when I pretrain bert_base with Megatron-LM v0.2 using latest en-wikipedia dump, I got low NSP loss, but the MLM loss not decreasing. |
Chen-Chang
pushed a commit
to Chen-Chang/Megatron-LM
that referenced
this issue
May 18, 2021
* adding example for ZeRO-2 CPU offload (NVIDIA#29) Co-authored-by: Jie <37380896+jren73@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jie <37380896+jren73@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
@JF-D, could I check what was the issue for you here? My loss is also not converging and I am trying to figure out possible reasons. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I train the BERT_BASE model on 16 V100 GPUs using En Wikipedia dataset. I discover that the NSP loss can decrease to 0.3 from 0.7 and the MLM loss decrease to 6.8 from 10.0. During training, I use the
create_pretrainig_data.py
from google-research/bert to pre-create bert pretraining examples, since when I use the default setting that creates training samples in dataset the speed is quite slow and sometimes it takes hours to train 100 iterations. With pre-created pretraining data and lazy-dataloader, the training speed is quite normal. But the loss is a big problem for me. I pretrain the BERT_BASE model using the scheme in BERT paper(batch size 256, each GPU batch size 16, train 900000 steps with seqlen128, 100000 steps with seqlen 512), then I run test on SQuAD, the F1 score is quite low.Could you please offer some help. Thanks!
The text was updated successfully, but these errors were encountered: