Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT Loss not decreasing #29

Closed
JF-D opened this issue Apr 18, 2020 · 2 comments
Closed

BERT Loss not decreasing #29

JF-D opened this issue Apr 18, 2020 · 2 comments

Comments

@JF-D
Copy link

JF-D commented Apr 18, 2020

I train the BERT_BASE model on 16 V100 GPUs using En Wikipedia dataset. I discover that the NSP loss can decrease to 0.3 from 0.7 and the MLM loss decrease to 6.8 from 10.0. During training, I use the create_pretrainig_data.py from google-research/bert to pre-create bert pretraining examples, since when I use the default setting that creates training samples in dataset the speed is quite slow and sometimes it takes hours to train 100 iterations. With pre-created pretraining data and lazy-dataloader, the training speed is quite normal. But the loss is a big problem for me. I pretrain the BERT_BASE model using the scheme in BERT paper(batch size 256, each GPU batch size 16, train 900000 steps with seqlen128, 100000 steps with seqlen 512), then I run test on SQuAD, the F1 score is quite low.

Could you please offer some help. Thanks!

@JF-D JF-D changed the title Loss not decreasing BERT Loss not decreasing Apr 18, 2020
@JF-D
Copy link
Author

JF-D commented Apr 20, 2020

Hi, @jaredcasper @raulpuric could you please offer some help? when I pretrain bert_base with Megatron-LM v0.2 using latest en-wikipedia dump, I got low NSP loss, but the MLM loss not decreasing.
image

@JF-D JF-D closed this as completed Apr 27, 2020
Chen-Chang pushed a commit to Chen-Chang/Megatron-LM that referenced this issue May 18, 2021
* adding example for ZeRO-2 CPU offload  (NVIDIA#29)

Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
@chaitanyabaranwal
Copy link

chaitanyabaranwal commented Apr 1, 2022

@JF-D, could I check what was the issue for you here? My loss is also not converging and I am trying to figure out possible reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants