BERT Loss not decreasing #29

JF-D · 2020-04-18T10:27:12Z

I train the BERT_BASE model on 16 V100 GPUs using En Wikipedia dataset. I discover that the NSP loss can decrease to 0.3 from 0.7 and the MLM loss decrease to 6.8 from 10.0. During training, I use the create_pretrainig_data.py from google-research/bert to pre-create bert pretraining examples, since when I use the default setting that creates training samples in dataset the speed is quite slow and sometimes it takes hours to train 100 iterations. With pre-created pretraining data and lazy-dataloader, the training speed is quite normal. But the loss is a big problem for me. I pretrain the BERT_BASE model using the scheme in BERT paper(batch size 256, each GPU batch size 16, train 900000 steps with seqlen128, 100000 steps with seqlen 512), then I run test on SQuAD, the F1 score is quite low.

Could you please offer some help. Thanks!

The text was updated successfully, but these errors were encountered:

JF-D · 2020-04-20T15:28:48Z

Hi, @jaredcasper @raulpuric could you please offer some help? when I pretrain bert_base with Megatron-LM v0.2 using latest en-wikipedia dump, I got low NSP loss, but the MLM loss not decreasing.

* adding example for ZeRO-2 CPU offload (NVIDIA#29) Co-authored-by: Jie <37380896+jren73@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jie <37380896+jren73@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

chaitanyabaranwal · 2022-04-01T04:52:00Z

@JF-D, could I check what was the issue for you here? My loss is also not converging and I am trying to figure out possible reasons.

JF-D changed the title ~~Loss not decreasing~~ BERT Loss not decreasing Apr 18, 2020

JF-D closed this as completed Apr 27, 2020

yulingao mentioned this issue Jun 27, 2023

RuntimeError: Socket Timeout when setting up NCCL communicator #386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT Loss not decreasing #29

BERT Loss not decreasing #29

JF-D commented Apr 18, 2020

JF-D commented Apr 20, 2020

chaitanyabaranwal commented Apr 1, 2022 •

edited

Loading

BERT Loss not decreasing #29

BERT Loss not decreasing #29

Comments

JF-D commented Apr 18, 2020

JF-D commented Apr 20, 2020

chaitanyabaranwal commented Apr 1, 2022 • edited Loading

chaitanyabaranwal commented Apr 1, 2022 •

edited

Loading