Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

mscherrmann · 2023-07-03T13:55:16Z

I have been exploring the Mosaic-BERT model and I noticed that it is trained on a sequence length of 128. It's my understanding that this length can be easily extrapolated during inference time due to Attention with Linear Biases. However, in one of your blog posts, you compared the Mosaic-BERT model with the Hugging Face BERT base model, and I'm unclear about the sequence length used for training the BERT-Base model.

Specifically, I would like to know if the BERT-Base model, which is used as a benchmark for the mosaic-bert model for example in the appended figure, is trained with a sequence length of 128 or 512? If it is trained with a sequence length of 128, I would like to inquire about the necessary steps to obtain a Mosaic-BERT model that matches the performance of the BERT-Base model with a sequence length of 512.

Thank you for your attention to this matter. I look forward to your response and clarification.

dakinggg · 2023-07-10T23:11:33Z

Apologies if I haven't totally understood your question.

From the blogpost:
"For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4."

To fully pretrain a model with 512 sequence length, you'll just need to follow our guide, but change the max_seq_len param to 512.

Because of alibi, you can also start with a model trained with sequence length 128, and change max_seq_len to 512 to adapt it.

mscherrmann · 2023-07-11T06:16:12Z

Thank you!

mscherrmann · 2023-07-31T08:32:50Z

Hi,

I have one follow-up question:

What do I have to consider regarding "global_train_batch_size" and "device_train_microbatch_size" if I want to train with sequence length of 512 instead of 128 tokens? If I leave everything as in the yamls/main/hf-bert-base-uncased.yaml file I probably get memory problems. Do you have any tips in this regard? Or even better: Do you have a yml for this case? I train on a Nvidia 8x80 GB A100.

Try and Error goes with me unfortunately badly, because I always have to wait quite long until I am on the GPU. Therefore the demand. Thanks a lot!

dakinggg · 2023-07-31T17:37:10Z

global_train_batch_size is an optimization related setting and you may or may not want to change it. If you increase the sequence length, you see more tokens per batch. device_train_microbatch_size does not affect the math, and is only related to memory. I'm not sure what setting will work on the exact setup you describe, but you can try device_train_microbatch_size=auto, which will determine it for you.

mscherrmann · 2023-08-01T06:44:32Z

Perfect, thank you for your quick response!

mscherrmann · 2023-08-02T13:33:28Z

I ran into another issue, sorry...

As mosaic-bert is not finetunable, I use the hf-bert. I follow the approach of the original BERT paper: Train 90% of the steps with a sequence length of 128 and 10% of the steps with a sequence length of 10%.

To accomplish this with your code, i run the "main" scirpt for pretraining twice. The first run completes without any issue. However, in the second run, when I load the previous checkpoint with "load_path" and change sequence length to 512, I get the following error:

ValueError: Reused local directory: ['/mnt/data/train'] vs ['/ mnt/data/train']. Provide a different one.

The data is stored locally. Do you have any idea why this error occurs?

Thank you very much!

karan6181 · 2023-08-02T23:43:40Z

HI @FinTexIFB , what is your remote and local parameter looks like which you are passing to StreamingDataset ? Since your dataset resides locally, you can actually provide your local directory to local parameter and remote=None. For example, local='/mnt/data/train' and remote=None.

mscherrmann · 2023-08-03T07:43:42Z

Hi @karan6181,

thank you for your response. Yes, setting local='/mnt/data/train' and remote=None is exactly what I've done.

However, I found a workaround by simply creating a new container with the same mosaic docker image and installing all dependencies. Now it works, but only once. When I try to continue pre-training with an existing checkpoint afterwards I'll get the error. Maybe that is a bug

jacobfulano · 2023-08-03T13:26:39Z

@FinTexIFB, mosaic-bert is finetunable, as can be seen in this yaml. Does this work for your use case?

mscherrmann closed this as completed Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

mscherrmann commented Jul 3, 2023 •

edited

Loading

dakinggg commented Jul 10, 2023

mscherrmann commented Jul 11, 2023

mscherrmann commented Jul 31, 2023

dakinggg commented Jul 31, 2023

mscherrmann commented Aug 1, 2023

mscherrmann commented Aug 2, 2023

karan6181 commented Aug 2, 2023

mscherrmann commented Aug 3, 2023 •

edited

Loading

jacobfulano commented Aug 3, 2023

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

Comments

mscherrmann commented Jul 3, 2023 • edited Loading

dakinggg commented Jul 10, 2023

mscherrmann commented Jul 11, 2023

mscherrmann commented Jul 31, 2023

dakinggg commented Jul 31, 2023

mscherrmann commented Aug 1, 2023

mscherrmann commented Aug 2, 2023

karan6181 commented Aug 2, 2023

mscherrmann commented Aug 3, 2023 • edited Loading

jacobfulano commented Aug 3, 2023

mscherrmann commented Jul 3, 2023 •

edited

Loading

mscherrmann commented Aug 3, 2023 •

edited

Loading