Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

Closed
mscherrmann opened this issue Jul 3, 2023 · 9 comments
Closed

Inquiry about Mosaic-BERT and BERT-Base Sequence Lengths #407

mscherrmann opened this issue Jul 3, 2023 · 9 comments

Comments

@mscherrmann
Copy link

mscherrmann commented Jul 3, 2023

I have been exploring the Mosaic-BERT model and I noticed that it is trained on a sequence length of 128. It's my understanding that this length can be easily extrapolated during inference time due to Attention with Linear Biases. However, in one of your blog posts, you compared the Mosaic-BERT model with the Hugging Face BERT base model, and I'm unclear about the sequence length used for training the BERT-Base model.

Specifically, I would like to know if the BERT-Base model, which is used as a benchmark for the mosaic-bert model for example in the appended figure, is trained with a sequence length of 128 or 512? If it is trained with a sequence length of 128, I would like to inquire about the necessary steps to obtain a Mosaic-BERT model that matches the performance of the BERT-Base model with a sequence length of 512.

Thank you for your attention to this matter. I look forward to your response and clarification.
BertComparisonMNLI

@dakinggg
Copy link
Collaborator

Apologies if I haven't totally understood your question.

From the blogpost:
"For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4."

To fully pretrain a model with 512 sequence length, you'll just need to follow our guide, but change the max_seq_len param to 512.

Because of alibi, you can also start with a model trained with sequence length 128, and change max_seq_len to 512 to adapt it.

@mscherrmann
Copy link
Author

Thank you!

@mscherrmann
Copy link
Author

Hi,

I have one follow-up question:

What do I have to consider regarding "global_train_batch_size" and "device_train_microbatch_size" if I want to train with sequence length of 512 instead of 128 tokens? If I leave everything as in the yamls/main/hf-bert-base-uncased.yaml file I probably get memory problems. Do you have any tips in this regard? Or even better: Do you have a yml for this case? I train on a Nvidia 8x80 GB A100.

Try and Error goes with me unfortunately badly, because I always have to wait quite long until I am on the GPU. Therefore the demand. Thanks a lot!

@dakinggg
Copy link
Collaborator

global_train_batch_size is an optimization related setting and you may or may not want to change it. If you increase the sequence length, you see more tokens per batch. device_train_microbatch_size does not affect the math, and is only related to memory. I'm not sure what setting will work on the exact setup you describe, but you can try device_train_microbatch_size=auto, which will determine it for you.

@mscherrmann
Copy link
Author

Perfect, thank you for your quick response!

@mscherrmann
Copy link
Author

I ran into another issue, sorry...

As mosaic-bert is not finetunable, I use the hf-bert. I follow the approach of the original BERT paper: Train 90% of the steps with a sequence length of 128 and 10% of the steps with a sequence length of 10%.

To accomplish this with your code, i run the "main" scirpt for pretraining twice. The first run completes without any issue. However, in the second run, when I load the previous checkpoint with "load_path" and change sequence length to 512, I get the following error:

ValueError: Reused local directory: ['/mnt/data/train'] vs ['/ mnt/data/train']. Provide a different one.

The data is stored locally. Do you have any idea why this error occurs?

Thank you very much!

@karan6181
Copy link
Contributor

HI @FinTexIFB , what is your remote and local parameter looks like which you are passing to StreamingDataset ? Since your dataset resides locally, you can actually provide your local directory to local parameter and remote=None. For example, local='/mnt/data/train' and remote=None.

@mscherrmann
Copy link
Author

mscherrmann commented Aug 3, 2023

Hi @karan6181,

thank you for your response. Yes, setting local='/mnt/data/train' and remote=None is exactly what I've done.

However, I found a workaround by simply creating a new container with the same mosaic docker image and installing all dependencies. Now it works, but only once. When I try to continue pre-training with an existing checkpoint afterwards I'll get the error. Maybe that is a bug

@jacobfulano
Copy link
Contributor

@FinTexIFB, mosaic-bert is finetunable, as can be seen in this yaml. Does this work for your use case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants