Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training details? #16

Closed
abhi-mosaic opened this issue Jul 6, 2023 · 1 comment
Closed

Training details? #16

abhi-mosaic opened this issue Jul 6, 2023 · 1 comment

Comments

@abhi-mosaic
Copy link

Hi InternLM team, thank you for this open source contribution! InternLM looks like a really strong 7B model.

I think the research community would greatly benefit from learning about the training details of InternLM. Are you open to sharing the token budget and global batch size used for this model?

In the README I see this comment which suggests a token budget over 1T tokens:

It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.

And in the training performance README I see that the max performance was achieved at 16k tokens per GPU. If this was used across 1024 GPUs for pretraining it would imply a global batch size of 16M tokens which is larger than I've seen before (especially for 7B models).

Thank you again!

@SolenoidWGT
Copy link
Collaborator

Hi, thank you for your interest in our project. As you mentioned, a global batch of 16M is indeed quite large. We commonly use configurations such as 512 GPUs with a global batch size of 4M, or 8M with a global batch size on 1024 GPUs for training. We have included information in our readme file regarding the performance testing under a certain global batch sizes.

As the number of GPUs increases, the batch size per GPU will gradually decrease, resulting in a higher proportion of communication overhead. This inevitably leads to a decrease in TGS or tflops. Additionally, enabling options such as pack_sample_into_one, bfloat16/float16, and reduce_bucket_size can cause fluctuations in TGS, and the network status of the cluster can also introduce minor disturbances. We are currently working continuously to address the communication overhead issue in small batches. We will keep you updated on any new developments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants