You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi InternLM team, thank you for this open source contribution! InternLM looks like a really strong 7B model.
I think the research community would greatly benefit from learning about the training details of InternLM. Are you open to sharing the token budget and global batch size used for this model?
In the README I see this comment which suggests a token budget over 1T tokens:
It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
And in the training performance README I see that the max performance was achieved at 16k tokens per GPU. If this was used across 1024 GPUs for pretraining it would imply a global batch size of 16M tokens which is larger than I've seen before (especially for 7B models).
Thank you again!
The text was updated successfully, but these errors were encountered:
Hi, thank you for your interest in our project. As you mentioned, a global batch of 16M is indeed quite large. We commonly use configurations such as 512 GPUs with a global batch size of 4M, or 8M with a global batch size on 1024 GPUs for training. We have included information in our readme file regarding the performance testing under a certain global batch sizes.
As the number of GPUs increases, the batch size per GPU will gradually decrease, resulting in a higher proportion of communication overhead. This inevitably leads to a decrease in TGS or tflops. Additionally, enabling options such as pack_sample_into_one, bfloat16/float16, and reduce_bucket_size can cause fluctuations in TGS, and the network status of the cluster can also introduce minor disturbances. We are currently working continuously to address the communication overhead issue in small batches. We will keep you updated on any new developments.
Hi InternLM team, thank you for this open source contribution! InternLM looks like a really strong 7B model.
I think the research community would greatly benefit from learning about the training details of InternLM. Are you open to sharing the token budget and global batch size used for this model?
In the README I see this comment which suggests a token budget over 1T tokens:
And in the training performance README I see that the max performance was achieved at 16k tokens per GPU. If this was used across 1024 GPUs for pretraining it would imply a global batch size of 16M tokens which is larger than I've seen before (especially for 7B models).
Thank you again!
The text was updated successfully, but these errors were encountered: