Skip to content

Question about the calculation of train_batch_size #3876

Closed Answered by ShadenSmith
formath asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @formath, great question 😸. In both cases the last term represents the degree of data parallelism used in training (since each data-parallel replica will have its own data pipeline contributing to the train_batch_size.

With pure data parallelism, that's simply the number of GPUs. If we are also training with pipeline (model) parallelism, the resulting degree of data parallelism is the number of GPUs divided by the number of pipeline stages.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by formath
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants