Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistributedSamplerWithLoop #10746

Merged
merged 3 commits into from Mar 16, 2021
Merged

Add DistributedSamplerWithLoop #10746

merged 3 commits into from Mar 16, 2021

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Mar 16, 2021

What does this PR do?

This PR adds a new distributed sampler that will provide a round multiple of the batch size samples on all processes by looping back at the beginning of the (shuffled) dataset. This is useful:

  • for TPUs to avoid triggering a new XLA compilation for the last training batch
  • for model parallelism to have batches of the same size on all processes

This PR also refactors some logic regarding the wold_size and process_rank in the TrainingArguments, as well as adds a test of the new DistributedSamplerWithLoop.

Tested on:

  • single-GPU
  • multi-GPU
  • TPU
  • SageMaker MP

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LGTM. Thanks for adding a test.

@sgugger sgugger merged commit a0a027c into master Mar 16, 2021
@sgugger sgugger deleted the new_distributed_sampler branch March 16, 2021 15:22
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* Add DistributedSamplerWithLoop

* Fix typo

* Test and small fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants