Add DistributedSamplerWithLoop #10746

sgugger · 2021-03-16T14:42:44Z

What does this PR do?

This PR adds a new distributed sampler that will provide a round multiple of the batch size samples on all processes by looping back at the beginning of the (shuffled) dataset. This is useful:

for TPUs to avoid triggering a new XLA compilation for the last training batch
for model parallelism to have batches of the same size on all processes

This PR also refactors some logic regarding the wold_size and process_rank in the TrainingArguments, as well as adds a test of the new DistributedSamplerWithLoop.

Tested on:

single-GPU
multi-GPU
TPU
SageMaker MP

LysandreJik

Yes, LGTM. Thanks for adding a test.

* Add DistributedSamplerWithLoop * Fix typo * Test and small fix

sgugger added 3 commits March 16, 2021 09:59

Add DistributedSamplerWithLoop

418f758

Fix typo

54c2bf0

Test and small fix

6ce8263

sgugger requested a review from LysandreJik March 16, 2021 14:42

LysandreJik approved these changes Mar 16, 2021

View reviewed changes

sgugger merged commit a0a027c into master Mar 16, 2021

sgugger deleted the new_distributed_sampler branch March 16, 2021 15:22

Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021

Add DistributedSamplerWithLoop (huggingface#10746)

cf81a50

* Add DistributedSamplerWithLoop * Fix typo * Test and small fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DistributedSamplerWithLoop #10746

Add DistributedSamplerWithLoop #10746

sgugger commented Mar 16, 2021

LysandreJik left a comment

Add DistributedSamplerWithLoop #10746

Add DistributedSamplerWithLoop #10746

Conversation

sgugger commented Mar 16, 2021

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment