-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP Dataloader shuffle order not synchronize between different ddp workers #2268
Comments
Hi @Teoge, thanks for reporting. I'm unable to reproduce the error. I'm using 5 processes with 25 elements.
Do you know what might be happening @muellerzr ? |
I am able to reproduce it on another machine. And it can only be reproduced in 0.25.0, not 0.24.0 or lower version. |
Thanks for testing again @Teoge ! I can indeed reproduce it now. I was probably on another version when I first tested ! To add more details, it seems that it also only happens when EDIT: I was able to find the PR #2126 that caused this. Also noticed that if you set the seed, the issue is solved but that's not intuitive at all: from accelerate.utils import set_seed
set_seed(42) |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Expected behavior
I expect the dataset to be distributed to all ddp workers without replacement. But results show that the sampling order of different ddp workers are not the same, leading to repeated sampling, like "9" is sampled three times and "3" is sampled two times.
The text was updated successfully, but these errors were encountered: