DDP Dataloader shuffle order not synchronize between different ddp workers #2268

Teoge · 2023-12-20T09:33:23Z

System Info

accelerate version: 0.25.0
torch version: 1.12.1
accelerate's configuration:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

from accelerate import Accelerator
from torch.utils.data import DataLoader
import time

accelerator = Accelerator()

dataloader = DataLoader(list(range(24)), shuffle=True, batch_size=1)

dataloader = accelerator.prepare(dataloader)

for batch in dataloader:
    print(batch)
    time.sleep(1)

Expected behavior

I expect the dataset to be distributed to all ddp workers without replacement. But results show that the sampling order of different ddp workers are not the same, leading to repeated sampling, like "9" is sampled three times and "3" is sampled two times.

tensor([1], device='cuda:7')
tensor([9], device='cuda:3')
tensor([15], device='cuda:5')
tensor([3], device='cuda:1')
tensor([23], device='cuda:6')
tensor([16], device='cuda:2')
tensor([3], device='cuda:0')
tensor([14], device='cuda:4')
tensor([4], device='cuda:7')
tensor([12], device='cuda:3')
tensor([10], device='cuda:5')
tensor([0], device='cuda:1')
tensor([20], device='cuda:6')
tensor([13], device='cuda:2')
tensor([4], device='cuda:0')
tensor([19], device='cuda:4')
tensor([14], device='cuda:7')
tensor([7], device='cuda:3')
tensor([9], device='cuda:5')
tensor([9], device='cuda:1')
tensor([10], device='cuda:6')
tensor([10], device='cuda:2')
tensor([5], device='cuda:0')
tensor([8], device='cuda:4')

The text was updated successfully, but these errors were encountered:

SunMarc · 2023-12-20T11:12:49Z

Hi @Teoge, thanks for reporting. I'm unable to reproduce the error. I'm using 5 processes with 25 elements.

tensor([2], device='cuda:4')
tensor([19], device='cuda:3')
tensor([18], device='cuda:4')
tensor([9], device='cuda:3')
tensor([11], device='cuda:4')
tensor([23], device='cuda:3')
tensor([4], device='cuda:4')
tensor([20], device='cuda:3')
tensor([0], device='cuda:4')
tensor([15], device='cuda:3')
tensor([3], device='cuda:2')
tensor([1], device='cuda:2')
tensor([14], device='cuda:2')
tensor([5], device='cuda:2')
tensor([13], device='cuda:2')
tensor([21], device='cuda:1')
tensor([10], device='cuda:0')
tensor([24], device='cuda:0')
tensor([7], device='cuda:0')
tensor([12], device='cuda:0')
tensor([22], device='cuda:0')
tensor([8], device='cuda:1')
tensor([16], device='cuda:1')
tensor([6], device='cuda:1')
tensor([17], device='cuda:1')

Do you know what might be happening @muellerzr ?

Teoge · 2023-12-27T04:01:49Z

I am able to reproduce it on another machine. And it can only be reproduced in 0.25.0, not 0.24.0 or lower version.

SunMarc · 2023-12-27T12:29:29Z

Thanks for testing again @Teoge ! I can indeed reproduce it now. I was probably on another version when I first tested ! To add more details, it seems that it also only happens when suffle=True. We will fix this asap. cc @muellerzr

EDIT: I was able to find the PR #2126 that caused this. Also noticed that if you set the seed, the issue is solved but that's not intuitive at all:

from accelerate.utils import set_seed
set_seed(42)

Teoge changed the title ~~DDP Dataloader shuffle order not synchronize between different devices~~ DDP Dataloader shuffle order not synchronize between different ddp workers Dec 20, 2023

muellerzr mentioned this issue Jan 9, 2024

Bring old seed technique back #2319

Merged

5 tasks

muellerzr closed this as completed in #2319 Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP Dataloader shuffle order not synchronize between different ddp workers #2268

DDP Dataloader shuffle order not synchronize between different ddp workers #2268

Teoge commented Dec 20, 2023 •

edited

Loading

SunMarc commented Dec 20, 2023

Teoge commented Dec 27, 2023 •

edited

Loading

SunMarc commented Dec 27, 2023 •

edited

Loading

DDP Dataloader shuffle order not synchronize between different ddp workers #2268

DDP Dataloader shuffle order not synchronize between different ddp workers #2268

Comments

Teoge commented Dec 20, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Dec 20, 2023

Teoge commented Dec 27, 2023 • edited Loading

SunMarc commented Dec 27, 2023 • edited Loading

Teoge commented Dec 20, 2023 •

edited

Loading

Teoge commented Dec 27, 2023 •

edited

Loading

SunMarc commented Dec 27, 2023 •

edited

Loading