[Bug] DataLoader always return batches in the same order (Expected Random) #2157

allanj · 2023-11-16T04:49:28Z

System Info

`accelerate==0.24.1`

The error does not happen with accelerate `0.23.0`

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Using the following example script

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from accelerate import Accelerator

class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        return item


# Example data
data = np.random.randn(10, 5)  # 100 samples, 10 features each

# Convert to PyTorch tensors
data = torch.tensor(data, dtype=torch.float32)
from accelerate.utils import set_seed

accelerator = Accelerator()
dataset = CustomDataset(data)

# Create data loader
data_loader = DataLoader(dataset, batch_size=3, shuffle=True)
data_loader = accelerator.prepare(data_loader)
epochs = 2
for ep in range(2):
    for batch_idx, data in enumerate(data_loader):
        if accelerator.is_main_process:
            print(data)
    
    if accelerator.is_main_process:
        print("end batch")

Run the above in a distributed environment by

accelerate launch --main_process_port 9205 example.py

Expected behavior

It will output something like this, suppose we want different epochs have different order of batches, and different batches

But now is "same batch order" and "same batches" for two epochs

tensor([[ 0.7063, -0.0190, -2.6028,  0.7356,  0.9286],
        [ 0.1154, -1.0464, -0.0675, -0.3442,  0.9731],
        [ 0.8266, -1.0352, -0.6741, -0.8682, -0.5972]], device='cuda:0')
tensor([[ 0.8992, -1.0027,  0.1585,  1.6470, -1.2425],
        [-0.3033,  0.3626, -0.8370,  0.0797,  0.5549],
        [-1.3101,  0.2919, -0.5106,  0.7952,  2.1747]], device='cuda:0')
end batch
tensor([[ 0.7063, -0.0190, -2.6028,  0.7356,  0.9286],
        [ 0.1154, -1.0464, -0.0675, -0.3442,  0.9731],
        [ 0.8266, -1.0352, -0.6741, -0.8682, -0.5972]], device='cuda:0')
tensor([[ 0.8992, -1.0027,  0.1585,  1.6470, -1.2425],
        [-0.3033,  0.3626, -0.8370,  0.0797,  0.5549],
        [-1.3101,  0.2919, -0.5106,  0.7952,  2.1747]], device='cuda:0')
end batch

Quick fixes that we tried:

the data sampler seems broken, we tried the following should work.

data_loader.batch_sampler.batch_sampler.sampler.set_epoch(ep)

We also tried 0.23.0 accelerate version also work

Or simply commented the following code also work

accelerate/src/accelerate/data_loader.py

Lines 836 to 846 in 0f2686c

    
           if isinstance(sampler, RandomSampler): 
        
               # When iterating through the dataloader during distributed processes 
        
               # we want to ensure that on each process we are iterating through the same 
        
               # samples in the same order if a seed is set. This requires a tweak 
        
               # to the `torch.utils.data.RandomSampler` class (if used). 
        
               sampler = SeedableRandomSampler( 
        
                   data_source=sampler.data_source, 
        
                   replacement=sampler.replacement, 
        
                   num_samples=sampler._num_samples, 
        
                   generator=getattr(sampler, "generator", torch.Generator()), 
        
               )

The text was updated successfully, but these errors were encountered:

allanj · 2023-11-16T04:57:23Z

I feel like it would be great to fix this quickly as it seems affect a lot of previous experiments that use the newer version of accelerate

hyang0511 · 2023-11-16T08:02:32Z

I have the same issue with accelerate version 0.24.1.
It would be great if this could be fixed ASAP.

allanj · 2023-11-18T07:13:27Z

I checked the master branch, the output seems fine. Does it mean the master branch fix it? (but may not be an elegant way

muellerzr · 2023-11-18T13:17:17Z

Thanks for the update! We’ll be pushing a new release next week that should include the fix. #2126

ashawkey · 2023-11-30T05:25:10Z

Can confirm the master branch fixes this problem, while the latest release 0.24.1 still doesn't.
It's quite an important issue and hope it should be fixed soon.

allanj · 2023-11-30T06:15:04Z

Yes. For now, I think I switch to 0.23.0, it seems they haven't fixed this issue to update the version.

muellerzr · 2023-11-30T13:00:04Z

Sorry for the delay on this, a release will be happening Friday with the fix included in it!

github-actions · 2023-12-24T15:06:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muellerzr self-assigned this Nov 16, 2023

github-actions bot closed this as completed Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DataLoader always return batches in the same order (Expected Random) #2157

[Bug] DataLoader always return batches in the same order (Expected Random) #2157

allanj commented Nov 16, 2023 •

edited

Loading

allanj commented Nov 16, 2023

hyang0511 commented Nov 16, 2023 •

edited

Loading

allanj commented Nov 18, 2023

muellerzr commented Nov 18, 2023

ashawkey commented Nov 30, 2023 •

edited

Loading

allanj commented Nov 30, 2023

muellerzr commented Nov 30, 2023

github-actions bot commented Dec 24, 2023

[Bug] DataLoader always return batches in the same order (Expected Random) #2157

[Bug] DataLoader always return batches in the same order (Expected Random) #2157

Comments

allanj commented Nov 16, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

allanj commented Nov 16, 2023

hyang0511 commented Nov 16, 2023 • edited Loading

allanj commented Nov 18, 2023

muellerzr commented Nov 18, 2023

ashawkey commented Nov 30, 2023 • edited Loading

allanj commented Nov 30, 2023

muellerzr commented Nov 30, 2023

github-actions bot commented Dec 24, 2023

allanj commented Nov 16, 2023 •

edited

Loading

hyang0511 commented Nov 16, 2023 •

edited

Loading

ashawkey commented Nov 30, 2023 •

edited

Loading