Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device? #19892

renren7111 · 2024-05-22T10:42:26Z

Bug description

If we use ddp strategy, dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, local rank 0 need very high memory space, but every device has the same memory. The more devices, the higher memory on local rank 0. So we can not run some large case, which need higher memory, or lead to waste the memory space. Can you give me an advice to how to use lighting for large case?

What version are you seeing the problem on?

master

How to reproduce the bug

maybe need to fix the code below:
def prepare_data(self) -> None:
        trainer = self.trainer

        # on multi-gpu jobs we only want to manipulate (download, etc) on node_rank=0, local_rank=0
        # or in the case where each node needs to do its own manipulation in which case just local_rank=0
        local_rank_zero = trainer.local_rank == 0
        global_rank_zero = trainer.local_rank == 0 and trainer.node_rank == 0

        datamodule = trainer.datamodule
        lightning_module = trainer.lightning_module
        # handle datamodule prepare data:
        if datamodule is not None and is_overridden("prepare_data", datamodule):
            prepare_data_per_node = datamodule.prepare_data_per_node
            with _InfiniteBarrier():
                if (prepare_data_per_node and local_rank_zero) or (not prepare_data_per_node and global_rank_zero):
                    call._call_lightning_datamodule_hook(trainer, "prepare_data")

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

The text was updated successfully, but these errors were encountered:

renren7111 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 22, 2024

github-actions bot added the ver: 2.2.x label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device? #19892

Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device? #19892

renren7111 commented May 22, 2024

Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device? #19892

Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device? #19892

Comments

renren7111 commented May 22, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info