Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device? #19892

Open
renren7111 opened this issue May 22, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.2.x

Comments

@renren7111
Copy link

Bug description

If we use ddp strategy, dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, local rank 0 need very high memory space, but every device has the same memory. The more devices, the higher memory on local rank 0. So we can not run some large case, which need higher memory, or lead to waste the memory space. Can you give me an advice to how to use lighting for large case?

What version are you seeing the problem on?

master

How to reproduce the bug

maybe need to fix the code below:
def prepare_data(self) -> None:
        trainer = self.trainer

        # on multi-gpu jobs we only want to manipulate (download, etc) on node_rank=0, local_rank=0
        # or in the case where each node needs to do its own manipulation in which case just local_rank=0
        local_rank_zero = trainer.local_rank == 0
        global_rank_zero = trainer.local_rank == 0 and trainer.node_rank == 0

        datamodule = trainer.datamodule
        lightning_module = trainer.lightning_module
        # handle datamodule prepare data:
        if datamodule is not None and is_overridden("prepare_data", datamodule):
            prepare_data_per_node = datamodule.prepare_data_per_node
            with _InfiniteBarrier():
                if (prepare_data_per_node and local_rank_zero) or (not prepare_data_per_node and global_rank_zero):
                    call._call_lightning_datamodule_hook(trainer, "prepare_data")

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@renren7111 renren7111 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.2.x
Projects
None yet
Development

No branches or pull requests

1 participant