You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we use ddp strategy, dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, local rank 0 need very high memory space, but every device has the same memory. The more devices, the higher memory on local rank 0. So we can not run some large case, which need higher memory, or lead to waste the memory space. Can you give me an advice to how to use lighting for large case?
What version are you seeing the problem on?
master
How to reproduce the bug
maybeneedtofixthecodebelow:
defprepare_data(self) ->None:
trainer=self.trainer# on multi-gpu jobs we only want to manipulate (download, etc) on node_rank=0, local_rank=0# or in the case where each node needs to do its own manipulation in which case just local_rank=0local_rank_zero=trainer.local_rank==0global_rank_zero=trainer.local_rank==0andtrainer.node_rank==0datamodule=trainer.datamodulelightning_module=trainer.lightning_module# handle datamodule prepare data:ifdatamoduleisnotNoneandis_overridden("prepare_data", datamodule):
prepare_data_per_node=datamodule.prepare_data_per_nodewith_InfiniteBarrier():
if (prepare_data_per_nodeandlocal_rank_zero) or (notprepare_data_per_nodeandglobal_rank_zero):
call._call_lightning_datamodule_hook(trainer, "prepare_data")
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
If we use ddp strategy, dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, local rank 0 need very high memory space, but every device has the same memory. The more devices, the higher memory on local rank 0. So we can not run some large case, which need higher memory, or lead to waste the memory space. Can you give me an advice to how to use lighting for large case?
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: