Skip to content

mpaepper/pytorch_lightning_multi_gpu_speed_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

Speed analysis using distributed data parallel on 4 GPUs (single machine)

When using PyTorch lightning for multi GPU training instead of a self-built solution, I am encountering severe speed problems. I built this repo to replicate these issues as described here: Lightning-AI/pytorch-lightning#1925

What I found is that this mostly depends on the data structure which is used in the Dataset. We are caching larger amounts of data to memory and are using a list of tensors to do so as the tensors are all of different size (here in this example they have the same size for simplicity). As you can see here, using a list of tensors does not work at all when using the PyTorch Lightning solution as the speed impairments are severe. When using a self-built solution which just runs 4 independent processes and not using PyTorch multiprocessing, these impairments are not observed.

PyTorch Lightning is using PyTorch multiprocessing which uses shared memory. When using lists, it opens a file memory pointer for each tensor entry, so you probably need to increase your file limits: ulimit -n 99999.

Here are the speed results:

python minimal.py --gpus 4                            # Training time: 105 seconds.
python minimal.py --gpus 4 --use_list                 # Training time: 310 seconds, so 3x slower.

# For the custom implementation, you need to start the run 4 times, once for each GPU:
python custom.py --world_size 4 --rank 0              # Training time: 98  seconds.
python custom.py --world_size 4 --rank 1
python custom.py --world_size 4 --rank 2
python custom.py --world_size 4 --rank 3

python custom.py --world_size 4 --rank 0 --use_list   # Training time: 97  seconds.
python custom.py --world_size 4 --rank 1 --use_list
python custom.py --world_size 4 --rank 2 --use_list
python custom.py --world_size 4 --rank 3 --use_list

As you can see, the custom implementation has a comparable runtime for both scenarios, but the list of tensors approach is not feasible in the multiprocessing setup.

About

Replicating speed issues when training using ddp on 4 GPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages