@@ -200,7 +200,8 @@ Distributed modes
200200Lightning allows multiple ways of training
201201
202202- Data Parallel (`distributed_backend='dp' `) (multiple-gpus, 1 machine)
203- - DistributedDataParallel (`distributed_backend='ddp' `) (multiple-gpus across many machines).
203+ - DistributedDataParallel (`distributed_backend='ddp' `) (multiple-gpus across many machines (python script based)).
204+ - DistributedDataParallel (`distributed_backend='ddp_spawn' `) (multiple-gpus across many machines (spawn based)).
204205- DistributedDataParallel 2 (`distributed_backend='ddp2' `) (dp in a machine, ddp across machines).
205206- Horovod (`distributed_backend='horovod' `) (multi-machine, multi-gpu, configured at runtime)
206207- TPUs (`tpu_cores=8|x `) (tpu or TPU pod)
@@ -253,6 +254,26 @@ Distributed Data Parallel
253254 # train on 32 GPUs (4 nodes)
254255 trainer = Trainer(gpus = 8 , distributed_backend = ' ddp' , num_nodes = 4 )
255256
257+ This Lightning implementation of ddp calls your script under the hood multiple times with the correct environment
258+ variables. If your code does not support this (ie: jupyter notebook, colab, or a nested script without a root package),
259+ use `dp ` or `ddp_spawn `
260+
261+ .. code-block :: bash
262+
263+ # example for 3 GPUs ddp
264+ MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --gpus 3 --etc
265+ MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --gpus 3 --etc
266+ MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --gpus 3 --etc
267+
268+ The reason we use ddp this way is because `ddp_spawn ` has a few limitations (because of Python and PyTorch):
269+
270+ 1. Since `.spawn() ` trains the model in subprocesses, the model on the main process does not get updated.
271+ 2. Dataloader(num_workers=N) where N is large bottlenecks training with ddp...
272+ ie: it will be VERY slow or not work at all. This is a PyTorch limitation.
273+ 3. Forces everything to be picklable.
274+
275+ However, if you don't mind these limitations, please use `ddp_spawn `.
276+
256277Distributed Data Parallel 2
257278^^^^^^^^^^^^^^^^^^^^^^^^^^^
258279In certain cases, it's advantageous to use all batches on the same machine instead of a subset.
@@ -275,6 +296,75 @@ In this case, we can use ddp2 which behaves like dp in a machine and ddp across
275296 # train on 32 GPUs (4 nodes)
276297 trainer = Trainer(gpus = 8 , distributed_backend = ' ddp2' , num_nodes = 4 )
277298
299+ Distributed Data Parallel Spawn
300+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
301+ `ddp_spawn ` is exactly like `ddp ` except that it uses .spawn to start the training processes.
302+
303+ .. warning :: It is STRONGLY recommended to use `ddp` for speed and performance.
304+
305+ .. code-block :: python
306+
307+ mp.spawn(self .ddp_train, nprocs = self .num_processes, args = (model, ))
308+
309+ Here's how to call this.
310+
311+ .. code-block :: python
312+
313+ # train on 8 GPUs (same machine (ie: node))
314+ trainer = Trainer(gpus = 8 , distributed_backend = ' ddp' )
315+
316+ Use this method if your script does not support being called from the command line (ie: it is nested without a root
317+ project module). However, we STRONGLY discourage this use because it has limitations (because of Python and PyTorch):
318+
319+ 1. The model you pass in will not update. Please save a checkpoint and restore from there.
320+ 2. Set Dataloader(num_workers=0) or it will bottleneck training.
321+
322+ `ddp ` is MUCH faster than `ddp_spawn `. We recommend you install a top-level module for your project using setup.py
323+
324+ .. code-block :: python
325+
326+ # setup.py
327+ # !/usr/bin/env python
328+
329+ from setuptools import setup, find_packages
330+
331+ setup(name = ' src' ,
332+ version = ' 0.0.1' ,
333+ description = ' Describe Your Cool Project' ,
334+ author = ' ' ,
335+ author_email = ' ' ,
336+ url = ' https://github.com/YourSeed' , # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
337+ install_requires = [
338+ ' pytorch-lightning'
339+ ],
340+ packages = find_packages()
341+ )
342+
343+ Then setup your project like so:
344+
345+ .. code-block :: bash
346+
347+ /project
348+ /src
349+ some_file.py
350+ /or_a_folder
351+ setup.py
352+
353+ Then install as a root-level package
354+
355+ .. code-block :: bash
356+
357+ cd /project
358+ pip install -e .
359+
360+ Now you can call your scripts anywhere
361+
362+ .. code-block :: bash
363+
364+ cd /project/src
365+ python some_file.py --distributed_backend ' ddp' --gpus 8
366+
367+
278368 Horovod
279369^^^^^^^
280370`Horovod <http://horovod.ai >`_ allows the same training script to be used for single-GPU,
@@ -516,3 +606,23 @@ And then launch the elastic job with:
516606
517607 See the official `PytorchElastic documentation <https://pytorch.org/elastic >`_ for details
518608on installation and more use cases.
609+
610+ Jupyter Notebooks
611+ -----------------
612+ Unfortunately any `ddp_ ` is not supported in jupyter notebooks. Please use `dp ` for multiple GPUs. This is a known
613+ Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!
614+
615+ Pickle Errors
616+ --------------
617+ Multi-GPU training sometimes requires your model to be pickled. If you run into an issue with pickling
618+ try the following to figure out the issue
619+
620+ .. code-block :: python
621+
622+ import pickle
623+
624+ model = YourModel()
625+ pickle.dumps(model)
626+
627+ However, if you use `ddp ` the pickling requirement is not there and you should be fine. If you use `ddp_spawn ` the
628+ pickling requirement remains. This is a limitation of Python.
0 commit comments