## Using `mpify` to Train Fastai2/Course-V4 Examples "Distributedly" on Multiple GPUs

###  To train a `fastai2` learner on multiple processes inside a Jupyter notebook

The key is to ensure each process has its own copy of: the model, the access to its GPU, and dataloader.

The `DataLoader` must be re-created fresh on each process, because the CUDA GPU context it might initialize cannot be reused in another process.
CUDA PyTorch tensors created in the parent Jupyter process should not be passed to the subprocesses, otherwise it will incur 600MB memory, *per subprocess*, on the original GPU associated with the tensor.

For other variables (`path` of untar'ed dataset, or `df` a loaded DataFrame), or the many helper functions, they can be passed to the distributed training API `in_torchddp()` via `imports=` and `need=` parameters.

### Quick links to chapters `mpify`ed

[01_intro.ipynb](/examples/fastai2_course-v4_01_intro_distrib.ipynb)

[05_pet_breeds.ipynb](/examples/fastai2_course-v4_05_pet_breeds_distrib.ipynb)

[06_multicat.ipynb](/examples/fastai2_course-v4_06_multicat_distrib.ipynb)

[07_sizing_and_tta.ipynb](/examples/fastai2_course-v4_07_sizing_tta_distrib.ipynb)

[08_collab.ipynb](/examples/fastai2_course-v4_08_collab_distrib.ipynb)


### Below are distributed training of examples correspond to fastai2 course-v4 <a href='https://github.com/fastai/course-v4/blob/master/nbs/05_pet_breeds.ipynb' target='_blank'>`05_pet_breeds.ipynb`</a>

### 05_pet_breeds.ipynb


In [None]:
from utils import *
from fastai2.data import *
from fastai2.vision.all import *

path = untar_data(URLs.PETS)

# The above are defined earlier in the notebook, included here for convenience

from mpify import in_torchddp
ngpus = 3

def fine_tune(learn:Learner, nepochs, *args, **kwargs):
    with learn.distrib_ctx(): learn.fine_tune(nepochs, *args, **kwargs)
    return learn
    
def one_cycle(learn:Learner, nepochs, *args, **kwargs):
    with learn.distrib_ctx(): learn.fit_one_cycle(nepochs, *args, **kwargs)
    return learn

def trainer(train_fn, nepochs, *args, load:str=None, **kwargs):
    pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                     get_items = get_image_files, 
                     splitter  = RandomSplitter(seed=42),
                     get_y     = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                     item_tfms = Resize(460),
                     batch_tfms= aug_transforms(size=224, min_scale=0.75))
    dls = pets.dataloaders(path/"images")
    
    learn = cnn_learner(dls, resnet34, metrics=error_rate)
    
    if load:
        learn.load(load)
        print(f'Model and state loaded from {load}')

    learn = train_fn(learn, nepochs, *args, **kwargs)
    return learn

imports='''
from utils import *
from fastai2.data import *
from fastai2.vision.all import *
from fastai2.distributed import *
'''

need='path fine_tune one_cycle'

First are items defined earlier on in the notebook, but will be needed to construct the Dataloader, and during training.

In this chapter, `Learner.fine_tune()` and `Learner.fit_one_cycle()` are introduced and interleaved throughout.

Thus I created a wrapper for each, and passed them via `in_torchddp()` as the first positional argument after `trainer` itself.

In [None]:
learn = in_torchddp(ngpus, trainer, fine_tune, 1, base_lr=0.1, imports=imports, need=need)

In [None]:
lr_min,lr_steep = learn.lr_find()

In [None]:
# to perform learn.fine_tune(2, base_lr=3e-3) on multiple GPUs in DDP

learn = in_torchddp(ngpus, trainer, fine_tune, 2, base_lr=3e-3, imports=imports, need=need)

In [None]:
# to do: learn.fit_one_cycle(3, 3e-3) on multiple GPUs in DDP

learn = in_torchddp(ngpus, trainer, one_cycle, 3, 3e-3, imports=imports, need=need)

In [None]:
learn.unfreeze()
learn.lr_find()

In [None]:
# to do: learn.fit_one_cycle(6, lr_max=1e-5) on multiple GPUs in DDP

learn = in_torchddp(ngpus, trainer, one_cycle, 6, lr_max=1e-5, imports=imports, need=need)

### 05 pets breeds: Discriminative Learning Rates

To perform multiple stages of training, e.g.:

```python
    learn.fit_one_cycle(3, 3e-3)
    learn.unfreeze()
    learn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))
```

we need to *save the model state* before subsequent calls to `fit_one_cycle()`.  Then wel tell `in_torchddp()` to load from that file using 'load=file'

In [None]:
learn = in_torchddp(ngpus, trainer, one_cycle, 3, 3e-3,
                    imports=imports, need="one_cycle")
learn.unfreeze()
learn.save("after_unfreeze", with_opt=True, pickle_protocol=4)

learn = in_torchddp(ngpus, trainer, one_cycle, 12, lr_max=slice(1e-6,1e-4),
                    load="after_unfreeze", imports=imports, need="one_cycle")

### 05 pets breeds: Deeper Architectures

To do:
```python
from fastai2.callback.fp16 import *
learn = cnn_learner(dls, resnet50, metrics=error_rate).to_fp16()
learn.fine_tune(6, freeze_epochs=3)
```

We modify `trainer()` to write a new function `trainer_fp16_resnet50()`, replace `resnet34` with `resnet50`, and add `.to_fp16()`.  Then pass it to `in_torchddp()`:

In [None]:
def trainer_fp16_resnet50(train_fn, nepochs, *args, load:str=None, **kwargs):

    path = untar_data(URLs.PETS)

    pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                     get_items = get_image_files, 
                     splitter  = RandomSplitter(seed=42),
                     get_y     = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                     item_tfms = Resize(460),
                     batch_tfms= aug_transforms(size=224, min_scale=0.75))
    dls = pets.dataloaders(path/"images")
    
    # Use resnet50, and half precision.
    learn = cnn_learner(dls, resnet50, metrics=error_rate).to_fp16()
    
    if load:
        learn.load(load)
        print(f'Model and state loaded from {load}')

    learn = train_fn(learn, nepochs, *args, **kwargs)
    return learn

learn = in_torchddp(ngpus, trainer_fp16_resnet50, fine_tune, 6, freeze_epochs=3,
                    imports=imports, need="fine_tune")