## Using `mpify` to Train Fastai2/Course-V4 Examples "Distributedly" on Multiple GPUs

###  To train a `fastai2` learner on multiple processes inside a Jupyter notebook
1. Ensure each process has its own copy of: the model, the access to its GPU, and dataloader.

    * The `DataLoader` must be re-created fresh on each process, because the CUDA GPU context it might initialize cannot be reused in another process.
    CUDA PyTorch tensors created in the parent Jupyter process should not be passed to the subprocesses, otherwise it will incur 600MB memory, *per subprocess*, on the original GPU associated with the tensor.

    * For other variables (`path` of untar'ed dataset, or `df` a loaded DataFrame), or the many helper functions, they can be passed to the distributed training API `in_torchddp()` via `imports=` and `need=` parameters.

2. In each process `from fastai2.distributed import *`, and surround the fitting function `with learn.distrib_ctx()`.

### Quick links to course-v4 chapters `mpify`-ed:

[01_intro.ipynb](/examples/fastai2_course-v4-01_intro_distrib.ipynb)

[05_pet_breeds.ipynb](#01petbreeds)

[06_multicat.ipynb](/examples/fastai2_course-v4_06_multicat_distrib.ipynb)

[07_sizing_and_tta.ipynb](/examples/fastai2_course-v4_07_sizing_tta_distrib.ipynb)

[08_collab.ipynb](/examples/fastai2_course-v4_08_collab_distrib.ipynb)

### Below are the distributed training examples correspond to fastai2 course-v4 <a href='https://github.com/fastai/course-v4/blob/master/nbs/07_sizing_and_tta.ipynb' target='_blank'>`07_sizing_and_tta.ipynb`</a>

### <a name='07sizingtta'></a> 07 Sizing and TTA

In this chapter, `Imagenette` dataset is trained with incremental optimization.  Then training tricks are added one by one. First normalization, then progressive resizing, then a choice between `learn.fine_tune()` or `learn.fit_one_cycle()`.

For clarity, I created TWO pairs of `get_dls*()` and `train_imgnette*()`:

First pair is for the base case: no normalization, one size, and uses `fit_one_cycle().

The second pair adds normalization, allow custom batch size and image size, choice of `fit_one_cycle` vs `fine_tune`, and loading from a saved model state file.



In [None]:
from mpify import in_torchddp
from fastai2.vision.all import *

path = untar_data(URLs.IMAGENETTE)

#  First Pair of dataloader factory and training function, all hard coded values.
def get_dls_basic():
    dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                       get_items=get_image_files,
                       get_y=parent_label,
                       item_tfms=Resize(460),
                       batch_tfms=aug_transforms(size=224, min_scale=0.75))
    dls = dblock.dataloaders(path, bs=64)
    return dls

def train_imgnette_basic(nepochs, lr, *args, **kwargs):
    dls = get_dls_basic() 
    model = xresnet50()
    learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
    
    with learn.distrib_ctx():
        learn.fit_one_cycle(nepochs, lr)
    
    return learn

# Second pair allows several customizable parameters
def get_dls_general(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                       get_items=get_image_files,
                       get_y=parent_label,
                       item_tfms=Resize(460),
                       batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                            Normalize.from_stats(*imagenet_stats)])
    return dblock.dataloaders(path, bs=bs)

def train_imgnette_general(nepochs, lr, dls_bs, dls_size, *args, fine_tune:bool=False, load:str=None, **kwargs):
    dls = get_dls_general(dls_bs, dls_size) 
    model = xresnet50()
    learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
    
    if load:
        learn.load(load)
        print(f"Model state loaded from {load}")
        
    with learn.distrib_ctx():
        if fine_tune: learn.fine_tune(nepochs, lr)
        else: learn.fit_one_cycle(nepochs, lr)
    
    return learn


imports='''
from utils import *
from fastai2.vision.all import *
from fastai2.distributed import *
'''

need='get_dls_basic get_dls_general path'
ngpus=3


In [None]:
# Base case
nepochs, lr = 5, 3e-3
learn = in_torchddp(ngpus, train_imgnette_basic, nepochs, lr, imports=imports, need=need)

# Then use normalization
nepochs, lr = 5, 3e-3
dls_bs, dls_size = 64, 224
learn = in_torchddp(ngpus, train_imgnette_general, nepochs, lr, dls_bs, dls_size, 
                    imports=imports, need=need)

### 07 Sizing & TTA . Progressive resizing

Save the model after one stage, then fine tune at different sizes, starting from that.

In [None]:
# Progressive Resizing.  Save the intermediate model state

nepochs, lr = 4, 3e-3
dls_bs, dls_size = 128, 128
learn = in_torchddp(ngpus, train_imgnette_general, nepochs, lr, dls_bs, dls_size,
                    imports=imports, need=need)

saved = '4epochs_128_128'
learn.save(saved)

# Then fine_tune at different size, starting from the saved state
nepochs, lr = 5, 1e-3
dls_bs, dls_size = 64, 224
learn = in_torchddp(ngpus, train_imgnette_general, nepochs, lr, dls_bs, dls_size,
                    load=saved, fine_tune=True, imports=imports, need=need)