## Using `mpify` to Train Fastai2/Course-V4 Examples "Distributedly" on Multiple GPUs

###  To train a `fastai2` learner on multiple processes inside a Jupyter notebook

The key is to ensure each process has its own copy of: the model, the access to its GPU, and dataloader.

The `DataLoader` must be re-created fresh on each process, because the CUDA GPU context it might initialize cannot be reused in another process.
CUDA PyTorch tensors created in the parent Jupyter process should not be passed to the subprocesses, otherwise it will incur 600MB memory, *per subprocess*, on the original GPU associated with the tensor.

For other variables (`path` of untar'ed dataset, or `df` a loaded DataFrame), or the many helper functions, they can be passed to the distributed training API `in_torchddp()` via `imports=` and `need=` parameters.

Links to `mpify'-ed 'Fastai2' course-v4 notebooks examples:

[01_intro.ipynb](/examples/fastai2_course-v4_01_intro_distrib.ipynb)

[05_pet_breeds.ipynb](/examples/fastai2_course-v4_05_pet_breeds_distrib.ipynb)

[06_multicat.ipynb](/examples/fastai2_course-v4_06_multicat_distrib.ipynb)

[07_sizing_and_tta.ipynb](/examples/fastai2_course-v4_07_sizing_tta_distrib.ipynb)

[08_collab.ipynb](/examples/fastai2_course-v4_08_collab_distrib.ipynb)

### Below are distributed training of examples correspond to fastai2 course-v4 <a href='https://github.com/fastai/course-v4/blob/master/nbs/08_collab.ipynb' target='_blank'>`08_collab.ipynb`</a>

### <a name='08collab'></a> 08 Collab

In the first FIVE (5) training examples in notebook 08_collab, several variables depend on `dls` --- we move them to inside the target function after the dataloader `dls` is created.

Those created before `dls`, we keep them at the global scope, and pass them in via `need=`

And since 3 versions of `DotProduct` classes are used, we parameterize it as `dp`.

In [None]:
from utils import *
from fastai2.collab import *
from fastai2.tabular.all import *

path = untar_data(URLs.ML_100k)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])

movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)

ratings = ratings.merge(movies)

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

def train_dotproduct(nepochs, *args, load:str=None, dp=DotProduct, **kwargs):

    dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
    
    n_users  = len(dls.classes['user'])
    n_movies = len(dls.classes['title'])
    n_factors = 5
    user_factors = torch.randn(n_users, n_factors)
    movie_factors = torch.randn(n_movies, n_factors)

    model = dp(n_users, n_movies, 50)
    learn = Learner(dls, model, loss_func=MSELossFlat())
    
    if load: learn.load(load); print(f'Model and state loaded from {load}')

    with learn.distrib_ctx():
        learn.fit_one_cycle(nepochs, *args, **kwargs)
    return learn

from mpify import in_torchddp
ngpus = 3

imports='''
from utils import *
from fastai2.collab import *
from fastai2.tabular.all import *
from fastai2.distributed import *
'''

need="path ratings"

learn = in_torchddp(ngpus, train_dotproduct, 5, 5e-3, dp=DotProduct, imports=imports, need=need)

Then the notebook modified the `DotProduct` class.

In [None]:
# A new dotproduct class
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

learn = in_torchddp(ngpus, train_dotproduct, 5, 5e-3, dp=DotProduct, imports=imports, need=need)

A new variant: `DotProductBias`, we update `dp` in `in_torchddp(...)`

In [None]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)
    
learn = in_torchddp(ngpus, train_dotproduct, 5, 5e-3, dp=DotProductBias, imports=imports, need=need)


### 08 Collab . Weight Decay, and Create Our Own Embedding Module

The chapter proceeds to adding weight decay to the training loop --- `**kwargs` in `in_torchddp()` will pick it up.

In [None]:
learn = in_torchddp(ngpus, train_dotproduct, 5, 5e-3, wd=0.1, dp=DotProductBias, imports=imports, need=need)

Further customize a `DotProductBias` class to use a new local function `create_params`?  Add it to `need=`:

In [None]:
# Creating Our Own Embedding Module
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

need = f"{need} create_params"  # add create_params
learn = in_torchddp(ngpus, train_dotproduct, 5, 5e-3, wd=0.1, dp=DotProductBias, imports=imports, need=need)


The tedious steps are all abstracted away in `collab_learner`, see how much shorter the new training function `train_collab()` is:

In [None]:
def train_collab(nepochs, *args, load:str=None, **kwargs):
    dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)

    learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
    
    if load: learn.load(load); print(f'Model and state loaded from {load}')

    with learn.distrib_ctx(): learn.fit_one_cycle(nepochs, *args)
    return learn

learn = in_torchddp(ngpus, train_collab, 5, 5e-3, wd=0.1, imports=imports, need=need)


### 08 Collab . Boostraping Collab Model

The noteboook then builds the model in two ways.

1. Build a PyTorch model, using the embeddings from the dataloader, pass that to Learner.  We have to add `embs` and `CollabNN` to `need=`.

or
 2. Let `collab_learner(.., use_nn=True, y_range=.., layers=[..])` takes care of the same details.


In [None]:
embs = get_emb_sz(learn.dls)

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

def train_bootstrap(nepochs, *args, load:str=None, **kwargs):
    dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)

    model = CollabNN(*embs)
    
    learn = Learner(dls, model, loss_func=MSELossFlat())    
    if load: learn.load(load); print(f'Model and state loaded from {load}')

    with learn.distrib_ctx(): learn.fit_one_cycle(nepochs, *args)
    return learn

def train_collab_learner(nepochs, *args, load:str=None, **kwargs):
    dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
    
    learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])    
    if load: learn.load(load); print(f'Model and state loaded from {load}')

    with learn.distrib_ctx(): learn.fit_one_cycle(nepochs, *args)
    return learn

need='ratings CollabNN embs'
learn = in_torchddp(ngpus, train_bootstrap, 5, 5e-3, wd=0.1, imports=imports, need=need)

learn = in_torchddp(ngpus, train_collab_learner, 5, 5e-3, wd=0.1, imports=imports, need=need)