In this series, I want to discuss the creation of a small library for training neural networks: `nntrain`. It's based off the excellent [part 2](https://course.fast.ai/) of Practical Deep Learning for Coders by Jeremy Howard, in which from lessons 13 to 18 (roughly) the development of the `miniai` library is discussed.

The library will build upon PyTorch. However, we'll try as much as possible to build from scratch mainly to understand how it all works. Once the main functionality of components are implemented and verified, we can switch over to PyTorch's version. This is similar to how things are done in the course. However, this is not just a "copy / paste" of the course: on many occasions I take a different route, and most of the code is my own. In my not so humble opinion the narrative presented here is slightly better 🤷‍♂️😱.

As we'll see, the library will be built using [`nb_dev`](https://nbdev.fast.ai/), another great project from the fastai community. With this software, it becomes very straight forward to create python libraries which are exported from jupyter notebooks. This may sound a bit weird, but it has the advantage that we can create the sourcecode for our library in the very same environment in which we want to experiment and interact with our methods, objects and structure **while we are building the library**. For more details on why this is a good idea, see [here](https://www.fast.ai/posts/2022-07-28-nbdev2.html).

So without further ado, let's start with where we left off in the previous [post](https://lucasvw.github.io/posts/08_nntrain_setup/):

## End of last post:

In [None]:
from datasets import load_dataset,load_dataset_builder
import torchvision.transforms.functional as TF   # to transform from PIL to tensor
import torch
import torch.nn as nn
import torch.nn.functional as F

name = "fashion_mnist"
ds_builder = load_dataset_builder(name)
ds_hf = load_dataset(name, split='train')

x_train = torch.stack([TF.to_tensor(i).view(-1) for i in ds_hf['image']])
y_train = torch.stack([torch.tensor(i) for i in ds_hf['label']])

def fit(epochs):
    for epoch in range(epochs):
        for i in range(0,len(x_train), bs):
            xb = x_train[i:i+bs]
            yb = y_train[i:i+bs]

            preds = model(xb)
            acc = accuracy(preds, yb)
            loss = loss_func(preds, yb)
            loss.backward()

            opt.step()
            opt.zero_grad()
        print(f'{epoch=} | {loss=:.3f} | {acc=:.3f}')

def accuracy(preds, targs):
    return (preds.argmax(dim=1) == targs).float().mean()        

def get_model_opt():
    layers = [nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out)]
    model = nn.Sequential(*layers)
    
    opt = torch.optim.SGD(model.parameters(), lr)
    
    return model, opt

n_in  = 28*28
n_h   = 50
n_out = 10
lr    = 0.01
bs    = 1024
loss_func = F.cross_entropy

model, opt = get_model_opt()
fit(5)

Downloading builder script:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading and preparing dataset fashion_mnist/fashion_mnist (download: 29.45 MiB, generated: 34.84 MiB, post-processed: Unknown size, total: 64.29 MiB) to /root/.cache/huggingface/datasets/fashion_mnist/fashion_mnist/1.0.0/8d6c32399aa01613d96e2cbc9b13638f359ef62bb33612b077b4c247f6ef99c1...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/26.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.42M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.15k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/60000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset fashion_mnist downloaded and prepared to /root/.cache/huggingface/datasets/fashion_mnist/fashion_mnist/1.0.0/8d6c32399aa01613d96e2cbc9b13638f359ef62bb33612b077b4c247f6ef99c1. Subsequent calls will reuse this data.
epoch=0 | loss=2.108 | acc=0.393
epoch=1 | loss=1.854 | acc=0.503
epoch=2 | loss=1.597 | acc=0.607
epoch=3 | loss=1.390 | acc=0.622
epoch=4 | loss=1.238 | acc=0.640


## Datasets:

For the next refactor, we want to tackle the minibatch construct we currently have in the training loop:

In [None]:
# ...
# for i in range(0,len(x_train), bs):
#     xb = x_train[i:i+bs]
#     yb = y_train[i:i+bs]
# ...

Instead of doing this manually, we would like to do something like this:

In [None]:
# ...
# for i in range(0,len(x_train), bs):
#     xb, yb = dataset[i:i+bs]
# ...

This is pretty straight-forward, we just need something we can index into and returns a tuple of the data:

In [None]:
class Dataset():
    
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
        
    def __getitem__(self, i):
        return self.x_train[i], self.y_train[i]
    
    def __len__(self):
        return len(self.x_train)

In [None]:
ds = Dataset(x_train, y_train)
print([i.shape for i in ds[0]])

[torch.Size([784]), torch.Size([])]


Because tensors behave very similar to numpy arrays, with this simple class we already have the behavior we need: i.e. slicing into the dataset:

In [None]:
ds[0:5]

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([9, 0, 0, 3, 0]))

Next, we want to further improve the training loop and get to this behavior:

In [None]:
# ...
# for xb, yb in dataloader:
# ...

So our dataloader needs to wrap the dataset, and provide some kind of an iterator returning batches of data, based on the specified batch size. Let's create one:

In [None]:
class DataLoader():
    
    def __init__(self, dataset, batch_size):
        self.dataset = dataset
        self.batch_size = batch_size
        
    def __iter__(self):
        for i in range(0,len(self.dataset),self.batch_size):
            yield self.dataset[i:i+self.batch_size]

In [None]:
dl = DataLoader(ds, bs)

In [None]:
def fit(epochs):
    for epoch in range(epochs):
        for xb, yb in dl:
            preds = model(xb)
            acc = accuracy(preds, yb)
            loss = loss_func(preds, yb)
            loss.backward()

            opt.step()
            opt.zero_grad()
        print(f'{epoch=} | {loss=:.3f} | {acc=:.3f}')

In [None]:
model, opt = get_model_opt()
fit(5)

epoch=0 | loss=2.072 | acc=0.326
epoch=1 | loss=1.837 | acc=0.572
epoch=2 | loss=1.580 | acc=0.635
epoch=3 | loss=1.361 | acc=0.650
epoch=4 | loss=1.205 | acc=0.656


## Next up: shuffling the data

The above training loop already looks pretty good, it's small and concise, and fairly generic. The next improvement we are going to make is something that doesn't improve the code of the training loop, but improves training of the model. So far during training, we cycle each epoch through the data in the exact same order. This means that all training samples are always batched together with the exact same other samples. This is not good for training our model, instead we want to shuffle the data up. So that each epoch, we have batches of data that have not yet been batched up together. This additional variety helps the model to generalize as we will see.

The simplest implementation would be to create a list of indices, which we put in between the dataset and the sampling of the mini-batches. In case we don't need to shuffle, this list will just be `[0, 1, ... len(dataset)]`. 

In [None]:
import random

class DataLoader():
    
    def __init__(self, dataset, batch_size, shuffle):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        
    def __iter__(self):
        self.indices = list(range(0, len(self.dataset)))
        if self.shuffle: 
            random.shuffle(self.indices)
            
        for i in range(0,len(self.dataset),self.batch_size):
            yield self.dataset[self.indices[i:i+self.batch_size]]

In [None]:
model, opt = get_model_opt()
dl = DataLoader(ds, bs, shuffle=True)
fit(5)

epoch=0 | loss=2.099 | acc=0.400
epoch=1 | loss=1.843 | acc=0.521
epoch=2 | loss=1.583 | acc=0.625
epoch=3 | loss=1.364 | acc=0.678
epoch=4 | loss=1.249 | acc=0.638


This works just fine, but how can we encapsulate this logic in a separate class? Let's start with a simple `Sampler` class that we can iterate through and either gives indices in order, or shuffled:

In [None]:
class Sampler():
    def __init__(self, ds, shuffle=False):
        self.range = list(range(0, len(ds)))
        self.shuffle = shuffle
        
    def __iter__(self):
        if self.shuffle: random.shuffle(self.range)
        for i in self.range:
            yield i

In [None]:
s = Sampler(ds, False)           # shuffle = False
for i, sample in enumerate(s): 
    print(sample, end=', ')
    if i == 5: break

0, 1, 2, 3, 4, 5, 

In [None]:
s = Sampler(ds, True)            # shuffle = TRUE
for i, sample in enumerate(s): 
    print(sample, end=', ')
    if i == 5: break

44880, 31024, 12590, 21343, 47390, 5890, 

Next, let's create a BatchSampler that does the same, but returns the indexes in batches. For that we can use the `islice()` function from the `itertools` module:

In [None]:
from itertools import islice

def printlist(this): print(list(this))

lst = list(range(0, 10))         # create a list of 10 numbers

printlist(islice(lst, 0, 3))     # with islice we can get a slice out of the list
printlist(islice(lst, 5, 10))

[0, 1, 2]
[5, 6, 7, 8, 9]


In [None]:
printlist(islice(lst, 4))        # we can also get the "next" 4 elements
printlist(islice(lst, 4))        # doing that twice gives the same first 4 elements

[0, 1, 2, 3]
[0, 1, 2, 3]


In [None]:
lst = iter(lst)                  # however if we put an iterator on the list:

printlist(islice(lst, 4))        # first 4 elements
printlist(islice(lst, 4))        # second 4 elements
printlist(islice(lst, 4))        # remaining 2 elements
printlist(islice(lst, 4))        # iterator has finished..

[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
[]


And thus we create our `BatchSampler`:

In [None]:
class BatchSampler():
    def __init__(self, sampler, batch_size):
        self.sampler = sampler
        self.batch_size = batch_size
        
    def __iter__(self):
        it = iter(self.sampler)
        while True:
            res = list(islice(it, self.batch_size))
            if len(res) == 0:
                return
            yield res

In [None]:
s = Sampler(list(range(0,10)), shuffle=False)
batchs = BatchSampler(s, 4)
for i in batchs:
    print(list(i))

[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]


And let's incorporate it into the DataLoader:

In [None]:
class DataLoader():
    
    def __init__(self, dataset, batch_sampler):
        self.dataset = dataset
        self.batch_sampler = batch_sampler
        
    def __iter__(self):
        for batch in self.batch_sampler:
            yield self.dataset[batch]

In [None]:
s = Sampler(ds, shuffle=True)
dl = DataLoader(ds, BatchSampler(s, bs))

model, opt = get_model_opt()
fit(5)

epoch=0 | loss=2.134 | acc=0.367
epoch=1 | loss=1.870 | acc=0.579
epoch=2 | loss=1.594 | acc=0.632
epoch=3 | loss=1.417 | acc=0.635
epoch=4 | loss=1.279 | acc=0.646


And this works pretty good. However, there is one caveat. In the very beginning of this post we did:

```
x_train = torch.stack([TF.to_tensor(i).view(-1) for i in ds_hf['image']])
y_train = torch.stack([torch.tensor(i) for i in ds_hf['label']])
```

And that is something we ideally would also like to be part of the Dataloaders / Dataset paradigm. So instead of first transforming the Huggingface Dataset into `x_train` and `y_train`, we want to directly use the dataset. We can do so by adding a collate function. This wraps around a list of individual "entries" into the datasets, and receives a list of individual x,y tuples (`[(x_1,y_1), (x_2, y_2), ..]`) as argument. In that function, we can determine how to treat these items and parse it in a way that is suitable to our needs. i.e.:

- batch the `x` and `y`, so that we transform from `[(x_1,y_1), (x_2, y_2), ..]`  to `[(x_1, x_2, ..), (y_1, y_2, y_3, ..)]`
- move individual items `x_i` and `y_i` to tensors
- stack the `x` tensors and `y` tensors respectively into one big tensor

In [None]:
class DataLoader():
    
    def __init__(self, dataset, batch_sampler, collate_func):
        self.dataset = dataset
        self.batch_sampler = batch_sampler
        self.collate_func = collate_func
        
    def __iter__(self):
        for batch in self.batch_sampler:
            yield self.collate_func(self.dataset[i] for i in batch)

In this case in the `collate_func` we transform from PIL to tensor, get rid of the dictionary and zip up the results as is expected from the dataloader:

In [None]:
def collate_func(data):
    data = [(TF.to_tensor(el['image']).view(-1), torch.tensor(el['label'])) for el in data]
    x, y = zip(*data)
    return torch.stack(x), torch.stack(y)

In [None]:
s = Sampler(ds_hf, shuffle=True)
dl = DataLoader(ds_hf, BatchSampler(s, bs), collate_func)

model, opt = get_model_opt()
fit(5)

epoch=0 | loss=2.045 | acc=0.359
epoch=1 | loss=1.794 | acc=0.495
epoch=2 | loss=1.588 | acc=0.569
epoch=3 | loss=1.352 | acc=0.641
epoch=4 | loss=1.242 | acc=0.646


This is pretty nice, we have replicated the main logic of PyTorch's DataLoader. It has a slightly different API as we don't have to specify the `BatchSampler`, instead we can just pass `shuffle=True`:

In [None]:
from torch.utils.data import DataLoader

s = Sampler(ds_hf, shuffle=True)
dl = DataLoader(ds_hf, batch_size=bs, shuffle=True, collate_fn=collate_func)

model, opt = get_model_opt()
fit(5)

epoch=0 | loss=2.121 | acc=0.363
epoch=1 | loss=1.890 | acc=0.595
epoch=2 | loss=1.660 | acc=0.605
epoch=3 | loss=1.428 | acc=0.635
epoch=4 | loss=1.284 | acc=0.630


## Validation set

Let's add a validation set to make sure we validate on data we are not training on. For that we are going to pull the data from the datasets library without the `splits` argument, which will give us a dataset dictionary containing both a training and a test dataset:

In [None]:
ds_hf = load_dataset(name)
ds_hf

Reusing dataset fashion_mnist (/root/.cache/huggingface/datasets/fashion_mnist/fashion_mnist/1.0.0/8d6c32399aa01613d96e2cbc9b13638f359ef62bb33612b077b4c247f6ef99c1)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 60000
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 10000
    })
})

And let's create two dataloaders, one for the train and one for the validation set. For the validation loader we can double the batch size since we won't be computing gradients for the forward pass:

In [None]:
train_loader = DataLoader(ds_hf['train'], batch_size=bs, shuffle=True, collate_fn=collate_func)
valid_loader = DataLoader(ds_hf['test'], batch_size=2*bs, shuffle=False, collate_fn=collate_func)

We change the training loop in a couple of ways:

- compute loss and metrics more correctly, by taking care of the batch-size and taking the average over all data
- add a seperate forward pass for the validation set

In [None]:
def fit(epochs):
    for epoch in range(epochs):
        model.train()                                       # put the model in "train" mode
        n_t = train_loss_s = 0                              # initialize variables for computing averages
        for xb, yb in train_loader:
            preds = model(xb)
            train_loss = loss_func(preds, yb)
            train_loss.backward()
            
            n_t += len(xb)
            train_loss_s += train_loss.item() * len(xb)
            
            opt.step()
            opt.zero_grad()
        
        model.eval()                                        # put the model in "eval" mode
        n_v = valid_loss_s = acc_s = 0                      # initialize variables for computing averages
        for xb, yb in valid_loader: 
            with torch.no_grad():                           # no need to compute gradients on validation set
                preds = model(xb)
                valid_loss = loss_func(preds, yb)
                
                n_v += len(xb)
                valid_loss_s += valid_loss.item() * len(xb)
                acc_s += accuracy(preds, yb) * len(xb)
        
        train_loss = train_loss_s / n_t                     # compute averages of loss and metrics
        valid_loss = valid_loss_s / n_v
        acc = acc_s / n_v
        print(f'{epoch=} | {train_loss=:.3f} | {valid_loss=:.3f} | {acc=:.3f}')

In [None]:
model, opt = get_model_opt()

fit(5)

epoch=0 | train_loss=2.185 | valid_loss=2.072 | acc=0.526
epoch=1 | train_loss=1.952 | valid_loss=1.823 | acc=0.635
epoch=2 | train_loss=1.685 | valid_loss=1.551 | acc=0.653
epoch=3 | train_loss=1.430 | valid_loss=1.330 | acc=0.657
epoch=4 | train_loss=1.243 | valid_loss=1.179 | acc=0.659


And that's it for this post (almost)! We have seen a lot of details on Datasets, Dataloaders and the transformation of data. We have used these concepts to improve our training loop: shuffling the training data on each epoch, and the computation of the metrics on the validation set. But before we close off, let's make our very first exports into the library, so that next time we can continue where we finished off.

## First exports

When exporting code to a module with `nbdev` the first thing we need to do is declare the `default_exp` directive. This makes sure that when we run the export, the module will be exported to `dataloaders.py`

In [None]:
#| default_exp dataloaders

Next, we can export any code into the module by adding `#\export` on top of the cell we want to export. For example:

In [None]:
#|export

def print_hello():
    print('hello')

To export, we simply execute:

In [None]:
import nbdev; nbdev.nbdev_export()

This will create a file called `dataloaders.py` in the library folder (in my case `nntrain`) with the contents:

```
# AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/01_dataloaders.ipynb.

# %% auto 0
__all__ = ['func']

# %% ../nbs/01_dataloaders.ipynb 59
def print_hello():
    print('hello')

```

So what do we want to export here? Let's see if we can create some generic code for loading data from the Huggingface datasets library into a PyTorch Dataloader:

In [None]:
#|export

import torchvision.transforms.functional as TF
from torch.utils.data import DataLoader
import torch
import PIL

In [None]:
#|export

def hf_ds_collate_func(data):
    '''
    Collation function for building a PyTorch DataLoader from a a huggingface dataset.
    Tries to put all items from an entry into the dataset to tensor.
    PIL images are converted to tensor.
    '''

    def to_tensor(i):
        if isinstance(i, PIL.Image.Image):
            return TF.to_tensor(i).view(-1)
        else:
            return torch.tensor(i)
    
    data = [map(to_tensor, el.values()) for el in data]  # map each item from a dataset entry through to_tensor()
    data = zip(*data)                                    # zip data of any length not just (x,y) but also (x,y,z)
    return (torch.stack(i) for i in data)                

In [None]:
#|export
class DataLoaders:
    def __init__(self, train, valid):
        self.train = train
        self.valid = valid
    
    @classmethod
    def _get_dls(cls, train_ds, valid_ds, bs, collate_fn):
        return (DataLoader(train_ds, batch_size=bs, shuffle=True, collate_fn=collate_fn),
                DataLoader(valid_ds, batch_size=bs*2, collate_fn=collate_fn))
        
    @classmethod
    def from_hf_dd(cls, dd, batch_size):
        return cls(*cls._get_dls(*dd.values(), batch_size, hf_ds_collate_func))

In [None]:
def fit(epochs):
    for epoch in range(epochs):
        model.train()                                       
        n_t = train_loss_s = 0                              
        for xb, yb in dls.train:
            preds = model(xb)
            train_loss = loss_func(preds, yb)
            train_loss.backward()
            
            n_t += len(xb)
            train_loss_s += train_loss.item() * len(xb)
            
            opt.step()
            opt.zero_grad()
        
        model.eval()                                        
        n_v = valid_loss_s = acc_s = 0                      
        for xb, yb in dls.valid: 
            with torch.no_grad():                           
                preds = model(xb)
                valid_loss = loss_func(preds, yb)
                
                n_v += len(xb)
                valid_loss_s += valid_loss.item() * len(xb)
                acc_s += accuracy(preds, yb) * len(xb)
        
        train_loss = train_loss_s / n_t                     
        valid_loss = valid_loss_s / n_v
        acc = acc_s / n_v
        print(f'{epoch=} | {train_loss=:.3f} | {valid_loss=:.3f} | {acc=:.3f}')

In [None]:
dls = DataLoaders.from_hf_dd(ds_hf, bs)

In [None]:
model, opt = get_model_opt()

fit(5)

epoch=0 | train_loss=2.175 | valid_loss=2.050 | acc=0.406
epoch=1 | train_loss=1.917 | valid_loss=1.780 | acc=0.536
epoch=2 | train_loss=1.651 | valid_loss=1.532 | acc=0.616
epoch=3 | train_loss=1.427 | valid_loss=1.340 | acc=0.637
epoch=4 | train_loss=1.262 | valid_loss=1.203 | acc=0.648


In [None]:
import nbdev; nbdev.nbdev_export()