### Why?

Implementing something from first principles is a good way to make sure you understand at a deep level how that thing works, and trying to build it from scratch will quickly make you aware of what parts you still don't understand.

### Backstory

I started working with ML and more precisely DL in 2016, my personal interests are NLP and Reinforcement Learning.

From 2016-2019 I worked on everything from Seq2Seq models to ConvNets focused on text prediction tasks. I was particularly inerested in the Attention Mechanism. 

I wrote a few papers as part of independent studies and as part of a stint as a research assistant in 2018. Prior to that I had been working fulltime as a SWE taking a break from college. From the end of 2018 to the start of 2020 I kept on ML as a hobby. From 2020-2021 I was working on my startup for better code search and documentation which ultimately didn't go anywhere. From Sep 2021 to June of 2022 I had burnt out and was just doing my day job and avoiding comptuers outside of that.

In 2020 I had used BERT and various flavors like RoBERTa for semantic search but I did not take the time to understand the architecture and more generally started getting rusty, that continued through my period of burn out.

Now that I'm not burnt out and working in a job that I enjoy I've been taking the time to get back into working on ML related projects and as part of that I'm doing a couple different refresher courses.

As of this work I've just finished Lesson 5 of "Practical Deep Learning for Coders" by FastAI and wanted to implement a full NN from scratch, from memory, to test my fundamentals. Below is the result of that. 

In [73]:
!pip install jupyter_contrib_nbextensions
!pip install autopep8
!jupyter nbextension enable code_prettify/code_prettify



Enabling notebook extension code_prettify/code_prettify...
      - Validating: [32mOK[0m


In [66]:
from torch.nn import functional as F
from torch import Tensor
import torch
from abc import ABC, abstractmethod


class Dataset():

    def __init__(self, train, valid, test=None):
        self._train_set = train
        self._valid_set = valid

    def train(self):
        return self._train_set()

    def valid(self):
        return self._valid_set()


class Optimizer():

    def __init__(self, model_params, lr):
        self.model_params = model_params
        self.lr = lr

    def step(self):
        for param in self.model_params:
            param.data -= param.grad.data * self.lr

    def zero_grad(self):
        for param in self.model_params:
            param.grad = None

            
class Model():

    def __init__(self, layers=[]):
        self.layers = []
        self.params = []

        for layer in layers:
            self._register_layer(layer)

    def parameters(self):
        return self.params

    def _register_layer(self, layer):
        if self.layers:
            layer.set_output_shape(self.layers[-1])
            layer.create_bias(self.layers[-1])
        else:
            layer.set_output_shape(None)
                    
        if layer.parameters:
            self.params.extend(layer.parameters)
            
        self.layers.append(layer)

    def infer(self, x):
        l1 = self.layers[0]
        o = l1.call(x)
        for layer in self.layers[1:]:
            o = layer.call(o)

        return o

In [67]:
import torchvision

from torchvision import transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    lambda x: x.view(x.shape[0], -1),
    lambda x: x.squeeze()
])

mnist_data = torchvision.datasets.MNIST("./data/mnist/train",
                                        train=True,
                                        download=True,
                                        transform=transform)
test_mnist_data = torchvision.datasets.MNIST("./data/mnist/test",
                                             train=False,
                                             download=True,
                                             transform=transform)

data_loader = torch.utils.data.DataLoader(mnist_data,
                                          batch_size=1024,
                                          shuffle=True,
                                          num_workers=12)

test_data_loader = torch.utils.data.DataLoader(test_mnist_data,
                                               batch_size=1024,
                                               shuffle=True,
                                               num_workers=12)


def train_fn():
    return iter(data_loader)


def valid_fn():
    return iter(test_data_loader)


ds = Dataset(train=train_fn, valid=valid_fn)

In [68]:
class Layer(ABC):

    def __init__(self):
        self.parameters = []
        
    @abstractmethod
    def set_output_shape(self, prev_layer):
        pass
    
    @abstractmethod
    def create_bias(self, prev_layer):
        pass

    @abstractmethod
    def call(self, layer):
        pass
    
class Input(Layer):
    def __init__(self, r, c):
        super().__init__()
        self.shape = (r, c)
         
    def create_bias(self, prev_layer):
        pass
    
    def set_output_shape(self, prev_layer):
        self.output_shape = self.shape
    
    def call(self, layer):
        return layer
    
class NonLinear(Layer):

    def __init__(self):
        super().__init__()
    
    def create_bias(self, prev_layer):
        pass
    
    def set_output_shape(self, prev_layer):
        self.output_shape = prev_layer.output_shape

    @abstractmethod
    def func(self, x):
        pass

    def a(self, zed):
        return self.func(zed)

    def call(self, layer):
        return self.a(layer)


class ReLu(NonLinear):

    def func(self, x):
        return torch.relu(x)


class LogSoftmax(NonLinear):

    def func(self, x):
        return torch.log(torch.exp(x) / torch.sum(torch.exp(x)))

In [69]:
class Linear(Layer):
    def __init__(self, r, c):
        self.shape = (r, c)
        self.w = torch.zeros(r, c, requires_grad=True)
        init_y = torch.div(1, torch.sqrt(torch.tensor(r)))
        #init_y = 0.01
        self.w.data.uniform_(-init_y, init_y)
        self.b = None
       
        self.parameters = [self.w]

    def zed(self, a_prev):
        z = a_prev @ self.w + self.b
        
        return z
    
    def set_output_shape(self, prev_layer):
        self.output_shape = (prev_layer.output_shape[0], self.shape[1])
        
    def create_bias(self, prev_layer):
        self.b = torch.zeros(self.output_shape[0], self.output_shape[1], requires_grad=True)
        init_y = torch.div(1, torch.sqrt(torch.tensor(self.output_shape[0])))
        #init_y = 0.01
        self.b.data.uniform_(-init_y, init_y)
        self.parameters.append(self.b)

    def call(self, layer):
        return self.zed(layer)

In [70]:
model = Model([
    Input(1024, 28 * 28),
    Linear(28 * 28, 128),
    ReLu(),
    Linear(128, 64),
    ReLu(),
    Linear(64, 10),
    LogSoftmax()
])

In [71]:
class Learner():

    def __init__(self, model, optimizer, loss, dataset, metric):
        self.opt = optimizer
        self.loss = loss
        self.dataset = dataset
        self.model = model
        self.metric = metric
        
    def _validate(self):
        accs = []
        
        for (xb, yb) in self.dataset.valid():
            if xb.shape[0] < model.layers[0].shape[0]:
                    continue
            preds = self.model.infer(xb)
            
            accs.append(self.metric(preds, yb))
        
        return round(torch.stack(accs).mean().item(), 4)

    def fit(self, epochs=10, lr=0.01):
        loss_fn = self.loss
        opt = self.opt(self.model.parameters(), lr)
        for _ in range(0, epochs):
            running_loss = 0
            total_batches = 0
            for (x, y) in self.dataset.train():
                if x.shape[0] < model.layers[0].shape[0]:
                    continue
                opt.zero_grad()
                out = self.model.infer(x)
                loss = loss_fn(out, y)
                loss.backward()
                opt.step()
                running_loss += loss.item()
                total_batches += 1
            print(running_loss/(total_batches))
            valid = self._validate()
            print(f"Running metric on validation set after epoch: {valid}")

In [72]:
opt = Optimizer

loss = torch.nn.NLLLoss()

def mnist_acc(preds, targets):
    return (preds.argmax(-1) == targets.squeeze()).float().mean()

learner = Learner(model, opt, loss, ds, mnist_acc)

learner.fit(epochs=20)

9.229732825838287
Running metric on validation set after epoch: 0.1627
9.212893338039004
Running metric on validation set after epoch: 0.3114
9.191571449411326
Running metric on validation set after epoch: 0.4355
9.162451711194269
Running metric on validation set after epoch: 0.5101
9.123250566679856
Running metric on validation set after epoch: 0.5642
9.06949758529663
Running metric on validation set after epoch: 0.6144
8.99477088862452
Running metric on validation set after epoch: 0.6521
8.89283456473515
Running metric on validation set after epoch: 0.6897
8.760742220385321
Running metric on validation set after epoch: 0.7119
8.604740093494284
Running metric on validation set after epoch: 0.7303
8.443349706715551
Running metric on validation set after epoch: 0.7446
8.295679848769616
Running metric on validation set after epoch: 0.7643
8.176097639675799
Running metric on validation set after epoch: 0.7885
8.086660450902478
Running metric on validation set after epoch: 0.8024
8.0187598