In [13]:
import numpy as np
import pytest
import bettertimeit

# Design your own Neural Net

## ~~Ray Hettinger~~

## Varun Nayyar


## This is

- A fun mix of ML and Software
- A deeper dive into Neural Nets than pytorch
- Mostly iterative design and analysis
- A lot of live coding (that I'm going to regret)

## Why? 

- Neural Nets are easy
- Everyone loves Neural Nets
- Good way to illustrate good software mixed with ML

## This is not

- A good way to implement a Neural Net Library in 2019
- Building computational graphs
- Automatic Differentiation (autograd) or how to do it
- GPU programming
- See @chewxy (if he ever returns) for the above


## Neural Nets

![nn.png](resources/nn.png)

## Forward 

- Fully Connected Layer
    - $y=Wx + b$
    - This is just a matrix multiplication
- Forward
    - $y = tanh(x)$


- Backprop was invented independently 3 times.
- Was thought to be useless for a long time - Hiton spent many years on approximate methods

In [2]:
class Layer:
    def forward(self, x):
        pass

    def backward(self, dldy):
        pass


In [3]:
class Tanh(Layer):
    def forward(self, x):
        return np.tanh(x)


## Aside

- $Wx+b$ has these shapes
    - x is (Indim,)
    - W is (Outdim, Indim)
    - b is (Outdim,)
- We want to batch our x, we don't want to do this
- How do we initialize the W and b?


In [None]:
def forward(...):
    y = []
    for vector in x:
        y.append(W @ x + b)
    return y
        

In [10]:
class FullyConnected(Layer):
    def __init__(self, indim, hiddendim):
        self.W = np.ones((hiddendim, indim))
        self.b = np.zeros(hiddendim)

    def forward(self, x):
        y = self.W @ x + self.b
        return y


## Let's quickly test

In [11]:
N = 100
x = np.random.randn(N, 10)
l = FullyConnected(10, 32)
l.forward(x)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 100 is different from 10)

## Hmm

- What should the shapes of $x$ and $W$ be?
    - if $x$ is (N, I) then W should be (I, O)
        - `x @ W`
    - if $x$ is (I, N) then W should be (O, I)
        - `W @ x`
- is there a compute consideration here?

In [17]:
def forward():
    import numpy as np
    N = 3000
    indim = 20
    hiddendim = 40

    w = np.random.randn(indim, hiddendim)
    x = np.random.randn(N, indim)

    def timeit_opt1():
        x @ w

    wT = w.T
    # wopp = np.random.randn(hiddendim, indim)
    xT = x.T
    # xopp = np.random.randn(indim, N)

    def timeit_opt2():
        wT @ xT

bettertimeit.bettertimeit(forward, 10)

opt1: 10000 loops, best of 10: 110 usec per loop
opt2: 10000 loops, best of 10: 123 usec per loop


## Design

- N >> I (usually)
- $x$ being I, N is a column major format
    - natural mathematical form
    - If this was fortran, matlab or julia 
    - (why col major and these languages are found in mathy applications a lot)
- Python and C are row major = N, I
    - faster 
    - more natural to have N the leading index anyway
    - Coda 

In [22]:
class FullyConnected(Layer):
    def __init__(self, indim, hiddendim):
        self.W = np.ones((indim, hiddendim))
        self.b = np.zeros(hiddendim)

    def forward(self, x):
        y = x @ self.W + self.b
        return y


## Quick test

In [25]:
N = 100
x = np.random.randn(N, 10)
l = FullyConnected(10, 32)
y = l.forward(x)
y.shape

(100, 32)

## Initialisation Design

- zeros is a bad idea (very slow backprop - will come back to)
- Many different approaches (Xavier, He, etc)
    - Xavier is random normal(0, scale) where scale is 2/I+O
    
- Classmethod or init arg?

## My opinion

- Classmethod
    - choices would show up in methods (less reliance on doc)
    - user needs to know all the options 
    - many methods would be very similar - code duplication
    - init would either be user unfriendly (takes W and b) or have a bad defaut
- init arg
    - classmethod's are best when we have very different arguments
    - lot of code can be shared
    - Maybe even allow it to be a function?
    

In [27]:
class FullyConnected(Layer):
    def __init__(self, indim, hiddendim, init="xavier"):
        if init == "xavier":
            scale = np.sqrt(2/(indim+hiddendim))
        elif init == "he":
            scale = np.sqrt(2/indim)
        else:
            raise ValueError(f"Unknown initialiser: {init}")
        self.W = np.random.randn(indim, hiddendim) * scale
        self.b = np.zeros(hiddendim)

    def forward(self, x):
        y = x @ self.W + self.b
        return y


In [28]:
## Test

N = 100
x = np.random.randn(N, 10)
l = FullyConnected(10, 32)
y = l.forward(x)
y.shape

(100, 32)

# Backward Pass

## The Equations


- Fully Connected
    - $\frac{dy}{dx} = W^T$
    - $\frac{dy}{dW} = x^T$
    - $\frac{dy}{db} = 1$
    - Chain Rule + Matrix math
        - $dL/dy$ is same shape as y - (N,O)
        - $\frac{dL}{dx} = \frac{dL}{dy} W^T$
        - $\frac{dL}{dW} = x^T\frac{dL}{dy}$
        - $\frac{dL}{db} = \frac{dL}{dy}$
- Tanh
    - $\frac{dy}{dx} = 1-tanh^2(x)$
    - $\frac{dL}{dx} = (1-tanh^2(x)) * \frac{dL}{dy}$

In [30]:

class Tanh(Layer):
    
    def backward(self, dldy):
        dldx = (1 - (np.tanh(x)) ** 2) * dldy
        return dldx

class FullyConnected(Layer):
    
    def backward(self, dldy):
        dldw = dldy @ self.W.T
        dldb = dldy
        dldx = x.T @ dldy
        return dldx

## Wait

- We don't have access to the input, $x$ in the backward pass!
- How should we solve this?
    - Cache it on the forward pass?
    - Save it to a 
