In [1]:
#|default_exp nb_02

# Impractical Deep Learning for Coders Lesson 1, The forward and backward passes (part 2)
> Building our first model

In [18]:
#|export
from notes.nb_01 import MNIST_URL
from fastdownload import FastDownload
import pickle, gzip
from torch import tensor
import torch, math
import matplotlib.pyplot as plt

def get_data():
    fd = FastDownload(base="~/.fastai")
    path = fd.download(MNIST_URL)
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

def normalize(x, mean, std): return (x-mean)/std

In [9]:
x_train,y_train,x_valid,y_valid = get_data()
train_mean,train_std = x_train.mean(), x_train.std()
train_mean,train_std

(tensor(0.1304), tensor(0.3073))

We need to normalize our data (mean ~= 0, std ~=1) by the **training** data, so they are on the same scale. If we did not then they could be considered two completely different datasets as a whole, and not actually part of the same bunch

In [10]:
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

In [11]:
train_mean,train_std = x_train.mean(), x_train.std()
train_mean,train_std

(tensor(2.1425e-08), tensor(1.))

In [12]:
#|export
def test_near_zero(a,tol=1e-3): assert a.abs()<tol, f"Near zero: {a}"

In [13]:
test_near_zero(x_train.mean())
test_near_zero(1-x_train.std())

#### Code

In [14]:
n,m = x_train.shape
c = y_train.max()+1

#### Explanation

In [16]:
{
    "n":"Size of the training set",
    "m":"The length of one input",
    "c":"Number of activations eventual to classify with"
};

In [17]:
n,m,c

(50000, 784, tensor(10))

## Foundations version

### Basic architecture

- One hidden layer
- Mean squared error to keep things simplified rather than cross entropy

We initialize with a simplified version of kaiming init / he init

#### Code

In [19]:
nh = 50
w1 = torch.randn(m,nh)/math.sqrt(m)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

#### Explaination

In [24]:
{
    "nh":"The size of our fully-connected hidden layer (nodes)",
    "w1":"One weight for our model, the first layer initialized (784,50)",
    "b1":"The bias for that weight",
    "w2":"Another weight for our model, the second layer (50,1)",
    "b2":"The bias for that weight",
    "torch.randn(a,b)/math.sqrt(a)":"Simplified kaiming init/he init"
};

In [21]:
w1.shape, b1.shape, w2.shape, b2.shape

(torch.Size([784, 50]), torch.Size([50]), torch.Size([50, 1]), torch.Size([1]))

In [25]:
test_near_zero(w1.mean())
test_near_zero(w1.std()-1/math.sqrt(m))

In [26]:
# This should be ~ (0,1) (mean,std)
x_valid.mean(),x_valid.std()

(tensor(-0.0059), tensor(0.9924))

In [27]:
def lin(inp, weight, bias): return inp@weight + bias

In [28]:
t = lin(x_valid, w1, b1)

In [29]:
# So should this because we used kaiming init which is designed to have this effect
t.mean(), t.std()

(tensor(0.1195), tensor(0.9740))

#### Code

In [1]:
def relu(inp): return inp.clamp_min(0.)

#### Explaination

In [31]:
{
    ".clamp_min":"A ReLU activation will turn all negatives into zero"
};

> While there are other ways of writing that, if you can find a function attached to a tensor for the thing you want to do, it will almost always be faster because it will be written in C - Jeremy Howard

In [32]:
t = relu(lin(x_valid, w1, b1))

In [33]:
t.mean(), t.std()

(tensor(0.4477), tensor(0.6152))

Uh oh! What went wrong?

- Whiteboard session stats at 1:31:00, [YouTube link](https://youtu.be/4u8FxNEDUeg?list=PLfYUBJiXbdtTIdtE1U8qgyxo4Jy2Y91uj&t=5473)

Basically we took everything with a mean below zero and just got rid of it. As a result we lost a ton of good data points, and our standard deviation and mean drastically swong as a result.

$$s t d=\sqrt{\frac{2}{\left(1+a^2\right) \times \text { fan_in }}}$$

Solution is to stick a two on the top:

In [200]:
std = math.sqrt(2/m)

In [181]:
w1 = torch.randn(m,nh)*std
t = relu(lin(x_valid, w1,b1))

t.mean(), t.std()

(tensor(0.7099), tensor(1.1645))

While this solved the standard deviation, our mean is now half because we still deleted everything below the mean

In [419]:
# What if...?
def relu_v2(x): return x.clamp_min(0.) - 0.5
def relu_v3(x): return (torch.pow(x.clamp_min(0.), 0.9)) - 0.5

In [420]:
w1 = torch.randn(m,nh)*std
t = relu_v2(lin(x_valid, w1,b1))

t.mean(), t.std()

(tensor(0.0090), tensor(0.7708))

In [421]:
t = relu_v3(lin(x_valid, w1,b1))

t.mean(), t.std()

(tensor(-0.0069), tensor(0.7123))

Jeremy tried seeing just what would happen if during relu we reduced it by .5, and it seems to have helped some in returning us to the correct mean:

How well does this work in practice? -- To test, I should try building a very basic CNN and throw it to ImageWoof and the only variance being the ReLU layer being utilized.

In [422]:
from torch.nn import init
w1 = torch.zeros(m,nh)
init.kaiming_normal_(w1, mode="fan_out")
t = relu(lin(x_valid, w1, b1))

In [423]:
w1.mean(),w1.std()

(tensor(3.2332e-05), tensor(0.0504))

In [424]:
t.mean(),t.std()

(tensor(0.5996), tensor(1.1000))

In [425]:
w1 = torch.randn(m,nh)*math.sqrt(2./m)
t = relu_v2(lin(x_valid, w1,b1))

t.mean(), t.std()

(tensor(0.0214), tensor(0.8005))

In [426]:
t = relu_v3(lin(x_valid, w1,b1))

t.mean(), t.std()

(tensor(0.0032), tensor(0.7366))

In [427]:
def model(xb, v2=True):
    l1 = lin(xb, w1, b1)
    l2 = relu_v2(l1) if v2 else relu_v3(l1)
    l3 = lin(l2, w2, b2)
    return l3

In [430]:
%timeit -n 10 _=model(x_valid)

2.21 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [431]:
%timeit -n 10 _=model(x_valid, False)

3.23 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [432]:
assert model(x_valid).shape == torch.Size([x_valid.shape[0],1])

## Loss function: MSE

In [433]:
model(x_valid).shape

torch.Size([10000, 1])

#### Code

In [435]:
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()

#### Explaination

In [436]:
{
    ".squeeze()":"Opposite of unsqueeze, removes a dimension. We use it to remove the trailing `[1]`"
};

> Note: better to use -1 or 1 than just to do `squeeze()`

In [437]:
y_train,y_valid = y_train.float(),y_valid.float()

In [438]:
preds_a = model(x_train)
preds_b = model(x_train,False)

In [440]:
preds_a.shape

torch.Size([50000, 1])

In [441]:
mse(preds_a, y_train)

tensor(25.9262)

In [442]:
mse(preds_b, y_train)

tensor(26.0106)

## Gradients and backward pass

Chain rule, chain rule, chain rule!