# Intro to Pytorch

Yesterday, we coded up the forward pass and backward propagation _by scratch_. Today, we're going to use an automatic differentiation framework :) We had checked our manual gradients before in `jax` b/c the syntax is very transparent for these types of gradient checks, but we'll use `pytorch` for the rest of the block course because it's a great balance between ease of use for projects, while still having it be easy to dive back into the matrix / tensor manipulation code easily (🥸) if needed (🤓).

**Table of Contents**
1. Build a simple MLP
2. (?)
3. Loss functions
4. Softmax interlude
5. Train the NN (with Adam)

In [1]:
import torch
import matplotlib.pyplot as plt
import numpy as np

**Pytorch Lecture notes**

In [2]:
# How to access the gradient of a tensor
w = torch.tensor([0.1,  0.2,  2, 1, 0.1],
                 requires_grad=True)
print(w.grad) # test
# Note this will get filled once we call `.backward` on the graph at some point in the computation chain

None


In [3]:
# Our new favorite fct <3 
# torch.einsum?

### 0) Load in our "data generator" (same as the last notebook, `Our-first-NN.ipynb`).

In [4]:
coeffs_true = [5, 4, -2, -0.7]

def generate_data(N):
    '''
    Same function as yesterday
    '''
    x = np.random.uniform(low=-1, high=1, size=N)
    
    # y ~ N(mu=f(x), std=0.2)
    mu = np.polyval(coeffs_true,x)
    std = 0.2 * np.random.randn(N)
    y = mu + std
    
    return x,y

def make_features(N, degree=4):
    x,y = generate_data(N)
    X = np.column_stack([x**i for i in reversed(range(degree+1))])
    return X,y

In [5]:
N=200
X_np,y_np = make_features(N)

print('X',X_np.shape)
print('y',y_np.shape)


X (200, 5)
y (200,)


Type case the np arrays to torch tensors .

In [6]:
N=200
X = torch.tensor(X_np,dtype=torch.float32)
y = torch.tensor(y_np,dtype=torch.float32)
y = y[:,None] # want the output of y to match the output of v

print('X',X.shape)
print('y',y.shape)


X torch.Size([200, 5])
y torch.Size([200, 1])


### 1) Build the simple MLP in `pytorch` that we've been playing with yesterday
- Input ￼$X \in \mathbb{R}^{N \times d}$, d=5
- NN with a single hidden layer, $H=16$ hidden units
- ReLU nonlinearity
- Output $y \in \mathbb{R}^N$

In [7]:
from torch import nn

d = X.shape[1]
H = 16

In [8]:
'''
Option 1: With Sequential
'''
f = nn.Sequential(nn.Linear(d,H), nn.ReLU(), nn.Linear(H,1))

In [9]:
# Test the evaluation, does it have the shape you expect??
f(X).shape

torch.Size([200, 1])

In [10]:
'''
Option 2: With the functional form
'''

class myMLP(nn.Module):
    def __init__(self):
        super(myMLP, self).__init__()
        self.lin1 = nn.Linear(d,H)
        self.lin2 = nn.Linear(H,1)
        
    def forward(self,x):
        z = self.lin1(x)
        h = nn.ReLU()(z)
        y = self.lin2(h)
        return y
        

ff = myMLP()

In [11]:
ff(X).shape

torch.Size([200, 1])

OK, it's great we have both implementations, but we'll just use `f` moving forward

## 2: Mean Squared Error loss

Note, torch computes the compuation graph when we call `.backward`.

Let's illustrate this w/ a linear model!

In [12]:
x = torch.tensor([1.,2.])
w = torch.tensor([.2,.3],requires_grad=True)

f_lin = w @ x

In [13]:
f_lin

tensor(0.8000, grad_fn=<DotBackward0>)

In [14]:
print(w.grad)

None


In [15]:
f_lin.backward()

In [16]:
print(w.grad) # is x

tensor([1., 2.])


Another pytorch "gotcha": when you call .backward() multiple times... you _sum up the gradients_. 

What does this look like??

In [17]:
# Similar to the ex above, a lin model with just two weights
m = nn.Sequential(nn.Linear(2,1))

In [18]:
m.parameters().__next__()

Parameter containing:
tensor([[0.3902, 0.1299]], requires_grad=True)

In [19]:
m(x)

tensor([0.6206], grad_fn=<AddBackward0>)

In [20]:
for i in range(10):
    fx = m(x)
    fx.backward()
    print(f'Iter {i} df/dw =',m.parameters().__next__().grad)

Iter 0 df/dw = tensor([[1., 2.]])
Iter 1 df/dw = tensor([[2., 4.]])
Iter 2 df/dw = tensor([[3., 6.]])
Iter 3 df/dw = tensor([[4., 8.]])
Iter 4 df/dw = tensor([[ 5., 10.]])
Iter 5 df/dw = tensor([[ 6., 12.]])
Iter 6 df/dw = tensor([[ 7., 14.]])
Iter 7 df/dw = tensor([[ 8., 16.]])
Iter 8 df/dw = tensor([[ 9., 18.]])
Iter 9 df/dw = tensor([[10., 20.]])


**Fix:** Need to zero out the gradient b/w calling `.backward()` 

In [21]:
for i in range(10):
    fx = m(x)
    m.zero_grad()
    fx.backward()
    print(f'Iter {i} df/dw =',m.parameters().__next__().grad)

Iter 0 df/dw = tensor([[1., 2.]])
Iter 1 df/dw = tensor([[1., 2.]])
Iter 2 df/dw = tensor([[1., 2.]])
Iter 3 df/dw = tensor([[1., 2.]])
Iter 4 df/dw = tensor([[1., 2.]])
Iter 5 df/dw = tensor([[1., 2.]])
Iter 6 df/dw = tensor([[1., 2.]])
Iter 7 df/dw = tensor([[1., 2.]])
Iter 8 df/dw = tensor([[1., 2.]])
Iter 9 df/dw = tensor([[1., 2.]])


#### Task for you!

Calculate the loss of the simple MLP `f` defined above.

In [22]:
# expects input, target to be passed to the layer
loss = nn.MSELoss()(f(X),y )
loss

tensor(2.9017, grad_fn=<MseLossBackward0>)

In [23]:
torch.mean((f(X)-y)**2)

tensor(2.9017, grad_fn=<MeanBackward0>)

In [24]:
# Create the computational graph
f.zero_grad()
loss.backward()

In [25]:
# Print and save it to a dictionary
keys = ['W1','b1','W2','b2']

grad_torch = {}

for k, p in zip(keys,f.parameters()):
    print(k,p.shape)
    print(p.grad)

    grad_torch[k] = p.grad

W1 torch.Size([16, 5])
tensor([[-0.1322, -0.1241, -0.1698, -0.1363, -0.1666],
        [ 0.0119, -0.0148,  0.0190, -0.0253,  0.0350],
        [-0.0214,  0.0244, -0.0278,  0.0319, -0.0367],
        [-0.1359, -0.1044, -0.1739, -0.1131, -0.1743],
        [ 0.0133,  0.0167,  0.0202,  0.0219,  0.0062],
        [ 0.0125, -0.0156,  0.0199, -0.0255,  0.0291],
        [ 0.0035, -0.0061,  0.0104, -0.0162,  0.0143],
        [-0.1083, -0.1201, -0.1330, -0.1440, -0.1406],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0434,  0.0333,  0.0555,  0.0361,  0.0556],
        [-0.1152, -0.1216, -0.1284, -0.1359, -0.1439],
        [ 0.0188, -0.0234,  0.0302, -0.0403,  0.0561],
        [-0.0199,  0.0233, -0.0276,  0.0328, -0.0395],
        [-0.0880, -0.0677, -0.1127, -0.0732, -0.1129],
        [-0.0219,  0.0255, -0.0300,  0.0355, -0.0424],
        [-0.1896, -0.2092, -0.2322, -0.2595, -0.2922]])
b1 torch.Size([16])
tensor([-0.1666,  0.0350, -0.0367, -0.1743,  0.0062,  0.0291,  0.0143, -0.14

#### Differentiable Detective

We're now in a place where we can use the Auto Diff to check the computational graph solution for $\nabla_{W1} \mathcal{L}$, $\nabla_{b1} \mathcal{L}$,$\nabla_{W2} \mathcal{L}$, $\nabla_{b2} \mathcal{L}$.

Note, getting $\nabla_{W1} f$, $\nabla_{b1} f$,$\nabla_{W2} f$, $\nabla_{b2} f$ is a little annoying in pytorch b/c it wants to calculate the gradient of a single scalar, and then NN output is an (N,1) array, where N is the number of examples.

The code snippet below gets you the sample-wise gradient. For the scope of this lecture, it's not expected that you need to understand the details of this code snippet, just that you can use the output to check your worksheet calculation. 

In [26]:
# Init the dict
grad_dict_f = {k:[] for k in keys}

# Loop over each example in the batch
for i in range(N):
    # Take the grad w/r.t. the example
    # A.k.a, set up a computation graph for the example

    # Warning! Need to zero out the gradients first!!
    f.zero_grad()
    
    f(X)[i].backward()

    # Append the gradients to the list
    for k, p in zip(keys,f.parameters()):
        grad_dict_f[k].append(p.grad)

# concatenate the lists
for k in keys:
    grad_dict_f[k] = torch.stack(grad_dict_f[k],dim=0)

In [27]:
(2*(f(X)-y)).shape

torch.Size([200, 1])

In [28]:
grad_dict_f['W1'].shape

torch.Size([200, 16, 5])

In [29]:
dl_dq = 2*(f(X)-y)
dl_dw1_batch = dl_dq[...,None] * grad_dict_f['W1']
print(dl_dw1_batch.shape)

torch.Size([200, 16, 5])


In [30]:
dl_dw1 = torch.mean(dl_dw1_batch,axis=0)
dl_dw1

tensor([[-0.1322, -0.1241, -0.1698, -0.1363, -0.1666],
        [ 0.0119, -0.0148,  0.0190, -0.0253,  0.0350],
        [-0.0214,  0.0244, -0.0278,  0.0319, -0.0367],
        [-0.1359, -0.1044, -0.1739, -0.1131, -0.1743],
        [ 0.0133,  0.0167,  0.0202,  0.0219,  0.0062],
        [ 0.0125, -0.0156,  0.0199, -0.0255,  0.0291],
        [ 0.0035, -0.0061,  0.0104, -0.0162,  0.0143],
        [-0.1083, -0.1201, -0.1330, -0.1440, -0.1406],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0434,  0.0333,  0.0555,  0.0361,  0.0556],
        [-0.1152, -0.1216, -0.1284, -0.1359, -0.1439],
        [ 0.0188, -0.0234,  0.0302, -0.0403,  0.0561],
        [-0.0199,  0.0233, -0.0276,  0.0328, -0.0395],
        [-0.0880, -0.0677, -0.1127, -0.0732, -0.1129],
        [-0.0219,  0.0255, -0.0300,  0.0355, -0.0424],
        [-0.1896, -0.2092, -0.2322, -0.2595, -0.2922]],
       grad_fn=<MeanBackward1>)

In [31]:
grad_torch['W1']

tensor([[-0.1322, -0.1241, -0.1698, -0.1363, -0.1666],
        [ 0.0119, -0.0148,  0.0190, -0.0253,  0.0350],
        [-0.0214,  0.0244, -0.0278,  0.0319, -0.0367],
        [-0.1359, -0.1044, -0.1739, -0.1131, -0.1743],
        [ 0.0133,  0.0167,  0.0202,  0.0219,  0.0062],
        [ 0.0125, -0.0156,  0.0199, -0.0255,  0.0291],
        [ 0.0035, -0.0061,  0.0104, -0.0162,  0.0143],
        [-0.1083, -0.1201, -0.1330, -0.1440, -0.1406],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0434,  0.0333,  0.0555,  0.0361,  0.0556],
        [-0.1152, -0.1216, -0.1284, -0.1359, -0.1439],
        [ 0.0188, -0.0234,  0.0302, -0.0403,  0.0561],
        [-0.0199,  0.0233, -0.0276,  0.0328, -0.0395],
        [-0.0880, -0.0677, -0.1127, -0.0732, -0.1129],
        [-0.0219,  0.0255, -0.0300,  0.0355, -0.0424],
        [-0.1896, -0.2092, -0.2322, -0.2595, -0.2922]])

Nice! visually, they look the same... let's check the rest of the examples!

In [34]:
dl_dw2 = torch.mean(dl_dq[...,None] * grad_dict_f['W2'], axis=0)

dl_db1 = torch.mean(dl_dq * grad_dict_f['b1'], axis=0)
dl_db2 = torch.mean(dl_dq * grad_dict_f['b2'], axis=0)

In [35]:
for k, manual_grad in zip(keys, [dl_dw1, dl_db1, dl_dw2, dl_db2]):
    print(torch.all(torch.isclose(manual_grad,grad_torch[k])))

tensor(True)
tensor(True)
tensor(True)
tensor(True)
