# Intro to Pytorch

Yesterday, we coded up the forward pass and backward propagation _by scratch_. Today, we're going to use an automatic differentiation framework :) We had checked our manual gradients before in `jax` b/c the syntax is very transparent for these types of gradient checks, but we'll use `pytorch` for the rest of the block course because it's a great balance between ease of use for projects, while still having it be easy to dive back into the matrix / tensor manipulation code easily (ðŸ¥¸) if needed (ðŸ¤“).

**Table of Contents**
1. Build a simple MLP
2. Mean Squared Error Loss
3. Gradient with respect to the Loss check
4. Train the NN (with Adam)

In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np

### 0) Load in our "data generator" (same as the last notebook, `Our-first-NN.ipynb`).

In [None]:
coeffs_true = [5, 4, -2, -0.7]

def generate_data(N):
    '''
    Same function as yesterday
    '''
    x = np.random.uniform(low=-1, high=1, size=N)
    
    # y ~ N(mu=f(x), std=0.2)
    mu = np.polyval(coeffs_true,x)
    std = 0.2 * np.random.randn(N)
    y = mu + std
    
    return x,y

def make_features(N, degree=4):
    x,y = generate_data(N)
    X = np.column_stack([x**i for i in reversed(range(degree+1))])
    return X,y

In [None]:
N=200
X_np,y_np = make_features(N)

print('X',X_np.shape)
print('y',y_np.shape)


Type case the np arrays to torch tensors .

In [None]:
N=200
X = torch.tensor(X_np,dtype=torch.float32)
y = torch.tensor(y_np,dtype=torch.float32)
y = y[:,None] # want the output of y to match the output of v

print('X',X.shape)
print('y',y.shape)


### 1) Build the simple MLP in `pytorch` that we've been playing with yesterday
- Input ï¿¼$X \in \mathbb{R}^{N \times d}$, d=5
- NN with a single hidden layer, $H=16$ hidden units
- ReLU nonlinearity
- Output $y \in \mathbb{R}^N$

In [None]:
from torch import nn

d = X.shape[1]
H = 16

In [None]:
'''
Your turn! Define a NN
'''
f =

In [None]:
# Test the evaluation, does it have the shape you expect??
f(X).shape

## 2: Mean Squared Error loss

Note, torch computes the compuation graph when we call `.backward`.

Let's illustrate this w/ a linear model!

In [None]:
x = torch.tensor([1.,2.])
w = torch.tensor([.2,.3],requires_grad=True)

f_lin = w @ x

In [None]:
f_lin

In [None]:
print(w.grad)

In [None]:
f_lin.backward()

In [None]:
print(w.grad) # is x

Another pytorch "gotcha": when you call .backward() multiple times... you _sum up the gradients_. 

What does this look like??

In [None]:
# Similar to the ex above, a lin model with just two weights
m = nn.Sequential(nn.Linear(2,1))

In [None]:
m.parameters().__next__()

In [None]:
m(x)

In [None]:
for i in range(10):
    fx = m(x)
    fx.backward()
    print(f'Iter {i} df/dw =',m.parameters().__next__().grad)

**Fix:** Need to zero out the gradient b/w calling `.backward()` 

In [None]:
for i in range(10):
    fx = m(x)
    m.zero_grad()
    fx.backward()
    print(f'Iter {i} df/dw =',m.parameters().__next__().grad)

#### Task for you!

Calculate the loss of the simple MLP `f` defined above.

Note, this final should the average over all $N=200$ of the examples, do you have the dimesntionality that you expect?

In [None]:
loss = 

In [None]:
# Create the computational graph
f.zero_grad()
loss.backward()

In [None]:
# Print and save it to a dictionary
keys = ['W1','b1','W2','b2']

grad_torch = {}

for k, p in zip(keys,f.parameters()):
    print(k,p.shape)
    print(p.grad)

    grad_torch[k] = p.grad

Nice!! We have $\nabla_{W1} \mathcal{L}$ now, just like we always wanted!

#### A.k.a, Differentiable Detective

We're now in a place where we can use the Auto Diff to check the computational graph solution for $\nabla_{W1} \mathcal{L}$, $\nabla_{b1} \mathcal{L}$,$\nabla_{W2} \mathcal{L}$, $\nabla_{b2} \mathcal{L}$ we derived at the beginning of the lecture.

Note, getting $\nabla_{W1} f$, $\nabla_{b1} f$,$\nabla_{W2} f$, $\nabla_{b2} f$ is a little annoying in pytorch b/c it wants to calculate the gradient of a single scalar, and then NN output is an (N,1) array, where N is the number of examples.

The code snippet below gets you the sample-wise gradient. For the scope of this lecture, it's not expected that you need to understand the details of this code snippet, just that you can use the output to check your worksheet calculation. 

In [None]:
# Init the dict
grad_dict_f = {k:[] for k in keys}

# Loop over each example in the batch
for i in range(N):
    # Take the grad w/r.t. the example
    # A.k.a, set up a computation graph for the example

    # Warning! Need to zero out the gradients first!!
    f.zero_grad()
    
    f(X)[i].backward()

    # Append the gradients to the list
    for k, p in zip(keys,f.parameters()):
        grad_dict_f[k].append(p.grad)

# concatenate the lists
for k in keys:
    grad_dict_f[k] = torch.stack(grad_dict_f[k],dim=0)

In [None]:
dl_dw1 = # your code here
dl_dw1

In [None]:
grad_torch['W1']

^ Above you should visually compare the gradients for the torch calc and the formula you circled this morning on your worksheet.

In [None]:
dl_dw2 = # your code here

dl_db1 = 
dl_db2 = # your code here

In [None]:
for k, manual_grad in zip(keys, [dl_dw1, dl_db1, dl_dw2, dl_db2]):
    print(torch.all(torch.isclose(manual_grad,grad_torch[k])))

#### 4. Train in pytorch

Train you tiny MLP regression model on this polynomial dataset:
- Use the Adam optimizer 
- Monitor a training and test dataset
- If time permits, explore the dependence on the training dataset size
    * Is our model underfitting or over fitting?