# Finding the perfect model

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nhartman94/TTT-model-building/blob/main/MNIST-shallow-regularization.ipynb)

**Goal:** We learned in the lecture a set of tricks for how to build a model that's "just right" for the data!

In this tutorial, we're going to learn how use these training and model building tricks to train a model in pytorch for a toy dataset.

We'll focus on understanding what's happening under the hood to gain intuition for what these "Occom's razor" regularization tricks are doing.

### Table of Contents

1. Data visualization 

- **Q1:** Plot the avg images

2. Model setup

- **Q2:** What's the loss before training (analystical calc)
- **Q3:** What's the loss before training (code check)

3. First training

- **Q4:** Compare with the validation loss

4. Regularization techniques
    - 4a) Batch normalization

        - **Q5:** Visualize the output from the activations in the last step
        - **Q6:** Implement batch norm in your model
        - **Q7:** Compare the activations from the batch normalized model

    - 4b) Dropout
        - **Q8:** Implement dropout in the model

    - 4c) (Bonus) L2 norm
        - **Q9:** Visualize the weights for the models trained earlier
        - **Q10:** Implement the L2 loss

5. Evaluate on the test set

In [None]:
import torch
from torch import nn
from torch.nn import functional as F

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split

In [None]:
# %pip install scikit-learn

## 1. Data setup

We're going to use the MNIST dataset as yesterday in Israt's tutorial.

(Caveat: here testing out a smaller one, 8x8 images instead of 28x28)

In [None]:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html

digits = datasets.load_digits()

In [None]:
# digits

In [None]:
X = digits['data']
y = digits["target"]

In [None]:
# train / val / test (80/20/20) split

# 20% test data
X_tr, X_test, y_tr, y_test = train_test_split(
    torch.FloatTensor(X), torch.LongTensor(y), test_size=0.2
)

# split validation set, 
# 25% of remaining data goes into the validation set

X_tr, X_val, y_tr, y_val = \
    train_test_split(X_tr, y_tr, test_size=0.25)

In [None]:
X_tr.shape

**Q1** Plot the avg image for each of the classes 

In [None]:
fig, axs = plt.subplots(2,5,figsize=(16,7))

for i,ax in enumerate(axs.flatten()):
    # mean img
    xi =  ... # your code here 
    ax.imshow(xi,cmap='Greys')
    ax.set_title(f'y={i}')


In [None]:
''' 
This function will draw samples from the training batch
'''

N_tr = len(y_tr)
print('# of training evts',N_tr)

def get_train_data(bs=128):

    idx = np.random.choice(N_tr, 128)

    return X_tr[idx], y_tr[idx]


## 2. Model setup

We're at the ERUM deep learning course... so ofc we want to train a NN for classification 😃


In [None]:
# Utility function: count the # of parameters
def count_parameters(model):
    return sum([sum(p.view(-1).shape) for p in model.parameters()])

Below we remind you how to build a model with a single hidden layer and 256 hidden units:

💡 powers of 2

In [None]:
in_dim = 64
out_dim = 10

# Starter code
class myNN(nn.Module):
    """
    We'll keep adding functionality to this NN as we go thru the
    next exercises, but this is the starting skeleton
    """

    def __init__(self, H=64):
        super(myNN, self).__init__()

        # In the init class, need to set the weights of the trainable weights
        self.in_layer = nn.Linear(in_dim, H)
        self.hid_layer = nn.Linear(H, H)
        self.out_layer = nn.Linear(H, out_dim)

    def forward(self, x):
        
        # First linear transformation
        x = self.in_layer(x)
        x = nn.ReLU()(x)

        x = self.hid_layer(x)
        x = nn.ReLU()(x)

        x = self.out_layer(x)

        return x


In [None]:
m = myNN(64)

with torch.no_grad():
    out = m(X_tr)
print(out.shape)

### Loss function: Categorical cross entropy

For targets $y=[0,1, ..., 9]$, our model is outputting $z \in \mathbb{R}^10$, the logits (unnormalized probabilities) for these 10 classes.

We want to interpret the output of the model probabilistically, which we can do via the softmax:

$$\mathrm{Softmax}(z) \rightarrow p_i = \frac{\exp(z_i)}{\sum_{i=1}^K\exp(z_i)}$$

The **cross entropy** loss function is then the negative log likelihood of the training data .

$$
\mathcal{L} = - \frac{1}{N} \sum_i \log p_{y_i},
$$

where $p_{y_i}$ is the predicted probability of the true target class.

**Q2:** What's the loss of this randomly initialized network?

In [None]:
# Warm-up: plot the Softmax prob for the network `m`
# (hint: use nn.Softmax() )


**Q3:** 💻 Check your calculation, what is the loss for your  model?

**Tip:** You can either code up the loss fct yourself or use `F.cross_entropy`


In [None]:
"""
Q3: YOUR CODE HERE
"""



## 3. First training

OK... can we improve this loss by training? ⚙️⚙️

Since you trained NNs in the tutorial yday we'll give you some starter code.

In [None]:
# Soln to Q10 
def get_L2_loss(model):
    
    return 0

In [None]:
# Starter code
def train_model(model, lr=1e-3):

    print(f"training model with {count_parameters(model)} parameters")
    train_losses = []
    val_losses = []

    opt = torch.optim.Adam(model.parameters(), lr)

    for i in range(1000):  # 1k training steps

        model.train()

        xi, yi = get_train_data(128)  # Draw 128 samples
        logits = model(xi)
        loss = nn.CrossEntropyLoss()(logits, yi)

        opt.zero_grad()
        loss.backward()
        opt.step()

        train_losses.append(float(loss))

        # Q4: Calc loss on validation set

        if i % 200 == 0:
            print(float(loss))
    return (model, train_losses, val_losses) 


Train the model and plot the loss

In [None]:
m, tr_loss, val_loss = train_model(m)

**Q4:** What about the validation loss?

- [ ] Add the functionality to `train_model`
- [ ] Draw the plot

In [None]:
m.eval()
with torch.no_grad():
    print(f"Val loss: {F.cross_entropy(m(X_val), y_val):.4f}")

In [None]:
plt.plot(tr_loss,color='C0',label='train')
plt.plot(val_loss,color='C0',label='val',ls='--')
plt.xlabel('Iteration')
plt.ylabel('Loss')

**Q4b:** What do you think? Are we overfitting or underfitting?

## 4. Regularization techniques

### 4a) Batch normalization 

**Q5:** Plot the activations of the hidden units right before the ReLU

(Fun fact, should be 64 b/c this is the size of the hidden latent space!)

**Hint:** For this it will be a lot easier to use the functional API than the sequential one.

In [None]:
"""
Q5: Plot the activations (on the validation set)

(For inspiration, we show you the first layer)

"""
fig,axs = plt.subplots(1,3,figsize=(10,3))

# Step 1: apply the input xform
with torch.no_grad():
    x = m.in_layer(X_val)  
    
axs[0].hist(x, 100, histtype="step", alpha=0.5)

# Step 2: apply the hidden activations

... #<-  your code here

# Step 3: Last output transform

for i, ax in enumerate(axs):
    ax.set_title(f'Layer {i+1} activations')

plt.show()

**What's your take away?**

**Q6:** Add batch norm to the model archicture. 
- Tip: `nn.BatchNorm1d`
- Put it right before the ReLU nonlinearity.

Train the new model.

How do the training and validation losses compare?

In [None]:
m_bn = ... # <- your code here

**What do you think?**

**Q7:** Now plot the activations of the model trained with batch norm (a.k.a, plot the model output right after the batch norm layer).

Are they closer to 0 mean, unit variance?

In [None]:
"""
Q7: Revise Q5, but w/ model trained w/ bn
"""


**What do you think?**

### 4b) Dropout

**Q8:** Add dropout to the model and compare trainings.

Where to put it? put it after the ReLU nonlinearities.

In [None]:
'''
Q8: Your code here
'''

**What do you think?**

### 4c) (Bonus) L2 normalization

**Q9:** Visualize the weights for our trained models

In [None]:
# Tip: To get the weight and bias of the first layer..
print('W',m.in_layer.weight)
print("b", m.in_layer.bias)

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(10, 3))

# input layer
for i, ax in zip(range(3),axs):


    ax.set_title(f"Linear{i+1}")



**What do you think?**

**Q10:** Implement the L2 loss and compare...

Question for you... _where_ and how will you implement the L2 loss?


In [None]:
# Your code here

**Compare with the model weights after training w/ L2**

In [None]:
# Your code here

## 5. Evaluate on the test set

Now that you're done w/ the optimizations on the val set... how did we do on the test set?

In [None]:
# Fill in after you're done w/ any final optimizations you want to do!

## Final thoughts

Great job! We've been diving into some of the guts of NNs and classic training techniques.

Something you might have notices is your experiments are a bit noisy, e.g, rerunning w/ the a new random seed can produce different results.

For a more robust study, we'd repeat each experiment 5-10 times and report the mean and error bar (the "deep ensembles error" that we discussed in lecture)... but the point here was just to get some hands-on-keyboard understanding of the concepts we were covering.

Until next time... happy training 🌞 🚊