<a href="https://colab.research.google.com/github/harvard-visionlab/psy1410/blob/master/psy1410_week02_anns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Psy1410 - Week02 - Artificial Neural Networks with PyTorch

This week we're going to use PyTorch to create and train ANNs on the MNIST digit recognition task.

For this workshop, set your Runtime to GPU!

In [None]:
#required
import torch 

torch.cuda.is_available()

## Import and define Helper Functions

Here we'll define any helper functions that we can use as we go. We'll probably add to this as we find a need for new helper functions.

In [None]:
#required
%config InlineBackend.figure_format = 'retina'

In [None]:
#required
import numpy as np 
from PIL import Image 
from IPython.core.debugger import set_trace 
import matplotlib.pyplot as plt

def show_image(img):
  return Image.fromarray( (img * 256).squeeze().numpy().astype(np.uint8) )

def show_weights(m):
  idx = -1
  fig, axs = plt.subplots(2, 5, figsize=(15, 6))
  for row in axs:
    for ax in row:
      idx += 1
      if hasattr(m, 'weight') and len(m.weight.shape) == 4:
        shape = m.weight[idx].shape[1:]
        w = m.weight[idx].detach().reshape(*shape).cpu()
        ax.imshow(w, extent=[0, 1, 0, 1], cmap='gray')
      elif hasattr(m, 'weight'):
        w = m.weight[idx].detach().reshape(28,28).cpu()
        ax.imshow(w, extent=[0, 1, 0, 1], cmap='coolwarm')
      else:
        w = m.fc.weight[idx].detach().reshape(28,28).cpu()
        ax.imshow(w, extent=[0, 1, 0, 1], cmap='coolwarm')
      ax.set_title(f"unit={idx}")
      ax.grid(True)
      ax.axes.get_xaxis().set_visible(False)
      ax.axes.get_yaxis().set_visible(False)
  plt.show() 

## A Minimal ANN

Let's start by defining a very minimal artificial neural network, with a single fully-connected linear layer that directly maps the input (1x28x28 pixels) to the output categories (10 digit categories).

In [None]:
import torch
import torch.nn as nn

class MyNet(nn.Module):
  def __init__(self):
    super(MyNet, self).__init__()
    # in_features = 784, because the input image is 1x28x28 = 784
    # out_features = 10, because there are 10 output categories (digits 0-9)
    self.fc = nn.Linear(in_features=784, out_features=10)
  
  def forward(self, x):
    # in the "forward pass", we take an input (a batch of images, x)
    # then first we flatten it into batchSize x 784, 
    batchSize = x.shape[0] # first dimension of x is "batchSize"
    x = x.view(batchSize, -1) # the -1 tells pytorch to flatten the tensor to be batchSize x "whatever size fits"

    # finally, we pass the flattened input into our fully-connected layer 
    # which will compute the weighted sum of the input for each of the 10 
    # categories
    x = self.fc(x)

    return x

In [None]:
# create an instance of MyNet
model = MyNet()
model

In [None]:
# test on random data (100 random images)
fake_imgs = torch.rand(100,1,28,28)
out = model(fake_imgs)
out.shape

In [None]:
# why is the output shape "100x10"?

In [None]:
# inspect the "learnable parameters" of your network
# You should find 2 sets of parameters: 10 x 784 weights, and 10 biases
params = list(model.parameters())
print(len(params))

In [None]:
for param in params:
  print(param.shape)

## Inspect/visualize the weights of your randomly intialized network

Remember that each output node has a weight on each of the 28x28 pixels. We can visualize these weights by color-coding the pixels according to the weight (negatives in blue, positives in red; brighter colors for larger weights).

In [None]:
# we can directly access modules of the model, and their params, like so:
model.fc.weight.shape, model.fc.bias.shape

In [None]:
# grab the weights for the `zero` output node
w = model.fc.weight[0].detach().reshape(28,28)
w.shape

In [None]:
plt.imshow(w, extent=[0, 1, 0, 1], cmap='coolwarm');

In [None]:
show_weights(model.fc)

## Let's Train this Model!

We'll need:
- [x] a model
- [ ] a dataset (MNIST), with train/test split
- [ ] a loss function (Cross Entropy Loss)
- [ ] an optimizer (which will do all of the `back-propogation of errors` that we need to modify the weights
- [ ] we need a training function
- [ ] useful to have a validation function too (to test how well the model generalizes to data outside of the training set)

## MNIST Dataset

- we'll start with the standard MNIST dataset

In [None]:
from torchvision import datasets, transforms

transform = transforms.Compose([
  transforms.ToTensor(),
])

In [None]:
train_dataset = datasets.MNIST('./data/MNIST', train=True, download=True, transform=transform)
train_dataset

In [None]:
test_dataset = datasets.MNIST('./data/MNIST', train=False, download=True, transform=transform)
test_dataset

In [None]:
train_dataset[0][0].shape

In [None]:
from torch.utils.data import DataLoader

In [None]:
train_loader = DataLoader(train_dataset, batch_size=256, 
                          num_workers=4, pin_memory=True, shuffle=True)
train_loader

In [None]:
test_loader = DataLoader(test_dataset, batch_size=256, 
                         num_workers=4, pin_memory=True, shuffle=False)
test_loader

In [None]:
imgs, labels = next(iter(train_loader))

In [None]:
imgs.shape, labels.shape

In [None]:
output = model(imgs)
output.shape

In [None]:
idx = 10
actual = labels[idx].item()
print(actual)
show_image(imgs[idx])

In [None]:
softmax = output[idx].exp()/output[idx].exp().sum()
softmax

In [None]:
predicted = softmax.argmax().item() 
print(f"predicted={predicted}, actual={actual}")

## Loss Function

Let's use the standard cross-entropy loss function

In [None]:
import torch 
import torch.nn as nn

In [None]:
# create a fresh instance of your model 
model = MyNet()

In [None]:
# define loss function (criterion)
criterion = nn.CrossEntropyLoss()

In [None]:
# pass some images through your model, get the outputs
# why is the output 256 x 10?
imgs, labels = next(iter(train_loader))
output = model(imgs)
output.shape

In [None]:
loss = criterion(output, labels)
loss 

## Define the Optimizer

In [None]:
# define the optimizer
# this updates the weights for us using gradient descent
optimizer = torch.optim.SGD(model.parameters(), lr=.03)

## The training loop

In [None]:
#required
def train(model, train_loader, criterion, optimizer, mb=None):
  # use gpu if available
  device = 'cuda' if torch.cuda.is_available() else 'cpu'  
  model.to(device)
  criterion.to(device)

  # place model in "train mode" so gradients are computed
  model.train()
  
  # loop through ALL images
  losses = []
  for imgs,labels in progress_bar(train_loader, parent=mb):
    # put images and labels on gpu if available
    imgs = imgs.to(device)
    labels = labels.to(device)

    # forward pass (pass images through model)
    output = model(imgs)

    # compute the loss 
    loss = criterion(output, labels)

    # backward pass (compute gradients, do backprop)
    optimizer.zero_grad() # zero out any existing gradients
    loss.backward()       # compute gradients (tells us which direction to change weights)
    optimizer.step()      # modify learnable parameters (optimizer decides how much to update weights, in direction of gradients)

    losses.append(loss.item())

  return torch.tensor(losses).mean().item()

## The "test" or "validation" loop

In [None]:
#required
def validate(model, test_loader, criterion, optimizer, mb=None):
  # use gpu if available
  device = 'cuda' if torch.cuda.is_available() else 'cpu'
  model.to(device)
  criterion.to(device)

  # place the model in "eval" mode (do not compute gradients during testing) 
  model.eval()  

  # iterate over batches, compute loss and accuracy for each batch
  losses = []
  correct = []
  for imgs,labels in progress_bar(test_loader, parent=mb):
    imgs = imgs.to(device)
    labels = labels.to(device)

    # forward pass 
    output = model(imgs)

    # calculate loss and classification accuracy
    loss = criterion(output, labels)
    _, correct_k = accuracy(output, labels, topk=(1,))             

    losses.append(loss.item())
    correct.append(correct_k)

  top1 = torch.cat(correct).mean()

  return torch.tensor(losses).mean().item(), top1.mean().item()

def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        acc = []
        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float()
            acc.append(correct_k)            
            res.append(correct_k.sum(0, keepdim=True).mul_(100.0 / batch_size))
        return res, acc[0]

## Finally, the main function

This function runs the train/validate funciton for N epochs (how ever many you want). One epoch is a "full pass" through the entire training set, updating weights & biases after each mini-batch of data. At the end of each epoch, we also test the model on "held out validation data" to make sure we aren't over-learning idiosyncracies of the training set (we want our model to generalize to new data!).

In [None]:
#required
from fastprogress.fastprogress import master_bar, progress_bar 

def train_model(num_epochs):
  mb = master_bar( range(num_epochs) )
  mb.names = ['train_loss', 'val_loss']
  xs,y1,y2 = [], [], []
  for epoch in mb:
    train_loss = train(model, train_loader, criterion, optimizer, mb=mb)
    val_loss, top1 = validate(model, test_loader, criterion, optimizer, mb=mb)
    # print(f"Epoch {epoch}: Train Loss {train_loss}, Val Loss {val_loss} Top1 {top1}")

    # graph results
    xs.append(epoch)
    y1.append(train_loss)
    y2.append(val_loss)
    graphs = [[xs,y1], [xs,y2]]
    x_bounds = [0, num_epochs]
    y_bounds = [0,max(max(y1),max(y2))*1.1]
    mb.update_graph(graphs, x_bounds, y_bounds)
  print("All Done!")
  print(f"Epoch {epoch}: Train Loss {train_loss:3.3f}, Val Loss {val_loss:3.3f} Top1 {top1:3.3f}")

# Exercise 1 - Train the Model to Recognize Digits!

In [None]:
# create a fresh instance of our model
model = MyNet()
model

In [None]:
show_weights(model)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=.03)
train_model(num_epochs=15)

In [None]:
show_weights(model)

## Exercise 2 - Improve your Model by training longer (e.g., 30 epochs)


In [None]:
torch.cuda.is_available()

In [None]:
model = MyNet()
model

In [None]:
show_weights(model)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=.03)
train_model(num_epochs=30)

In [None]:
show_weights(model)

## Exercise 2 - Improve your Model by using a better optimizer (e.g., Adam, Adadelta), or by varying the learning rate, or both; 

Save a record of the results for each variant you try (you can just create a new +Code cell for each run).

SGD is known to show the "best generalization" but can also take longer. Adam and Adadelta are adaptive (intelligently adjust the step size), but Adam is known to have poorer generalization. 

In [None]:
model = MyNet()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=.03)
#optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
show_weights(model)

In [None]:
# let's try Adam with a higher learning rate (matching the default for Adadelta)
model = MyNet()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1.0)
#optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
show_weights(model)

In [None]:
# let's try Adam with a higher learning rate
model = MyNet()
criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=1.0)
optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
show_weights(model)

Compare the output weights (the visualizations) for SGD (exercise one), Adam, and Adadelta. What do you notice?

## Exercise 3 - Improve your Model by adding one or more hidden layers, with or without ReLU activations.

In [None]:
class MyShallowNet(nn.Module):
  def __init__(self, use_relu=True):
    super(MyShallowNet, self).__init__()
    self.use_relu = use_relu
    # in_features = 784, because the input image is 1x28x28 = 784
    # out_features = 128, because there are 10 output categories (digits 0-9)
    self.fc = nn.Linear(in_features=784, out_features=128)
    if self.use_relu:
      self.relu1 = nn.ReLU()
    self.fc2 = nn.Linear(in_features=128, out_features=10)
    
  def forward(self, x):
    # in the "forward pass", we take an input (a batch of images, x)
    # then first we flatten it into batchSize x 784, 
    batchSize = x.shape[0] # first dimension of x is "batchSize"
    x = x.view(batchSize, -1) # the -1 tells pytorch to flatten the tensor to be batchSize x "whatever size fits"
    
    x = self.fc(x)
    if self.use_relu:
      x = self.relu1(x)
    x = self.fc2(x)
    return x

In [None]:
model = MyShallowNet()
model

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
show_weights(model)

In [None]:
model = MyShallowNet(use_relu=False)
model

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
class MyDeepNet(nn.Module):
  def __init__(self, use_relu=True, use_dropout=True):
    super(MyDeepNet, self).__init__()
    self.use_relu = use_relu
    self.use_dropout = use_dropout

    # in_features = 784, because the input image is 1x28x28 = 784
    # out_features = 128, because there are 10 output categories (digits 0-9)
    self.fc = nn.Linear(in_features=784, out_features=128)
    if self.use_relu:
      self.relu1 = nn.ReLU()
    
    if self.use_dropout:
      self.dropout1 = nn.Dropout2d(0.50)

    self.fc2 = nn.Linear(in_features=128, out_features=128)
    if self.use_relu:
      self.relu2 = nn.ReLU()
    
    if self.use_dropout:
      self.dropout2 = nn.Dropout2d(0.50)

    self.fc3 = nn.Linear(in_features=128, out_features=10)

  def forward(self, x):
    # in the "forward pass", we take an input (a batch of images, x)
    # then first we flatten it into batchSize x 784, 
    batchSize = x.shape[0] # first dimension of x is "batchSize"
    x = x.view(batchSize, -1) # the -1 tells pytorch to flatten the tensor to be batchSize x "whatever size fits"
    
    x = self.fc(x)
    if self.use_relu:
      x = self.relu1(x)
    
    if self.use_dropout:
      x = self.dropout1(x)

    x = self.fc2(x)
    if self.use_relu:
      x = self.relu2(x)

    if self.use_dropout:
      x = self.dropout2(x)

    x = self.fc3(x)

    return x

In [None]:
model = MyDeepNet(use_relu=True, use_dropout=True)
model

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
show_weights(model)

In [None]:
# your turn. Two of us try without relu, two without dropout, two without both. 

In [None]:
model = MyDeepNet(use_relu=False, use_dropout=True)
model

## gather measures from everyone to plot a graph

In [None]:
import pandas as pd
import seaborn as sns

df = pd.DataFrame(columns=['model_name','relu','dropout','train_loss','val_loss','top1'])
scores = [
  (True, True, 0.14, 0.09, 0.97),
]
for relu,dropout,train_loss,val_loss,top1 in scores:
  model_name = f'relu{relu}_dropout{dropout}'
  df = df.append({
      "model_name": model_name,
      "relu": relu,
      "dropout": dropout,
      "train_loss": train_loss,
      "val_loss": val_loss,
      "top1": top1,
  }, ignore_index=True)
df

In [None]:
ax = sns.barplot(data=df, x="relu", y="train_loss", hue="dropout", 
                 order=[True,False], hue_order=[True,False]); 

In [None]:
ax = sns.barplot(data=df, x="relu", y="val_loss", hue="dropout", 
                 order=[True,False], hue_order=[True,False]); 

In [None]:
ax = sns.barplot(data=df, x="relu", y="top1", hue="dropout", 
                 order=[True,False], hue_order=[True,False]); 

## Exercise 4 - Improve your Model by using convolutional layers

Save a record of the results for each variant you try.

In [None]:
import torch.nn as nn
from collections import OrderedDict
# reference: https://github.com/pytorch/examples/blob/master/mnist/main.py
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.cnn_backbone = nn.Sequential(OrderedDict([
             ('conv1', nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1)),
             ('relu1', nn.ReLU()),
             ('conv2', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1)),
             ('relu2', nn.ReLU()),
             ('pool2', nn.MaxPool2d(2)),
             ('dropout2', nn.Dropout2d(0.25))
        ]))
        self.head = nn.Sequential(OrderedDict([
            ('fc3', nn.Linear(9216, 128)),
            ('relu3', nn.ReLU()),
            ('dropout3', nn.Dropout2d(0.50)),
            ('fc4', nn.Linear(128, 10)),
            ('relu4', nn.ReLU()),
        ]))

    def forward(self, x):
        x = self.cnn_backbone(x)
        x = torch.flatten(x, 1)
        x = self.head(x)
        return x

In [None]:
model = CNN()
model

In [None]:
fake_imgs = torch.rand(10,1,28,28)
out = model(fake_imgs)
out.shape

In [None]:
criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=.03)
optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
show_weights(model.cnn_backbone.conv1)

## Exercise 5 - Challenge your model by adding position and scale variation, see how this affects learning, test performance.

If you get an "out of memory" error, you might have to goto "Runtime" => "Restart runtime". However, after you restart a runtime, everything is wiped from memory, include functions like "train_model" which are annoyingly scattered throughout this notebook.

So you have to go back and reload the functions needed to train a model. I've added the text "#required" to the start of any cell that's needed to train a model, so that you can find them and just execute those cells (skipping all the excercises interspersed). So, press command+f to find text, and search for "#required", then for each of those cells press "shift + enter" to execute it. 

BUT WAIT. Why did we run out of memory? Could be a bunch of other variables that you don't need hogging GPU space. It could be that your current model is TOO BIG to fit on the GPU, or your images are too large, or you are trying to run too many of them through the model at once.

So, how do you trouble shoot? Try the following, in order, but remember to restart the runtime and run #required cells before each troubleshooting step:
- just restart and try again (don't change your model or training code): restart runtime, load only the required cells and anything you need for your new model
- try reducing your batch size (but if this get's below 64, you'll run into issues due to small batch size)
- try reducing the input image size
- try reducing your model size
- buy a bigger GPU
- Pay Amazon or Google to rent their bigger GPUs

In [None]:
#required
import torch
import numpy as np
import torchvision.transforms.functional as F
from torchvision import datasets, transforms
from PIL import Image

def random_size(img, sizes=[28,56,128]):
  s = np.random.choice(sizes)
  return F.resize(img, (s, s))

def embed_image_centered(img, bg_size=(224,224)):
  img_w, img_h = img.size
  background = Image.new('L', bg_size, color=0)
  bg_w, bg_h = background.size
  # centered, but we want to randomly position
  offset = ((bg_w - img_w) // 2, (bg_h - img_h) // 2)
  background.paste(img, offset)
  return background  

def embed_image_random(img, bg_size=(224,224)):
  img_w, img_h = img.size
  background = Image.new('L', bg_size, color=0)
  bg_w, bg_h = background.size
  h_shift = (bg_w - img_w) * np.random.rand()
  v_shift = (bg_h - img_h) * np.random.rand()
  offset = (int(h_shift), int(v_shift))
  background.paste(img, offset)
  return background  

In [None]:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
  random_size,
  embed_image_random,
  transforms.ToTensor(),
])

In [None]:
train_dataset = datasets.MNIST('./data/MNIST', train=True, download=True, transform=transform)
train_dataset

In [None]:
train_dataset[0][0]

In [None]:
test_dataset = datasets.MNIST('./data/MNIST', train=False, download=True, transform=transform)
test_dataset

In [None]:
train_loader = DataLoader(train_dataset, batch_size=200, 
                          num_workers=4, pin_memory=True, shuffle=True)
train_loader

In [None]:
test_loader = DataLoader(test_dataset, batch_size=200, 
                         num_workers=4, pin_memory=True, shuffle=True)
test_loader

In [None]:
import torch.nn as nn
from collections import OrderedDict
# reference: https://github.com/pytorch/examples/blob/master/mnist/main.py
class CNN_224(nn.Module):
    def __init__(self):
        super(CNN_224, self).__init__()
        self.cnn_backbone = nn.Sequential(OrderedDict([
             ('conv1', nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1)),
             ('relu1', nn.ReLU()),
             ('pool1', nn.MaxPool2d(2)),
             ('conv2', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1)),
             ('relu2', nn.ReLU()),
             ('pool2', nn.MaxPool2d(2)),
             ('dropout2', nn.Dropout2d(0.25)),             
        ]))
        self.downsample = nn.AdaptiveAvgPool2d((6,6))
        self.head = nn.Sequential(OrderedDict([
            ('fc3', nn.Linear(64*6*6, 128)),
            ('relu3', nn.ReLU()),
            ('dropout3', nn.Dropout2d(0.50)),
            ('fc4', nn.Linear(128, 10)),
            ('relu4', nn.ReLU()),
        ]))

    def forward(self, x):
        x = self.cnn_backbone(x)
        x = self.downsample(x)
        x = torch.flatten(x, 1)
        x = self.head(x)
        return x        

In [None]:
model = CNN_224()
criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=.03)
optimizer = torch.optim.Adadelta(model.parameters(), lr=1.0)
train_model(num_epochs=15)

In [None]:
train_model(num_epochs=15)

In [None]:
train_model(num_epochs=70)

## Bonus Exercises

Experiment with varying the kernel_size, or out_channels of different layers in your network. Use our default settings above as your "baseline". Then only adjust one parameter at a time, so you can isolate which factor accounts for any changes in performance (if you change two things, you have no way of knowing which one "caused" the change in performance). Try visualizing your kernels (whether you vary kernel_size or out_channels) to see whether the tuning functions change in any obvious way. Once you have a sense for how individual parameters affect your model, you can experiment with more dramatic changes (changing multiple parameters at once). 

Coordinate with each other if you would like to collate results (since it takes a while to run even one model!).
