# Instructions

For this assignment you will use PyTorch instead of EDF to implement and train neural networks. The experiments in this assignment will take a long time to run without a GPU, but you can run the notebook remotely on Google Colab and have access to GPUs for free -- in this case you don't have to worry about installing PyTorch as it is available by default in Google Colab's environment.

In case you will be running the experiments in your own machine, you should install PyTorch -- there are multiple tutorials online and it is especially easy if you're using Anaconda. Check https://pytorch.org/tutorials/ for some PyTorch tutorials -- this assignment assumes that you know the basics like defining models with multiple modules and coding up functions to train models with PyTorch optimizers. To 

To use Google Colab, you should access https://colab.research.google.com/ and upload this notebook to your workspace. To use a GPU, go to Edit -> Notebook settings and select GPU as the accelerator.

Unlike previous assignments, in this one you will have to do some writing instead of just coding. Try to keep your answers short and precise, and you are encouraged to write equations if needed (you can do that using markdown cells). You can also use code as part of your answers (like plotting and printing, etc). Blue text indicates questions or things that you should discuss/comment, and there will red "ANSWER (BEGIN)" and "ANSWER (END)" markdown cells to indicate that you should add cells with your writeup between these two. **Make sure not to redefine variables or functions in your writeup, which can change the behavior of the next cells.**

Finally, you might have to do minor changes to the provided code due to differences in python/pytorch versions. You can post on piazza if there's a major, non-trivial change that you had to do (so other students can be aware of it and how to proceed), but for minor changes you should just apply them and keep working on the assignment.

In [None]:
import torch, math, copy
import numpy as np
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F

# From Shallow to Deep Neural Networks

The main goal of this assignment is to develop a better understanding of how the depth of a network interacts with its trainability and performance.

In the previous assignment you likely observed difficulties in training sigmoid and ReLU networks with over ~8 layers, which is typically associated with 'vanishing' or 'exploding' gradients. As you will see, some of the biggest achievements in deep learning have been the development of techniques that enable deeper networks to be successfully trained, and without them deep networks are notoriously difficult to train successfully.

You will be working with the MNIST dataset, which will be downloaded and loaded in the cell below.

In [None]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

train_dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

test_dataset = datasets.MNIST("data", train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=256, shuffle=False)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw



Fill the missing code below. In both train_epoch and test, total_correct should be the total number of correctly classified samples, while total_samples should be the total number of samples that have been iterated over.

In [None]:
def train(epochs, model, criterion, optimizer, train_loader, test_loader):
    for epoch in range(epochs):
        train_err = train_epoch(model, criterion, optimizer, train_loader)
        test_err = test(model, test_loader)
        print('Epoch {:03d}/{:03d}, Train Error {:.2f}% || Test Error {:.2f}%'.format(epoch, epochs, train_err*100, test_err*100))
    return train_err, test_err
    
def train_epoch(model, criterion, optimizer, loader):
    total_correct = 0.
    total_samples = 0.
    
    for batch_idx, (data, target) in enumerate(loader):

        if torch.cuda.is_available():
            data, target = data.cuda(), target.cuda()

        # insert code to feed the data to the model and collect its output
        output = model(data)
        # insert code to compute the loss from output and the true target
        loss = criterion(output, target)

        # insert code to update total_correct and total_samples
        # total_correct: total number of correctly classified samples
        # total_samples: total number of samples seen so far
        _, pred = torch.max(output, 1)
        total_correct += (pred==target).sum().item()
        total_samples += len(target)

        # insert code to update the parameters using optimizer
        # be careful in this part as an incorrect implementation will affect
        # all your experiments and have a significant impact on your grade!
        # in particular, note that pytorch does --not-- automatically
        # clear the parameter's gradients: check tutorials to see
        # how this can be done with a single method call.

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return 1 - total_correct/total_samples
    
def test(model, loader):
    total_correct = 0.
    total_samples = 0.
    model.eval()
    
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
            if torch.cuda.is_available():
                data, target = data.cuda(), target.cuda()

            # insert code to feed the data to the model and collect its output
            output = model(data)

            # insert code to update total_correct and total_samples
            # total_correct: total number of correctly classified samples
            # total_samples: total number of samples seen so far
            _,pred = torch.max(output, 1)
            total_correct += (pred==target).sum().item()
            total_samples += target.size(0)

    return 1 - total_correct/total_samples

### CNN with Tanh activations

Next, you should implement a baseline model so you can check how increasing the number of layers can make a network considerably harder to train, given that no additional methods such as residual connections and normalization layers are adopted.

Finish the implementation of CNNtanh below, carefully following the specifications:

The model should have exactly 'k' many convolutional layers, followed by a linear (fully-connected) layer that actually outputs the logits for each of the 10 MNIST classes.

The network should consist of 3 stages, each with k/3 many convolutional layers (you can assume k is divisible by 3). Each conv layer should have a 3x3 kernel, a stride of 1 and a padding of 1 pixel (such that the output of the convolution has the same height and width as its input).

It should also have an average pooling layer at the end of each stage, with a 2x2 window (hence halving the spatial dimensions), and the number of channels should double from one stage to the other (starting with 4 in the first stage). Moreover, a Tanh activation should follow each convolution layer.

When k=3, for example, the network should be:

1. Stage 1 (1x28x28 input, 4x14x14 output):
    1. Conv layer with 1 input channel and 4 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
2. Stage 2 (4x14x14 input, 8x7x7 output):
    1. Conv layer with 4 input channels and 8 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
3. Stage 3 (8x7x7 input, 16x3x3 output):
    1. Conv layer with 8 input channels and 16 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
4. Fully-connected layer with 16 * 3 * 3=144 input dimension and 10 output dimension

Note that the model should not have any activation after the fully-connected layer: the PyTorch loss module that will be adopted takes logits as input and not class probabilities.

In contrast to the network exemplified above with k=3, when k=6 it should have two conv layers per stage instead of one (each one with a tanh activation following it).

Lastly, do not change the code block with a for loop in the end of init: its purpose to randomly initialize the parameters of the conv layers by sampling from a Gaussian with zero mean and 0.05 deviation.

In [None]:
from torch.nn.modules.pooling import AvgPool2d
from torch.nn.modules.activation import Tanh
from torch.nn.modules.conv import Conv2d
class CNNtanh(nn.Module):
    def __init__(self, k):
        super(CNNtanh, self).__init__()
        
        #### Stage 1:
        self.conv1=nn.Sequential(Conv2d(1,4,3,1,1),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv1=nn.Sequential(self.conv1, 
                                   Conv2d(4,4,3,1,1),
                                   Tanh())
        self.avgpool1 = AvgPool2d(2,2)

        #### Stage 2:
        self.conv2=nn.Sequential(Conv2d(4,8,3,1,1),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv2=nn.Sequential(self.conv2, 
                                   Conv2d(8,8,3,1,1),
                                   Tanh())
        self.avgpool2 = AvgPool2d(2 ,2)

        #### Stage 3:
        self.conv3=nn.Sequential(Conv2d(8,16,3,1,1),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv3=nn.Sequential(self.conv3, 
                                   Conv2d(16,16,3,1,1),
                                   Tanh())
        self.avgpool3 = AvgPool2d(2 ,2)

        self.feedforward= nn.Linear(144, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()
        
    def forward(self, input):
   
        u = self.avgpool1(self.conv1(input))
        u = self.avgpool2(self.conv2(u))
        u = self.avgpool3(self.conv3(u))
        
        u = u.reshape(u.shape[0],-1)
        u = self.feedforward(u)
      
        return u

The line below just instantiates the PyTorch Cross Entropy loss, whose inputs should be logits: hence the reason that the CNN should not have an activation after last (feedforward) layer.

In [None]:
criterion = torch.nn.CrossEntropyLoss()

Now, you should train CNNtanh with different values for k: your goal is to find the largest value for k such that the network achieves less than 20% error (either train or test) in 3 epochs. You should also choose an appropriate learning rate (but do not change the optimizer or the momentum settings!).

Note that CNNs can easily achieve under 2% test error on MNIST, but we're choosing 20% as a threshold since you will be training each network for only 3 epochs.

Remember to use values for k that are divisible by 3. When submitted, your notebook should have the training log of a network with two consecutive values for k (for example, 6 and 9) such that the network is 'trainable' with the smaller one but not 'trainable' with the larger one. It is fine for the training log to include runs with more than two values of k.

In [None]:
ks = [9,12]
lrs = [.1, .01]

for k  in ks:
  for lr in lrs:
    print("Training Tanh CNN with {} layers".format(k))
    model = CNNtanh(k).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)
    print("\n")

Training Tanh CNN with 9 layers
Epoch 000/003, Train Error 89.21% || Test Error 89.90%
Epoch 001/003, Train Error 73.62% || Test Error 12.92%
Epoch 002/003, Train Error 6.99% || Test Error 3.41%


Training Tanh CNN with 9 layers
Epoch 000/003, Train Error 88.95% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 88.76% || Test Error 88.65%


Training Tanh CNN with 12 layers
Epoch 000/003, Train Error 89.00% || Test Error 88.65%
Epoch 001/003, Train Error 89.03% || Test Error 89.72%
Epoch 002/003, Train Error 89.18% || Test Error 89.90%


Training Tanh CNN with 12 layers
Epoch 000/003, Train Error 89.11% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 88.76% || Test Error 88.65%




### Better Initialization

Next, we will change the initialization of the conv layers and see how it affects the trainability of deep networks. Instead of sampling from a Gaussian with a deviation of 0.05, you should sample from a Gaussian with a deviation $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{out}}}$ or $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{in}}}$, where $k$ is the kernel size ($k=3$ for 3x3 convolutions), $C_{in}$ is the number of input channels, and $C_{out}$ the number of output channels.

The model below should be exactly like CNNtanh except for the standard deviation of the normal distribution used to initialize the conv layers.

The paper 'Understanding the difficulty of training deep feedforward neural networks' by Glorot and Bengio provides some intuition behind such a choice for $\sigma$.

In [None]:
class CNNtanh_newinit(nn.Module):
    def __init__(self, k):
        super(CNNtanh_newinit, self).__init__()
        ##### Stage 1
        self.conv1=nn.Sequential(Conv2d(1,4,3,1,1),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv1=nn.Sequential(self.conv1, 
                                   Conv2d(4,4,3,1,1),
                                   Tanh())
        self.avgpool1 = AvgPool2d(2 ,2)

        #### Stage 2:
        self.conv2=nn.Sequential(Conv2d(4,8,3,1,1),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv2=nn.Sequential(self.conv2, 
                                   Conv2d(8,8,3,1,1),
                                   Tanh())
        self.avgpool2 = AvgPool2d(2 ,2)

        #### Stage 3:
        self.conv3=nn.Sequential(Conv2d(8,16,3,1,1),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv3=nn.Sequential(self.conv3, 
                                   Conv2d(16,16,3,1,1),
                                   Tanh())
        self.avgpool3 = AvgPool2d(2 ,2)

        self.feedforward= nn.Linear(144, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = 1/np.sqrt(3*m.weight.shape[1])
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):
        u = self.avgpool1(self.conv1(input))
  
        u = self.avgpool2(self.conv2(u))
        u = self.avgpool3(self.conv3(u))
        u = u.reshape(u.shape[0],-1)
        u = self.feedforward(u)
        
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNtanhinit.

In [None]:
ks = [84, 87, 90, 93]
lrs = [.001, .0001, .00001]

for k in ks:
  for lr in lrs:
    print("\nTraining Tanh CNN + new init with {} layers".format(k))
    model = CNNtanh_newinit(k).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training Tanh CNN + new init with 84 layers
Epoch 000/003, Train Error 89.77% || Test Error 88.65%
Epoch 001/003, Train Error 89.09% || Test Error 89.90%
Epoch 002/003, Train Error 89.20% || Test Error 90.20%

Training Tanh CNN + new init with 84 layers
Epoch 000/003, Train Error 88.19% || Test Error 80.66%
Epoch 001/003, Train Error 73.27% || Test Error 68.70%
Epoch 002/003, Train Error 64.26% || Test Error 62.23%

Training Tanh CNN + new init with 84 layers
Epoch 000/003, Train Error 89.87% || Test Error 90.16%
Epoch 001/003, Train Error 89.97% || Test Error 90.16%
Epoch 002/003, Train Error 90.11% || Test Error 90.86%

Training Tanh CNN + new init with 87 layers
Epoch 000/003, Train Error 64.84% || Test Error 38.96%
Epoch 001/003, Train Error 25.91% || Test Error 17.20%
Epoch 002/003, Train Error 13.43% || Test Error 12.11%

Training Tanh CNN + new init with 87 layers
Epoch 000/003, Train Error 89.86% || Test Error 90.01%
Epoch 001/003, Train Error 89.91% || Test Error 89.38%
Epoch

### CNN with ELU activations

In this section you should replace the Tanh activations of the previous network for Exponential Linear Units (ELUs). Complete CNNelu below, which should be exactly like CNNtanhinit except for ELU activations instead of Tanh (ELUs are readily available in PyTorch, check its documentation for more details).

In [None]:
class CNNelu(nn.Module):
    def __init__(self, k):
        super(CNNelu, self).__init__()

        #### Stage 1
        self.conv1=nn.Sequential(Conv2d(1,4,3,1,1),
                                   nn.ELU())
        for i in range(1, int(k/3)):
          self.conv1=nn.Sequential(self.conv1, 
                                   Conv2d(4,4,3,1,1),
                                   nn.ELU())
        self.avgpool1 = AvgPool2d(2 ,2)

        #### Stage 2:
        self.conv2=nn.Sequential(Conv2d(4,8,3,1,1),
                                   nn.ELU())
        for i in range(1, int(k/3)):
          self.conv2=nn.Sequential(self.conv2, 
                                   Conv2d(8,8,3,1,1),
                                   nn.ELU())
        self.avgpool2 = AvgPool2d(2 ,2)

        #### Stage 3:
        self.conv3=nn.Sequential(Conv2d(8,16,3,1,1),
                                   nn.ELU())
        for i in range(1, int(k/3)):
          self.conv3=nn.Sequential(self.conv3, 
                                   Conv2d(16,16,3,1,1),
                                   nn.ELU())
          
        self.avgpool3 = AvgPool2d(2,2)
        self.feedforward= nn.Linear(144, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = 1/np.sqrt(3*m.weight.shape[1])
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):

        u = self.avgpool1(self.conv1(input))
        u = self.avgpool2(self.conv2(u))
        u = self.avgpool3(self.conv3(u))
        u = u.reshape(u.shape[0],-1)

        u = self.feedforward(u)
        
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu.

In [None]:
ks = [42, 45, 47, 51]
lrs = [0.001, .0001, .00001]

for k in ks:
  for lr in lrs:
    print("\nTraining ELU CNN, with {} layers".format(k))
    model = CNNelu(k).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ELU CNN, with 42 layers
Epoch 000/003, Train Error 50.73% || Test Error 25.29%
Epoch 001/003, Train Error 13.59% || Test Error 9.14%
Epoch 002/003, Train Error 8.43% || Test Error 8.34%

Training ELU CNN, with 42 layers
Epoch 000/003, Train Error 88.09% || Test Error 86.11%
Epoch 001/003, Train Error 83.00% || Test Error 76.92%
Epoch 002/003, Train Error 64.71% || Test Error 47.53%

Training ELU CNN, with 42 layers
Epoch 000/003, Train Error 89.94% || Test Error 90.12%
Epoch 001/003, Train Error 89.40% || Test Error 90.03%
Epoch 002/003, Train Error 89.20% || Test Error 89.36%

Training ELU CNN, with 45 layers
Epoch 000/003, Train Error 90.10% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%

Training ELU CNN, with 45 layers
Epoch 000/003, Train Error 88.45% || Test Error 86.02%
Epoch 001/003, Train Error 81.31% || Test Error 76.55%
Epoch 002/003, Train Error 69.23% || Test Error 60.44%

Trainin

### CNN with Batch Normalization

Next, you will check how batch normalization can make deep networks easier to train. Implement the network below, which should be exactly like CNNelu except for additional BatchNorm2d layers after each convolution (before the ELU activation).

Note that BatchNorm2d modules require the number of channels as argument -- see the PyTorch documentation for more details.

In [None]:
class CNNeluBN(nn.Module):
    def __init__(self, k):
        super(CNNeluBN, self).__init__()

        self.conv1=nn.Sequential(Conv2d(1,4,3,1,1),
                                 nn.BatchNorm2d(4),
                                   nn.ELU())
        for i in range(1, int(k/3)):
          self.conv1=nn.Sequential(self.conv1, 
                                   Conv2d(4,4,3,1,1),
                                   nn.BatchNorm2d(4),
                                   nn.ELU())
        self.avgpool1 = AvgPool2d(2 ,2)

        
        #### Stage 2:
        self.conv2=nn.Sequential(Conv2d(4,8,3,1,1),
                                 nn.BatchNorm2d(8),
                                   nn.ELU())
        for i in range(1, int(k/3)):
          self.conv2=nn.Sequential(self.conv2, 
                                   Conv2d(8,8,3,1,1),
                                   nn.BatchNorm2d(8),
                                   nn.ELU())
        self.avgpool2 = AvgPool2d(2 ,2)

        #### Stage 3:
        self.conv3=nn.Sequential(Conv2d(8,16,3,1,1),
                                 nn.BatchNorm2d(16),
                                   nn.ELU())
        for i in range(1, int(k/3)):
          self.conv3=nn.Sequential(self.conv3, 
                                   Conv2d(16,16,3,1,1),
                                   nn.BatchNorm2d(16),
                                   nn.ELU())
          
        self.avgpool3 = AvgPool2d(2,2)
        self.feedforward= nn.Linear(144, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = 1/np.sqrt(3*m.weight.shape[1])
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):
        
        u = self.avgpool1(self.conv1(input))
        u = self.avgpool2(self.conv2(u))
        u = self.avgpool3(self.conv3(u))
      
        u = u.reshape(u.shape[0],-1)

        u = self.feedforward(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN.

In [None]:
ks = [117,120, 123, 126]
lrs = [0.001, .0001, .00001]
for k in ks:
  for lr in lrs:
    print("\nTraining ELU CNN + BN with {} layers".format(k))
    model = CNNeluBN(k).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ELU CNN + BN with 117 layers
Epoch 000/003, Train Error 85.98% || Test Error 78.26%
Epoch 001/003, Train Error 53.18% || Test Error 25.31%
Epoch 002/003, Train Error 19.96% || Test Error 14.83%

Training ELU CNN + BN with 117 layers
Epoch 000/003, Train Error 89.64% || Test Error 89.49%
Epoch 001/003, Train Error 89.54% || Test Error 89.42%
Epoch 002/003, Train Error 89.86% || Test Error 89.10%

Training ELU CNN + BN with 117 layers
Epoch 000/003, Train Error 90.13% || Test Error 90.84%
Epoch 001/003, Train Error 90.12% || Test Error 89.85%
Epoch 002/003, Train Error 90.11% || Test Error 90.11%

Training ELU CNN + BN with 120 layers
Epoch 000/003, Train Error 89.61% || Test Error 88.92%
Epoch 001/003, Train Error 62.20% || Test Error 23.75%
Epoch 002/003, Train Error 16.88% || Test Error 11.82%

Training ELU CNN + BN with 120 layers
Epoch 000/003, Train Error 89.86% || Test Error 90.14%
Epoch 001/003, Train Error 89.65% || Test Error 90.18%
Epoch 002/003, Train Error 89.65% |

### Residual Networks

Finally, you experiment adding residual connections to a CNN.

To implement the model below, you should add a 'skip connection' to 'Conv->BatchNorm->ELU' blocks whenever the shape of the block's input and output are the same: this will be the case for every such block except for the first ones in each stage, as they double the number of channels.

More specifically, you should change $u = ELU(BatchNorm(Conv(x)))$ to $u = ELU(BatchNorm(Conv(x))) + x$, where $x$ and $u$ denote the block's input and output, respectively.

You should take your CNNeluBN implementation and add skip-connections as described above.

Note that there are key differences between the resulting model and the actual ResNet proposed by He et al. in 'Deep Residual Learning for Image Recognition', for example the use of ELU activations instead of ReLU and the exact position of skip-connections.

In [None]:
class Block(nn.Module):
  def __init__(self, in_channels, out, kern, stride):
        super(Block, self).__init__()
        self.block = nn.Sequential( Conv2d(in_channels=in_channels, out_channels=out, kernel_size=kern, stride=stride, padding=1), 
                                   nn.BatchNorm2d(out),
                                   nn.ELU())
  def forward(self, input):
    return self.block(input)

class ResNet(nn.Module):
    def __init__(self, k):
        super(ResNet, self).__init__()
        ### Stage 1:
        self.conv1=[Block(1,4,3,1)]
        for i in range(1, int(k/3)):
          self.conv1.append(Block(4,4,3,1))
        self.avgpool1 = AvgPool2d(2 ,2)
        self.conv1 = nn.Sequential(*self.conv1)
        

        #### Stage 2:
        self.conv2=[Block(4,8,3,1)]
        for i in range(1, int(k/3)):
          self.conv2.append(Block(8,8,3,1))
        self.avgpool2 = AvgPool2d(2 ,2)
        self.conv2 = nn.Sequential(*self.conv2)

        #### Stage 3:
        self.conv3=[Block(8,16,3,1)]
        for i in range(1, int(k/3)):
          self.conv3.append(Block(16, 16, 3, 1))
        self.avgpool3 = AvgPool2d(2,2)
        self.conv3=nn.Sequential(*self.conv3)

        self.feedforward= nn.Linear(144, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = 1/np.sqrt(3*m.weight.shape[1])
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
    
    def forward(self, input):
        
        ####Stage 1
        u = self.conv1[0](input)
        res = u #set residual to 0
        for i in range(1, int(k/3)):
          u = self.conv1[i](u)+res
          res = u 
        u = self.avgpool1(u)

        #### Stage 2
        u = self.conv2[0](u)
        res = u
        for i in range(1, int(k/3)):
          u = self.conv2[i](u)+res
          res = u
        u = self.avgpool2(u)

        #Stage 3
        u = self.conv3[0](u)
        res = u
        for i in range(1, int(k/3)):
          u = self.conv3[i](u)+res
          res = u
        u = self.avgpool3(u)

        u = u.reshape(u.shape[0],-1)
        u = self.feedforward(u)

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with the 'ResNet' model.

In [None]:
k = 800
lr = .001

print("\nTraining ResNet with {} layers".format(k))
model = ResNet(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ResNet with 800 layers
Epoch 000/003, Train Error 45.97% || Test Error 19.96%
Epoch 001/003, Train Error 41.45% || Test Error 15.70%
Epoch 002/003, Train Error 11.98% || Test Error 9.10%


**<font color='blue'>
    Summarize your results and observations regarding the experiments above. What was the maximum number of layers for each of the five models such that training remained successful? Briefly discuss why you think each modification helped/harmed the trainability of deep models.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER 

Starting with CNNtanh, which is the model that performed the worst in my experiments, it could be successfully trained in a model with up to 9 convolutional layers using a learning rate of 0.1.
 
In the second experiment, we changed the learning rate to 0.001. The model did a little better and could train a model with up to 87 convolutional layers. This improvement was achieved by changing the initialization code for the parameters of the model. More concretely, initializing our parameters by sampling from a gaussian distribution with a sigma as defined above can make our model’s parameters more stable.

The CNNelu model, which is exactly the same as described above but using ELU activation functions instead of tanh, didn’t perform as well. In this experiment, the model was able to train up to 42 convolutional layers using a learning rate of 0.001, which is an important downgrade with respect to the last experiment. This can be explained because in deep networks without batchnorm, when using ELU activation function the inputs and gradients of the parameters in each layer can get really high values as there is no upper bound for this activation function as there is one for the tanh, therefore the trainability was worse.

With the introduction of batch normalization, the next experiment performed substantially better. It could be trained using up to 126 convolutional layers using a learning rate of 0.001, which is a great improvement from our last best  model, which allowed 87. The normalization layer after each convolution whitenned the data. Each layer has an input that is variant to the past layers. Using batch normalization makes the network less sensitive to changes in its inputs by forcing them to be mean 0 and variance 1. The network is more stable even when the inputs’ distributions are different. This is also known as internal covariance shift.

The big change however,  is introduced by the ResNet model, which implements residual connections. From the experiments we saw that it could train more than 800 convolutional layers with a learning rate of 0.001,  which is an abrupt improvement in comparison with the other models. We are adding directly information about past layers. Therefore, during backward propagation the network is able to better distribute the gradients to the layers below and is able to achieve a great improvement in the trainability of deep networks.


---------------------------------------------------------------------
</font>**




**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### Interactions: Batch Norm and Initialization

Intuitively, batch norm should make the model more robust to changes in the magnitude of the network's weights: informally, scaling up all the elements of a conv layer's filters by a factor of 10 would not affect the network's output as long as there is a batch norm layer following such convolution, as the normalization would undo the scaling.

To check how this intuition translates to practical settings, you should change the original 'CNNtanh' model so that it incorporates batch norm layers (like you have done when modifying 'CNNelu' into 'CNNeluBN').

The model below should adopt the naive initialization procedure of sampling from a Gaussian with a deviation of 0.05, not the more sophisticated one that you implemented previously

In [None]:
class CNNtanhBN_oldinit(nn.Module):
    def __init__(self, k):
        super(CNNtanhBN_oldinit, self).__init__()
        self.conv1=nn.Sequential(Conv2d(1,4,3,1,1),
                                 nn.BatchNorm2d(4),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv1=nn.Sequential(self.conv1, 
                                   Conv2d(4,4,3,1,1),
                                   nn.BatchNorm2d(4),
                                   Tanh())
        self.avgpool1 = AvgPool2d(2 ,2)

        
        #### Stage 2:
        self.conv2=nn.Sequential(Conv2d(4,8,3,1,1),
                                 nn.BatchNorm2d(8),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv2=nn.Sequential(self.conv2, 
                                   Conv2d(8,8,3,1,1),
                                   nn.BatchNorm2d(8),
                                   Tanh())
        self.avgpool2 = AvgPool2d(2 ,2)

        #### Stage 3:
        self.conv3=nn.Sequential(Conv2d(8,16,3,1,1),
                                 nn.BatchNorm2d(16),
                                   Tanh())
        for i in range(1, int(k/3)):
          self.conv3=nn.Sequential(self.conv3, 
                                   Conv2d(16,16,3,1,1),
                                   nn.BatchNorm2d(16),
                                   Tanh())
        self.avgpool3 = AvgPool2d(2 ,2)

        self.feedforward= nn.Linear(144, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()
        
    def forward(self, input):

        u = self.conv1(input)
        u = self.avgpool1(u)

        u = self.conv2(u)
        u = self.avgpool2(u)
     
        u = self.conv3(u)
        u = self.avgpool3(u)
      
        u = u.reshape(u.shape[0],-1)

        u = self.feedforward(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN_oldinit.

In [None]:
ks = [93, 96, 99]
lrs = [.01, 0.001, .0001, .00001]
for k in ks:
  for lr in lrs:
    print("\nTraining Tanh CNN + BN + naive init with {} layers".format(k))
    model = CNNtanhBN_oldinit(k).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training Tanh CNN + BN + naive init with 93 layers
Epoch 000/003, Train Error 43.31% || Test Error 18.91%
Epoch 001/003, Train Error 24.43% || Test Error 15.45%
Epoch 002/003, Train Error 13.09% || Test Error 9.01%

Training Tanh CNN + BN + naive init with 93 layers
Epoch 000/003, Train Error 48.57% || Test Error 26.26%
Epoch 001/003, Train Error 18.81% || Test Error 14.06%
Epoch 002/003, Train Error 11.94% || Test Error 11.27%

Training Tanh CNN + BN + naive init with 93 layers
Epoch 000/003, Train Error 85.58% || Test Error 76.30%
Epoch 001/003, Train Error 72.74% || Test Error 66.61%
Epoch 002/003, Train Error 60.48% || Test Error 52.56%

Training Tanh CNN + BN + naive init with 93 layers
Epoch 000/003, Train Error 89.96% || Test Error 89.97%
Epoch 001/003, Train Error 90.28% || Test Error 89.85%
Epoch 002/003, Train Error 90.00% || Test Error 89.42%

Training Tanh CNN + BN + naive init with 96 layers
Epoch 000/003, Train Error 71.80% || Test Error 66.23%
Epoch 001/003, Train Error

**<font color='blue'>
    Compare CNNtanh (model with naive initialization and no batch norm), CNNtanh_newinit (model with better initialization and no batch norm), and CNNtanhBN_oldinit (model with naive initialization and batch norm), in terms of how deep each could be while being trainable, and discuss your thoughts one how batch norm interacts with the way parameters are initialized.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

Using batch norm with a naive initiallization can still outperform a model with a better initialization but without batchnorm layer. The batchnorm layer kind of works as a stablizer for the model parameters, which is the same as we were looking for with the new init model. Therefore, using batchnorm we are already using parameters that are good for the model and the new init should make no difference. That is, if we use batchnorm, it doesn't matter which of the initialization techniques we use. However, CNNtanhBN_oldinit outperforms CNNtanh_newinit because we are 'stablizing' the model after each epoch, and not only in at the beggining. In general terms, the best model was CNNtanhBN_oldinit followed by CNNtanh_newinit and CNNtanh. They models could be trained with up to 96, 84 and 9 convolutional layers respectively.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### Interactions: Batch Norm and Residual Connections

Lastly, implement and train a CNN with residual connections but without batch normalization layers -- the goal here is to check how residuals interact with normalization.

The model below should be exactly like ResNet, except that it should not have batch norm layers.

In [None]:
class Block(nn.Module):
  def __init__(self, in_channels, out, kern, stride):
        super(Block, self).__init__()
        self.block = nn.Sequential( Conv2d(in_channels=in_channels, out_channels=out, kernel_size=kern, stride=stride, padding=1), 
                                   nn.ELU())
  def forward(self, input):
    return self.block(input)


class ResNet_noBN(nn.Module):
    def __init__(self, k):
        super(ResNet_noBN, self).__init__()

        ### Stage 1:
        self.conv1=[Block(1,4,3,1)]
        for i in range(1, int(k/3)):
          self.conv1.append(Block(4,4,3,1))
        self.avgpool1 = AvgPool2d(2 ,2)
        self.conv1 = nn.Sequential(*self.conv1)
        

        #### Stage 2:
        self.conv2=[Block(4,8,3,1)]
        for i in range(1, int(k/3)):
          self.conv2.append(Block(8,8,3,1))
        self.avgpool2 = AvgPool2d(2 ,2)
        self.conv2 = nn.Sequential(*self.conv2)

        #### Stage 3:
        self.conv3=[Block(8,16,3,1)]
        for i in range(1, int(k/3)):
          self.conv3.append(Block(16, 16, 3, 1))
        self.avgpool3 = AvgPool2d(2,2)
        self.conv3=nn.Sequential(*self.conv3)

        self.feedforward= nn.Linear(144, 10)


        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = 1/np.sqrt(3*m.weight.shape[1])
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        ####Stage 1
        u = self.conv1[0](input)
        res = u 
        for i in range(1, int(k/3)):
          u = self.conv1[i](u)+res
          res = u 
        u = self.avgpool1(u)

        #### Stage 2
        u = self.conv2[0](u)
        res = u
        for i in range(1, int(k/3)):
          u = self.conv2[i](u)+res
          res = u
        u = self.avgpool2(u)

        #Stage 3
        u = self.conv3[0](u)
        res = u
        for i in range(1, int(k/3)):
          u = self.conv3[i](u)+res
          res = u
        u = self.avgpool3(u)

        u = u.reshape(u.shape[0],-1)
        u = self.feedforward(u)

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with ResNet_noBN.

In [None]:
ks = [15, 18, 21, 24]
lrs = [.01, .001, .0001, .00001]
for k in ks:
  for lr in lrs:
    print("\nTraining ResNet w/o BN with {} layers".format(k))
    model = ResNet_noBN(k).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ResNet w/o BN with 15 layers
Epoch 000/003, Train Error 90.09% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%

Training ResNet w/o BN with 15 layers
Epoch 000/003, Train Error 90.16% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%

Training ResNet w/o BN with 15 layers
Epoch 000/003, Train Error 69.43% || Test Error 48.11%
Epoch 001/003, Train Error 37.09% || Test Error 29.14%
Epoch 002/003, Train Error 23.40% || Test Error 19.26%

Training ResNet w/o BN with 15 layers
Epoch 000/003, Train Error 77.57% || Test Error 64.73%
Epoch 001/003, Train Error 56.83% || Test Error 48.03%
Epoch 002/003, Train Error 43.51% || Test Error 37.25%

Training ResNet w/o BN with 18 layers
Epoch 000/003, Train Error 90.14% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% |

**<font color='blue'>
    Compare ResNet and ResNet_noBN in terms of how deep each could be while being trainable, and discuss your thoughts one how batch norm interacts with residual connections.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

When using ResNets, it is important to use BatchNorm because while adding terms to the output of the convolutional layers, we increase its value. It is a cumulative sum. Without batchnorm, the input values of each layer can get get really large and unstablize the network for a large amount of layers. This is the reason why as we've seen, the ResNet without Batchnorm performed way worse than the one with batchnorm, both using a learning rate of 0.001.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### (Optional) Multiple Loss Heads

In this optional section, your goal is to incorporate the idea of having multiple loss heads throughout the network, distributed across its depth.

For the CNNelu_multihead model below, you should take the CNNelu model that you implemented previously and add two additional classification heads, connected to the outputs of stages 1 and 2.

More specifically, the outputs of stages 1 and 2, with shapes 4x14x14 and 8x7x7, should be connected to new fully-connected layers that map them to a 10-dimensional vector (logits for the 10 MNIST classes). The network should output three logit vectors (the original one at the end of the network plus the two new ones) instead of just one, and the loss should be computed as the average of the cross entropies between the true target and each of the three predictions.

Note that you will likely have to change the implementation of train_epoch() and test() to accomodate the fact that this model will output three logit vectors instead of one.

In [None]:
class CNNelu_multihead(nn.Module):
    def __init__(self, k):
        super(CNNelu_multihead, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = 
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):
        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'

        return u1, u2, u3

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu_multihead.

In [None]:
k = 
lr = 

print("\nTraining ELU CNN + multiloss with {} layers".format(k))
model = CNNelu_multihead(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

**<font color='blue'>
    Did the adoption of multiple loss heads help train deeper models? How did it compare to the adoption of batch normalization, in terms of how deeper each of the two approaches enabled the network to be while staying trainable?
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**