# Tutorial 2: Convolutional Neural Networks

## Introduction

In this tutorial, there are two parts:

First:
- Convolutional layers
- Pooling layers
- LeNet architecture example

Second:
- Exercise with CIFAR-10

## Convolutional Layers

### A few things to understand about convolutional layers:
1. Each neuron is spatially localized and operate on the **full depth** dimension of its input layer.
1. Neurons that are at the same depth in the grid **share the same weights** (parameters $W$,$b$).

   <img src="tutorial2_img/cnn_layers.jpeg" width="800" />

In the above image, the colors of the neurons represent their weights.

### Hyperparameters

Assume an input tensor of dimensions $(C_{\mathrm{in}}, H_{\mathrm{in}}, W_{\mathrm{in}})$, i.e. channels, height, width. 

Requires four hyperparameters:

- Number of filters, $K$.
- Spatial extent (size) of each filter, $F$. 
- Stride $S$: spatial distance between consecutive applications of a filter.
- the amount of zero padding $P$.
 
The output tensor is of dimensions $C_{\mathrm{out}}$x$H_{\mathrm{out}}$x$W_{\mathrm{out}}$ where: <br><br>
\begin{equation}W_{\mathrm{out}} = \frac{W_{\mathrm{in}} − F + 2P}{S} + 1 \end{equation}<br>
\begin{equation}H_{\mathrm{out}} = \frac{H_{\mathrm{in}} − F + 2P}{S} + 1 \end{equation}<br>
\begin{equation}C_{\mathrm{out}} = K \end{equation}

The number of parameters in the layer will be:

$$
\underbrace{K}_{\mathrm{filters}} \cdot \left(
\underbrace{C_{\mathrm{in}} \cdot F^2}_{\mathrm{filter\ size}} + \underbrace{1}_{\mathrm{bias\ term}}
\right)
$$

**Example**: Input image is 256x256x3, and the first conv layer has 16 filters of size 3x3. The number of parameters in the first layer will be: $16 (3 * 3^2 + 1) = 448$


### Pytorch `Conv2d` layer example

In [1]:
# Setup
import os
import torch
import torchvision
import torchvision.transforms as transforms


data_dir = os.path.join(os.getenv('HOME'), 'cs460/datasets')
# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root=data_dir,
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


In [2]:
# Load first MNIST image
x0,y0 = train_dataset[0]
# add batch dim
x0 = x0.unsqueeze(0)
print('x0 shape with batch dim:', x0.shape)

def num_params(layer):
    return sum([p.numel() for p in layer.parameters()])

x0 shape with batch dim: torch.Size([1, 1, 28, 28])


In [3]:
import torch.nn as nn

# First conv layer: works on input image volume
conv1 =nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
print(f'conv1: {num_params(conv1)} parameters')
conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
print(f'conv2: {num_params(conv2)} parameters')

print(f'{"Input image shape:":25s}{x0.shape}')
print(f'{"After first conv layer:":25s}{conv1(x0).shape}')
print(f'{"After second conv layer:":25s}{conv2(conv1(x0)).shape}')

conv1: 156 parameters
conv2: 2416 parameters
Input image shape:       torch.Size([1, 1, 28, 28])
After first conv layer:  torch.Size([1, 6, 28, 28])
After second conv layer: torch.Size([1, 16, 24, 24])


## Pooling Layers

Assume an input tensor of dimensions $(C_{\mathrm{in}}, H_{\mathrm{in}}, W_{\mathrm{in}})$, i.e. channels, height, width. 

Requires two hyperparameters:

- Spatial extent (size) of each pooling filter, $F$. 
- Stride $S$

The output tensor is of dimensions $C_{\mathrm{out}}$x$H_{\mathrm{out}}$x$W_{\mathrm{out}}$ where: <br><br>
\begin{equation}W_{\mathrm{out}} = \frac{W_{\mathrm{in}} − F}{S} + 1 \end{equation}<br>
\begin{equation}H_{\mathrm{out}} = \frac{H_{\mathrm{in}} − F}{S} + 1 \end{equation}<br>
\begin{equation}C_{\mathrm{out}} = C_{\mathrm{in}} \end{equation}

**Example**: $\max$-pooling with $F=2,~S=2$ performing a factor-2 downsample:

<img src="tutorial2_img/maxpool.png" width="600" />

### PyTorch `Pool2d` layer example

In [4]:
pool = nn.MaxPool2d(kernel_size=2, stride=2)

print(f'{"After second conv layer:":25s}{conv2(conv1(x0)).shape}')
print(f'{"After max-pool:":25s}{pool(conv2(conv1(x0))).shape}')

After second conv layer: torch.Size([1, 16, 24, 24])
After max-pool:          torch.Size([1, 16, 12, 12])


## LeNet Architecture Example

Let's implement **LeNet**, arguably the first successful CNN model for MNIST (LeCun, 1998).
<img src="tutorial2_img/lenet.png" width="1000" />

In [5]:
import torch.nn.functional as F

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1)
        self.fc1 = nn.Linear(in_features=16*5*5, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.fc3 = nn.Linear(in_features=84, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

In [6]:
net = LeNet()
print(net)
print('LeNet(x0)=', net(x0))
print('shape=', net(x0).shape)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)
LeNet(x0)= tensor([[ 0.0597, -0.0771, -0.0585,  0.0164,  0.0034, -0.0529,  0.1519,  0.0478,
         -0.0613,  0.0879]], grad_fn=<ThAddmmBackward>)
shape= torch.Size([1, 10])


## Exercise

This exercise has 4 parts. You will learn PyTorch on different levels of abstractions, which will help you understand it better. 

1. Preparation: we will use CIFAR-10 dataset.
2. PyTorch Module API: we will use `nn.Module` to define arbitrary neural network architecture. 
3. PyTorch Sequential API: we will use `nn.Sequential` to define a linear feed-forward network very conveniently. 
4. CIFAR-10 open-ended challenge: please implement your own network to get as high accuracy as possible on CIFAR-10. You can experiment with any layer, optimizer, hyperparameters or other advanced features. 

Here is a table of comparison:

| API           | Flexibility | Convenience |
|---------------|-------------|-------------|
| `nn.Module`     | High        | Medium      |
| `nn.Sequential` | Low         | High        |

### Part I. Preparation

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [39]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [40]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
data_dir = os.path.join(os.getenv('HOME'), 'cs460/datasets')
cifar10_train = dset.CIFAR10(data_dir, train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10(data_dir, train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10(data_dir, train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this exercise. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [41]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cuda


### Part II. PyTorch Module API

PyTorch provides the `nn.Module` API for you to define arbitrary network architectures, while tracking every learnable parameters for you. PyTorch also provides the `torch.optim` package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam. You can refer to the [doc](http://pytorch.org/docs/master/optim.html) for the exact specifications of each optimizer.

#### Module API: Two-Layer Network
Here is a concrete example of a 2-layer fully connected network:

In [42]:
def flatten(x):
    N = x.shape[0] # read in N, C, H, W
    return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

class TwoLayerFC(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        # assign layer objects to class attributes
        self.fc1 = nn.Linear(input_size, hidden_size)
        # nn.init package contains convenient initialization methods
        # http://pytorch.org/docs/master/nn.html#torch-nn-init 
        nn.init.kaiming_normal_(self.fc1.weight)
        self.fc2 = nn.Linear(hidden_size, num_classes)
        nn.init.kaiming_normal_(self.fc2.weight)
    
    def forward(self, x):
        # forward always defines connectivity
        x = flatten(x)
        scores = self.fc2(F.relu(self.fc1(x)))
        return scores

def test_TwoLayerFC():
    input_size = 50
    x = torch.zeros((64, input_size), dtype=dtype)  # minibatch size 64, feature dimension 50
    model = TwoLayerFC(input_size, 42, 10)
    scores = model(x)
    print(scores.size())  # you should see [64, 10]
test_TwoLayerFC()

torch.Size([64, 10])


#### Module API: Three-Layer ConvNet
It's your turn to implement a 3-layer ConvNet followed by a fully connected layer. The network architecture should have the following architecture:

1. Convolutional layer with `channel_1` 5x5 filters with zero-padding of 2
2. ReLU
3. Convolutional layer with `channel_2` 3x3 filters with zero-padding of 1
4. ReLU
5. Fully-connected layer to `num_classes` classes

You should initialize the weight matrices of the model using the Kaiming normal initialization method.

**HINT**: http://pytorch.org/docs/stable/nn.html#conv2d

After you implement the three-layer ConvNet, the `test_ThreeLayerConvNet` function will run your implementation; it should print `(64, 10)` for the shape of the output scores.

In [43]:
class ThreeLayerConvNet(nn.Module):
    def __init__(self, in_channel, channel_1, channel_2, num_classes):
        super().__init__()
        ########################################################################
        # TODO: Set up the layers you need for a three-layer ConvNet with the  #
        # architecture defined above.                                          #
        ########################################################################
        self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=channel_1, kernel_size=5, padding=2)
        self.conv2 = nn.Conv2d(in_channels=channel_1, out_channels=channel_2, kernel_size=3, padding=1)
        self.fc = nn.Linear(channel_2 * 32 * 32, 10)
        ########################################################################
        #                          END OF YOUR CODE                            #       
        ########################################################################

    def forward(self, x):
        scores = None
        ########################################################################
        # TODO: Implement the forward function for a 3-layer ConvNet. you      #
        # should use the layers you defined in __init__ and specify the        #
        # connectivity of those layers in forward()                            #
        ########################################################################
        conv1_relu = F.relu(self.conv1(x))
        conv2_relu = F.relu(self.conv2(conv1_relu))
        scores = self.fc(flatten(conv2_relu))
        ########################################################################
        #                             END OF YOUR CODE                         #
        ########################################################################
        return scores


def test_ThreeLayerConvNet():
    x = torch.zeros((64, 3, 32, 32), dtype=dtype)  # minibatch size 64, image size [3, 32, 32]
    model = ThreeLayerConvNet(in_channel=3, channel_1=12, channel_2=8, num_classes=10)
    scores = model(x)
    print(scores.size())  # you should see [64, 10]
test_ThreeLayerConvNet()

torch.Size([64, 10])


#### Module API: Check Accuracy
Given the validation or test set, we can check the classification accuracy of a neural network. 

In [44]:
def check_accuracy(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

#### Module API: Training Loop
We use an Optimizer object from the `torch.optim` package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks.

In [45]:
def train_part(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Iteration %d, loss = %.4f' % (t, loss.item()))
                check_accuracy(loader_val, model)
                print()

#### Module API: Train a Two-Layer Network
Now we are ready to run the training loop. 

Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of `TwoLayerFC`. 

You also need to define an optimizer that tracks all the learnable parameters inside `TwoLayerFC`.

You don't need to tune any hyperparameters, but you should see model accuracies above 40% after training for one epoch.

In [46]:
hidden_layer_size = 4000
learning_rate = 1e-2
model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

train_part(model, optimizer)

Iteration 0, loss = 3.4579
Checking accuracy on validation set
Got 125 / 1000 correct (12.50)

Iteration 100, loss = 2.7790
Checking accuracy on validation set
Got 309 / 1000 correct (30.90)

Iteration 200, loss = 1.7359
Checking accuracy on validation set
Got 376 / 1000 correct (37.60)

Iteration 300, loss = 2.0947
Checking accuracy on validation set
Got 361 / 1000 correct (36.10)

Iteration 400, loss = 2.1214
Checking accuracy on validation set
Got 397 / 1000 correct (39.70)

Iteration 500, loss = 1.8188
Checking accuracy on validation set
Got 424 / 1000 correct (42.40)

Iteration 600, loss = 1.7932
Checking accuracy on validation set
Got 443 / 1000 correct (44.30)

Iteration 700, loss = 1.7630
Checking accuracy on validation set
Got 438 / 1000 correct (43.80)



#### Module API: Train a Three-Layer ConvNet
You should now use the Module API to train a three-layer ConvNet on CIFAR. This should look very similar to training the two-layer network! You don't need to tune any hyperparameters, but you should achieve above above 45% after training for one epoch.

You should train the model using stochastic gradient descent without momentum.

In [47]:
learning_rate = 3e-3
channel_1 = 32
channel_2 = 16

model = None
optimizer = None
################################################################################
# TODO: Instantiate your ThreeLayerConvNet model and a corresponding optimizer #
################################################################################
model = ThreeLayerConvNet(in_channel=3, channel_1=channel_1, channel_2=channel_2, num_classes=10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
################################################################################
#                                 END OF YOUR CODE                             
################################################################################

train_part(model, optimizer)

Iteration 0, loss = 2.3270
Checking accuracy on validation set
Got 136 / 1000 correct (13.60)

Iteration 100, loss = 1.9296
Checking accuracy on validation set
Got 322 / 1000 correct (32.20)

Iteration 200, loss = 1.6600
Checking accuracy on validation set
Got 357 / 1000 correct (35.70)

Iteration 300, loss = 1.9171
Checking accuracy on validation set
Got 378 / 1000 correct (37.80)

Iteration 400, loss = 1.6615
Checking accuracy on validation set
Got 416 / 1000 correct (41.60)

Iteration 500, loss = 1.5577
Checking accuracy on validation set
Got 430 / 1000 correct (43.00)

Iteration 600, loss = 1.8113
Checking accuracy on validation set
Got 451 / 1000 correct (45.10)

Iteration 700, loss = 1.7183
Checking accuracy on validation set
Got 435 / 1000 correct (43.50)



### Part III. PyTorch Sequential API

Part II introduced the PyTorch Module API, which allows you to define arbitrary learnable layers and their connectivity. 

For simple models like a stack of feed forward layers, you still need to go through 3 steps: subclass `nn.Module`, assign layers to class attributes in `__init__`, and call each layer one by one in `forward()`. Fortunately, PyTorch provides a container Module called `nn.Sequential`, which merges the above steps into one. It is not as flexible as `nn.Module`, because you cannot specify more complex topology than a feed-forward stack, but it's good enough for many use cases.

#### Sequential API: Two-Layer Network
Let's see how to rewrite our two-layer fully connected network example with `nn.Sequential`, and train it using the training loop defined above.

Again, you don't need to tune any hyperparameters here, but you shoud achieve above 40% accuracy after one epoch of training.

In [48]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

hidden_layer_size = 4000
learning_rate = 1e-2

model = nn.Sequential(
    Flatten(),
    nn.Linear(3 * 32 * 32, hidden_layer_size),
    nn.ReLU(),
    nn.Linear(hidden_layer_size, 10),
)

# you can use Nesterov momentum in optim.SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)

train_part(model, optimizer)

Iteration 0, loss = 2.3830
Checking accuracy on validation set
Got 162 / 1000 correct (16.20)

Iteration 100, loss = 1.8040
Checking accuracy on validation set
Got 372 / 1000 correct (37.20)

Iteration 200, loss = 1.8015
Checking accuracy on validation set
Got 430 / 1000 correct (43.00)

Iteration 300, loss = 1.7279
Checking accuracy on validation set
Got 422 / 1000 correct (42.20)

Iteration 400, loss = 1.5839
Checking accuracy on validation set
Got 437 / 1000 correct (43.70)

Iteration 500, loss = 1.8470
Checking accuracy on validation set
Got 428 / 1000 correct (42.80)

Iteration 600, loss = 2.0405
Checking accuracy on validation set
Got 433 / 1000 correct (43.30)

Iteration 700, loss = 1.7106
Checking accuracy on validation set
Got 453 / 1000 correct (45.30)



#### Sequential API: Three-Layer ConvNet
Here you should use `nn.Sequential` to define and train a three-layer ConvNet with the same architecture we used in Part II:

1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
2. ReLU
3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
4. ReLU
5. Fully-connected layer (with bias) to compute scores for 10 classes

##### Initialization
Let's write a couple utility methods to initialize the weight matrices for our models.

- `random_weight(shape)` initializes a weight tensor with the Kaiming normalization method.
- `zero_weight(shape)` initializes a weight tensor with all zeros. Useful for instantiating bias parameters.

The `random_weight` function uses the Kaiming normal initialization method, described in:

He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*, ICCV 2015, https://arxiv.org/abs/1502.01852

You should initialize your weight matrices using the `random_weight` function defined above, and you should initialize your bias vectors using the `zero_weight` function above.

You should optimize your model using stochastic gradient descent with Nesterov momentum 0.9.

Again, you don't need to tune any hyperparameters but you should see accuracy above 55% after one epoch of training.

In [49]:
def random_weight(shape):
    """
    Create random Tensors for weights; setting requires_grad=True means that we
    want to compute gradients for these Tensors during the backward pass.
    We use Kaiming normalization: sqrt(2 / fan_in)
    """
    if len(shape) == 2:  # FC weight
        fan_in = shape[0]
    else:
        fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]
    # randn is standard normal distribution generator. 
    w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)
    w.requires_grad = True
    return w

def zero_weight(shape):
    return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)

# create a weight of shape [3 x 5]
# you should see the type `torch.cuda.FloatTensor` if you use GPU. 
# Otherwise it should be `torch.FloatTensor`
random_weight((3, 5))

tensor([[-1.1018, -1.1779, -0.7266, -0.2086, -0.5028],
        [-0.8231,  0.8430,  0.3966,  0.6094,  0.6051],
        [-0.8805, -1.0446,  1.4365, -0.6604, -0.6944]],
       device='cuda:0', requires_grad=True)

In [50]:
channel_1 = 32
channel_2 = 16
learning_rate = 1e-2

model = None
optimizer = None

################################################################################
# TODO: Rewrite the 2-layer ConvNet with bias from Part III with the           #
# Sequential API.                                                              #
################################################################################
model = nn.Sequential(
    nn.Conv2d(3, channel_1, kernel_size=5, padding=2),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, kernel_size=3, padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, 10),
)

optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)

# Weight initialization
# Ref: http://pytorch.org/docs/stable/nn.html#torch.nn.Module.apply
def init_weights(m):
    # print(m)
    if type(m) == nn.Conv2d or type(m) == nn.Linear:
#         m.weight.data = random_weight(m.weight.size())
#         m.bias.data = zero_weight(m.bias.size())
        nn.init.xavier_normal_(m.weight.data)
        m.bias.data.fill_(0.)

model.apply(init_weights)
################################################################################
#                                 END OF YOUR CODE                             
################################################################################

train_part(model, optimizer)

Iteration 0, loss = 2.4663
Checking accuracy on validation set
Got 115 / 1000 correct (11.50)

Iteration 100, loss = 1.8741
Checking accuracy on validation set
Got 420 / 1000 correct (42.00)

Iteration 200, loss = 1.6008
Checking accuracy on validation set
Got 516 / 1000 correct (51.60)

Iteration 300, loss = 1.4181
Checking accuracy on validation set
Got 506 / 1000 correct (50.60)

Iteration 400, loss = 1.3120
Checking accuracy on validation set
Got 536 / 1000 correct (53.60)

Iteration 500, loss = 1.2413
Checking accuracy on validation set
Got 556 / 1000 correct (55.60)

Iteration 600, loss = 1.0920
Checking accuracy on validation set
Got 580 / 1000 correct (58.00)

Iteration 700, loss = 1.3343
Checking accuracy on validation set
Got 582 / 1000 correct (58.20)



### Part IV. CIFAR-10 open-ended challenge

In this section, you can experiment with whatever ConvNet architecture you'd like on CIFAR-10. 

Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves **at least 70%** accuracy on the CIFAR-10 **validation** set within 10 epochs. You can use the check_accuracy and train functions from above. You can use either `nn.Module` or `nn.Sequential` API. 

#### Things you might try:
- **Filter size**: Above we used 5x5; would smaller filters be more efficient?
- **Number of filters**: Above we used 32 filters. Do more or fewer do better?
- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?
- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?
- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:
    - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]
- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).
- **Regularization**: Add l2 weight regularization, or perhaps use Dropout.

#### Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these, but don't miss the fun if you have time!

- Alternative optimizers: you can try Adam, Adagrad, RMSprop, etc.
- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
- Model ensembles
- Data augmentation
- New Architectures
  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.
  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.
  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)

In [51]:
# A 4-layer convolutional network
# (conv -> batchnorm -> relu -> maxpool) * 3 -> fc
layer1 = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=5, padding=2),
#     nn.ReLU(),
    nn.BatchNorm2d(16),
    nn.ReLU(),
#     nn.Conv2d(16, 16, kernel_size=4, padding=1, stride=2),    
#     nn.ReLU()
    nn.MaxPool2d(2,2)
)

layer2 = nn.Sequential(
    nn.Conv2d(16, 32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
#     nn.Conv2d(32, 32, kernel_size=4, padding=1, stride=2),    
#     nn.ReLU()
    nn.MaxPool2d(2,2)
)

layer3 = nn.Sequential(
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
#     nn.Conv2d(64, 64, kernel_size=4, padding=1, stride=2),    
#     nn.ReLU()
    nn.MaxPool2d(2,2)
)

fc = nn.Linear(64*4*4, 10)

model = nn.Sequential(
    layer1,
    layer2,
    layer3,
    Flatten(),
    fc
)

learning_rate = 1e-3

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

print_every = 10000


train_part(model, optimizer, epochs=10)

Iteration 0, loss = 2.3252
Checking accuracy on validation set
Got 135 / 1000 correct (13.50)

Iteration 0, loss = 1.1006
Checking accuracy on validation set
Got 638 / 1000 correct (63.80)

Iteration 0, loss = 0.9431
Checking accuracy on validation set
Got 674 / 1000 correct (67.40)

Iteration 0, loss = 0.8599
Checking accuracy on validation set
Got 717 / 1000 correct (71.70)

Iteration 0, loss = 0.6269
Checking accuracy on validation set
Got 738 / 1000 correct (73.80)

Iteration 0, loss = 0.4490
Checking accuracy on validation set
Got 731 / 1000 correct (73.10)

Iteration 0, loss = 0.8030
Checking accuracy on validation set
Got 733 / 1000 correct (73.30)

Iteration 0, loss = 0.5838
Checking accuracy on validation set
Got 736 / 1000 correct (73.60)

Iteration 0, loss = 0.4378
Checking accuracy on validation set
Got 754 / 1000 correct (75.40)

Iteration 0, loss = 0.6335
Checking accuracy on validation set
Got 749 / 1000 correct (74.90)



#### Test set -- run this only once

Now that we've gotten a result we're happy with, we test our final model on the test set (which you should store in best_model). Think about how this compares to your validation set accuracy.

In [52]:
best_model = model
check_accuracy(loader_test, best_model)

Checking accuracy on test set
Got 7480 / 10000 correct (74.80)
