In [1]:
import numpy as np
import pandas as pd
from scipy.special import expit

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

# Torch
import torch
import torch.nn as nn
import torch.nn.functional as F 
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch import tensor
import torch.optim as optim

# Plotting
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Artificial Neural Network - Convolutional NeuralNetworks

### Summary
The exercises here aim for some understanding and some hands-on exeprience with CNNs. Before these exercises, I also suggest that you have a look at the exampleFilters.ipynb to see some simple filters that detect vertical and horizotal edges.

### Exercise 1
Consider a CNN that takes in 32 × 32 grayscale images and has a single convolution layer with three 5 × 5 convolution filters (without padding).
- What is the size of the feature map? (feature map is the output of one filter applied to the previous layer)
- With what size of padding we will end up to a feature map with the same size as the input (this is sometimes called "same padding")?
- How many parameters are in this model?
- Explain how this model can be thought of as an ordinary feedforward neural network with the individual pixels as inputs? are there any kind of constraints on the weights?

Assumptions
- As an input we takee 32 x 32 grayscale image
- We run the image through thre 5x5 filters
- We will use no padding around the image
- **The stride length is 1**

Every time we run the imagee through a filter its dimensions will shrink by 4.

To compute the size of the outputted channel, we use that
- input dimension is $N=32$
- filter dimension is $F=5$
- we use no padding, therefore the input dimension $N$ will shrink by $F-1$.

Therefore the output dimension will be

$$
O=N-(F-1)
$$

$$
O=32-(5-1)=28
$$

The output of the feature map is 28 x 28.

If we want to end up with an output of the same dimension as the input, then we need to add a padding of 2 to each side of the image. 

We can compute the desired dimension of the input, $N$, as

$$
\begin{align}
O &= N-(F-1)
N &= O+(F-1)
\end{align}
$$

Given the desired output size $O$ and the filter size $F$, we can compute the needed input size $N$.

$$
N = 32 + (5-1)=36
$$

We need to add 4 pixels to each row and column in the input. Meaning, that on each side of the input image, we need to add 2 pixels. Using the padded image, we can obtain output image with the same dimension.

Each filter has 25 parameters. And since there is three filters, this means that there we have 75 parameters in total.


You can think of it as a feedforward neural nework, where each pixel in a convolutional layer is a neuron, and the filters are set of weights. Each neuron, takes 25 pixels as input (either from the input layer of from the previous convolution layer). But in opposition to a FFNN, all neurons in the same layer uses the same weights. This significantly reduces the number of parameters. 

Last but not the least, why do we actually need convolutional layer? Well, the input image has 1024 pixels, if we would just feed these to our FFNN, we would lose a lot of information, especially about the order of the pixels. Therefore, the core idea behind convolutional layer is to have model learn some patterns from the image and THEN feed these patterns as an input to our FFNN.

### Exercise 2
Consider a CNN composed of three convolutional layers, each with 3 × 3 kernels, a stride of 2, and "same" padding. The lowest layer outputs 100 feature maps (i.e., channels), the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200 × 300 pixels.
- What is the total number of parameters in the CNN? If we are using 32-bit floats for every parameter, at least how much RAM will this network require when making a prediction for a single instance?
- What about when training on a mini-batch of 50 images?
- Why would you want to add a max pooling layer rather than a convolutional layer (with the same stride)?

Link for ouput dimensions: https://kvirajdatt.medium.com/calculating-output-dimensions-in-a-cnn-for-convolution-and-pooling-layers-with-keras-682960c73870

#### Same padding and output size (Convolution layer)
We can compute the convolution output dimension as

$$
O = [(I-F+2\cdot P)/S]+1 \times D
$$

where I is the input dimensions of the image ($i \times i$), F is the size of the filter/kernel ($f \times f$), S is the strides, P is the padding and D is the depth (number of feature maps).

If your are using same padding with stride > 1, P will be the minimum number to make $(I-F+2\cdot P)$ divisible by S.

Same padding means that 

$$
O=\left\lceil \frac{I}{S} \right\rceil
$$

The generic output formula is

$$
O = \left\lfloor (I-F+2\cdot P)/2 + 1 \right\rfloor
$$

If you solve for $P$, you get that

$$
\begin{align}
p_{min} &= (O-1)S-I+F \\
p_{max} &= O\cdot S-I+F-1
\end{align}
$$

#### Output size (Max pooling layer)
For a pooling layer, one can specify only the filter/kernel size (D) and the strides (S).

$$
O = [(I-F)/S]+1 \times D
$$

#### Number of parameters

We have 3 convolution layers with $3\times 3$ filters, outputting 100, 200 and 400 feature maps. 

The first convolution layer outputs images of size $100 \times 150$. The second convolution layer outputs images of size $50 \times 75$. The third convolution layer outputs images of size $25 \times 38$. 

As an input we take RGB images of resolution $200 \times 300$ pixels.

Lets start by computing the number of parameters for each convolutional layer:
- **Convolutional layer 1:** We take an RGB image as input and transform it into 100 new channels. To obtain each of these channels, we will need 3 filters for each channel. Therefore per output channel we will have $3 \times 3 \times 3 = 27$ parameters. In total we need to train **2700 parameters** for the first convolutional layer.
- **Convolutional layer 2:** Here we take 100 channels as input. Therefore, for each output channel we need $100 \times 3 \times 3 = 900$ parameters. Since we have 200 ouput channels, we will need to train **180,000 parameters**.
- **Convolutional layer 3:** Here we take 200 channels as input, therefore for each output channel we will need $200 \times 3 \times 3 = 1800$ parameters. So in total we will need to train **720,000 parameters**.

Summing over all parameters, our network has **902,700 trainable parameters**. Each of these parameters will be represented as an 32-bit float (4 bytes), so all these parameters will take up $3,610,800$ bytes or $3.6$ MB of RAM.

#### Prediction for a single instnce
Now we want to answer how much memory the channels take.
- **Convolutional layer 1:** We take a 3D arrat with $3\times 200 \times 300$ values. We output 100 channels. With stride 2 the output for each channel (RGB) is $100\times 150$. These output is summarized into a single output of the same size, so per output channel we need $4 \times 100 \times 150$ values. Going through the first convolutional layer will require:

$$
(3+1)\times 100 \times 150 \times 100 = 6,000,000 \text{ values}
$$

- **Convolution layer 2:** Similarly for this layer

$$
(100+1) \times 50 \times 75 \times 200 = 75,750,000 \text{ values}
$$

-- **Convolution layer 3:** Similarly for this layer

$$
(200+1) \times 25 \times 38 \times 400 = 74,370,000 \text{ values}
$$

So in total, if we wanted to fit all values into main memory, we would need:

$$
(6,000,000 + 75,750,000+74,370,000)\times 4 /1000 /1000 = 624.5 MB
$$

We would need approximately **628 MB of memory** to predict a single instance.

#### Training on a mini batch of 50 images
Training on a mini batch of 50 images will

$$
624.5\times 50 +3.5 \approx 31.2 \text{ GB}
$$

#### Max pooling layer
Pooling layer help us reduce the dimensions of the output layers by essentially simmarizing them and as such reducing the dimension by several factors.

The purpose of pooling layers is to
- reduce the computational load
- reduce memory usage
- reduce the number of parameters (thereby limiting the risk of overfitting)

### Exercise 3
#### Solving a Fashion_MNIST with LeNet architecture
Here you will implement a LeNet architecture of Convolutional Neural Networks (as briefly introduced in the lecture and shown in LeNet.pdf here) with pyTorch (you may get inspired by the code in exampleCNN.ipynb).

First you will download the Fashion-MNIST dataset. Split into train/validation test datasets and train the network. Finally, plot the learning curves (train/validation loss and accuracy) and show the confusion matrix.
 1. Download Fashion-MNIST
 2. Split the data into train / validation / test subsets. Make mini-batches if necesssary.
 3. Build a CNN model
 4. Train the model on the dataset
 5. Plot the training curves (Loss and accuracy)
 6. Show the confusion matrix and accuracy on the test dataset.
 7. Try adding Droput layers; play with the hyperparameters. Use cross-valudation to find the best hyperparameters
 8. When you train the best model, visualize the filters of the first convolutional layer. You may look at an example on how to visualize filters in PyTorch: https://stackoverflow.com/questions/55594969/how-to-visualise-filters-in-a-cnn-with-pytorch


#### Split data into test and validation

In [2]:
train_set, val_set = torch.utils.data.random_split(data, [50000, 10000])

NameError: name 'data' is not defined

#### Make mini batches

Mini-batches are a way to train an epoch on a smaller random sample of the training data to avoid overfitting and improve the performance with large datasets.

#### Build a CNN model

In [22]:
class CNN(nn.Module):
    def __init__(self, **INPUT):

        # Inherit from nn module
        super(CNN, self).__init__()

        # Define activation function
        self.af = INPUT.get('af')

        # Define drop ratio
        self.dpr = INPUT.get('dpr')

        # Define neural network architecture
        self.nn = nn.Sequential(
                # C1 6@28x28
                nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=2),
                nn.BatchNorm2d(6),
                self.af(),
                nn.AvgPool2d(kernel_size=2, stride=2),
                
                # C2: 16@10x10
                nn.Dropout(self.dpr),
                nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0),
                nn.BatchNorm2d(16),
                self.af(),
                nn.AvgPool2d(kernel_size=2, stride=2),
                
                # Apply flattening on the output
                nn.Flatten(),
                
                # Dense part
                # L1
                nn.Dropout(self.dpr),
                nn.Linear(16 * 5 * 5, 120),
                nn.BatchNorm1d(120),
                self.af(),
                
                # L2
                nn.Dropout(self.dpr),
                nn.Linear(120, 84),
                nn.BatchNorm1d(84),
                self.af(),
                
                # L3
                nn.Dropout(self.dpr),
                nn.Linear(84, 10))
        
        # Define batch size
        self.batch_size = INPUT.get('batch_size')

        # Define datasets
        self.training_data = DataLoader(
            INPUT.get('trd'), batch_size=self.batch_size, shuffle=True, num_workers=1)
        self.validation_data = DataLoader(
            INPUT.get('vd'), batch_size=self.batch_size, shuffle=True, num_workers=1)
        self.test_data = DataLoader(
            INPUT.get('ted'), batch_size=self.batch_size, shuffle=True, num_workers=1)

        # Define loss function
        self.loss_fn = INPUT.get('loss_fn')

        # Define learning rate
        self.lr = INPUT.get('lr')

        # Define numper of epochs
        self.epochs = INPUT.get('epochs')

        # Define optimizer
        self.optimizer = INPUT.get('optim')(self.parameters(), lr=self.lr)

        # Save training progress
        self.loss_history = []
        self.acc_history = []
    
    def forward(self, x):
        logits = self.nn(x)
        return logits
    
    def train_loop(self):
        
        size = len(self.training_data.dataset)
        for batch, (X, y) in enumerate(self.training_data):

            # Compute prediction and loss
            pred = self.forward(X)
            loss = self.loss_fn(pred, y)

            # Backpropagation
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            if batch % 100 == 0:
                loss, current = loss.item(), batch * len(X)
                print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

    def val_loop(self):
        size = len(self.validation_data.dataset)
        num_batches = len(self.validation_data)
        test_loss, correct = 0, 0

        with torch.no_grad():
            for X, y in self.validation_data:
                pred = self.forward(X)
                test_loss += self.loss_fn(pred, y).item()
                correct += (pred.argmax(1) == y).type(torch.float).sum().item()
        
        test_loss /= num_batches
        correct /= size

        print(
            f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

        # Save it to history
        self.acc_history.append(correct)
        self.loss_history.append(test_loss)

    def visualize(self):
        x = [i for i in range(self.epochs)]
        y1 = self.acc_history
        y2 = self.loss_history
        plt.plot(x, y1)
        plt.plot(x, y2)
        plt.show()

    def fit(self):
        for t in range(self.epochs):
            print(f"Epoch {t+1}\n-------------------------------")
            self.train_loop()
            self.val_loop()
        print("Done!")

    def predict(self, x):
        logits = self.forward(x)
        softmax = nn.Softmax(dim=1)
        return softmax(logits).argmax(1)

    def test(self):
        # Get data
        X, y = next(iter(self.test_data))

        # Predict values
        y_hat = self.predict(X)

        print("Accuracy score for test data")
        print("-"*60)
        print(f"Acc: {accuracy_score(y, y_hat)*100} %")
        print()
        print("Confusion matrix for test data")
        print("-"*60)
        print(confusion_matrix(y, y_hat))

#### Train model on dataset

In [23]:
# Get training data
data = datasets.FashionMNIST(
    root = 'data',
    train = True,                         
    transform = ToTensor(), 
    download = True,            
)

# Get test data
test_data = datasets.FashionMNIST(
    root = 'data', 
    train = False, 
    transform = ToTensor()
)

# Split training into  training and validation
g_cpu = torch.Generator()
g_cpu.manual_seed(3)
training_data, val_data = torch.utils.data.random_split(data, [50000, 10000])

# Initialize model
INPUT = {
    'batch_size': 100,
    'trd': training_data,
    'vd': val_data,
    'ted': test_data,
    'loss_fn': nn.CrossEntropyLoss(),
    'lr': 1e-1,
    'epochs': 5,
    'af': nn.Sigmoid,
    'optim': torch.optim.Adam,
    'dpr': 1e-3

}
model = CNN(**INPUT)

# Train model
model.fit()

Epoch 1
-------------------------------
loss: 2.382711  [    0/50000]
loss: 0.812746  [10000/50000]
loss: 0.464937  [20000/50000]
loss: 0.500483  [30000/50000]
loss: 0.361832  [40000/50000]
Test Error: 
 Accuracy: 83.8%, Avg loss: 0.439631 

Epoch 2
-------------------------------
loss: 0.639330  [    0/50000]
loss: 0.496878  [10000/50000]
loss: 0.476845  [20000/50000]
loss: 0.400529  [30000/50000]
loss: 0.300162  [40000/50000]
Test Error: 
 Accuracy: 85.9%, Avg loss: 0.383186 

Epoch 3
-------------------------------
loss: 0.272244  [    0/50000]
loss: 0.332273  [10000/50000]
loss: 0.286407  [20000/50000]
loss: 0.312728  [30000/50000]
loss: 0.198159  [40000/50000]
Test Error: 
 Accuracy: 87.6%, Avg loss: 0.336496 

Epoch 4
-------------------------------
loss: 0.171026  [    0/50000]
loss: 0.361318  [10000/50000]
loss: 0.343658  [20000/50000]
loss: 0.326028  [30000/50000]
loss: 0.266119  [40000/50000]
Test Error: 
 Accuracy: 86.7%, Avg loss: 0.346077 

Epoch 5
------------------------

#### Plot the training curves

#### Show the confusion matrix and accuracy on the test data

#### Optimise model
Try adding Droput layers; play with the hyperparameters. Use cross-valudation to find the best hyperparameters.

#### Visualize the filters of the first convolution layer