# Convolutional Neural Networks #

Convolutional Neural Networks (CNNs) are deep learning frameworks that are common for image classification, though used in other cases as well. CNNs can be useful for this task over a general deep feed-forward network because it allows for preserving relative positioning, say between two pixels, when learning. This is because pixels "near" each other will be processed together through the convolutional layer described below. There are also other layers invovled in the basic CNN structure, often incuding pooling layers and the usual fully connected layers.

### Convolutional Layer ###

The basic component of a CNN is the filter, called a kernal. There are many kernels in a CNN and through training they eventually learn features useful for classification. For example, we may be trying to create a classifier that determines if an image contains a person, dog, or cat. The network would have many kernels that may all learn to distingush specific parts of an image useful for distinguishing between these three classes. For example, a kernel may learn the feature pointed ears on top of a head. In which case, if this feature is in an image, the classifier may want to consider it more likely this image is of a cat, possible it is a dog, but definitely not a human.

A kernel itself is a matrix of size $k \times k$ that "sweeps" over the layer input, in this case say an image that is $n \times m \times 3$. That is, the image has a height n, a width m, and has 3 channels (in this case signifying it is an RGB image). In this case, the kernel will move over the first channel of size $n \times m$ by moving across groups of $k \times k$ pixels. For each group, it will calculate the dot product of the input and the kernel weights and that value will be added to an output matrix. An example of a $5 \times 5$ image with a $3 \times 3$ kernel is shown below (borrowed from: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53). The current part of the image being processed is highlighted in yellow and the kernel weights are shown in the corner of the yellow squres in red. 

After calculating the dot product, the yellow matrix would slide over some number of squares to the right (determine by the kernel setting called the step size or stride) and calculate the new dot product, which would be added to the output matrix being constructed. This would continue over this green image matrix until the entire image had been proccessed (and applied to additional channels if there were some). Then, the output matrix would be passed onto the next layer. By differing the stride length, we can get filters that focus on different amounts of detail in the image. A shorter stride length yields a larger output (in terms of dimension) and can capture more local details. A larger stride length yilds a smaller output and can capture more global details in the image.

A single convolutional layer is usually made up of many filters, leading to higher dimensions in channels in the output. Therefore, the stride length of individual kernels and the number of kernels is something to consider when it comes to computational efficiency and learning trade-offs.

### Pooling Layers ###

A poolying layer reduces the dimensions (downsamples), in order to reduce the number of parameters required in the model. This is also achieved with a kernel filter, which in this case just sweeps over the layer to determine which parts of a given matrix are "pooled" together. There are two ways of pooling. In max pooling, the maximum value "covered" by the kernel is passed on to the next layer. In average pooling, the average of the values "covered" by the kernel is passed on. In this way, for example a $5 \times 5$ area of an image (25 pixels), is reduced to 1 pixel. Pooling usually happens after some convolutional layers are applied, which generally increase the dimensions from the original input.

### Fully-Connected Layer ###

These are the normal layers of a general deep neural network. They are what is usually applied at the end to return the final classification or a given input.

## Demonstration and Data Set ##

The implementation of a basic CNN will be shown using the built-in scikit-learn hand written digits data set.

### Import Data Set ###

Each image is $8 \times 8$ with the image in black and white (only 1 color channel). There are 10 possible classes and 1,797 images in the data set.

In [134]:
from sklearn import datasets, model_selection
import numpy as np
import torch

data = datasets.load_digits()
X = data['data']
Y = data['target']

# check format of the data
print(X[0])

# data should be in 8x8 so resize the X images
X = [np.resize(x, (1, 8, 8)) for x in X]
print(X[0])

[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
 15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
  0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
  0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
[[[ 0.  0.  5. 13.  9.  1.  0.  0.]
  [ 0.  0. 13. 15. 10. 15.  5.  0.]
  [ 0.  3. 15.  2.  0. 11.  8.  0.]
  [ 0.  4. 12.  0.  0.  8.  8.  0.]
  [ 0.  5.  8.  0.  0.  9.  8.  0.]
  [ 0.  4. 11.  0.  1. 12.  7.  0.]
  [ 0.  2. 14.  5. 10. 12.  0.  0.]
  [ 0.  0.  6. 13. 10.  0.  0.  0.]]]


In [141]:
# create a training and testing split
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y, train_size=0.7, stratify=Y)

# create a validation set from the training data
X_train,X_val,Y_train,Y_val = model_selection.train_test_split(X_train, Y_train, train_size=0.7)

print('Training size:', len(X_train))
print('Validation size:', len(X_val))
print('Testing size:', len(X_test))

# make data pytorch tensors

X_train = torch.tensor(X_train, dtype=torch.float32)
Y_train = torch.tensor(Y_train, dtype=torch.long).reshape(-1, 1)
X_val = torch.tensor(X_val, dtype=torch.float32)
Y_val = torch.tensor(Y_val, dtype=torch.long).reshape(-1, 1)
X_test = torch.tensor(X_test, dtype=torch.float32)
Y_test = torch.tensor(Y_test, dtype=torch.long).reshape(-1, 1)

Training size: 879
Validation size: 378
Testing size: 540


### CNN Code ###

We will create a small CNN with the following structure:

* Convolutional layer with ReLU
* Batch normalization
* Max pooling
* Convolutional layer with ReLU
* Batch normalization
* Max pooling
* Flatten
* Fully connected layer with ReLU
* Fully connected layer with Softmax

The convolutional layers include a padding of 1, which pads around the "image" with 0s along all dimensions (as if the image were surrounded with 1 layer of 0s on all sides). This can be helpful to ensure the kernels are able to "see" and get information from all the pixels on the edges of the image, which otherwise might not be included in as many sweeps of the kernels (often the kernel starts there and moves on, which means interior pixels may be swept over multiple times by a single kernel). In this case, we also use it because the images are quite small and we want to make sure the image says of "usable" size for the different layers.

In [138]:
import torch
from torch import nn, flatten, optim

device = ("cuda" if torch.cuda.is_available() else "cpu")

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.model_stack = nn.Sequential(
            # 1x7x7 --> 15x5x5
            nn.Conv2d(1, 15, kernel_size=(3,3), stride=1, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(15),
            # 15x5x5 --> 15x4x4
            nn.MaxPool2d((2,2), stride=1),
            # 15x6x6 --> 30x4x4
            nn.Conv2d(15, 30, (3,3), stride=1, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(30),
            # 30x4x4 --> 30x2x2
            nn.MaxPool2d((2,2), stride=(2,2)),
            # 30x2x2 --> 120x1
            nn.Flatten(),
            nn.Linear(270, 150),
            nn.ReLU(),
            nn.Linear(150, 10),
            nn.Softmax(dim=1)
        )
    
    def forward(self, x):
        output = self.model_stack(x)
        
        return output

# train the model
def train(X_data, Y_data, batch_size, model, loss_fn, optimizer):
    num_data = len(X_data)
    epoch_loss = 0
    epoch_correct = 0
    
    model.train()
    
    # train in batch sizes of 64
    for ba in range(0, num_data, batch_size):
        if ba+batch_size < num_data:
            X_batch = X_data[ba:ba+batch_size]
            Y_batch = Y_data[ba:ba+batch_size]
        else:
            X_batch = X_data[ba:]
            Y_batch = Y_data[ba:]
            
        X_batch = X_batch.to(device)
        Y_batch = Y_batch.to(device)
            
        # forward pass
        pred = model(X_batch)
        Y_batch = Y_batch.flatten()
        loss = loss_fn(pred, Y_batch)
            
        # backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
            
        # update training tracking
        epoch_loss += loss
        epoch_correct += (pred.argmax(1) == Y_batch).sum()
            
    return epoch_loss,epoch_correct

    
# test performance on test set
def test(X_data, Y_data, model, loss_fn):
    loss = 0
    correct = 0
    
    with torch.no_grad():
        model.eval()
        pred = model(X_data)
        
        # update tracking
        Y_data = Y_data.flatten()
        loss = loss_fn(pred, Y_data)
        correct = (pred.argmax(1) == Y_data).sum()

    return loss,correct
        


### Training and Testing the CNN ###

Using the model, and training and testing code above, a CNN will be trained on the handwritten digits data. The Adam optimizer and Cross-entropy loss will be used. The model will be trained using batch processing with size 64 and trained for 10 epochs.

In [143]:
if __name__ == '__main__':
    
    # initialize model, optimizer, and loss function
    model = Model()#.to_device()
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = nn.CrossEntropyLoss()
    
    # track training
    train_size = len(X_train)
    val_size = len(X_val)
    
    # train the model
    for e in range(10):
        print('Epoch: {0}'.format(e))
        
        # train on the whole training data set
        train_loss,train_correct = train(X_train, Y_train, 64, model, loss_fn, optimizer)
        # test on validation set
        val_loss,val_correct = test(X_val, Y_val, model, loss_fn)
        # print epoch results
        print('----Training Loss: {0:.2f}; Training Accuracy: {1:.2f}'.format(train_loss, train_correct/train_size))
        print('----Validation Loss: {0:.2f}; Validation Accuracy: {1:.2f}'.format(val_loss, val_correct/val_size))
        
    # test model
    test_loss,test_correct = test(X_test, Y_test, model, loss_fn)
    print('*****************************************************')
    print('Testing Results:')
    print('----Loss: {0:.2f}'.format(test_loss))
    print('----Accuracy: {0:.2f}'.format(test_correct/len(Y_test)))

Epoch: 0
----Training Loss: 29.58; Training Accuracy: 0.52
----Validation Loss: 2.05; Validation Accuracy: 0.76
Epoch: 1
----Training Loss: 23.79; Training Accuracy: 0.83
----Validation Loss: 1.66; Validation Accuracy: 0.88
Epoch: 2
----Training Loss: 22.03; Training Accuracy: 0.92
----Validation Loss: 1.55; Validation Accuracy: 0.97
Epoch: 3
----Training Loss: 20.98; Training Accuracy: 0.98
----Validation Loss: 1.50; Validation Accuracy: 0.98
Epoch: 4
----Training Loss: 20.70; Training Accuracy: 0.99
----Validation Loss: 1.49; Validation Accuracy: 0.98
Epoch: 5
----Training Loss: 20.58; Training Accuracy: 1.00
----Validation Loss: 1.49; Validation Accuracy: 0.98
Epoch: 6
----Training Loss: 20.52; Training Accuracy: 1.00
----Validation Loss: 1.49; Validation Accuracy: 0.99
Epoch: 7
----Training Loss: 20.49; Training Accuracy: 1.00
----Validation Loss: 1.49; Validation Accuracy: 0.99
Epoch: 8
----Training Loss: 20.48; Training Accuracy: 1.00
----Validation Loss: 1.48; Validation Accurac