# MNIST with Convolutional Neural Network (CNN)

## Resources

Really good notes if you want to get a better understanding of the topics (http://cs231n.github.io/).

## Import Libraries

In [None]:
# Libraries for building network
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Libraries for dataset
import torchvision
import torchvision.transforms as transforms

# Miscellaneous libraries for analysis
import time

## Define Network

In [None]:
class tutorial_model(nn.Module):
    def __init__(self):
        """ Initialize all layers of model """
        super(tutorial_model, self).__init__()
    
        # Convolution Layer 1
        self.convolution1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)  
        # Convolution Layer 2
        self.convolution2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)    
        
        # Dropout Layer
        self.conv_dropout = nn.Dropout()                                                 
        
        # Fully Connected Layer 1
        self.fc1 = nn.Linear(in_features=16*4*4, out_features=30)  
        # Fully Connected Layer 2
        self.fc2 = nn.Linear(in_features=30, out_features=10)                            
        
        # Max Pooling Layer
        self.pool = nn.MaxPool2d(2, 2)                                                   
        
    def forward(self, x):
        """ Chain all layers together """
        # input_x -> Convolution 1 -> RELU -> Pooling -> x_1
        x = self.pool(F.relu(self.convolution1(x)))    
        # x_1 -> Dropout -> x_2
        x = self.conv_dropout(x)               
        # x_2 -> Convolution 2 -> RELU -> Pooling -> x_3
        x = self.pool(F.relu(self.convolution2(x)))    
        
        # Collapse 3D tensor to 1D
        x = x.view(-1, 16*4*4)                            
        
        # x_3 -> Fully Connected 1 -> x_4 
        x = self.fc1(x)
        # x_4 -> Fully Connected 2 -> y
        y = self.fc2(x)                                
        return y

### Architecture

The architecture I implemented is a very simple (common introductory). Here is a visualization.

![](http://playagricola.com/Kaggle/extract.png)

For more information about this architecture and other potential architectures, check out this link (https://www.kaggle.com/cdeotte/how-to-choose-cnn-architecture-mnist).

### Convolution

Each channel in a convolution represents a different kernel convolving over the input image. The "learning" in a CNN is primarily about changing the weights of these kernels. Here are some visualizations of convolutions.

![](https://www.learnopencv.com/wp-content/uploads/2017/11/convolution-example-matrix.gif)


![](https://colah.github.io/posts/2014-07-Understanding-Convolutions/img/RiverTrain-ImageConvDiagram.png)


This is the equation for determining the dimensions of an image after a convolution. It is useful when doing the math for kernel sizes when developing a network.

![](https://miro.medium.com/max/660/1*D47ER7IArwPv69k3O_1nqQ.png)

For more information/visualizations, check out this link (https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1).

### Max Pooling

Max Pooling is like a convolution where the stride = kernel size and the value is the max value. Here is a visualization.

![](https://miro.medium.com/max/2344/1*ReZNSf_Yr7Q1nqegGirsMQ@2x.png)

### RELU

RELU is one of the many activation units used to introduce non-linearity to a network. Non-linearity is what allows a network to model complex functions. RELU is one of the most commonly used activation units. Here is a graph of RELU.

![](https://miro.medium.com/max/357/1*oePAhrm74RNnNEolprmTaQ.png)

For more information about activation uints, check out this link (https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0).

### Dropout

Dropout is used in order to combat overfitting. Essentially, you randomly choose to ignore certain nodes, and by doing so, you help avoid overfitting. For more information, checkout this link (https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/dropout_layer.html).

### Fully Connected Layer

It's difficult to explain a Fully connected Layer in a couple lines. A lot of the concepts related to FC layers are important throughought ML. I would encourage you to do your own research to understand the topic. Some possible sources (https://towardsdatascience.com/understanding-neural-networks-19020b758230, https://www.youtube.com/watch?v=aircAruvnKk).

To just understand the code, we use Fully Connevted layers in order to convert the features calculated through convolutions into a classification. That is, we take the 4x4x20 matrix produced by the last convolution, flatten it into a vector of dimension 320x1, and then feed it into two fully connected layers, which output 10 values. We can then feed these 10 values to a softmax in order to calculate the probability the input image was a specfic digit.

## Set Hyperparameters

In [None]:
# Number of epochs for training
num_epochs = 3           

# Batch Size for training/testing
batch_size = 64          

# Learning Rate for optimizer
learning_rate = 0.01     

# Momentum for optimizer
momentum = 0.5           

# Dimensions of MNIST
dim = 28                 

## Setup Data Loader

In [None]:
# Transformation for training data
transform_train = torchvision.transforms.Compose([
    transforms.ToTensor(),                            # Convert grayscale image to pytorch tensor
    transforms.Normalize((0.5,), (0.5,)),             # Normalize grayscale data
])

# Transformation for training data
transform_test = transforms.Compose([
    transforms.ToTensor(),                            # Convert grayscale image to pytorch tensor
    transforms.Normalize((0.5,), (0.5,)),             # Normalize grayscale data
])

In [None]:
# Download training data
trainset = torchvision.datasets.MNIST(root='./files', train=True, download=True, 
                                      transform=transform_train)

# Initialize dataloader for training data
train_loader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, 
                                           num_workers=8)

# Download testing Data
testset = torchvision.datasets.MNIST(root='./files', train=False, download=False, 
                                     transform=transform_test)

# Initialize dataloader for testing data
test_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, 
                                          num_workers=8)

### MNIST Dataset
![](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/05/Examples-from-the-MNIST-dataset.png)

## Initialize Model/Optimizer

In [None]:
# Initialize previously defined model
model = tutorial_model()                                               

# Definie loss function (Cross Entropy Loss)
criterion = nn.CrossEntropyLoss()                                      

# Initialize Optimizer (ADAM)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)     

# Set model to training (updating weights)
model.train();                                                        

### Cross Entropy Loss

This is the mathematical formulation of cross entropy loss.

![](https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/35_blog_image_38.png)

This provides us a way of measuring the accuracy of our model. For more information about loss functions in general, check out this link (https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html). For more information about cross entropy loss specifically, check out this link (https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a).

### ADAM

This is the mathematical formulation of the ADAM Optimizer. It is one of the more commonly used optimizers.

![](https://i.stack.imgur.com/08VZN.png)

Ultimately, most of machine learning is basically an optimization problem. You are trying to minimize the loss functions. Most optimization methods (including ADAM) revolve around calculating the gradient of the loss function and then taking a step to update the weights. For more information about ADAM, check out this link (https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/). 

## Train Model

In [None]:
# Store time to calculate train time
start_time = time.time()

# Store loss and accuracy data
loss = []
accuracy = []

# Train the model
# Loop for number of epochs
for epoch in range(num_epochs):
    # Loop through data in batch sized increments
    for batch_idx, (X_train_batch, Y_train_batch) in enumerate(train_loader):
        # If trained on all data in epoch, move onto next epoch
        if(Y_train_batch.shape[0]<batch_size):
            continue

        # Forward pass through network
        output = model(X_train_batch)                           
        # Calculate loss of predictions
        curr_loss = criterion(output, Y_train_batch)            
        # Store loss
        loss.append(curr_loss.item())                           

        
        # Clear last calculation
        optimizer.zero_grad()                                   
        # Calculate gradient based on loss
        curr_loss.backward()                                    
        # Update model weights
        optimizer.step()                                        

        # Extract model predictions
        _, predicted = torch.max(output.data, 1) 
        # Calculate number of correct predictions
        correct = (predicted == Y_train_batch).sum().item()     
        # Calculate/store accuracy
        accuracy.append(correct/Y_train_batch.size(0))          
        
        # Intermitently print statistics
        if batch_idx % 100 == 0:
            print('Epoch: ' + str(epoch+1) + '/' + str(num_epochs) + ', Step: ' 
                  + str(batch_idx+1) + '/' + str(len(train_loader)) + ', Loss: ' 
                  + str(curr_loss.item()) + ', Accuracy: ' 
                  + str(correct/Y_train_batch.size(0)*100) + '%')

# Store time to calculate train time
end_time = time.time()

# Print train time
print('Run Time: ' + str(end_time - start_time))

## Test Model

In [None]:
# Test the model
# Set model to testing (constant weights)
model.eval()

with torch.no_grad():
    # Store number of correct/total samples in test data
    correct = 0
    total = 0
    
    # Loop through test data
    for X_test_batch, Y_test_batch in test_loader:
        # Forward pass through network
        output = model(X_test_batch)  
        
        # Extract prediction
        _, predicted = torch.max(output.data, 1)    
        
        # Update total number of sample
        total += Y_test_batch.size(0)  
        
        # Update number of correct predictions
        correct += (predicted == Y_test_batch).sum().item()     

print('Test Accuracy: ' + str((correct/total) * 100) + '%')