# Convolutional Neural Network for MNIST digit recognition in Pytorch.

## Some of the required libraries and data

(Lots of the code here is modified from a tutorial I found somewhere. I am not trying to steal credit, I simply have completely lost where I initially found the code. I ended up using the tutorial to learn and then modifying it to get the best possible performance from the model. If you read this and realize that your code was what I modified, reach out and I'll edit it to credit you if you so desire. But this is also very generic, typical MNIST tutorial code so maybe most tutorials are just similar to each other.)

Additionally, I am still a student myself. I'm here to learn, if you are a knowledgable person who finds some error in my explanation, please don't hesitate to reach out to me or fork this notebook on GitHub, edit it, and submit a merge request with an explanation of what you found wrong, and why your changes make it correct.

This first cell is for Google Colab users only, if you are not on Colab you WILL still need to import the torch module, however you will not need any of the other lines in the first cell.

The second cell imports some required libraries. Torchvision is a computer vision library in Pytorch, torchvision.transforms allows you to do transformations on images and tensors (tensors AKA N-dimensional arrays). torchvision.datasets is where we will pull the MNIST data that comes packaged with Pytorch from.


In [0]:
# http://pytorch.org/
from os import path
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())

accelerator = 'cu80' if path.exists('/opt/bin/nvidia-smi') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.3.0.post4-{platform}-linux_x86_64.whl torchvision
import torch

In [0]:
import torchvision
import torchvision.transforms as transforms
import torchvision.datasets as datasets

In this third cell, we are setting the batch size equal to 50 and then loading our data using Pytorch's DataLoader module. The arguments to the DataLoader should be self-explanatory, with the possible exception of 'transform=transforms.ToTensor()'. To clarify, that particular command is simply converting our input images to tensors so that we can feed them to the network. Classes is a tuple of charcters where each character corresponds to one of the classes in our dataset. In this case, MNIST has handwritten digits 0 through 9 in the data.

In [3]:
batch_size = 50
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=False, transform=transforms.ToTensor()),
    batch_size=1000)

classes = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9')

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


## Defining the network

Here we are defining the network within a class. We use 'nn.Module' to tell Pytorch this is going to be a network instance. Following that the init function is where each layer is initialized. To explain some of the functions and their arguments: the call 'nn.Conv2d' takes the following parameters as inputs: (input channels, output channels, kernel size, stride (default is 1), and padding = 2); there are more arguments which can be passed into it, they are found in the Pytorch documentation here: (http://pytorch.org/docs/master/index.html). Following our two convolutional layers are three fully connected layers. Fully connected layers are declared with call to 'nn.Linear' and the arguments given to them here are (input channels, output channels). 

Now, on to the 'forward' function. This function takes an argument 'x' and passes it through our network. So, here we see that x is the data we are training/testing the network on. The calls to 'F.max_pool2d' are taking the output of a convolutional layer and 'pooling' it to reduce its size by two. The 'F.relu' call within the max pooling call applies the ReLU activation function to the convolutional layer. The argument '2' simply tells the max pooling layer to use a 2 by 2 kernel size. The function 'x.view' is simply what we use to reshape our output from the convolutional layer into a vector for the fully connected layer. The '-1' in the arguments allows the view function to use our batch size to infer what that dimension should be equal to, while $64*7*7$ is the size of our output from the 'conv2' layer. Each subsequent call to 'F.relu' simply applies the ReLU activation function to the layer called within it. The call to 'F.dropout' applies dropout to the 'x' passed into it and the 'training' argument is used to give the function awareness to whether the net is in a training or evaluation mode. Finally, 'F.log_softmax(x)' applies softmax activation function to the output of the final fully connected layer.

'net = Net()' simply creates an instance of our class 'Net()', while net.cuda() compiles it to run on the GPU. Comment out the line 'net.cuda()' if you do not have access to a GPU.

**Note that before giving output of a convolutional layer to a linear layer, the output from the convolutional layer MUST be reshaped into a vector (hence the $64 * 7 * 7$ weirdness going on next to the defining of 'fc1'). For me, figuring out why $64 * 7 * 7$ was the correct reshaping size took multiple hours and the help of two graduate students, so I'll just tell you here why that is the correct size. In the defining of layer 'conv2', we declare output channels of 64, so that is why the 64 is there. Now, why $7 * 7$? This is because at this point in our neural network, the size of the image has been reduced to 7 by 7. Lets follow the resizing of the image start to finish: Initially we have images of size $28*28$, these go through layer conv1 and get their size reduced by 1/2 in the max pooling layer, so then the size is $14 * 14$. After this resizing, this output again gets size reduced by 1/2 in the second max pooling layer so the size then becomes $7 * 7$, and this is finally the numbers we use in our reshaping of output $-1, 64 * 7 * 7$. In summary: $28 * 28$ --> $14*14$ --> $7 * 7$ **

In [4]:
import torch.nn as nn
import torch.nn.functional as F

# to get back to original, add padding=2 to both conv layers and for fc1 do 64*7*7

class Net(nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.conv1 = nn.Conv2d(1, 150, 5, stride=1, padding=2)
    self.conv2 = nn.Conv2d(150, 64, 5, stride=1, padding=2)
    self.fc1 = nn.Linear(64*7*7, 1024)
    self.fc2 = nn.Linear(1024, 500)
    self.fc3 = nn.Linear(500, 10)
    
  def forward(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), 2)
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    x = x.view(-1, 64*7*7)
    x = F.relu(self.fc1(x))
    x = F.dropout(x, training=self.training)
    x = F.relu(self.fc2(x))
    x = F.dropout(x, training=self.training)
    x = self.fc3(x)
    return F.log_softmax(x)

net = Net()
net.cuda()

Net(
  (conv1): Conv2d (1, 150, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv2): Conv2d (150, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (fc1): Linear(in_features=3136, out_features=1024)
  (fc2): Linear(in_features=1024, out_features=500)
  (fc3): Linear(in_features=500, out_features=10)
)

In [5]:
print(net)

Net(
  (conv1): Conv2d (1, 150, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv2): Conv2d (150, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (fc1): Linear(in_features=3136, out_features=1024)
  (fc2): Linear(in_features=1024, out_features=500)
  (fc3): Linear(in_features=500, out_features=10)
)


## Loss, Optimization, and Training

So here in the cell directly below, we import the torch optimization library as optim. Following that, the loss function is defined. In this case, the loss function used is Cross Entropy Loss. Then, the optimizer is initialized. Here, I used Stochastic Gradient Descent. It takes in the parameters to the network (given by 'net.parameters()'), the learning rate 'lr', and the momentum '0.95'. I chose 1e-2 for my learning rate because I have seen it work well before, and it seems to be a pretty common choice for a starting learning rate, and I chose 0.95 for my momentum because I saw a tutorial using 0.9 so I played around with different values and seemed to be getting best results with 0.95.

In [0]:
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
#optimizer = optim.Adam(net.parameters(), lr=1e-2)
optimizer = optim.SGD(net.parameters(), lr=1e-2, momentum=0.95)

Here we define our training loop. First, we import a function 'Variable'. Variable is part of Pytorch's autograd package, and I have just begun to scratch the surface of autograd so the best place to find information on that is here: (http://pytorch.org/docs/master/notes/autograd.html). Variable is going to allow us to wrap our data in it and perform propagation, differentiation, and other mathematical operations needed for deep learning on them. It's also letting us compile them to go to GPU with the '.cuda()' call. If you don't have a GPU to run this on, you'll need to go below and delete the call to '.cuda()'. train_loss and train_acc are just going to be lists containing our training accuracy and training loss in each iteration, while i = 0 is simply being used as an iteration counter in the for loop. Now, on to the for loop itself. We are running the loop for as many epochs as we would like to do. To change the number of epochs, change the number inside the call to 'range(3)'. The 'for data, target in train_loader: ' is allowing us to step through our training data in batch sizes of 50. The data is the input tensors and the target is the label associated with that input. 

On the first line, we simply convert our input data and target to Variables and compile them for CUDA (again, delete the call to cuda if you do not have a CUDA enabled GPU to run on). 

Our second line zeros the gradients for our optimizer.

The third line passes our data through the network and generates output. 

The fourth line calculates the loss function for the output we generated and the target classification. 

The fifth line performs backpropagation of our loss.

The sixth line simply appends our current loss to the train_loss list.

The seventh line steps our optimizer in the direction it computed. 

The eighth, ninth, and tenth lines are all focused around computing our current accuracy and appending it to the train_acc list.

The 'if' statement is used to print out training updates every 1000 iterations.

In [7]:
from torch.autograd import Variable
train_loss = []
train_acc = []
i = 0
for epoch in range(3):
  for data, target in train_loader:
    data, target = Variable(data.cuda()), Variable(target.cuda())
    optimizer.zero_grad()
    outputs = net(data)
    loss = criterion(outputs, target)
    loss.backward()
    train_loss.append(loss.data[0])
    optimizer.step()
    prediction = outputs.data.max(1)[1]
    accuracy = prediction.eq(target.data).sum()/batch_size*100
    train_acc.append(accuracy)
    if i % 1000 == 0:
      print('Train Step: {}\tLoss: {:.10f}\tAccuracy: {:.10f}'.format(i, loss.data[0], accuracy))
    i += 1
    



Train Step: 0	Loss: 2.3147785664	Accuracy: 6.0000000000
Train Step: 1000	Loss: 0.0469076298	Accuracy: 98.0000000000
Train Step: 2000	Loss: 0.0813489035	Accuracy: 98.0000000000
Train Step: 3000	Loss: 0.0053700125	Accuracy: 100.0000000000


# A tiny bit of network evaluation!

Here we import the library NumPy so that we can use a function from it which will generate n random integers for us. The integers it produces will be used to index into our test data labels and our predictions so that we can compare our target and our actual output while seeing new numbers almost every time. To show more or less numbers at a time, change the third argument to 'np.random.randint'. Currently it shows 10 numbers at a time. Then, we produce an iterator object to step through our test data and output before finally using the print function and some iterator comprehension to step through our testing labels to our random indices and show those labels to us. Almost all code from here down, minus the random indices, has been borrowed from various Pytorch tutorials. Thanks for helping us learn Pytorch!

In [8]:
import numpy as np
rand_check = np.random.randint(0, 1000, 10)
dataiter = iter(test_loader)
images, labels = dataiter.next()
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in rand_check))

GroundTruth:      1     3     5     5     7     7     9     3     6     0


This cell just generates outputs from our test data for us.

In [9]:
outputs = net(Variable(images).cuda())



Here, we step through our outputs produced in the line above and print out what our network believes the digit is. You can compare the printed out predictions to the printed out ground truth above and see if you find any mistakes in your model.

In [10]:
_, predicted = torch.max(outputs.data, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in rand_check))

Predicted:      1     3     5     5     7     7     9     3     6     0


Finally, in this cell we simply compute our total test accuracy. We use a for loop to step through our test data, generate outputs, compare them to the ground truth, and then finally compute the fraction of correct classifications and multiply it by 100 to make it a percentage.

In [11]:
dataiter = iter(test_loader)
images, labels = dataiter.next()
output = net(Variable(images).cuda())
correct = 0
total = 0
for data in test_loader:
    images, labels = data
    outputs = net(Variable(images).cuda())
    outputs = outputs.cpu()
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100.0 * correct / total))



Accuracy of the network on the 10000 test images: 99 %
