# MNIST handwritten digits classification with MLPs

In this notebook, we'll train a multi-layer perceptron model to classify MNIST digits using PyTorch.

First, the needed imports.

In [None]:
%pylab
%matplotlib inline
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
from torch.autograd import Variable
import progressbar

## Defining the Hyper-Parameter of the network
They are often used in processes to help estimate model parameters.
They are often specified by the practitioner.
They can often be set using heuristics.
They are often tuned for a given predictive modeling problem.
    
Hyperparameters are usually fixed before the actual training process begins.

In [None]:
# The initial size of the First layer
input_size = 784

# The hidden state size
hidden_size = 500

# Number of different classes in MNIST classification task
num_classes = 10

# Total number of iteration/epochs we will be running our network
num_epochs = 5

# Number of example(batch_size) to be fed to the network at once
batch_size = 100

# The rate at which the network learns/unlearns
learning_rate = 0.001


## Data
Next we'll load the MNIST data. First time we may have to download the data, which can take a while.

#### Data loader. 
Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset.

In [None]:
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=1, shuffle=True)



The train and test data are provided via data loaders that provide iterators over the datasets. 

The first element of training data (X_train) is a 4th-order tensor of size (batch_size, 1, 28, 28), i.e. it consists of a batch of images of size 1x28x28 pixels. 

y_train is a vector containing the correct classes ("0", "1", ..., "9") for each training digit.

In [None]:
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=False, transform=transforms.ToTensor()),
    batch_size=1000)

In [None]:
print ('==>>> total trainning batch number: {}'.format(len(train_loader)))
print ('==>>> total testing batch number: {}'.format(len(test_loader)))

## Visualization

Here are the first 16 training digits:

In [None]:
images,_ = next(iter(train_loader))
i = torchvision.utils.make_grid(images).numpy()
i = np.transpose(i,(1,2,0))
plt.imshow(i)

# MLP network definition

Let's define the network as a Python class. 

We have to write the __init__() and forward() methods, and PyTorch will automatically generate a backward() method for computing the gradients for the backward pass.

### nn.linear()
Applies a linear transformation to the incoming data: y=Ax+b

Parameters:	
    in_features – size of each input sample
    out_features – size of each output sample
    bias – If set to False, the layer will not learn an additive bias. Default: True

Variables:	
    weight – the learnable weights of the module of shape (out_features x in_features)
    bias – the learnable bias of the module of shape (out_features)

### nn.Relu()
Applies the rectified linear unit function element-wise ReLU(x)=max(0,x)

In [None]:
class Net(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

In [None]:
net = Net(input_size, hidden_size, num_classes)
net

Finally, we define an optimizer to update the model parameters based on the computed gradients. 

We select ADAM (with momentum) as the optimization algorithm, and set learning rate to 0.01.

Note that there are several different options for the optimizer in PyTorch that we could use instead of ADAM.

In [None]:
train_loss_mlp = []
train_accu_mlp = []

criterion = nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)  

bar = progressbar.ProgressBar()
for epoch in bar(range(num_epochs)):
    for i, (images, labels) in enumerate(train_loader):  
        # Convert torch tensor to Variable
        images = Variable(images.view(-1, 28*28))
        labels = Variable(labels)
        
        # Forward + Backward + Optimize
        optimizer.zero_grad()  # zero the gradient buffer
        outputs = net(images)
        loss = criterion(outputs, labels)
        loss.backward()
        
        train_loss_mlp.append(loss.data[0])
        optimizer.step()
        
        prediction = outputs.data.max(1)[1]   # first column has actual prob.
        accuracy = prediction.eq(labels.data).sum()/batch_size*100
        train_accu_mlp.append(accuracy)
        
        
        if (i+1) % 500 == 0:
            print ('Epoch [%d/%d], Step [%d/%d], Loss: %.4f' 
                   %(epoch+1, num_epochs, i+1, 3750//batch_size, loss.data[0]))

Loss is a function of the difference of the network output and the target values. We are minimizing the loss function during training so it should decrease over time.
Accuracy is the classification accuracy for the test data.

In [None]:
correct = 0
total = 0
for images, labels in test_loader:
    images = Variable(images.view(-1, 28*28))
    outputs = net(images)
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted.cpu() == labels).sum()

print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

Let's now visualize how the training progressed.

In [None]:
plt.plot(np.arange(len(train_loss_mlp)), train_loss_mlp)

In [None]:
plt.plot(np.arange(len(train_accu_mlp)), train_accu_mlp)

### Let load the dataset and check the predictions of the network visually

In [None]:
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=1, shuffle=True)


#### Take one example from the dataset at a time and predicting on it to see the outputs

In [None]:
data,target = next(iter(train_loader))

i = torchvision.utils.make_grid(data).numpy()
i = np.transpose(i,(1,2,0))
plt.imshow(i)

data, target = Variable(data.view(-1, 28*28), volatile=True), Variable(target)
output = net(data)
prediction = output.data.max(1)[1]
correct += prediction.eq(target.data).sum()
# print('\nTest set: Accuracy: {:.2f}%'.format(100. * correct / len(test_loader.dataset)))

print(prediction)

# MNIST CNN

Start with loading the data again

In [None]:
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=100, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=False, transform=transforms.ToTensor()),
    batch_size=1000)

## nn.Conv2d

    Applies a 2D convolution over an input signal composed of several input planes.
    stride controls the stride for the cross-correlation, a single number or a tuple.
    padding controls the amount of implicit zero-paddings on both sides for padding number of points for each dimension.
    dilation controls the spacing between the kernel points; also known as the à trous algorithm.

In [None]:
class MnistModel(nn.Module):
    def __init__(self):
        super(MnistModel, self).__init__()
        
        # input is 28x28
        # padding=2 for same padding
        self.conv1 = nn.Conv2d(1, 32, 5, padding=2)
        
        # feature map size is 14*14 by pooling
        # padding=2 for same padding
        self.conv2 = nn.Conv2d(32, 64, 5, padding=2)
        
        # feature map size is 7*7 by pooling
        self.fc1 = nn.Linear(64*7*7, 1024)
        self.fc2 = nn.Linear(1024, 10)
        
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, 64*7*7)   # reshape Variable
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)
    
model = MnistModel()
model

Printing the size of the model parameter after each layer

In [None]:
for p in model.parameters():
    print(p.size())

## Optimization:
Use the optim package to define an Optimizer that will update the weights of the model for us. 

Here we will use Adam; the optim package contains many otheroptimization algoriths. 
The first argument to the Adam constructor tells the optimizer which Variables it should update.

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.0001)

## Training the network that we just defined

#### Using the same steps for training:

    Define the input and the target variable
    Pass the data to the model for a forward pass
    Manually zero out the optimizer 
    Calculate the loss
    Backpropagate
    Optimize
    Repeat

In [None]:
model.train()
train_loss = []
train_accu = []
i = 0
bar = progressbar.ProgressBar()
for epoch in bar(range(3)):
    bar = progressbar.ProgressBar()
    for data, target in bar(train_loader):
        
        # Define the input and the target variable
        data, target = Variable(data), Variable(target)
        
        # Pass the data to the model for a forward pass
        optimizer.zero_grad()
        output = model(data)
        
        loss = F.nll_loss(output, target)  # Calculate loss
        loss.backward()    # calc gradients i.e backpropogation
        
        
        train_loss.append(loss.data[0])
        optimizer.step()   # update gradients
        prediction = output.data.max(1)[1]   # first column has actual prob.
        if i % 200 == 0:
            print('Train Step: {}\tLoss: {:.3f}'.format(i, loss.data[0]))
        i += 1

Let's now visualize how the training progressed.

In [None]:
plt.plot(np.arange(len(train_loss)), train_loss)


Loss is a function of the difference of the network output and the target values. We are minimizing the loss function during training so it should decrease over time. Accuracy is the classification accuracy for the test data.

In [None]:
def test():
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data, target = Variable(data, volatile=True), Variable(target)
        output = model(data)
        test_loss += F.nll_loss(output, target, size_average=False).data[0] # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    
test()

### Let load the dataset and check the predictions of the network visually

In [None]:
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=1, shuffle=True)



#### Take one example at a time and predict on it

In [None]:
data,target = next(iter(train_loader))

i = torchvision.utils.make_grid(data).numpy()
i = np.transpose(i,(1,2,0))
plt.imshow(i)

data, target = Variable(data, volatile=True), Variable(target)
output = model(data)
prediction = output.data.max(1)[1]
correct += prediction.eq(target.data).sum()
# print('\nTest set: Accuracy: {:.2f}%'.format(100. * correct / len(test_loader.dataset)))

print(prediction)