## Build, train, evaluate MLP

PyTorch provides two high-level features:
* Tensor computing (like NumPy) with strong acceleration via graphics processing units (GPU)
* Deep neural networks biult on an automatic differentiation system

In [1]:
# import packages
import torch
import torch.nn as nn
from torchvision import datasets, transforms
import numpy as np

### Some utilities

In [2]:
# accuarcy
def AccuarcyCompute(pred, label):
    pred = pred.cpu().data.numpy()
    label = label.cpu().data.numpy()
    test_np = (np.argmax(pred,1) == label)
    test_np = np.float32(test_np)
    return np.mean(test_np)

### Load MNIST dataset
#### Handwritten Digit Recognition
In this tutorial, we'll give you a step by step walk-through of how to build a hand-written digit classifier using the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. For someone new to deep learning, this exercise is arguably the "Hello World" equivalent.

MNIST is a widely used dataset for the hand-written digit classification task. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images.  

![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/example/mnist.png) 
<!---![png](mnist.png)-->

**Figure 1:** Sample images from the MNIST dataset.

### Build MLP 

![png](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/mlp_mnist.png)
<!-- ![png](mlp_mnist.png) -->

**Figure 2:** MLP network architecture for MNIST. (Last layer is Dense layer without activation)

The last fully connected layer often has its hidden size equal to the number of output classes in the dataset. While we could use the Softmax activation to map values to a probability score for each class of output, we will use special loss for this puporse uring the training stage( which process layer through softmax function and then loss function computes the [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution (softmax output) predicted by the network and the true probability distribution given by the label).

In [3]:
def build_mlp():
    mlp = nn.Sequential(
        nn.Linear(784, 50),
        nn.ReLU(),
        nn.Linear(50, 10)
    )
    print(f'You have defined the following MLP components: \n{mlp}')
    return mlp

mlp = build_mlp()

You have defined the following MLP components: 
Sequential(
  (0): Linear(in_features=784, out_features=50, bias=True)
  (1): ReLU()
  (2): Linear(in_features=50, out_features=10, bias=True)
)


### In pytorch, load minist is very simple, just use torchvision.datasets.MNIST(), torch.utils.data.DataLoader()

 - torchvision.datasets.MNIST(): download and import dataset
 - torch.utils.data.DataLoader(): transfer dataset to be iterable, so that for loop can handle

In [4]:
# Prepare dataset
train_set = datasets.MNIST("data/", train=True, transform=transforms.ToTensor(), download=True)
test_set = datasets.MNIST("data/", train=False, transform=transforms.ToTensor(), download=True)

# batch_size means number of samples (data) to be handled by program in one iteration
train_dataset = torch.utils.data.DataLoader(train_set, batch_size=100)
test_dataset = torch.utils.data.DataLoader(test_set, batch_size=100)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST\raw\train-images-idx3-ubyte.gz


0it [00:00, ?it/s]

Extracting data/MNIST\raw\train-images-idx3-ubyte.gz to data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST\raw\train-labels-idx1-ubyte.gz


0it [00:00, ?it/s]

Extracting data/MNIST\raw\train-labels-idx1-ubyte.gz to data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST\raw\t10k-images-idx3-ubyte.gz


0it [00:00, ?it/s]

Extracting data/MNIST\raw\t10k-images-idx3-ubyte.gz to data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST\raw\t10k-labels-idx1-ubyte.gz


0it [00:00, ?it/s]

Extracting data/MNIST\raw\t10k-labels-idx1-ubyte.gz to data/MNIST\raw
Processing...
Done!


  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


Image batches are commonly represented by a 4-D array with shape `(batch_size, num_channels, width, height)`. For the MNIST dataset, since the images are grayscale, there is only one color channel. Also, the images are 28x28 pixels, and so each image has width and height equal to 28. Therefore, the shape of input is `(batch_size, 1, 28, 28)`. Another important consideration is the order of input samples. When feeding training examples, it is critical that we don't feed samples with the same label in succession. Doing so can slow down training.

### Loss function and optimizer

To optimize the model, we need to define a objective between the prediction and target

A simple method is to use square loss. Nevertheless, this approach does not consider the meaning of model outputs. 

Instead, for classification task, we often use another loss called **cross-entropy (CE) loss**. 

Suppose the target is a one-hot vector $\vec{t}$ with $t_k=1$, and the model prediction is denoted as $\vec{v}$ as above, then the cross-entropy loss is defined as:
$$L_{\text{CE}}(\vec{v}, \vec{t}) = -\sum_{j=1}^C t_j \log(v_j) = -\log(v_k)$$
Since all elements of $\vec{v}$ are between 0 and 1, we must always have $\log(v_j)<0$, therefore the objective is always non-negative. 

And we have $L_{\text{CE}}(\vec{v}, \vec{t}) = 0$ if and only if $v_k=1$ (and $\vec{v}=\vec{t}$)

mlp.parameters(): all parameters of the defined mlp.
You should tune 'lr' (not too big and not too small)

In [5]:
# define optimizer, optimizers are in torch.optim pkg
optimizer = torch.optim.SGD(mlp.parameters(), lr=0.01, momentum=0.9)

# define loss_func, loss_func are in torch.nn pkg
lossfunc = torch.nn.CrossEntropyLoss()

### Train model


Typically, one runs the training until convergence, which means that we have learned a good set of model parameters (weights + biases) from the train data. For the purpose of this tutorial, we’ll run training for 3 epochs and stop. An epoch is one full pass over the entire train data.

We will take following steps for training:

- Define Accuracy evaluation metric over training data.
- Loop over inputs for every epoch.
- Forward input through network to get output.
- Compute loss with output and label inside record scope.
- Backprop gradient inside record scope.
- Update evaluation metric and parameters with gradient descent.

Loss function takes (output, label) pairs and computes a scalar loss for each sample in the mini-batch. The scalars measure how far each output is from the label.

In [6]:
epochs = 3
for i_epoch in range(epochs):
    
    for idx, (inputs, labels) in enumerate(train_dataset):
        
        # convert from shape (batch_size, 1, 28, 28) to shape (batch_size, 784)
        inputs = inputs.view(-1, 28*28)
        
        # zero the paraeter gradients
        optimizer.zero_grad()
        # make prediction
        outputs = mlp(inputs)
        # compute the loss
        loss = lossfunc(outputs, labels)
        # backpropagation to compute the gradients
        loss.backward()
        # update the model parameters
        optimizer.step()
        
    print(i_epoch, "epoch, training accuracy ", AccuarcyCompute(outputs,labels))

0 epoch, training accuracy  0.93
1 epoch, training accuracy  0.95
2 epoch, training accuracy  0.96


## Test our model, prediction

After the above training completes, we can evaluate the trained model by running predictions on testing dataset. Since the dataset also has labels for all test images, we can compute the accuracy metric over validation data as follows:

In [7]:
acc_list = []

with torch.no_grad():
    for i, (inputs, labels) in enumerate(test_dataset):
        
        # again, reshape input
        inputs = inputs.view(-1, 28*28)
        
        # make prediction
        outputs = mlp(inputs)
        
        # compute and store accuracy score
        acc_list.append(AccuarcyCompute(outputs, labels))

print("Testing accuracy:", sum(acc_list) / len(acc_list))

Testing accuracy: 0.9402000027894973


## Auto-differential, computational graph

Variable & autograd

In [8]:
import torch
from torch.autograd import Variable

# Create tensors
x = Variable(torch.tensor(3.), requires_grad=True)
w = Variable(torch.tensor(2.), requires_grad=True)
b=  Variable(torch.tensor(3.), requires_grad=True)

# Build a computational graph
y = w * x + b

# Compute gradients
y.backward()

# Print out the gradients
print("x.grad", x.grad)
print("w.grad", w.grad)
print("b.grad", b.grad)

x.grad tensor(2.)
w.grad tensor(3.)
b.grad tensor(1.)


In [9]:
import torch
from torch.autograd import Variable

# Create tensors
x = Variable(torch.tensor(-1.), requires_grad=True)
y = 1. / (1 + torch.exp(-x))

# Compute gradients
y.backward()

# Print out the gradients
print("x.grad:", x.grad)
print("torch.sigmoid:", torch.sigmoid(x)*(1-torch.sigmoid(x)))
print("manually compute:", 1./(1+torch.exp(-x)) * (1-1./(1+torch.exp(-x))))

x.grad: tensor(0.1966)
torch.sigmoid: tensor(0.1966, grad_fn=<MulBackward0>)
manually compute: tensor(0.1966, grad_fn=<MulBackward0>)
