# 02. Character recognition of MNIST dataset using CNN


---
## Purpose

Carry out character recognition of the MNIST dataset using a Convolutional Neural Network (CNN). For evaluation, calculate the recognition rate of each class using a confusion matrix.

Also, compute neural network operations by using the GPU.

## Preparations

### Confirm and change Google Colaboratory settings

In this tutorial, we use PyTorch to implement a neural network and carry out training and evaluation.
**To process operations using the GPU, go to the menu bar at the top of screen and choose Runtime -> Change runtime type -> Hardware accelerator -> GPU.** 

## Import modules

First, import the necessary modules.

In [None]:
from time import time
import torch
import torch.nn as nn

import torchvision
import torchvision.transforms as transforms

import torchsummary

### Confirm GPU settings

Confirm computation using GPU is enabled.


If `Use CUDA: True` is displayed, it is possible to use the GPU to perform computation in PyTorch. If Use CUDA: False is displayed, start from the procedures given in “Confirm and change Google Colaboratory settings” above and change the settings. Then import the modules again.


In [None]:
use_cuda = torch.cuda.is_available()
print('Use CUDA:', use_cuda)

## Read and confirm dataset

Load the training data (MNIST dataset)．

In [None]:
train_data = torchvision.datasets.MNIST(root="./", train=True, transform=transforms.ToTensor(), download=True)
test_data = torchvision.datasets.MNIST(root="./", train=False, transform=transforms.ToTensor(), download=True)

print(type(train_data.data), type(train_data.targets))
print(type(test_data.data), type(test_data.targets))
print(train_data.data.size(), train_data.targets.size())
print(test_data.data.size(), test_data.targets.size())

## Define network model

Define the convolutional neural network.


The network in this tutorial consists of two convolutional layers and three fully connected layers.

The first convolutional layer has 1 input channel, 16 output feature maps, and a 3x3 convolution filter. The second convolutional layer has 16 input channels, 32 output feature maps, and convolution filter that also has a size of 3x3. The first fully connected layer has `7*7*32` input units and 1024 output units. The next fully connected layer has 1024 input units and 1024 output units. The output layer has 1024 input units and 10 output units. For the activation function, we define a sigmoid function in `self.act`. In addition, we define `self.pool` to carry out pooling. For this example, we use max pooling. We define the composition of each layer using the `__init__` function.


The `forward` function describes how to connect and process the defined layers. The `forward` function’s parameter `x` represents the input data.
This parameter’s argument is inputted to `conv1` defined by the `__init__` function.
The output is passed to the activation function `self.act`.
The output of that function is passed to `self.pool`. The result of pooling is outputted as `h`.
The second convolutional layer is also processed using the same procedures.


After convolution is applied to the feature map, the map is inputted to the fully connected layers. 
Identification results are outputted. First, the shape (channel x height x width) of the feature map obtained by convolution is converted to a one-dimensional array. Here we manipulate array h by using `view()`. We obtain the first dimension of the size of h with the first argument, `h.size()[0]`, and specify it as the size of the first dimension of the array after conversion. The second argument, `-1`, specifies an arbitrary size. Doing so transforms `h` to the shape (number of batches `x` arbitrary length of data). Finally, the class scores are returned by sequentially inputting the converted `h` to the fully connected layers and the activation function.


In [None]:
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.l1 = nn.Linear(7*7*32, 1024)
        self.l2 = nn.Linear(1024, 1024)
        self.l3 = nn.Linear(1024, 10)
        self.act = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)

    def forward(self, x):
        h = self.pool(self.act(self.conv1(x)))
        h = self.pool(self.act(self.conv2(h)))
        h = h.view(h.size()[0], -1)
        h = self.act(self.l1(h))
        h = self.act(self.l2(h))
        h = self.l3(h)
        return h

## Create neural network

Create the neural network defined by the program above.

Call the `CNN` class to define the neural network model. If using the GPU （`use_cuda == True`）, the network model is placed in GPU memory. This makes it possible to perform operations using the GPU.

We use stochastic gradient descent with momentum (SGD with momentum) as the optimization technique when training. We pass 0.01 as the argument of the learning rate parameter and 0.9 as the argument of the momentum parameter.

Finally, `torchsummary.summary()` is used to display detailed information about the defined network. 



In [None]:
model = CNN()
if use_cuda:
    model.cuda()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# display the detailed information about the defined network
if use_cuda:
    torchsummary.summary(model, (1, 28, 28), device='cuda')
else:
    torchsummary.summary(model, (1, 28, 28), device='cpu')

## Training

Carry out training by using the loaded MNIST dataset and created neural network.

We set the number of data for calculating errors for one pass (mini-batch size) as 100 and the number of training epochs as 10. 

Next, we define the data loader. The data loader uses the training dataset (`train_data`) that was loaded above and creates an object that reads the data in the mini-batch size as specified by the assignment statement below. For this training, we set `shuffle=True` to specify that the data is to be read randomly each time.

Next, we set the error function. Because we are dealing with a classification problem here, we define `criterion` to be `CrossEntropyLoss` to calculate cross entropy error.

Begin training.

For each update, the data to be learned and the teacher data are given the names `image` and `label`, respectively. The training model is given an image and obtains the probability y for each class. The error between each class’s probability y and the teacher label is calculated by `criterion`. The recognition accuracy is also calculated. The error is then backpropagated by the backward function to update the neural network. 



In [None]:
# set the mini-batch size and training epochs
batch_size = 100
epoch_num = 10

# define data loader
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)

# set the error (loss) function
criterion = nn.CrossEntropyLoss()
if use_cuda:
    criterion.cuda()

# swich the network configuration into the training mode
model.train()

# begin training
train_start = time()
for epoch in range(1, epoch_num+1):
    sum_loss = 0.0
    count = 0

    for image, label in train_loader:

        if use_cuda:
            image = image.cuda()
            label = label.cuda()

        y = model(image)

        loss = criterion(y, label)
        model.zero_grad()
        loss.backward()
        optimizer.step()

        sum_loss += loss.item()

        pred = torch.argmax(y, dim=1)
        count += torch.sum(pred == label)

    print("epoch: {}, mean loss: {}, mean accuracy: {}, elapsed time: {}".format(epoch, sum_loss/600, count.item()/60000., time() - train_start))

## Testing

Evaluate by using the trained network model on the testing data. 

Apply `model.eval()` to change network operations to evaluation mode. This enables different operations (e.g. dropout) to behave differently in evaluation mode instead of training mode.
Apply `torch.no_grad()` to carry out operations without keeping gradient information that is required during training.


In [None]:
# defnine data loader
test_loader = torch.utils.data.DataLoader(test_data, batch_size=100, shuffle=False)

# switch the network configuration into evaluation mode
model.eval()

# begin evaluation
count = 0
with torch.no_grad():
    for image, label in test_loader:

        if use_cuda:
            image = image.cuda()
            label = label.cuda()
            
        y = model(image)

        pred = torch.argmax(y, dim=1)
        count += torch.sum(pred == label)

print("test accuracy: {}".format(count.item() / 10000.))

## Problems


### 1. Confirm the difference in computation time when training using GPU compared with using CPU.

**Hint: You can switch between GPU and CPU by changing the value of `use_cuda` (`True` or `False`) in the "GPU Confirmation" cell (at the top of this page).**



### 2. Change the neural network structure and confirm the change in recognition accuracy. 

**Hint: The following items can change the neural network structure. **

* The number of units in intermediate layers
* Number of layers
* The activation function
  * For example, `nn.Tanh()` or `nn.ReLU()`, `nn.LeakyReLU()`, etc.
  * Other activation functions that can be used in PyTorch are summarized on [this page](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity).

\* After changing the neural network structure, use the function `torchsummary.summary()` to view changes in the number of parameters.


### 3. hange training settings and confirm the change in recognition accuracy.

**Hint: The following settings that can be changed in the program**
* Mini-batch size
* Number of training cycles (number of epochs)
* Learning rate
* Optimization method
  * Choices include `torch.optim.Adagrad()` and `torch.optim.Adam()`.
  * Optimization methods that can be used in PyTorch are summarized on [this page](https://pytorch.org/docs/stable/optim.html#algorithms).

