# **COMP3010 - Machine Learning**

The tutorial contains two parts: (theoretical) discussion and (practical) coding. The discussion part consists of important concepts, advanced topics, or open-ended questions, for which we want an in-depth discussion. The coding part contains programming exercises for you to gain hands on experience.

## **Tutorial 06**
Learning outcomes:

*   Training and evaluating MLP for MNIST
*   Realising Data visualization with t-SNE
*   Deepen understanding of NN with Playground


## **Discussion**


1.   Explain the ``Universal Approximation Theory" of NNs. If NN with one hidden layer can fit anything, why do we need deep NN?
2.   Where does the term "deep learning" come from? What is the difference with "shallow learning"?
3.   You are training a NN and observed that the training loss becomes NaN after several iterations. What might be the causes and how do you address them. 
4.   Explain the concept of _batch_size_, _iteration_, and _epoch_ in training NN. How does batch size affect the training of NNs?
5.   Discuss regularization techniques for addressing overfitting of NN.

## **Coding**


In this part, you will

*   implement a MLP for MNIST handawritten digits 0-9 recognition.
*   implement a data visualization technique named [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).
*   have fun with [NNPlaygroud](https://playground.tensorflow.org/).



## Outline
- [ 1 - MLP for MNIST ](#1)
  - [ 1.1 Load data](#1.1)
  - [ 1.2 - Visualize a batch of data](#1.2)
  - [ 1.3 - Define network architecture](#1.3)
    - [Exercise 01](#ex01)
  - [ 1.4 - Specify loss function and optimizer](#1.4)
  - [ 1.5 - Model Training](#1.5)
  - [ 1.6 - Model Evaluation](#1.6)
- [ 2 - T-SNE visualization](#2)
  - [ 2.1 Load data](#2.1)
  - [ 2.2 - Visualising the raw data](#2.2)
  - [ 2.3 - Visualising leaned features](#2.3)
- [ 3 - NN Playground](#3)
  - [Exercise 02](#ex02)
  - [Exercise 03](#ex03)
  - [Exercise 04](#ex04)


<a name="1"></a>
## 1 - MLP for MNIST digits recognition
---
In this sectgion, we will train an MLP to classify images from the [MNIST database](http://yann.lecun.com/exdb/mnist/) hand-written digit database.

The process will be broken down into the following steps:
1. Load and visualize the data
2. Define a neural network
3. Train the model
4. Evaluate the performance of our trained model on a test dataset!

Before we begin,  let's import necessary libraries for working with data and PyTorch.


In [None]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import torch
import torch.nn as nn
from torchvision import datasets
import torchvision.transforms as transforms

<a name="1.1"></a>
### 1.1 Load data

We will download the full MNIST data from torchvision. Downloading may take a few moments, and you should see your progress as the data is loading. You may also choose to change the `batch_size` if you want to load more data at a time.

This cell will create DataLoaders for each of our [datasets]((http://pytorch.org/docs/stable/torchvision/datasets.html)).

In [None]:
from torchvision import datasets
import torchvision.transforms as transforms

# how many examples per batch to load
batch_size = 16

# convert data to torch.FloatTensor
transform = transforms.ToTensor()

# choose the training and test datasets
train_data = datasets.MNIST(root='DATA', train=True, download=True, transform=transform)
test_data = datasets.MNIST(root='DATA', train=False, download=True, transform=transform)

# prepare data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size)

Note that the full dataset is too big and thus we load data batch by batch, where in each batch we load 16 images. The pytorch dataloader module will handle many cubersome issues, and thus very useful when dealing with large datasets.

<a name="1.2"></a>
### 1.2 - Visualize a batch of data

The first step in a classification task is to take a look at the data, make sure it is loaded in correctly, then make any initial observations about patterns in that data.

In [None]:
# obtain one batch of training images
dataiter = iter(train_loader)
images, labels = next(dataiter)
images = images.numpy()

# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(20, 4))
for idx in np.arange(batch_size):
    ax = fig.add_subplot(2, batch_size//2, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    # print out the correct label for each image
    # .item() gets the value contained in a Tensor
    ax.set_title(str(labels[idx].item()))

Check the image dimensions:

In [None]:
print(images.shape)

Pytorch use ``BCHW`` **(batch, channel, height, width)** convention to store image tensors. The above shape tells us that the ``images`` tensor contains a batch of 16 grayscale images of resolution $28\times28$, i.e., 784 pixels per image.

<a name="1.3"></a>
### 1.3 - Define network architecture

The architecture will be responsible for seeing as input a 784-dim Tensor of pixel values for each image, and producing a Tensor of length 10 (our number of classes) that indicates the class scores for an input image. These information decides the input and output layers of the neural network. We as users need to define the hidden layers of our NN, including number of layers and neurons as hyperparameters.


<a name="ex01"></a>
### Exercise 01

Let's define a MLP with structure s=[784, 256, 64, 10], that is, two hidden layers of size 256 and 64 respectively.

In [None]:
import torch.nn as nn

## Define the NN architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        ### START CODE HERE ###

        ### END CODE HERE ###

    def forward(self, x):
        # flatten image input
        x = x.view(-1, 28 * 28)

        ### START CODE HERE ###

        ### END CODE HERE ###
        return x

# initialize the NN
model = Net()
print(model)

<a name="1.4"></a>
###  1.4 - Specify [Loss Function](http://pytorch.org/docs/stable/nn.html#loss-functions) and [Optimizer](http://pytorch.org/docs/stable/optim.html)

It's recommended that you use cross-entropy loss for classification. If you look at the documentation (linked above), you can see that PyTorch's cross entropy function applies a softmax funtion to the output layer *and* then calculates the log loss.

In [None]:
## Specify loss and optimization functions

# specify loss function
criterion = nn.CrossEntropyLoss()

# specify optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

<a name="1.5"></a>
###  1.5 - Model Training

The steps for training/learning from a batch of data are described in the comments below:
1. Clear the gradients of all optimized variables
2. Forward pass: compute predicted outputs by passing inputs to the model
3. Calculate the loss
4. Backward pass: compute gradient of the loss with respect to model parameters
5. Perform a single optimization step (parameter update)
6. Update average training loss

The following loop trains for 30 epochs; feel free to change this number. For now, we suggest somewhere between 20-50 epochs. As you train, take a look at how the values for the training loss decrease over time. We want it to decrease while also avoiding overfitting the training data. Note that the training process is slow as we are iterating through 60,000 images with a batch size of 16 in each epoch. You can speed up this process by switching from CPU to GPU for this notebook.

In [None]:
# number of epochs to train the model
n_epochs = 10  # 10 for fast demo, suggest training between 20-50 epochs

model.train() # prep model for training

for epoch in range(n_epochs):
    # monitor training loss
    train_loss = 0.0

    ###################
    # train the model #
    ###################
    for data, target in train_loader:
        # clear the gradients of all optimized variables
        optimizer.zero_grad()
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the loss
        loss = criterion(output, target)
        # backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        # perform a single optimization step (parameter update)
        optimizer.step()
        # update running training loss
        train_loss += loss.item()*data.size(0)

    # print training statistics
    # calculate average loss over an epoch
    train_loss = train_loss/len(train_loader.dataset)

    print('Epoch: {} \tTraining Loss: {:.6f}'.format(
        epoch+1,
        train_loss
        ))

<a name="1.6"></a>
###  1.6 - Model Evaluation

Finally, we test our best model on previously unseen **test data** and evaluate it's performance. Testing on unseen data is a good way to check that our model generalizes well. It may also be useful to be granular in this analysis and take a look at how this model performs on each class as well as looking at its overall loss and accuracy.

#### `model.eval()`

`model.eval(`) will set all the layers in your model to evaluation mode, which might be different with train mode for some layers, such as ``dropout``.

In [None]:
# initialize lists to monitor test loss and accuracy
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))

model.eval() # prep model for *evaluation*
with torch.no_grad():  # turn off gradient to save memory
  for data, target in test_loader:
      # forward pass: compute predicted outputs by passing inputs to the model
      output = model(data)
      # calculate the loss
      loss = criterion(output, target)
      # update test loss
      test_loss += loss.item()*data.size(0)
      # convert output probabilities to predicted class
      _, pred = torch.max(output, 1)
      # compare predictions to true label
      correct = np.squeeze(pred.eq(target.data.view_as(pred)))
      # calculate test accuracy for each object class
      for i in range(batch_size):
          label = target.data[i]
          class_correct[label] += correct[i].item()
          class_total[label] += 1

# calculate and print avg test loss
test_loss = test_loss/len(test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))

for i in range(10):
    if class_total[i] > 0:
        print('Test Accuracy of digit %5s: %2d%% (%2d/%2d)' % (
            str(i), 100 * class_correct[i] / class_total[i],
            np.sum(class_correct[i]), np.sum(class_total[i])))
    else:
        print('Test Accuracy of digit %5s: N/A (no training examples)' % (classes[i]))

print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
    100. * np.sum(class_correct) / np.sum(class_total),
    np.sum(class_correct), np.sum(class_total)))

### Visualize Sample Test Results


This cell displays test images and their labels in this format: `predicted (ground-truth)`. The text will be green for accurately classified examples and red for incorrect predictions.

In [None]:
# obtain one batch of test images
dataiter = iter(test_loader)
images, labels = next(dataiter)

# get sample outputs
output = model(images)
# convert output probabilities to predicted class
_, preds = torch.max(output, 1)
# prep images for display
images = images.cpu().numpy()

# plot the images in the batch, along with predicted and true labels
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(batch_size):
    ax = fig.add_subplot(2, batch_size//2, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    ax.set_title("{} ({})".format(str(preds[idx].item()), str(labels[idx].item())),
                 color=("green" if preds[idx]==labels[idx] else "red"))

<a name="1.7"></a>
###  1.7 - (Optional) Improve model performance by trying different hyperparameters

You can try to improve the model performance by using a different network architecture and longer training.

<a name="2"></a>
## 2 - T-SNE visualization
---

[T-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) is a powerful dimensionality reduction technique that can be used for data visualisation.

The idea is that we as human cannot make sense of data with dimensionality higher than 3. Therefore, the goal of t-SNE is to reduce high-dimensional data to low-dimension (ideally 2) so that we can visualize and understand the data.

Obviously, the dimensionality reduction process needs to preserve the (manifold) structure of the data, in a way that close examples in the high-dimensional space are still close in the low-dimensional space.

<a name="2.1"></a>
#### 2.1 Load data

We will use 3000 images from the test set of MNIST for this exercise.

In [None]:
# how many samples per batch to load
batch_size = 3000

# transforms to be applied when loading data
transform = transforms.ToTensor()

# doload test data if not yet
test_data = datasets.MNIST(root='DATA', train=False, download=True, transform=transform)

# prepare data loaders
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size)

# load data
dataIter = iter(test_loader)
X, label = next(dataIter)

# Convert to numpy and flatten images
X_np = X.numpy().reshape(3000, -1)
print("The current data shape is: ", X_np.shape)

<a name="2.2"></a>
#### 2.2 Visualizing the raw data

In the code below, we will use tsnet to reduce the data X of shape (3000, 784) to X_embeded of shape (3000, 2) and then plot X_embeded with a scatter plot, color coded by its class label.

In [None]:
from sklearn.manifold import TSNE
import seaborn as sns; sns.set_theme()

tsne = TSNE(n_components=2,        # number of dimension after embeding
            perplexity=30,         # size of nearest neighbour (to be tuned)
            learning_rate='auto',
            init='pca',
            n_iter=300)

X_embeded = tsne.fit_transform(X_np)

# Create the figure
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(1, 1, 1, title='TSNE of MNIST TEST SET' )

# Create the scatter
sns.scatterplot(x=X_embeded[:, 0], y=X_embeded[:, 1], hue=label, palette="deep")

As seen, some digits are well separated from the other, such as `0', '1', '2', while some digits are mixed with each other, such as '4' and '9'. These observation makes sense as handwritten 4 and 9 can be quite similar.


<a name="2.3"></a>
#### 2.3 Visualizing the learned features

We mentioned in the lecture that the neural network learns latent representation (embedding) by its hidden layers. Let's visualise these learned representation and see if it's better than the raw representation.

We will use the feature after the 2nd hidden layers ``fc2`` (which has dimensionality of 64). Since our NN model has been defined and trained. To extract the intermediate result after fc2, we need to use ``hook`` in pytorch.

In [None]:
extracted_features = []

def hook_function(module, input, output):
    extracted_features.append(output.clone().detach())

hook = model.fc2.register_forward_hook(hook_function)
model_output = model(X)
X_fc2 = extracted_features[0]  # The features are stored in a list
print("The data after fc2 has the shape: ", X_fc2.shape)

We will then visualise this 64 dimensional data using t-SNE again.

In [None]:
tsne = TSNE(n_components=2,        # number of dimension after embeding
            perplexity=30,         # size of nearest neighbour (to be tuned)
            learning_rate='auto',
            init='pca',
            n_iter=300)

X_embeded = tsne.fit_transform(X_fc2)

# Create the figure
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(1, 1, 1, title='TSNE of MNIST TEST SET after fc2')

# Create the scatter
sns.scatterplot(x=X_embeded[:, 0], y=X_embeded[:, 1], hue=label, palette="deep")

As shown, the data are much better separated. In particular, the dights '4' and '9' are well separated. Applying a classifer on this representation will obtain better performance than on the raw representation. This is the power of feature learning, which is the essence of deep learning. We will see more examples in later lectures and tutorials.

Note that the t-sne embedding can be adjusted or tuned by its hyperparameters as well, in particular, the ``perplexity=30``. You can try use t-sne with other features, e.g., after fc3.

<a name="3"></a>
## 3 NN Playground


The [neural network playground]((https://playground.tensorflow.org/)) provides a webbroswer based tool for playing with neural networks. In particular, we could visualize how NN is trained and what the **deicion boundary** looks like.

You can play it yourself or you can try to go through the following exercises.
               

<a name="ex02"></a>
### Exercise 02: XOR data (The 2nd one)

**Task 1**: Start with the model of 1 hidden layer 1 neuron with linear activation. The NN model combines our two input features into a single neuron. Will this model learn any nonlinearities? Run it to confirm your guess.

**Task 2**: Try increasing the number of neurons in the hidden layer from 1 to 2, and also try changing from a Linear activation to a nonlinear activation like ReLU. Can you create a model that can learn nonlinearities? Can it model the data effectively?

**Task 3**: Try increasing the number of neurons in the hidden layer from 2 to 3, using a nonlinear activation like ReLU. Can it model the data effectively? How does model quality vary from run to run?

**Task 4**: Continue experimenting by adding or removing hidden layers and neurons per layer. Also feel free to change learning rates, regularization, and other learning settings. What is the smallest number of neurons and layers you can use that gives test loss of 0.177 or lower?

Does increasing the model size improve the fit, or how quickly it converges? Does this change how often it converges to a good model? For example, try the following architecture:

- First hidden layer with 3 neurons.
- Second hidden layer with 3 neurons.
- Third hidden layer with 2 neurons.

<a name="ex03"></a>
### Exercise 03: XOR data

This exercise uses the XOR data again, but looks at the repeatability of training Neural Nets and the importance of initialization. Use a NN with 1 hidden layer 3 hidden neurons with ReLU activation.

**Task 1**: Run the model as given four or five times. Before each trial, hit the Reset the network button to get a new random initialization. (The Reset the network button is the circular reset arrow just to the left of the Play button.) Let each trial run for at least 500 steps to ensure convergence. What shape does each model output converge to? What does this say about the role of initialization in non-convex optimization?

**Task 2**: Try making the model slightly more complex by adding a layer and a couple of extra nodes. Repeat the trials from Task 1. Does this add any additional stability to the results?

<a name="ex04"></a>
### Exercise 04: Neural Net Spiral (the last one)

This data set is a noisy spiral. Obviously, a linear model will fail here, but even manually defined feature crosses may be hard to construct.

**Task 1**: Train the best model you can, using just X1 and X2. Feel free to add or remove layers and neurons, change learning settings like learning rate, regularization rate, and batch size. What is the best test loss you can get? How smooth is the model output surface?

**Task 2**: Even with Neural Nets, some amount of feature engineering is often needed to achieve best performance. Try adding in additional cross product features or other transformations like sin(X1) and sin(X2). Do you get a better model? Is the model output surface any smoother?