# ANN Pytorch Implementation

### MNIST Dataset

Input: 28x28 pixel image (gray scale)

Output: Binary classification
- output = 1: if digit is small (0, 1, 2)
- output = 0: otherwise (3,4,5,6,7,8,9)

![Alt text](images/img38.png)

Supervised or unsupervised? 
- Supervised (since we have access to labels)
- Classification: Outputs are categorical and not continuous

### ANN in PyTorch

Import libraries

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
import matplotlib.pyplot as plt     # for plotting
import torch.optim as optim         # for optimizer
torch.manual_seed(1)                # (when developing:) set the random seed to a fixed model, Important for reproducing results
                                    # so that we can see whether a change is caused by random seed or by editting configurations
                                    # same number when we call the random numbers

Define our NN architecture

- We usually use a class, inherit from nn.Module
    - class nn.Linear defines a fully-connected layer

- **forward()** method defines how to make a prediction

In [None]:
# define a 2-layer neural network
class NN(nn.Module):
    def __init__(self):
        super(Pigeon, self).__init__()
        self.layer1 = nn.Linear(28 * 28, 30)    # 30 is just hyper param. This means input x is (28x28x1), w is (30x28x28), b is (30x1)
        self.layer2 = nn.Linear(30, 1)          # inputs and output. 1 since we're doing binary classification
        
    def forward(self, img):
        flattened = img.view(-1, 28 * 28)
        activation1 = self.layer1(flattened)
        activation1 = F.relu(activation1)
        activation2 = self.layer2(activation1)
        return activation2
        
model = NN()        # instantiated model

NOTE: where is the final function to map the result to probability? (output is a linear layer)

### Loss Function and Softmax Activation

- You would expect to see the softmax activation applied to the output layer. Indeed this would be the case if we used:

    `criterion = nn.BCELoss()`

- Due to numerical stability, we will use:

    `criterion = nn.BCEWithLogitsLoss()`       --> <mark>RECOMMENDED</mark>

This applies **softmax activation internally**!

### PyTorch: Load Data

Load MNIST data: (there are also other data loaders)

In [None]:
# load commonly used MNIST dataset
mnist_data = datasets.MNIST('data', train=True, download=True)
mnist_data = list(mnist_data)
mnist_train = mnist_data[:1000]     # training data
mnist_val = mnist_data[1000:2000]   # validation data
img_to_tensor = transforms.ToTensor()

### Forward and Backward Pass

Forward pass: **makes a prediction**
- e.g. `model(input)`, which calls network.forward method
- Information flows forwards from input to output layer

Backward pass: **computes gradients** for making changes to weights
- e.g. `loss.backward()`
- Information flows backwards from output to input layer

In [None]:
### Training code for binary classification problems

# define loss function and optimizer settings
criterion = nn.BCEWithLogitsLoss()          # NOTE: this is why we do not need softmax in the end
optimizer = optim.SGD(model.parameters(), lr=0.005, momentum=0.9)  # or adam instead of SGD

for (image, label) in mnist_train:
    # get ground truth: is the digit less than 3?
    actual = torch.tensor(label < 3).reshape([1,1]).type(torch.FloatTensor)
    
    out = model(img_to_tensor(image))       # make prediction (through all layers)
    loss = criterion(out, actual)           # calculate loss
    loss.backward()                         # obtain gradients -> Get all gradients for all parameters in model (all layers)
    optimizer.step()                        # updates parameters -> Update all weights and biases 
                                                # based on SGD with lr and momentum (throughout all layers)
    optimizer.zero_grad()                   # a clean up step - important! This sets gradient to 0

### Training and Validation Error

**NOW, the model is trained** based on mnist_train.

We can assess model performance by tracking error rate and accuracy

In [None]:
# computing the error and accuracy on the training set
error = 0
for (image, label) in mnist_train:
    prob = torch.sigmoid(model(img_to_tensor(image)))                   # For each image, pass it to sigmoid to get prediction
    if (prob < 0.5 and label < 3) or (prob >= 0.5 and label >= 3):      # assume that class 0 < 0.5, class 1 >= 0.5
        error += 1
print("Training Error Rate:", error/len( mnist_train))
print("Training Accuracy:", 1 - error/len( mnist_train))

# computing the error and accuracy on the validation set
error = 0
for (image, label) in mnist_val:
    prob = torch.sigmoid(model(img_to_tensor(image)))
    if (prob < 0.5 and label < 3) or (prob >= 0.5 and label >= 3):
        error += 1
print("Testing Error Rate:", error/len( mnist_val))
print("Testing Accuracy:", 1 - error/len( mnist_val))

# QUESTION:

- model(img_to_tensor(image)) generates output (that is previously used against "actual" to compute loss). Why do we need to pass it through torch.sigmoid again?

***
## Multi-Class Classification

Identify each image to its corresponding digit 0-9

Do One-hot encoding

Requires minor changes to our PyTorch:

1. The final output layer has **as many neurons as classes.**
2. Apply the **softmax activation function** on the final layer to obtain class probabilities
3. Use the **multiclass cross-entropy** loss function

In [None]:
class MNISTClassifier(nn.Module):
    def __init__(self):
        super(MNISTClassifier, self).__init__()
        self.layer1 = nn.Linear(28 * 28, 50)    # 50, 20 and 3 layers are hyperparameter choice
        self.layer2 = nn.Linear(50, 20)
        self.layer3 = nn.Linear(20, 10)         # 1 output neuron for each digit number
        
    def forward(self, img):
        flattened = img.view(-1, 28 * 28)
        activation1 = F.relu(self.layer1(flattened))
        activation2 = F.relu(self.layer2(activation1))
        output = self.layer3(activation2)
        return output                       # NO SOFTMAX FUNCTION here?
model = MNISTClassifier()

### Loss Function and Softmax Activation

You would expect to see the softmax activation applied to the output layer. Indeed this would be the case if we used:

    `criterion = nn.NLLLoss()`

Due to numerical stability, we will use: (**returns the loss, but has a softmax embedded**)

    `criterion = nn.CrossEntropyLoss()`     --> RECOMMENDED

$\implies$ Applies softmax activation internally!

### Output Probabilities

To obtain output probabilities we have to apply the softmax:

In [None]:
prob = F.softmax(output, dim=1)
print(prob)
print(sum(prob[0]))

***
## Evaluation and Debugging

### Debugging Neural Networks

- Make sure your model can overfit
    - Make sure you can get loss to decrease w.r.t training data
- Make sure that your network is training: i.e. loss is going down.
    - Sanity check!
- Ensures that you are using the right variable names, and rule out other  
programming bugs that are difficult to discern from architecture issues.
- Confusion Matrix
    - True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN)
- 2D Projections of Data
    - PCA, t-SNE


### <mark>**Confusion Matrix**</mark>

(If we have 10 classes, we will have a 10x10 matrix)

The correct predictions are on the diagonal

![Alt text](images/img39.png)

**NOTE**: This accuracy formula is only correct if the input is distributed equally. **ELSE, IT IS BAD**

> e.g We want to predict cancer, minority class have cancer, majority don't have.  
> If this formula gives high prediction rate for people that DO NOT have cancer, but do badly for people who HAVE cancer  
> the accuracy is still high, but we might still **built a bad model**

### <mark>**F1 Score**</mark>

Good for imbalance datasets

![Alt text](images/img40.png)

<mark>**Precision**</mark>: How many True Positives (predicted positive, is positive) out of Predicted Positive  

<mark>**Recall**</mark>: How many True Positives (predicted positive, is positive) out of Is Positive

### MNIST 2D Visualization

If visualization shows clear separation between data, which means **your model is doing well**

![Alt text](images/img41.png)