# 03- Neural Networks

This is an exploration for creating AI models, starting with a simple shallow Neural Network (NN). Neural networks forward pass input features through their inner layers, and then multiply them by weights and offset them by bias parameters. There is non-linearity activation functions added to each layer. The output can be a single value, like a binary classification system, or a combination of values. The comparison of expected vs actual data is sent as feedback through a backward pass, and the parameters updated through gradient descent to minimize the cost function, bringing the outputs closer to expected on later passes.

In [None]:
import torch

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using {device} device")

## Load Dataset

The data is labeled, and separated into train, validate, and test data. Train data will be used to compute inner layer parameters. Validate/Dev data helps with selecting better model hyperparameters and reducing variance. Test data approximates if the model will perform well with real data.

In [None]:
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
from collections import OrderedDict

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor()
     #,transforms.Normalize((0.5,), (0.5,))
    ])

train_ds  = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=transform
)

test_ds  = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=transform
)
print(train_ds)

### Split training set into training and validation

For < 100K item datasets, generally 80% test, 20% dev split is good. For larger datasets, both dev and test ratios can be reduced.

In [None]:
RATIO_VALIDATION = 0.2

In [None]:
train_num = len(train_ds)
indices = list(range(train_num))
np.random.shuffle(indices)
split = int(np.floor(RATIO_VALIDATION * train_num))
val_idx, train_idx = indices[:split], indices[split:]
len(val_idx), len(train_idx)

### Prepare dataloaders

In [None]:
BATCH_SIZE = 64

In [None]:
from torch.utils.data import DataLoader
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
val_dl   = DataLoader(train_ds, batch_size=BATCH_SIZE)
test_dl  = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=True)

In [None]:
# Show sample
image, label = next(iter(train_dl))
print(image[0].shape, label.shape)
classes = ['T-shirt/top','Trouser','Pullover','Dress','Coat','Sandal','Shirt','Sneaker','Bag','Ankle Boot']
print(classes[label[0].item()])
plt.imshow(image[0].numpy().squeeze(), cmap='gray');

## Shallow Neural Network

 A shallow NN is one that only has a single hidden layer of weights and biases applied to the input, then sent to the output after an activation function.
 
 The hidden layer here is a fully connected layer with a Rectified Linear Unit (RELU) non-linear activation.
 
 The output layer activation is a SoftMax function, generally used for categorical classifiers, where each output shows the scaled probability of the input matching that category.

![](../media/neural_networks/Shallow_Neural_network.jpg)

In [None]:
# We are classifiying against 10 classes
OUTPUTS = 10

In [None]:
# This variable is the number of parameters in the single inner layer of this shallow neural network.
HIDDEN_PARAMETERS = 128

input_features = image[0].shape[0] * image[0].shape[1] * image[0].shape[2] # Total features computed from input data: #color channels * width * height pixels of the image

In [None]:
from torch import nn

In [None]:
model = nn.Sequential(OrderedDict([('fc1', nn.Linear(input_features, HIDDEN_PARAMETERS)), # Fully connected NN
                                   ('relu1', nn.ReLU()), # Activation function
                                   ('output', nn.Linear(HIDDEN_PARAMETERS, OUTPUTS)), # Fully connected NN
                                   ('logsoftmax', nn.LogSoftmax(dim=1))])) # Softmax activation for categorization
# Use GPU if available
model = model.to(device)

In [None]:
print(model)

### Loss Function (Criterion) and Optimizer

In [None]:
# This is a model hyperparameter. Large values can fail to minimize the cost function, small values might mean more time spent on iterations.
LEARNING_RATE = 0.003

In [None]:
# Select our loss function
# Cross Entropy Loss is the traditional loss function for neural networks.
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
# Select our optimizer
# Stochastic Gradient Descent is the traditional optimizer
optimizer = torch.optim.SGD(model.parameters(), lr = LEARNING_RATE)

### Model training

### Training Loop (Explained)

Training a neural network involves iteratively updating its weights to minimize the loss function. This process is typically achieved using gradient descent optimization algorithms. Here's an in-depth explanation of the training loop:

1. **Epochs**: An epoch represents one complete forward and backward pass of all the training examples. The number of epochs (`n_epochs`) is the number of times the learning algorithm will work through the entire training dataset. Usually a custom hyperparameter.

2. **Model Training Mode**: Neural networks can operate in different modes - training and evaluation. Some layers, like dropout, behave differently in these modes. Setting the model to training mode ensures that layers like dropout function correctly.

3. **Batch Processing**: Instead of updating weights after every training example (stochastic gradient descent) or after the entire dataset (batch gradient descent), we often update weights after a set of training examples known as a batch.

4. **Zeroing Gradients**: In PyTorch, gradients accumulate by default. Before calculating the new gradients in the current batch, we need to set the previous gradients to zero.

5. **Forward Pass**: The input data (images) are passed through the network, layer by layer, until we get the output. This process is called the forward pass.

6. **Calculate Loss**: Once we have the network's predictions (outputs), we compare them to the true labels using a loss function. This gives a measure of how well the network's predictions match the actual labels.

7. **Backward Pass**: To update the weights, we need to know the gradient of the loss function with respect to each weight. The backward pass computes these gradients.

8. **Update Weights**: The optimizer updates the weights based on the gradients computed in the backward pass.

In [None]:
# Define our model, with parameterized loss function and optimizer
def train_validate(model, loss_fn, optimizer, trainloader, testloader, device, n_epochs=25):
    train_losses = []
    test_losses = []
    for epoch in range(n_epochs):
        model.train() # Set mode to training - Dropouts will be used here
        train_epoch_loss = 0
        for images, labels in trainloader:
            images, labels = images.to(device), labels.to(device)
            # flatten the images to batch_size x 784
            images = images.view(images.shape[0], -1)
            # forward pass
            outputs = model(images)
            # backpropogation
            train_batch_loss = loss_fn(outputs, labels)
            optimizer.zero_grad()
            train_batch_loss.backward()
            # Weight updates
            optimizer.step()
            train_epoch_loss += train_batch_loss.item()
        # One epoch of training complete
        # calculate average training epoch loss
        train_epoch_loss = train_epoch_loss/len(trainloader)

        # Now Validate on testset
        with torch.no_grad():
            test_epoch_acc = 0
            test_epoch_loss = 0
            model.eval() # Set mode to eval - Dropouts will NOT be used here
            for images, labels in testloader:
                images, labels = images.to(device), labels.to(device)                    
                # flatten images to batch_size x 784
                images = images.view(images.shape[0], -1)
                # make predictions 
                test_outputs = model(images)
                # calculate test loss
                test_batch_loss = loss_fn(test_outputs, labels)
                test_epoch_loss += test_batch_loss
                
                # get probabilities, extract the class associated with highest probability
                proba = torch.exp(test_outputs)
                _, pred_labels = proba.topk(1, dim=1)
                
                # compare actual labels and predicted labels
                result = pred_labels == labels.view(pred_labels.shape)
                batch_acc = torch.mean(result.type(torch.FloatTensor))
                test_epoch_acc += batch_acc.item()
            # One epoch of training and validation done
            # calculate average testing epoch loss
            test_epoch_loss = test_epoch_loss/len(testloader)
            test_epoch_loss = test_epoch_loss.cpu() # To be able to plot it
            # calculate accuracy as correct_pred/total_samples
            test_epoch_acc = test_epoch_acc/len(testloader)
            # save epoch losses for plotting
            train_losses.append(train_epoch_loss)
            test_losses.append(test_epoch_loss)
            # print stats for this epoch
            print(f'Epoch: {epoch:02} -> train_loss: {train_epoch_loss:.10f}, val_loss: {test_epoch_loss:.10f}, ',
                  f'val_acc: {test_epoch_acc*100:.2f}%')
    # Finally plot losses
    plt.plot(train_losses, label='train-loss')
    plt.plot(test_losses, label='val-loss')
    plt.legend()
    plt.show()

In [None]:
# Number of epochs to run
EPOCHS = 25

In [None]:
# Train and validate
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS)

### Validate on test set

Once our model is trained, it's crucial to evaluate its performance on unseen data. We'll:

1. Generate predictions for the test set.
2. Compute the overall accuracy.
3. Examine the model's performance in detail using a confusion matrix and classification report.

These tools will provide insights into specific areas where the model excels or might need improvement.

Note: We don't want to compute gradients, so we use `torch.no_grad()`.

In [None]:
# Validate
def test_validate(model, test_dl, device):
    with torch.no_grad(): # Ne need to calculate backward pass for test set.
        batch_acc = []
        model.eval()
        for images, labels in test_dl:
            images, labels = images.to(device), labels.to(device)
            # flatten images to batch_size x 784
            images = images.view(images.shape[0], -1)
            # make predictions and get probabilities
            proba = torch.exp(model(images))
            # extract the class associted with highest probability
            _, pred_labels = proba.topk(1, dim=1)
            # compare actual labels and predicted labels
            result = pred_labels == labels.view(pred_labels.shape)
            acc = torch.mean(result.type(torch.FloatTensor))
            batch_acc.append(acc.item())
        else:
            print(f'Test Accuracy: {torch.mean(torch.tensor(batch_acc))*100:.2f}%')
        return batch_acc

In [None]:
test_validate(model, test_dl, device);

## Shallow NN Hyperparameters

We have some `PARAMETERS` that can be changed to see if we can get better and faster results. Lets change some of them here.

In [None]:
# Let's assume we will change our loss function and optimizer to better suit our classification problem. If that's true, we could accelerate our learning rate to get faster training
LEARNING_RATE = 0.01

In [None]:
# Select our loss function
# Negative log likelihood loss, useful to train a classification problem.
loss_fn = nn.NLLLoss()

In [None]:
# Select our optimizer
# Adam (Adaptive Moment Estimation) optimizer. It is a robust method that builds momentum to speed up training, and accelerating in the right directions based on previous history.
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
# And because we might train faster, we assume we'll need less iterations
EPOCHS = 15

In [None]:
# Train with new parameters
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS)

In [None]:
test_validate(model, test_dl, device);

We increased our learning rate by 3, and could reduce our number of epochs by 40%, but the learning is showing some diverging instead of converging characteristics.

### Hidden parameters

The number of parameters in the hidden layer can also change, and might provide different results. We'll reduce them by half. The test set accuracy will probably hold.

In [None]:
HIDDEN_PARAMETERS = 64 # Decrease by factor of 2

In [None]:
model = nn.Sequential(OrderedDict([('fc1', nn.Linear(input_features, HIDDEN_PARAMETERS)),
                                   ('relu1', nn.ReLU()),
                                   ('output', nn.Linear(HIDDEN_PARAMETERS, OUTPUTS)),
                                   ('logsoftmax', nn.LogSoftmax(dim=1))]))
# Use GPU if available
model = model.to(device)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS)

In [None]:
test_validate(model, test_dl, device);

# Deep Neural Networks

Deep Neural Networks contain more than 1 hidden layer of parameters. Each of these layers is usually a linear transformation with weights and a bias, coupled with a non-linear activation function. Let's redefine our model with two layers, and keep all other hyperparameters the same.

![](../media/neural_networks/Deep_Neural_network.jpg)

### Two layer DNN

In [None]:
HIDDEN_LAYER_PARAMETERS = [64, 32]

In [None]:
model = nn.Sequential(OrderedDict([('fc1', nn.Linear(input_features, HIDDEN_LAYER_PARAMETERS[0])),
                                   ('relu1', nn.ReLU()),
                                   ('fc2', nn.Linear(HIDDEN_LAYER_PARAMETERS[0], HIDDEN_LAYER_PARAMETERS[1])),
                                   ('relu2', nn.ReLU()),
                                   ('output', nn.Linear(HIDDEN_LAYER_PARAMETERS[1], OUTPUTS)),
                                   ('logsoftmax', nn.LogSoftmax(dim=1))]))
# Use GPU if available
model = model.to(device)
print(model)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS)

In [None]:
test_validate(model, test_dl, device);

### Three layer DNN

In [None]:
HIDDEN_LAYER_PARAMETERS = [64, 48, 24]

In [None]:
model = nn.Sequential(OrderedDict([('fc1', nn.Linear(input_features, HIDDEN_LAYER_PARAMETERS[0])),
                                   ('relu1', nn.ReLU()),
                                   ('fc2', nn.Linear(HIDDEN_LAYER_PARAMETERS[0], HIDDEN_LAYER_PARAMETERS[1])),
                                   ('relu2', nn.ReLU()),
                                   ('fc3', nn.Linear(HIDDEN_LAYER_PARAMETERS[1], HIDDEN_LAYER_PARAMETERS[2])),
                                   ('relu3', nn.ReLU()),
                                   ('output', nn.Linear(HIDDEN_LAYER_PARAMETERS[2], OUTPUTS)),
                                   ('logsoftmax', nn.LogSoftmax(dim=1))]))
# Use GPU if available
model = model.to(device)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [None]:
train_validate(model, loss_fn, optimizer, train_dl, val_dl, device, n_epochs = EPOCHS)

In [None]:
test_validate(model, test_dl, device);

We can modify our NN further or train it for longer, but you can probably see that our validation loss has a lot of variance, and that our test set underperforms in comparison. We're overfitting our data. Next, we'll take some measures to prevent that.

**Next Notebook: [04-Regularization](04-Regularization.ipynb)**