# Demo of MLP Library Usage

## INSTRUCTIONS

For more detailed instructions of using MLPLibrary consult the documentation. This demo will give practical demonstration of MLPLibrary functions and usage for constructing MLP models, setting hyperparameters and evaluating training and testing performance, using the provided data for COMP5329 Assignment 1. 

### Installing MLPLibrary

As provided in the README.md in the main folder, MLPlibrary may be installed locally as a portable library (similar to NumPy or PyTorch) simply by navigating into the network directory (i.e. ```'MLPLibrary/network/'```) and executing the following statement:
```
pip install ..
```
This will install all the required dependencies stated in the ```setup.py``` file and allow usage of our library functions anywhere on your device locally.

After this step, you may perform the following imports. If you decide not to install MLPLibrary, make sure you are within the main folder of MLPLibrary.

## IMPORTING MLPLibrary

In [None]:
# Importing all the modules and functions related to constructing MLP network models

from network.net import Net                   # Net class, base class for constructing MLP networks
from network.layer import Linear              # Linear class, child class of parent class Layer 
from network.loss import CrossEntropyLoss     # CrossEntropyLoss class, child class of parent class Loss
from network.activ import ReLU, LeakyReLU     # ReLU, LeakyReLU classes, child classes of parent class Activation
from network.optim import SGD, Adam           # SGD, Adam classes, child classes of parent class Optimizer

In [None]:
# Importing all the modules and functions related to data processing including loaders for the assignment data

# Process module contains functions relating to data processing:
from network.loader.process import (
    train_test_split,        # Function to split data with chosen ratio, data can be shuffled
    normalize,               # Normalizes data to have mean of zero and unit variance
    standardize,             # Normalizes data to be between range 0-1, i.e. standardizes data
    one_hot,                 # One hot encoding: 100% prob of 2 is [0, 0, 1] with 3 classes
    pca                      # Reduces data to chosen K principal components
) 

# Data module for loading the assignment data
from network.dataset.source import (
    get_data_from_file,   # Loads assignment data from file (must be within main directory)
    get_data_from_url     # Loads assignment data from public GitHub repo that stores data
)

# Data loader module for automating processing of and loading of assignment data based on parameter selections
from network.loader.data_loader import load_train_val_test  # Parameter selections decide method of processing                            

### Importing Standard Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as pl
import pandas as pd 
import seaborn as sns

# setting random seed
np.random.seed(88)

### Example Data Loading and Preprocessing

#### Parameter Selections:

In [None]:
SOURCE_DATA = "url"          # May choose from "url" or "file" to source assignment data (must be in main for "file")
NORM_METHOD = "standardize"  # May choose from "standardize", "normalize" or "none" (none is no normalization on data)
PCA_N_COMPONENTS = 0         # If PCA_N_COMPONENTS > 0;
                                # Normalization is skipped as done implicitly by mean centering prior to applying PCA
                                # Strictly, the choice of PCA_N_COMPONENTS <= N_DIMENSIONS of the input dataset 
N_CATEGORIES = 10            # If N_CATEGORIES > 0 chosen, then categorical one-hot encoding is applied to label data
SPLIT_RATIO = 0.2             # Ratio data is to be split upon when obtaining a train test split (default is set to 0.2)
SHUFFLE_DATA = True          # IF SHUFFLE is True, then the data is shuffled prior to splitting by taking random indices

#### Obtaining loaded and processed data based on the chosen parameter selections:

In [None]:
# Note as we are loading data from URL it will take longer than from file.
train_set, valid_set, test_set = load_train_val_test(
    source = SOURCE_DATA,
    method = NORM_METHOD,      
    pca_N = PCA_N_COMPONENTS,
    n_categories = N_CATEGORIES,
    ratio = SPLIT_RATIO,
    shuffle = SHUFFLE_DATA
)

With a ratio of 0.2 and shuffle selected, the validation set is split from the training set by shuffling the data and taking random indices in proportion of the ratio.

In [None]:
line_sep = "------------------------------------------------"
print(line_sep)
print(f"Shape of Training Data: {train_set[0].shape}")
print(f"Shape of Training Labels: {train_set[1].shape}")
print(line_sep)
print(f"Shape of Validation Data: {valid_set[0].shape}")
print(f"Shape of Validation Labels: {valid_set[1].shape}")
print(line_sep)
print(f"Shape of Test Data: {test_set[0].shape}")
print(f"Shape of Test Labels: {test_set[1].shape}")
print(line_sep + '\n')
print(f"First 10 one-hot encoded labels of test data:\n\n{test_set[1][:10]}")
print(f"\nEquivalent to: {[np.argmax(label) for label in test_set[1][:10]]}")

## CONSTRUCTING MLP MODELS

### OPTIMIZER SELECTION

##### Optimizer 1: Stochastic Gradient Descent

In [None]:
LEARNING_RATE = 0.04
WEIGHT_DECAY = 0.001
MOMENTUM_TERM = 0.999
LR_SCHEDULER = "step"
STEP_TERMS = (25, .95)  # lr dropped by factor of 5% each step of 25 epochs

sgd = SGD(
    learning_rate = LEARNING_RATE,
    weight_decay = WEIGHT_DECAY,
    momentum = MOMENTUM_TERM,
    lr_decay = LR_SCHEDULER,
    step_terms = STEP_TERMS,
)
print(f"Optimizer: {sgd} is initialized and ready to be deployed.")

##### Optimizer 2: Adam

In [None]:
LEARNING_RATE = 0.0008    # can set even smaller than default 0.001 (can hopefully increase performance)
BETA_TERM_ONE = 0.900            # this is allowed in adaptive learning rate methods such as Adam
BETA_TERM_TWO = 0.999       # will take longer to converge, however, may reach a lower validation loss
EPS_STABILITY = 1e-09 # usually a sharp loss is seen when lr = 0.001 which is good to see but doesn't produce BEST results

adam = Adam(
    learning_rate = LEARNING_RATE,
    beta1 = BETA_TERM_ONE,
    beta2 = BETA_TERM_TWO,
    epsilon = EPS_STABILITY
)
print(f"Optimizer: {adam} is initialized and ready to be deployed.")

### MODEL ARCHITECTURE

##### Network Hyperparameters

In [None]:
OPTIMIZER = sgd
CRITERION = CrossEntropyLoss()
BATCH_NORM = True
ALPHA_TERM = 0.9
L2_REG_TERM = 0.004

mlp = Net(
    optimizer = OPTIMIZER,
    criterion = CRITERION,
    batch_norm = BATCH_NORM,
    alpha = ALPHA_TERM,
    L2_reg_term = L2_REG_TERM
)

##### Adding Hidden Linear and Activation Layers 

In [None]:
mlp.add(Linear(indim = 128, outdim = 1024, dropout = 0.2))
mlp.add(ReLU())
mlp.add(Linear(indim = 1024, outdim = 64, dropout = 0.2))
mlp.add(ReLU())
mlp.add(Linear(indim = 64, outdim = 32, dropout = 0.2))
mlp.add(ReLU())
mlp.add(Linear(indim = 32, outdim = 10))

mlp.set_name("SGD_small_network")
print(f"{mlp.model_name} is initialized and ready to be trained.")

### Plotting Helper Functions

In [None]:
def plot_results(stats):
    ep, tl, ta, vl, va = stats
    pl.figure(figsize = (10, 7))

    fig, ((ax1, ax2), (ax3, ax4)) = pl.subplots(2, 2)
    fig.suptitle(f'Training Results, best model found @ Epoch {ep}')

    ax1.plot(tl)
    ax1.set_title('Training Loss')

    ax2.plot(vl, 'tab:orange')
    ax2.set_title('Validation Loss')

    ax3.plot(ta, 'tab:green')
    ax3.set_title('Training Accuracy')

    ax4.plot(va, 'tab:red')
    ax4.set_title('Validation Accuracy')
    
    for ax in fig.get_axes():
        ax.label_outer()

    pl.show()

## NETWORK TRAINING

### TRAINING TILL CONVERGENCE

In [None]:
PERCENT_CHANGE_IN_LOSS = 1e-25    # convergence criteria has been achieved when:
CHECK_LAST_N_MODELS = 10            # (1) training loss is not reduced in next model by chosen N percentage change; OR
BATCH_SIZE = 500                    # (2) if min val loss since last best model is not beat by any of the next M models
MAX_EPOCHS = 200                  
REPORTING_INTERVAL = 1

stats = mlp.train_convergence(
    train_set = train_set,
    valid_set = valid_set,
    batch_size = BATCH_SIZE,
    planned_epochs = MAX_EPOCHS,
    last_check = CHECK_LAST_N_MODELS,
    threshold = PERCENT_CHANGE_IN_LOSS,
    report_interval = REPORTING_INTERVAL
)

{"tags": ["hide-output"]}

### Plotting Epoch-wise Loss and Accuracy Curves

In [None]:
plot_results(stats)

### Checking Accuracy of Best Model

In [None]:
# Loading best model found:

best_model = Net.load_model("model/" + mlp.model_name)
best_model.test_network(train_set, "train data")
best_model.test_network(valid_set, "valid data")
best_model.test_network(test_set, "test data")

### Can We Do Better?

Let's nudge training to see if we can obtain a better model in the next 50 epochs. Learning rate is also reset, this could allow us to get out of a local minima.

In [None]:
best_model.set_name("SGD_small_network_alt")

stats = best_model.train_network(
    train_set = train_set,
    valid_set = valid_set,
    batch_size = 250,        # reduced batch size to vary any possible patterns learned
    epochs = 50,
    report_interval = 1
)

{"tags": ["hide-output"]}

### Plotting Epoch-wise Loss and Accuracy Curves

In [None]:
plot_results(stats)

### Let's see how our new model does

In [None]:
# Loading new best model found:

new_model = Net.load_model("model/SGD_small_network_alt")
new_model.test_network(train_set, "train data")
new_model.test_network(valid_set, "valid data")
new_model.test_network(test_set, "test data")

### Confusion Matrix Helper Function

In [None]:
def confusion_matrix(pred, label):
    x, y = len(np.unique(pred)), len(np.unique(label))
    matrix = np.zeros((x, y))
    for i in range(len(pred)):
        m, n = pred[i], label[i]
        matrix[m, n] += 1
    return matrix

In [None]:
# Plot the confusion matrix of training data
pred = new_model.predict(train_set[0], train_set[1].shape[1])
pred_train_labels = np.argmax(pred, axis=1)

matrix = confusion_matrix(pred_train_labels, np.argmax(train_set[1], axis=1))
matrix_df = pd.DataFrame(matrix, index = np.arange(10), columns = np.arange(10))

pl.figure(figsize = (10,7))
sns.heatmap(matrix, annot=True)
pl.show()

In [None]:
# Plot the confusion matrix on test data
pred = new_model.predict(test_set[0], test_set[1].shape[1])
pred_test_labels = np.argmax(pred, axis=1)

matrix = confusion_matrix(pred_test_labels, np.argmax(test_set[1], axis=1))
matrix_df = pd.DataFrame(matrix, index = np.arange(10), columns = np.arange(10))

pl.figure(figsize = (10,7))
sns.heatmap(matrix, annot=True)
pl.show()

In [None]:
print('Predicted labels:', pred_test_labels.shape, '\n', pred_test_labels)

## Using Adam

In [None]:
OPTIMIZER = adam
L2_REG_TERM = 0.001


mlp = Net(
    optimizer = OPTIMIZER,
    criterion = CRITERION,
    batch_norm = BATCH_NORM,
    alpha = ALPHA_TERM,
    L2_reg_term = L2_REG_TERM
)

mlp.add(Linear(128, 1024, dropout=0.4))
mlp.add(ReLU())
mlp.add(Linear(1024, 512, dropout=0.2))
mlp.add(ReLU())
mlp.add(Linear(512, 64, dropout=0.2))
mlp.add(ReLU())
mlp.add(Linear(64, 16, dropout=0.2))
mlp.add(ReLU())
mlp.add(Linear(16, 10))  


mlp.set_name("Adam_network")
print(f"{mlp.model_name} is initialized and ready to be trained.")

### TRAINING TILL CONVERGENCE

In [None]:
stats = mlp.train_convergence(
    train_set = train_set,
    valid_set = valid_set,
    batch_size = BATCH_SIZE,
    planned_epochs = MAX_EPOCHS,
    last_check = CHECK_LAST_N_MODELS,
    threshold = PERCENT_CHANGE_IN_LOSS,
    report_interval = REPORTING_INTERVAL
)

{"tags": ["hide-output"]}

### Plotting Epoch-wise Loss and Accuracy Curves

In [None]:
plot_results(stats)

### Checking Accuracy of Best Model Found

In [None]:
# Loading best model found:

best_model = Net.load_model("model/" + mlp.model_name)
best_model.test_network(train_set, "train data")
best_model.test_network(valid_set, "valid data")
best_model.test_network(test_set, "test data")

### Can we do better?

In [None]:
best_model.set_name("Adam_network_alt")

stats = best_model.train_network(
    train_set = train_set,
    valid_set = valid_set,
    batch_size = 250,        # reduced batch size to vary any possible patterns learned
    epochs = 50,
    report_interval = 1
)

{"tags": ["hide-output"]}

### Is there potential for improvement?

In [None]:
plot_results(stats)

Continuing training for 10 more epochs

In [None]:
# Loading "new best model found":

new_model = Net.load_model("model/Adam_network_alt")
new_model.test_network(train_set, "train data")
new_model.test_network(valid_set, "valid data")
new_model.test_network(test_set, "test data")


### Confusion Matrix

In [None]:

# Plot the confusion matrix of training data
pred = new_model.predict(train_set[0], train_set[1].shape[1])
pred_train_labels = np.argmax(pred, axis=1)

matrix = confusion_matrix(pred_train_labels, np.argmax(train_set[1], axis=1))
matrix_df = pd.DataFrame(matrix, index = np.arange(10), columns = np.arange(10))

pl.figure(figsize = (10,7))
sns.heatmap(matrix, annot=True)
pl.show()

In [None]:
# Plot the confusion matrix on test data
pred = new_model.predict(test_set[0], test_set[1].shape[1])
pred_test_labels = np.argmax(pred, axis=1)

matrix = confusion_matrix(pred_test_labels, np.argmax(test_set[1], axis=1))
matrix_df = pd.DataFrame(matrix, index = np.arange(10), columns = np.arange(10))

pl.figure(figsize = (10,7))
sns.heatmap(matrix, annot=True)
pl.show()

##### Continuing training for 10 more epochs

In [None]:
best_model.set_name("Adam_network_alt")

stats = best_model.train_network(
    train_set = train_set,
    valid_set = valid_set,
    batch_size = 250,        # reduced batch size to vary any possible patterns learned
    epochs = 10,
    report_interval = 1
)

{"tags": ["hide-output"]}

In [None]:
plot_results(stats)

new_model = Net.load_model("model/Adam_network_alt")
new_model.test_network(train_set, "train data")
new_model.test_network(valid_set, "valid data")
new_model.test_network(test_set, "test data")

## Other Models