In [None]:
### Install requirements
from os.path import isfile

repository   = "https://github.com/lmingari/olot-course.git"
requirements = "requirements-section2-1.txt"

if not isfile(requirements):
    !git clone {repository}
    %cd olot-course
    !pip install -r {requirements}

# 2.1 Training a Neural Network with PyTorch
***

## Supervised learning workflow

| A general workflow for supervised learning |
| --- |
| ![](figs/supervised.svg) |

## Implementation of a neural network in PyTorch: General overview

* __Python__ is a popular programming language that emphasizes code readability
* __PyTorch__ is a Python-based scientific computing package for deep learning
    
| Programming language | Deep learning libraries |
| ------ | ------- |
| ![](figs/python_logo.svg) | ![](figs/pytorch_logo.svg) |

| PyTorch typical workflow |
| --- |
| <img src="figs/pytorch-workflow.svg" width=600px> |

#### Required objects

We need to instantiate a few objects listed below. These objects define how the model learns from data — the __model__ predicts, the __criterion__ evaluates, the __optimizer__ improves, and the __dataloader__ keeps the data flowing:

* __Model__: Defines the architecture of the neural network (layers, activations, etc.). It transforms input data into predictions
* __DataLoader__: Provides an efficient way to feed data to the model in batches, with optional shuffling and parallel loading. It wraps around datasets for training and testing
* __Criterion (Loss Function)__: Measures how far the model’s predictions are from the target values. It’s used to guide the learning process by computing an error signal
* __Optimizer__: Updates the model’s parameters based on the gradients computed from the loss. It implements the learning rule (e.g., SGD, Adam)

## A multilayer perceptron

A __perceptron__ is a basic unit in neural networks. It processes inputs with weighted connections and a bias, producing binary outputs through an activation function:

| Single perceptron unit   |
| ------------------------ |
| <img src="figs/perceptron.svg" width=400/> |

| Perceptron unit output  |
| ----------------------- |
| $$ f(x_0, \dots, x_n) = \varphi \left( \sum_{i=0}^{n} w_i x_i \right) $$ |

**Multi-layer Perceptron (MLP)** is a supervised learning algorithm that learns
a function $f: R^m \rightarrow R^o$ by training on a dataset.
Given a set of features $X = \{x_1, x_2, ..., x_m\}$ and a target $y$, 
it can learn a non-linear function approximator for either classification or regression.

| Multilayer perceptron |
| ----------------------|
| <img src="figs/mlp.svg" width=600/> |

[mlp]: https://scikit-learn.org/stable/modules/neural_networks_supervised.html "MLP"

## Implementing a MLP

Let's check the installation by importing the essential pytorch modules/sub-modules:

In [None]:
import torch                                               # main PyTorch library for tensor computation
import torch.nn as nn                                      # building blocks for creating and training neural networks
import torch.optim as optim                                # implementation of various optimization algorithms
from torch.utils.data import Dataset, DataLoader           # dataset utilities

## 1. Loading raw data and feature scaling

Feature scaling is a method used to normalize the range of independent variables or features of data.
In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

* __Normalization__: scales features to a specific range

$$ X \to \dfrac{X - X_{min}}{X_{max} - X_{min}} $$

* __Standardization__: transforms data to have a mean of 0 and a standard deviation of 1

$$ X \to \dfrac{X-\mu}{\sigma} $$

where $\mu$ and $\sigma$ are the mean and standard deviation of the features.

In [None]:
nfeatures = 4
ntargets  = 1
nexamples = 100

X = torch.rand(nexamples,nfeatures) ## Features
y = torch.ones(nexamples,ntargets)  ## Targets

print("Features array: ", X.shape)
print("Target vector: ", y.shape)

## 2. Create Dataset

Data splitting refers to the practice of dividing a dataset into distinct subsets to facilitate the training, testing, and evaluation of machine learning models. By separating the data, we ensure that the model is trained on one set, validated on another, and tested on a final, independent set.

1. __Training dataset__: Used to optimize the model parameters (learn weights)
2. __Validation dataset__: Used during training to monitor performance (used for hyperparameter tuning, early stopping, etc...)
3. __Test dataset__: Used once to evaluate the final chosen model for a fair (unbiased) reporting of model performance

In [None]:
class myDataset(Dataset):
    def __init__(self, X, y, transform = None):
        """
        X: NumPy array of shape (n_samples, n_features)
        y: NumPy array of shape (n_samples, 1)
        """
        self.X = X
        self.y = y
        self.transform = transform

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        x = self.X[idx]
        if self.transform:
            x = self.transform(x)
        return x, self.y[idx]

In [None]:
## Let's define out training dataset for now
dataset = myDataset(X,y)
len(dataset)

In [None]:
# Return the 4th data sample
X, y = dataset[3]
print(f'Features: {X}')
print(f'Target: {y}')

## 3. Create DataLoader

While training a model, we typically want to pass samples in "minibatches". `DataLoader` is an iterable that abstracts this complexity for us in an easy API:

In [None]:
# Create a data loader with batch_size of 16
loader = DataLoader(dataset, batch_size=16, shuffle=True)

In [None]:
for xb, yb in loader:
    print("**** Mini-batch of features: \n", xb)
    print("**** Mini-batch of targets: \n", yb)
    break

## 4. Define a model

In [None]:
# Define the model for a multilayer perceptron with a single hidden layer
class MultiLayerPerceptron(nn.Module):
    def __init__(self, nfeatures=4, nhidden=16):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(nfeatures, nhidden), # input layer -> hidden layer
            nn.ReLU(),
            nn.Linear(nhidden, 1),         # hidden layer -> output layer
        )

    def forward(self, x):
        return self.model(x)

In [None]:
from torchsummary import summary

# Instantiate model
model = MultiLayerPerceptron()
summary(model, (nfeatures,))

In [None]:
for xb, yb in loader:
    # Make a prediction
    yp = model(xb)
    print("**** Mini-batch of predictions: ", yp.shape)
    print("**** Mini-batch of targets: ", yb.shape)
    break

## 5. Loss function

Let's define a loss function based on the mean squared error:

$$ L = \dfrac{1}{n} \sum_{i=1}^n (y_i-\hat{y}_i)^2 $$

In [None]:
# Creates a criterion that measures the mean squared error
criterion = nn.MSELoss()

In [None]:
# minibatch loop
for xb, yb in loader:
    # Make a prediction
    yp = model(xb)
    # Compute MSE loss
    loss = criterion(yp,yb)
    print("**** Averaged loss: ", loss)
    break

## 6. Optimizer

According the _Gradient Descent (GD)_ Optimization algorithm, the weights $\mathbf{w}$ are updated incrementally by taking a step in the opposite direction of the cost gradient:

$$ \mathbf{w}(t+1) = \mathbf{w}(t) - \eta \dfrac{\partial L}{\partial \mathbf{w}} (t) $$

where $\eta$ is the learning rate.

* The __learning rate__ is a key __hyperparameter__ in neural networks that controls how quickly the model learns during training

| Gradient Descent optimization |
| --- |
| <img src="figs/optimization.svg" width=600/> |

In [None]:
# Create an optimizer with learning rate 1E-3 = 0.001
optimizer = optim.Adam(model.parameters(), lr=1E-3)

## A template for the training routine

In [None]:
def train_epoch(model, loader, criterion, optimizer):
    # Set training mode
    model.train()
    
    total_loss = 0.0
    
    for xb, yb in loader:
        # Make a prediction
        yp = model(xb)
        # Compute MSE loss
        loss = criterion(yp,yb)

        # Update gradients
        optimizer.zero_grad()
        loss.backward()

        # Update parameters
        optimizer.step()
        
        # Update total loss
        total_loss += loss.item()
    return total_loss / len(loader.dataset)

## Training loop

In [None]:
NEPOCHS = 500

for epoch in range(NEPOCHS):
    loss = train_epoch(model, loader, criterion, optimizer)
    if epoch%10 == 0:
        print(f"Epoch {epoch+1:03d} -> Train loss {1000*loss:.4f}")

## Performing an inference 

In [None]:
## Set inference mode
model.eval()

## Validation dataset
X_val = torch.rand((10,nfeatures))

with torch.no_grad():
    yp = model(X_val)
yp

## Summary of hyperparameters

The next table lists some hyperparameters used in this lecture:

| Hyperparameter   | Description | Context | Value |
| ---------------- | :---------: | :-----: | :---: |
| _Number of layers_      | Deep the network    | Model Architecture | ? |
| _Neurons per layer_     | Width of each layer | Model Architecture | ? |
| _Activation functions_  | e.g., ReLU, sigmoid, tanh, ... | Model Architecture | ? |
| _Learning rate_, $\eta$ | Size of the update step            | Training | ? |
| _Optimizer_  | Optimization algorithm: SGD, Adam, etc...     | Training | ? |
| _Batch size_ | Samples processed for every update of weights | Training | ? |
| _Number of epochs_ | Iterations over the entire dataset      | Training | ? |

> &#9998; **Exercise:** <br>
> Complete the __Value__ column in the table with the hyperparameters used in this section