# Introduction

This notebook will train a simple multilayer perceptron (MLP) model on the data prepared in Notebook 1.

# Load Libraries

In [None]:
import pandas as pd
import numpy as np
import torch
from torch import nn, optim
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Load Data

Firstly, the data prepared in Notebook 1 are loaded (data combined.csv), including dropping missing values (just in case any creeped in!).

Next, we specify the inputs and outputs. Note that to_drop contains any columns that are not to be used in inputs. Inputs will only contain molecular descriptors, M, T and p.
Outputs will be density values.

Output is divided by 1000 to convert the unit from kg/m3 to g/cm3 (this will be important in PINN).

In [None]:
df = pd.read_csv('./Data/data combined.csv', index_col = 0).dropna(axis = 0)
print('Loaded dataset:')
display(df)

extra_tags = ['M g/mol', 'T / K', 'p / MPa']
to_drop = ['Dataset ID', 'IL ID', 'Cation', 'Anion', 'Cationic family', 'Anionic family',
           'Excluded IL', 'Accepted dataset', 'T / K', 'p / MPa', 'ρ / kg/m3', 'SWMLR (v0) + FFANN (f)',
           'SWMLR (v0) + FFANN (f).1', 'FFANN (v0) + FFANN (f)', 'FFANN (v0) + FFANN (f).1',
           'LSSVM (v0) + FFANN (f)', 'LSSVM (v0) + FFANN (f).1', 'M g/mol']
molecular_descriptors = df.drop(to_drop, axis = 1).columns
inputs = np.hstack((molecular_descriptors, 
                    extra_tags))

outputs = ['ρ / kg/m3']

print(f'\n\nInputs {len(inputs)}:\n{inputs}')
print(f'\n\nOutputs {len(outputs)}:\n{outputs}')

X = df[inputs]
y = df[outputs]/1000

print('\n\nInput data:')
display(X)

print('\n\nOutput data:')
display(y)

# Scale and Split Data

The code defines two functions, scale and descale, for normalising and denormalising data. The scale function adjusts the input data by subtracting a shift value and dividing by a factor for each column, effectively normalising the data. The descale function reverses this process, restoring the original data values by multiplying by the factor and adding the shift.

In the preparation section, a dictionary named scaler is created to store the shift and factor values for both input features (X) and output values (y). These values are used to scale the inputs and outputs into a range defined by scaler_range.

The scaled data is then stored in the dataset dictionary, and any columns with all NaN values are dropped. The dataset is split into training and testing sets using a 15% test size, with shapes of the resulting sets printed to confirm the split.

In [None]:
def scale(data_in, shift, factor):
    """
    Scale the input data by shifting and dividing by a factor.
    
    Parameters:
    data_in (array-like or DataFrame): The input data to be scaled.
    shift (Series or DataFrame): The shift values for each column in the data.
    factor (Series or DataFrame): The factor values for each column in the data.
    
    Returns:
    DataFrame: The scaled data.
    """
    # Convert the input data to a DataFrame
    data_in = pd.DataFrame(data_in)
    
    # Scale the data by subtracting the shift and dividing by the factor for each column
    return (data_in - shift[data_in.columns]) / factor[data_in.columns]

def descale(data_in, shift, factor):
    """
    Descale the input data by multiplying by a factor and adding a shift.
    
    Parameters:
    data_in (array-like or DataFrame): The input data to be descaled.
    shift (Series or DataFrame): The shift values for each column in the data.
    factor (Series or DataFrame): The factor values for each column in the data.
    
    Returns:
    DataFrame: The descaled data.
    """
    # Convert the input data to a DataFrame
    data_in = pd.DataFrame(data_in)
    
    # Descale the data by multiplying by the factor and adding the shift for each column
    return data_in * factor[data_in.columns] + shift[data_in.columns]

# Prepare scaler
scaler = {}
scaler_range = (0, 1)
scaler['X_shift'] = X.min(axis=0)
scaler['X_factor'] = X.max(axis=0) - X.min(axis=0)
scaler['y_shift'] = y.min(axis=0)
scaler['y_factor'] = y.max(axis=0) - y.min(axis=0)

# Scale inputs X and output y
dataset = {}
dataset['X_scaled'] = scale(X, scaler['X_shift'], scaler['X_factor']).dropna(axis = 1, how = 'all')
dataset['Y_scaled'] = scale(y, scaler['y_shift'], scaler['y_factor'])
print('Scaled input data size:', dataset['X_scaled'].shape)
print('Scaled outputs data size:', dataset['Y_scaled'].shape)

# Split data into training and testing
dataset['X_train_scaled'], dataset['X_test_scaled'], dataset['Y_train_scaled'], dataset['Y_test_scaled'] = train_test_split(dataset['X_scaled'], dataset['Y_scaled'], test_size=0.15, random_state = 23)
print('Training size:', dataset['X_train_scaled'].shape, dataset['Y_train_scaled'].shape)
print('Testing size:', dataset['X_test_scaled'].shape, dataset['Y_test_scaled'].shape)

# Define Model and Training Functions

The code defines a simple multi-layer perceptron (MLP) model using PyTorch, along with functions for training and evaluating the model.

MLPModel_torch Class:

* Defines a basic MLP with two hidden layers, each containing 100 neurons, and an output layer with a single neuron for regression tasks.
* Uses ReLU activation functions in the hidden layers.
* Includes methods for initialisation and forward pass through the network.

closure Function:

* Performs a single optimisation step, including zeroing the gradients, making predictions, calculating loss, and performing backpropagation.

get_metrics Function:

* Computes various performance metrics for the model, including training and testing loss, Mean Squared Error (MSE), and Mean Absolute Error (MAE).
* Returns the computed metrics as a list.

print_progress Function:

* Prints the training progress, showing the current epoch's training and testing loss, MSE, and MAE.
* Highlights epochs with the best test loss achieved so far.

In [None]:
class MLPModel_torch(nn.Module):
    """
    A simple multi-layer perceptron (MLP) model using PyTorch.
    
    Parameters:
    input_dim (int): The number of input features.
    """
    def __init__(self, input_dim):
        super(MLPModel_torch, self).__init__()
        # Define the first fully connected layer with 100 neurons
        self.dense1 = nn.Linear(input_dim, 100)
        # Define the second fully connected layer with 100 neurons
        self.dense2 = nn.Linear(100, 100)
        # Define the output layer with 1 neuron for regression output
        self.dense_density = nn.Linear(100, 1)

    def forward(self, x):
        """
        Forward pass through the network.
        
        Parameters:
        x (Tensor): Input tensor.
        
        Returns:
        Tensor: Output tensor.
        """
        # Apply ReLU activation function to the output of the first layer
        x = torch.relu(self.dense1(x))
        # Apply ReLU activation function to the output of the second layer
        x = torch.relu(self.dense2(x))
        # Generate the final output
        y = self.dense_density(x)
        return y

def closure():
    """
    Perform a single optimization step for the model.
    
    Returns:
    Tensor: The loss value.
    """
    # Zero the gradients of the optimizer
    optimizer.zero_grad()
    # Forward pass: compute predicted y by passing x_batch to the model
    y_pred = model(x_batch)
    # Compute the loss
    loss = loss_fn(y_pred, y_batch).mean()
    # Backward pass: compute gradient of the loss with respect to parameters
    loss.backward()
    return loss

def get_metrics(model, X_train, X_test, y_train, y_test, loss_fn):
    """
    Calculate and return various performance metrics for the model.
    
    Parameters:
    model (nn.Module): The trained model.
    X_train (Tensor): Training input data.
    X_test (Tensor): Testing input data.
    y_train (Tensor): Training target data.
    y_test (Tensor): Testing target data.
    loss_fn (function): Loss function.
    
    Returns:
    list: A list of metrics [train loss, test loss, train MSE, test MSE, train MAE, test MAE].
    """
    # Predict on the training data
    y_pred_train = model(X_train)
    # Compute the training loss
    loss_train = loss_fn(y_pred_train, y_train)
    
    # Predict on the testing data
    y_pred_test = model(X_test)
    # Compute the testing loss
    loss_test = loss_fn(y_pred_test, y_test)

    # Calculate Mean Squared Error (MSE) for training data
    MSE = torch.mean((y_pred_train - y_train) ** 2)
    # Calculate Mean Absolute Error (MAE) for training data
    MAE = torch.mean(torch.abs(y_pred_train - y_train))

    # Calculate Mean Squared Error (MSE) for testing data
    MSE_test = torch.mean((y_pred_test - y_test) ** 2)
    # Calculate Mean Absolute Error (MAE) for testing data
    MAE_test = torch.mean(torch.abs(y_pred_test - y_test))
    
    # Return the computed metrics as a list
    return [loss_train.mean().item(), 
            loss_test.mean().item(), 
            MSE.item(), 
            MSE_test.item(),
            MAE.item(),
            MAE_test.item()]

def print_progress(epoch, loss_info, best_loss):
    """
    Print the progress of training, including various metrics.
    
    Parameters:
    epoch (int): The current epoch number.
    loss_info (DataFrame): DataFrame containing the loss information.
    best_loss (float): The best loss achieved so far.
    
    Returns:
    None
    """
    # Check if the current epoch's test loss is better than the best loss
    if loss_info.loc[epoch, 'loss test'] < best_loss:
        # Print progress with an asterisk indicating a new best test loss
        print('{}   {:.3e}/{:.3e}   {:.3e}/{:.3e}   {:.3e}/{:.3e} *'.format(epoch,
            loss_info.loc[epoch, 'loss train'],
            loss_info.loc[epoch, 'loss test'],
            loss_info.loc[epoch, 'MSE train'], 
            loss_info.loc[epoch, 'MSE test'],
            loss_info.loc[epoch, 'MAE train'], 
            loss_info.loc[epoch, 'MAE test']))
    else:
        # Print progress without an asterisk
        print('{}   {:.3e}/{:.3e}   {:.3e}/{:.3e}   {:.3e}/{:.3e}    {}'.format(epoch,
            loss_info.loc[epoch, 'loss train'],
            loss_info.loc[epoch, 'loss test'],
            loss_info.loc[epoch, 'MSE train'], 
            loss_info.loc[epoch, 'MSE test'],
            loss_info.loc[epoch, 'MAE train'], 
            loss_info.loc[epoch, 'MAE test'],
            epoch - last_loss))

# Train Model

This code provides a detailed workflow for training and evaluating a multi-layer perceptron (MLP) model using PyTorch, focusing on optimizing model performance and implementing early stopping.

Firstly, the device is set to use GPU ('cuda') if available, allowing for faster computation. An instance of the MLPModel_torch class is then created and moved to the specified device. The Adam optimizer is set up with the model parameters and a learning rate of 0.0001, and the loss function is defined as Mean Squared Error (MSE) without reduction.

The training and testing data are converted to PyTorch tensors and moved to the specified device. A DataFrame is initialized to store loss information, including training and testing loss, Mean Squared Error (MSE), and Mean Absolute Error (MAE).

Several training parameters are defined: the number of epochs is set to 100,000, with early stopping patience set to 150 epochs. Initial values for the last improved epoch and best loss are set, and the batch size is defined as 50.

The training loop begins with shuffling the training data for each epoch. Mini-batch training is performed by dividing the data into batches and performing optimization steps on each batch. After each epoch, performance metrics are calculated and stored. The training progress is printed, highlighting if the current epoch achieved the best test loss so far. The model is saved if the current test loss is the best seen so far. Early stopping is implemented to terminate training if no improvement in test loss is observed for a specified number of epochs (patience).

This workflow ensures efficient training with proper evaluation, progress tracking, and early stopping to prevent overfitting.

In [None]:
# Set the device to GPU if available
device = 'cuda'

# Initialize the model and move it to the specified device
model = MLPModel_torch(dataset['X_train_scaled'].shape[1]).to(device)

# Set up the optimizer (Adam) with the model parameters and a learning rate
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Define the loss function (Mean Squared Error) without reduction
loss_fn = torch.nn.MSELoss(reduction='none')

# Convert training and testing data to PyTorch tensors and move to the specified device
x_train = torch.tensor(dataset['X_train_scaled'].values, dtype=torch.float32, device=device)
y_train = torch.tensor(dataset['Y_train_scaled'].values, dtype=torch.float32, device=device)
x_test = torch.tensor(dataset['X_test_scaled'].values, dtype=torch.float32, device=device)
y_test = torch.tensor(dataset['Y_test_scaled'].values, dtype=torch.float32, device=device)

# Initialize a DataFrame to store loss information
loss_info = pd.DataFrame(columns=['loss train', 'loss test', 'MSE train', 'MSE test', 'MAE train', 'MAE test'])

# Set training parameters
epochs = 100000
patience = 150
last_loss = 0
batch_size = 50
best_loss = np.inf

# Training loop
for epoch in range(epochs):
    # Shuffle the training data
    x_shuffle, y_shuffle = shuffle(x_train, y_train)
    
    # Mini-batch training
    for i in range(0, x_train.shape[0], batch_size):
        x_batch = x_shuffle[i:i + batch_size]
        y_batch = y_shuffle[i:i + batch_size]
        # Perform a single optimization step
        optimizer.step(closure)
    
    # Calculate and store metrics for the current epoch
    loss_info.loc[epoch, :] = get_metrics(model, x_train, x_test, y_train, y_test, loss_fn)
    # Print progress for the current epoch
    print_progress(epoch, loss_info, best_loss)
    
    # Save the model if the current test loss is the best seen so far
    if loss_info.loc[epoch, 'loss test'] < best_loss:
        torch.save(model.state_dict(), './Models/model_MLP.pt')
        best_loss = loss_info.loc[epoch, 'loss test']
        last_loss = epoch
        
    # Early stopping if no improvement in test loss for 'patience' epochs
    if epoch - last_loss >= patience:
        print('Terminated')
        break