# Artificial neural networks for QSAR
_by David Holmberg (February 2023)_
#### Dataset
For this exercise we will use the same dataset of aqueous solubility of 1142 diverse chemical compounds as you previously explored during the QSAR lab last week. However, here we will only use the PhysChem descriptors.

#### Modelling comparisons
1. Compare the results of linear regression to those of a simple neural network with no hidden layers and no non-linear activation functions
2. Compare the results of a a random forest regressor, a support vector regressor, and a neural network with two hidden layers (with non-linear activations) and dropout.

#### Aims
* to see the link between neural networks and linear regression
* to learn the basics of how to define, compile, fit and evaluate neural networks via TensorFlow.

#### Note
We will be using the open-source machine learning framework TensorFlow (https://www.tensorflow.org) and Keras (https://keras.io) for our neural networks. TensorFlow was developed by the Google Brain team and is today one of the most widely used machine learning frameworks in research and industry and Keras was/is the most popular higher-level API that runs atop TensorFlow. However, last year TensorFlow 2 was released. In TensorFow 2 (which we will use for all our neural network work) Keras is now fully integrated. This means that we get all the benefits of TensorFlow with a much easier (Keras-type) way to define and train models than was previously possible with TensorFlow 1.

## Load packages

In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

# Pytorch and Pytorch Geometric
import torch as tch
import torch.nn as nn
import torch.optim as optim

from torch_geometric.nn import GCNConv
from torch_geometric.data import Data, DataLoader


# Helper libraries
from torchsummary import summary
import random
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


## functions
Run these cells to have access to the necessary functions for the lab.

In [None]:

def plot_history(train_losses, val_losses, model_name):
    fig = plt.figure(figsize=(15, 5), facecolor='w')
    ax = fig.add_subplot(121)
    ax.plot(train_losses)
    ax.plot(val_losses)
    ax.set(title=model_name + ': Model loss', ylabel='Loss', xlabel='Epoch')
    ax.legend(['Train', 'Test'], loc='upper right')
    ax = fig.add_subplot(122)
    ax.plot(np.log(train_losses))
    ax.plot(np.log(val_losses))
    ax.set(title=model_name + ': Log model loss', ylabel='Log loss', xlabel='Epoch')
    ax.legend(['Train', 'Test'], loc='upper right')
    plt.show()
    plt.close()

## Load and preprocess data

#### Load and check shape of X and y

In [None]:
X = np.load('data/X_qsar.npy')
y = np.load('data/y_qsar.npy')
print(X.shape)
print(y.shape)

#### Split into training and test sets and standardize the data
Here we will just have a training and test set, so our results will not be quite as rigerous as those you got with cross-validation in the supervised machine learning lab.

In [None]:
n_train = int(len(y) * 0.7) # 70% of data for training and 30% for testing

random.seed(1234)
indices = np.arange(len(y))
random.shuffle(indices)

# X_train0 is our training data prior to standardization
X_train0, X_test0 = X[indices[:n_train]], X[indices[n_train:]]
y_train, y_test = y[indices[:n_train]], y[indices[n_train:]]

# standardize X_train0 and X_test0 to give X_train and X_test
scaler = StandardScaler().fit(X_train0)
X_train = scaler.transform(X_train0)
X_test = scaler.transform(X_test0)

## Linear Regression
## Random Forest Regressor & Support Vector Regressor
For comparative purposes, with the results we will explore later with a more involved neural network architectures than the one above, we will build a Linear Regression, Random Forest and Support Vector model. For these three machine learning algorithms we will just use the default hyper parameter settings, which are often a good place to start. This means that you will just have () after the model definition, as you did for the linear regression with LinearRegression(). To change the hyper parameters from the defaults one needs to specify them within the braces.

The code cells for the random forest and support vector regressors have been left blank below. You should fill in these cells. You should define the models, fit them, make predictions from them, compute their MSEs and print out the results.

* hint 1: look to the cell where we 'Load packages' to get the right model definition for the two machine learning methods
* hint 2: look at the cell with Linear Regression. It should be similar.

In [None]:
#Linear Regression
LR_model = LinearRegression()
LR_model.fit(X_train, y_train)
LR_pred = LR_model.predict(X_test)
LR_mse = mean_squared_error(y_test, LR_pred)
print('Linear Regression: MSE = ' + str(np.round(LR_mse, 3)))

#### Random Forest Regressor

In [None]:
RF_model = RandomForestRegressor()
RF_model.fit(X_train, y_train)
RF_pred = RF_model.predict(X_test)
RF_mse = mean_squared_error(y_test, RF_pred)
print('Random Forest Regressor: MSE = ' + str(np.round(RF_mse, 3)))

#### Support Vector Regressor

In [None]:
SV_model = SVR()
SV_model.fit(X_train, y_train)
SV_pred = SV_model.predict(X_test)
SV_mse = mean_squared_error(y_test, SV_pred)
print('Support Vector Regressor: MSE = ' + str(np.round(SV_mse, 3)))

## Artifical neural network as a linear regression
If we define a neural network with no hidden layers and no non-linear activations we essentailly get the same results as we do with basic linear regression. The results below should help clarify that to you (there are some minor differences hovever, hence the MSE for the neural network will not be _exactly_ the same as the results above for linear regression, but they are neverthelss very close).

<p>
    <img src="figs/lin-reg.png" alt="drawing" style="width:1200px;"/>
    <center>Figure 1. Our neural network version of linear regression.</center>
</p>

#### Define model
You will need to convert the data into tensors and define the model to use Neural Networks. IN this particular case, you'll get some help - but keep note of how it's done, you *will* have to do it yourselves in the rest of the assignment.

In [None]:
X_train_tensor = tch.tensor(X_train, dtype=tch.float32)
y_train_tensor = tch.tensor(y_train, dtype=tch.float32).unsqueeze(1)  # Add dimension for regression
X_test_tensor = tch.tensor(X_test, dtype=tch.float32)
y_test_tensor = tch.tensor(y_test, dtype=tch.float32).unsqueeze(1)

In [None]:

class ANN1(nn.Module):
    def __init__(self, input_dim):
        super(ANN1, self).__init__()
        self.fc1 = nn.Linear(input_dim, 16)
        self.fc2 = nn.Linear(16, 8)  # Output layer with 8 nodes
        self.relu = nn.ReLU()

    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

#### Innitialize the model and set optimizer and learning rate hyperparameters
The learning rate and optimizer chosen below are both things that can be changed when one explores hyper parameter options, different architectures and what not. Below we use a learning rate (lr) of 0.001 (a common default learning rate) and the 'Adam' optimizer.

In [None]:
input_dim = X_train.shape[1]
ann_model = ANN1(input_dim)
summary(ann1_model, input_size=(input_dim,))
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(ann_model.parameters(), lr=0.001)  # Adjust learning rate as needed


### Train the model
Here you will train the model

In [None]:
num_epochs = 1000  # Adjust as needed
train_losses = []
val_losses = []
for epoch in range(num_epochs):
    ann_model.train()
    optimizer.zero_grad()
    outputs = ann_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    train_losses.append(loss.item())
    
    ann_model.eval()
    with tch.no_grad():
        val_outputs = ann_model(X_test_tensor)
        val_loss = criterion(val_outputs, y_test_tensor)
        val_losses.append(val_loss.item())
    
    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Val Loss: {val_loss.item():.4f}')


#### Evaluate the model

In [None]:
# Evaluate the model
ann_model.eval()
with tch.no_grad():
    y_pred = ann_model(X_test_tensor)
    ann_mse = mean_squared_error(y_test_tensor, y_pred)
    print('ANN Regression: MSE = {:.3f}'.format(ann_mse))

# Plot training history
plot_history(train_losses, val_losses, 'ANN Regression')

## Going deeper with ANNs 
In the cells below we define, compile, fit and evaluate a neural network model with:
* two hiiden layers, each with 32 neurons and non-linear activations (relu)
* a dropout layer at the end with a dropout rate of 0.2

<p>
    <img src="figs/relu-activation.png" alt="drawing" style="width:500px;"/>
    <center>Figure 2. relu activation.</center>
</p>

Dropout can help to avoid overfitting, much as L1 and L2 regularizations do (as you explored in the supervise machine learning lab). In the model loss plots (below) this stops the test loss from increasing as you train for more epochs.

Some quotes from a paper co-authored by members of our group called "Deep Learning in Image Cytometry: A Review" (https://onlinelibrary.wiley.com/doi/full/10.1002/cyto.a.23701):

"_Overfitting occurs when the parameters of a model fit too closely to the input training data, without capturing the underlying distribution, and thus reducing the model’s ability to generalize to other datasets_".

DROPOUT: "_A regularization technique that reduces the interdependent learning among the neurons to prevent overfitting. Some neurons are randomly “dropped,” or disconnected from other neurons, at every training iteration, removing their influence on the optimization of the other neurons. Dropout creates a sparse network composed of several networks—each trained with a subset of the neurons. This transformation into an ensemble of networks hugely decreases the possibility of overfitting, and can lead to better generalization and increased accuracy_".

<p>
    <img src="figs/dropout.png" alt="drawing" style="width:1200px;"/>
    <center>Figure 3. Dropout.</center>
</p>



In [None]:
class ANN2(nn.Module):
    def __init__(self, input_dim):
        super(ANN2, self).__init__()
        self.fc1 = nn.Linear(input_dim, 32)
        self.fc2 = nn.Linear(32, 32)
        self.fc3 = nn.Linear(32, 8)  # Output layer with 8 nodes
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

In [None]:
# Initialize the ANN2 model
input_dim = X_train.shape[1]
ann2_model = ANN2(input_dim)
summary(ann2_model, input_size=(input_dim,))
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(ann2_model.parameters(), lr=0.001)  # Adjust learning rate as needed


In [None]:
num_epochs = 1000  # Adjust as needed
train_losses = []
val_losses = []
for epoch in range(num_epochs):
    ann2_model.train()
    optimizer.zero_grad()
    outputs = ann2_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    train_losses.append(loss.item())
    
    ann2_model.eval()
    with tch.no_grad():
        val_outputs = ann2_model(X_test_tensor)
        val_loss = criterion(val_outputs, y_test_tensor)
        val_losses.append(val_loss.item())
    
    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Val Loss: {val_loss.item():.4f}')


In [None]:
ann2_model.eval()
with tch.no_grad():
    y_pred = ann2_model(X_test_tensor)
    ann2_mse = mean_squared_error(y_test_tensor, y_pred)
    print('ANN2 Regression: MSE = {:.3f}'.format(ann2_mse))

### Testing GNNs
So, now you've tested regression on molecular descriptors with ANNs. ANother option that is gaining traction in the research world is using Graph Neural Networks or, as they can also be called, Graph Convolutional Networks. You will be using an extension library called Pytorch.Geometric for this. Training and Evaluation looks the same.

In [None]:
class GNN1(nn.Module):
    def __init__(self, input_dim):
        super(GNN1, self).__init__()
        self.conv1 = GCNConv(input_dim, 32)
        self.conv2 = GCNConv(32, 32)
        self.fc3 = nn.Linear(32, 8)  # Output layer with 8 nodes
        self.relu = nn.ReLU()
    
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.relu(self.conv1(x, edge_index))
        x = self.relu(self.conv2(x, edge_index))
        x = self.fc3(x)
        return x

In [None]:
# Initialize the GNN1 model
input_dim = X_train.shape[1]
gnn1_model = GNN1(input_dim)
summary(gnn1_model, input_size=(input_dim,))
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(gnn1_model.parameters(), lr=0.001)  # Adjust learning rate as needed

In [None]:
num_epochs = 1000  # Adjust as needed
train_losses = []
val_losses = []
for epoch in range(num_epochs):
    gnn1_model.train()
    optimizer.zero_grad()
    outputs = gnn1_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    train_losses.append(loss.item())
    
    ann2_model.eval()
    with tch.no_grad():
        val_outputs = gnn1_model(X_test_tensor)
        val_loss = criterion(val_outputs, y_test_tensor)
        val_losses.append(val_loss.item())
    
    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Val Loss: {val_loss.item():.4f}')

In [None]:
gnn1_model.eval()
with tch.no_grad():
    y_pred = ann2_model(X_test_tensor)
    ann2_mse = mean_squared_error(y_test_tensor, y_pred)
    print('GNN1 Regression: MSE = {:.3f}'.format(ann2_mse))