# Workshop 3 - ML Predictions of Aqueous Solubility

## Introduction

In this workshop, you will get some hands on practice of applying some of the major machine learning (ML) models to a chemical dataset.

### The data

AqSolDB ([Sorkun et al.](https://doi.org/10.1038/s41597-019-0151-1)) is a curated dataset of experimentally-determined aqueous solubility values, with calculated descriptors for the molecules.

The paper gives details on how the data was acquired and processed, and its availability on a number of platforms including [github](https://github.com/mcsorkun/AqSolDB)


### The task

Prepare the data for training and evaulating a set of machine learning models to predict the solubilty of the compounds based on the features supplied (and others if you would like to calculate additional descriptors as features).

You will use scikit-learn to train and evaluate the following Supervised Learning models:

- Linear regression
- Logistic regression
- k-Nearest neighbors

### Steps

1. Load the data
2. Perform some EDA to gain initial understanding of the distribution of features and relationships between features, and with the target.

For each model (may require additional stages depending on the model)

3. Prepare the data 
4. Train the model
5. Make predictions
6. Evaluate performance

7. Analyse the performance of the models. Draw conclusions about the chemical problem, e.g. from the feature importances.


### Some Possibly Useful Reading

- The [scikit-learn documentation](https://scikit-learn.org/stable/) contains a huge range of examples
- The [Hundred-Page Machine Learning Book](https://themlbook.com/wiki/doku.php 
) goes through the ML methods we use here exceptionally clearly.
- The Course Book contains both a [Chapter](https://joeforth.github.io/chem502_book/ml-intro/ml-intro/) and a [Workbook](https://joeforth.github.io/chem502_book/ml-intro/ml-demo/) with examples relevant to this Workshop.

## 0. Import Modules

We'll be using a Python module called [scikit-learn](https://scikit-learn.org/stable/), it's a big complicated piece of software and we'll just install a few modules from it.

In [None]:
from random import choice

import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt

from sklearn import linear_model

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:
# If you do not have scikit-learn installed, uncomment the following line
# !conda install -y -c conda-forge scikit-learn

If you have issues with VSCode notebook cell outputs being truncated:

- Go to Settings (via menubar or cmd-, on Mac)
- Search for cell output settings: try @tag:notebookOutputLayout
- Adjust settings, e.g. scrolling, number of lines to display

## 1. Load and Clean the Data

First, perform some initial exploratory analysis of the dataset using some of the methods you've used in previous workshops.

In addition to looking for distribution and patterns in the data, look at what the columns actually contain. Some will include metadata about the source of the observation and its processing, which will not be relevant to the target variable.

In [None]:
# TODO: 
# 1. Check the data and load into a DataFrame
# 2. Check the data types
# 3. Check for missing values
# 4. Check summary statistics
# 5. Identify and drop redundant columns (e.g., those containing experiment metadata)
# 6. Identify and convert columns that should store data in a more appropriate format (e.g., int)

## 2. Perform Exploratory Data Analysis

Now get a feel for your dataset - perform exploratory data analysis to see what correlations and outliers there are.

In [None]:
# TODO:
# 1. Visualise the data to look for distributions of features, check for outliers
# 2. Visualise the data to look for correlations
# 3. Visualise the data to look for relationships between features and target

## Questions for Part 2

In your submission - add a block of markdown discussing each of these.

- Explain your approach to EDA for the dataset. What questions can this process answer and suggest how it can aid the subsequent analysis and modelling.
- What are the most significant correlations in this dataset? Discuss any strong relationships between the features and the target variable that are apparent.
- If you were selecting features from the data, are there any that you would remove? Explain why/why not.


#### Things to consider

- Note the distributions of values of features (e.g. the measures of centre, the magnitude and shape of the distribution and the range of the values).


## 3. Linear regression

The first model we will apply is a [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

Linear regression models the relationship between input features and a continuous target variable using a linear function.

The function looks like:  

$$
y = w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n + \epsilon
$$

Where:  
- $ y $ = **Predicted output** (target variable)  
- $ x_1, x_2, \dots, x_n $ = **Input features** (independent variables)  
- $ w_0 $ = **Intercept** (bias term)  
- $ w_1, w_2, \dots, w_n $ = **Coefficients** (weights)  
- $ \epsilon $ = **Error term** (accounts for noise in data)  

The goal is to find weights $w_{i}$ that minimize the error, typically using Ordinary Least Squares (OLS).

It finds a best-fit line by minimising the difference between predictions and actual values, typically using least squares. 

It is widely used for trend analysis, forecasting, and understanding feature impact on outcomes.

You may find the contents of the [Notebook on Machine Learning](https://joeforth.github.io/chem502_book/ml-intro/ml-demo/) in the course book useful here.

## 3.1 Prepare data

To prepare the data, create a new dataframe containing only the numerical features of the AqSolDB dataset. Then, separate your data into the features (the predictor variables) and target (the variable you want to predict).

In [None]:
# TODO:
# 1 - Read the target column into a separate variable
# 2 - Read the feature columns into a different variable - remember to drop the target column


## 3.2 Create the training and test sets

Run [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to create separate training and test sets, with 20% of the samples in the test set.

In [None]:
# TODO 
# 
# 3 - Split the data into training (80%) and testing (20%) sets and check the size of the resulting datasets


((7985, 17), (1997, 17), (7985,), (1997,))

## 3.3 Training the model

It is time to train the first ML model.

You will need to create a new LinearRegression model and train it using its `fit` method on the training data's features.

The [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) on Linear Regression may be useful here.

In [1]:
# TODO: 4 - Create a linear regression model

# TODO: 5 - Fit the model to the training data


## 3.4 Test the model's performance on unseen data

You can now get the model to predict the solubilities for the subset of data you withheld for the test set.

In [None]:
# TODO: 6 - Predict the solubility of the test set


## 3.5 Evaluating the model's performance

We can visualise how closely the predicted solubility values for both the training and/or test set match the real values.

***
Hint - You will need to also generate predictions for the test set if you want to visualise
***

There are a variety of metrics that can be used to quantify the model's performance. 

One commonly used metric for regression tasks is $r^2$ which expresses how well the model fits the data. It ranges from 0 to 1, with 1 indicating a perfect fit.

In [None]:
# TODO: 
# 
# 7 - Calculate r^2 value of how well your model fits the data
# 8 - Plot the predicted vs. actual values for the training set
# 9 - Plot the predicted vs. actual values for the test set


## Questions for Part 3

In your submission - add a block of markdown discussing each of these.

- What other metrics might be useful for evaluating the model's performance? Choose one other metric and calculate it for the model's perform on the test data. Briefly explain the form and meaning of the metric.
- Comment on the performance of the model on the training vs. the test data. Is there anything you can infer from the comparison?
- What information can you gain from the model coefficients? 

The [Book Section on ML](https://joeforth.github.io/chem502_book/ml-intro/ml-intro/) as well as the [Introduction to ML Notebook](https://joeforth.github.io/chem502_book/ml-intro/ml-demo/) may be helpful here. 

How could you use this information to improve model or the training process?


## 4. Logistic Regression

[Logistic regression](https://www.geeksforgeeks.org/understanding-logistic-regression/) is used for binary classification: predict to which of two classes an input belongs. Scikit-learn's [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for info on its `LogisticRegression` classifier. The python you write is very similar to what you have seen for the linear regression task.

In the context of AqSolDB, we can convert solubility values (logS) into two classes:

Soluble (1): logS above a certain threshold (e.g., logS > -2)
Insoluble (0): logS below the threshold

This allows us to predict solubility as a classification problem.

### 4.1 Prepare data

Get a copy of the dataframe after you had dropped the non-numeric features.

You will need to add a new target variable based on the current `solubility` column, where the new column value is:

`1` if `logS >= -2`  

`0` if `logS < -2`


In [34]:
from sklearn.linear_model import LogisticRegression

In [None]:
# TODO: 
# 1. Create a copy of the original DataFrame with numeric columns only
# 2. Add a new column with binary solubility values
# 3. Drop the original solubility column

### 4.2 Separate features and target and test-train split

Follow the same process as for the linear regression and separate the target and feature columns.

Then split the data into training and testing sets. Make sure you run this with `stratify=<name of your target array>`. (What does [`stratify`](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms#:~:text=stratified%20train-test%20split) do?)

Both Logistic Regression and the next model (K-Nearest Neighbours) are classification algorithms, you need to standardise your data before training your models - for this Workshop, it's recommended to use the [scikit-learn's StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Some example code showing you how to do this is shown below:

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=42, stratify=y_full)

# Create a scaler
scaler = StandardScaler()

# Fit the scaler to the training data
scaler.fit(X_train)

# ransform the training data
X_train_scaled = scaler.transform(X_train)

# Alternatively you can fit the scaler and transform the training data (steps 3+4) at once
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data
X_test_scaled = scaler.transform(X_test)


In [None]:
# TODO:
# 4. Separate features and target column
# 5. Split the data into training and testing sets - split first
# 6. Scale the features using StandardScaler - scale the test and training sets separately


In [None]:
# TODO: 
# 7. Create a logistic regression model
# 8. Fit the model to the training data
# 9. Predict the solubility of the test set
# 10. Calculate the accuracy of the model



The [`classification report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) provides a set of metrics for classification tasks.

## Questions for Part 4

In your submission - add a block of markdown discussing each of these.

- Briefly explain the meaning of the metrics in the classification report.
- Comment on the performance of the regression and classification models. Why might this approach be useful for some types of problems?

## 5. k-NN classification

Over to you for this one. Here is the documentation for sklearn's [`KNeighboursClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

The process follows a very similar process to the models you have already seen.

You can either stick with the binary classification or use the classes described in the AqSolDB paper:


| Category | logS range |
|------------|----------|
|**Highly soluble** | logS > 0 |
|**Soluble** | 0 > logS > -2 |
|**Slightly soluble**  | -2 > logS > -4 |
|**Insoluble** | logS < -4 |


There are a few important points:


1. Make sure you use `stratify` when you split the data and pass it the full target array.
2. You must scale the features using StandardScaler after splitting.
3. k-NN has a hyperparameter, so you will need to use cross-validation to adjust the value of k. There is a quick tutorial [here](https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn)

#### Devise your evaluation strategy for the k-NN model

Scikit-learn has a variety of methods to [measure and present](https://scikit-learn.org/stable/api/sklearn.metrics.html) model performance, e.g.

- classification report 
- confusion matrix

## Questions for Part 5

In your submission - add a block of markdown discussing each of these.

- Based on the models you have trained and tested, how would you decide which model is most appropriate for predicting solubility categories? Consider the evaluation metrics, feature selection, and any limitations you observed.

- Suggest one way that you could you iteratively refine your approach - e.g. adjusting models, features, or preprocessing steps - to improve predictive performance?

## 6. Summary

As you have worked through this notebook, you have 

- Used exploratory analysis to identify and understand the structure of and trends within a moderately-sized chemical dataset.

- Prepared a dataset to apply predictive modelling.

- Trained, tested and evaluated some frequently-used ML models to predict a chemical property.

In addition to the practical and technical skills you will have acquired in applying machine learning for this task, as part of the process, you have seen that it is important to critically consider how best to use the data you have available to address the scientific problem that you have.

The process of structuring your data, selecting a model, selecting features, etc. can be a highly iterative process. It is important to think critically about how your data is being processed, how the model is learning from it, and how well the model’s predictions align with the real-world problem you are addressing.

Machine learning is not a black-box tool but a structured approach that requires careful decision-making at every stage. This includes selecting appropriate features, choosing a suitable model, and ensuring rigorous evaluation of performance. Model results should not be taken at face value: It is essential to assess accuracy, biases, and generalisation.

By approaching ML critically and iteratively, you can refine your models, improve predictions, and ensure that the insights gained are scientifically meaningful and reliable.