# Assignment - ML

## Introduction

In this assigment, you are asked to apply machine learning (ML) model to a chemical dataset. You should use similar approaches that uyou have seen in the previous jupyter notebooks.

### The data

The data is the same data you looked at for classification modelling. It contains information relating to Epidermal growth factor receptor (EGFR) kinase. The *csv* ha sthe following columns:

* CHEMBL-ID
* SMILES string of the corresponding compound
* Measured affinity: pIC50

### EGFR Kinase

The Epidermal Growth Factor Receptor (EGFR) is a transmembrane protein/receptor present on the cell membrane (https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor). It is a member of the ErbB family of receptors. EGFR plays an important role in controlling normal cell growth, apoptosis and other cellular functions. It is activated by ligand binding to its extracellular domain, upon activation EGFR undergoes a transition from an inactive monomeric form to an active homodimers. The EGFR receptor is upregulated in various types of tumors or cancers, so an EGFR inhibition is a type of biological therapy that might stop cancer cells from growing.

### The task

Prepare the data for training and evaulate machine learning model to predict the pIC50 of the compounds based on the features supplied (and others if you would like to calculate additional descriptors as features).

You will use scikit-learn to train and evaluate the following models:

**Supervised learning**

- Random Forest

### What you have to do

Complete the code cells below indictaed with "# TO DO" statements. The previous notebooks should be useful in these tasks.

#### Steps

1. Load the data
2. Perform some EDA to gain initial understanding of the distribution of features and relationships between features, and with the target.
3. Prepare the data 
4. Train the model
5. Make predictions
6. Evaluate performance
7. Analyse the performance of the models. Draw conclusions about the chemical problem, e.g. from the feature importances.

### Load libraries

In [120]:
# If you do not have scikit-learn installed, uncomment the following line
# !conda install -y -c conda-forge scikit-learn
#!conda install -y -c conda-forge numpy

In [121]:
from random import choice
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor 
from sklearn.metrics import (root_mean_squared_error,
                            mean_absolute_error,
                            mean_squared_error,
                            r2_score)
from rdkit import Chem 
from rdkit.Chem import AllChem, rdFingerprintGenerator, Draw

If you have issues with VSCode notebook cell outputs being truncated:

- Go to Settings (via menubar or cmd-, on Mac)
- Search for cell output settings: try @tag:notebookOutputLayout
- Adjust settings, e.g. scrolling, number of lines to display

## Load the data and perform exploratory analysis

You can perform some initial exploratory analysis of the dataset using some of the methods you saw last week.

In addition to looking for distribution and patterns in the data, look at what the columns actually contain. Some will include metadata about the source of the observation and its processing, which will not be relevant to the target variable.

In [None]:
# TODO: 
# - Read the data into a DataFrame
# - Check the data types
# - Check for missing values
# - Identify the target variable
# - Identify any metadata
# - Identify any descriptors
# - Examine summary statistics of target variable and descriptors and make some short comments
# - Identify any redundant columns

#### Visualisations

In [None]:
# TODO:
# - Create a dataframe with just the dependent and indpendent variables
# - Visualise the data to look for distributions of features, check for outliers and make some short comments
# - Visualise the data to look for correlations and make some short comments
# - Visualise the data to look for relationships between features and target and make some short comments

### Molecular Fingerprints

Use the functions below to add ECFP to the dataframe

In [137]:
def smiles_to_fp(smiles):
    # convert smiles to RDKit mol object
    mol = Chem.MolFromSmiles(smiles)
    fpg = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
    return np.array(fpg.GetFingerprint(mol))
    

Apply this function to the dataframe and look at first three molecules

In [138]:
chembl_df["fp"] = chembl_df["smiles"].apply(smiles_to_fp)
chembl_df.head(3)

Unnamed: 0,molecule_chembl_id,IC50,units,smiles,pIC50,molecular_weight,n_hba,n_hbd,logp,ro5_fulfilled,fp
0,CHEMBL63786,0.003,nM,Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1,11.522879,349.021459,3,1,5.2891,True,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,CHEMBL35820,0.006,nM,CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC,11.221849,387.058239,5,1,4.9333,True,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,CHEMBL53711,0.006,nM,CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1,11.221849,343.043258,5,1,3.5969,True,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


#### Separate features and target

You can now separate your data into the features (the predictor variables) and target (the variable you want to predict). Run code for X_full  below.

In [None]:
df1=num_chembl_df[["logp","molecular_weight","n_hba","n_hbd"]]
df2=pd.DataFrame(chembl_df.fp.tolist())
list = (df1).columns.to_list() + [str(element) for element in pd.DataFrame(df2).columns.to_list()]
X_full=pd.DataFrame(np.hstack([df1,df2]),columns=list)


In [141]:
# TODO:
# - Read the target column into a separate variable y_full

#### Create the training and test sets

Run [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to create separate training and test sets, with 20% of the samples in the test set.

In [None]:
# TODO 
# Split the data into training (80%) and testing (20%) sets using random_state=42
# Check the size of the resulting datasets


#### Feature Engineering

In [None]:
# TO DO Feature engineering: Scale the variables using StandardScaler()
# TO DO Calculate the mean and standard deviation of the training and test sets

#### Building a Random Forets Model

In [None]:
# TODO 
# Build a random forets model
# Make predictions on the test set
# Calculate MSE and r2 for the training and test sets

#### Plot the training and test sets

In [None]:
# TO DO Plot the experimental value against the predicted values for training and test sets using differnt colours for the two sets 

### Feature Importance

In [None]:
# TO DO Look at feature importance using Gini importance and make some comments

In [None]:
# TO DO Create a bar plot for feature importance