# **The Project**

The final project of this lecture is to use the TOX21 dataset that contains molecular `SMILES` and to train a model to able to classify a molecule as toxic or non-toxic.

The `TOX21` Data challenge has been the largest effort of the scientific community to compare computational methods for toxicity prediction. 
[[link]](https://tripod.nih.gov/tox21/challenge/data.jsp#)

Here we provide a bit of help to structure your project:

- Load both the training and test datasets CSV files and explore them
- Build a function that uses RDKit to interpret the SMILE representation present in the CSV files
- From that list of SMILES extrat the Molecular Morgan FingerPrints as representation for the molecule. [[more Help]](https://www.rdkit.org/docs/GettingStartedInPython.html)
- Choose a couple of reasonable models from scikit-learn able to classify does molecules into toxic and non-toxic.
- Train it a assess its performance by classifying the ones in the test set.
- Try to improve the model performance:
    - Cross-validation strategy
    - Hyperparameter optimization

In [None]:
# Import libraries here
import pickle
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw

import IPython.core.display
from IPython.display import HTML

## 1.0 Load CSV files

In [None]:
# Load CSV files with pandas


In [None]:
# Convert df smile and target as lists
smile_list = tox_train["smile"].tolist()
target_list = tox_train["target"].tolist()

## 2.0 Helper functions

Build the funstions to convert from smile to rdkit molecule object and to extract the Morgan Fingerprints out of that molecule object. In two different functions.

In [None]:
# We need a function that converts smile to rdkit molecule and get their fingerprints

def _smile_2_mol(smile):
    """Helper function to convert smile to rdkit mol"""
    return

def _morgan_representation(mol, radius=6):
    """Helper function to convert smile to morgan finger prints"""
    return

def _save_model(model, filename="./best_model.pkl"):
    """Helper function to save a model into a pickle file"""
    with open(filename, "wb") as p_file:
        pickle.dump(model, p_file)
    p_file.close()
    return

def _load_model(filename):
    """Helper function load a ml model from pickle file"""
    with open(filename, "rb") as p_file:
        model = pickle.load(p_file)
    p_file.close()
    return model

## 3.0 Process the training data

Now use the previuous function to actually convert the dataset into Morgan fingerprints and store the representation and target in a dictionary.

In [None]:
# Collect fingerprint and target as dict
dataset = {"rep": [], "target": []}


In [None]:
# Arrays for input and target
x_tox21 = np.array(dataset["rep"])
y_tox21 = np.array(dataset["target"])

## 4.0 Train the model

Implement the code to train a given model using `KFold` strategy and to save the best model as pickle file.

In [None]:
# In this case, we are gonna use logistic regressor as the model
from sklearn.model_selection import train_test_split, KFold, cross_val_score


## 5.0 Process the test data for TOX21

Do the same as you did for the training dataset but for the test dataset, so you can see the actual accuracy of your model in this new dataset

In [None]:
# Convert df smile and target as lists
smile_test_list = tox_test["smile"].tolist()
target_test_list = tox_test["target"].tolist()

# Build the test dataset as a dict
dataset_test = {"rep": [], "target": []}


## 6.0 Load the best model

Load the best model using the `_load_model` function and assess the accuracy of your model with test dataset

In [None]:
# Best model
model = _load_model(filename="./best_model.pkl")
print(model)

In [None]:
# Performance
y_pred = model.predict(x_test_tox21)

print(f"The accuracy is {accuracy_score(y_pred, y_test_tox21)}")

## 7.0 Play with your model!

Use your model to predict whether caffefine and Pentaclorophenol is a toxic molecule or not!

In [None]:
# Caffeine
caffeine = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
caff_img = Draw.MolToImage(caffeine)
display(caff_img)

In [None]:
# Convert mol to Finger prints


# Predict value


if caff_tox == 0:
    print("Caffeine is non-toxic!")
else:
    print("Caffeine is toxic!")

In [None]:
# Pentaclorophenol
cloro_phenol = Chem.MolFromSmiles("C1(=C(C(=C(C(=C1Cl)Cl)Cl)Cl)Cl)O")
cloro_phenol_img = Draw.MolToImage(cloro_phenol)
display(cloro_phenol_img)

In [None]:
# Convert mol to Finger prints


# Predict value


if cloro_phenol_tox == 0:
    print("Pentaclorophenol is non-toxic!")
else:
    print("Pentaclorophenol is toxic!")

Use this model to predict the toxicity of a molecular that you normally use in your lab!