In [None]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Draw

# Classification

Classification is a type of supervised learning, where non-numerical values are the target of the model. In this notebook, you will build a classification model to determine if a given compound is considered toxic or non-toxic. While toxicity can certiainly be quantified numerically, a simpler model that can do the binary classification between toxic and non-toxic is an extremely useful first step in drug development projects.

To train such a model, we will use the ToxCast dataset. ToxCast is a large dataset containin toxicolodgy data for many compounds based on in vitro high-throughput screening. Building a model to predict this toxicological data will let us know if a molecule has potential to be a drug without needing to synthesize it.

To build the model, you will need to follow the same steps we saw for regression:
 1. Prepare data
 2. Featurization
 3. Splitting the Data
 4. Train the model
 5. Evaluate the model
 6. Apply the model
 
 


## 1. Prepare Data

First, load the data file, `toxcast_data.csv` located in the `data` folder, into a pandas dataframe. Name your dataframe `df_tox_unprocessed`, and print the first three rows in the dataframe.

In [None]:
df_tox_unprocessed = 

We see that there are a lot of columns. For simplicity, we'll only use the `TOX21_TR_LUC_GH3_Antagonist` column. 

Execute the cell below update our dataframe with only the data we need.

In [None]:
df_tox = df_tox_unprocessed.loc[:,["smiles","TOX21_TR_LUC_GH3_Antagonist"]].dropna()
df_tox.columns = ["smiles", "toxic"]
print(df_tox.head(5))

In the cell below, print the total number of molecules, and print how many of them are toxic. Then, execute the cell that follows to visualize and print a few of the molecules.

In [None]:
# Visualize some of the molecules of this dataset
n=6
smiles = df_tox.smiles.sample(n).values
legend = df_tox.toxic.sample(n).values
molecs = [Chem.MolFromSmiles(s) for s in smiles]

Draw.MolsToGridImage(
    molecs,
    subImgSize=(600,300),
    legends=["Toxic" if i==1 else "Non toxic" for i in legend])

## 2. Featurization

Next, we need to generate features. For this exercise, we'll only use the auto-featurizer.

In the cell below, generate the features for all molecules. We cannot do this in one line as before, since some of our SMILES strings are invalid (an unfortunate truth of large datasets). So, you'll need to generate the feature vector manually:

   1. Create a list of the SMILES strings from the dataframe
   2. For each element in this list, generate the features usint the RDKitDescriptors auto-featurizer
   3. Append these features to a list, called `auto_features`
   4. Print the length
    

In [None]:
from deepchem.feat import RDKitDescriptors



Since some of our SMILES are invalid, you'll notice that the code produced a few warnings. DeepChem could not calculate features for these molecules, so we'll need to remove them and the corresponding toxicity label.

Run the cell below to filter the correct molecules. We'll store the valid features as `X` and the corresponding targets as `y`.


In [None]:
## Do not edit this cell

tox_vals = df_tox['toxic'].values

feat_len = featurizer.featurize("CCC").shape[1]

bad_mols = []
X = []
y = []

for n,item in enumerate(auto_features):
    if np.isnan(item[0]).any():
        continue    
    elif len(item[0]) == feat_len:
        X.append(item[0])
        y.append(tox_vals[n])
    else:
        bad_mols.append(n)

X = np.asarray(X)
y = np.asarray(y)
print(X.shape)


In the cell below, perform the feature selection and print the number of features retained.

## 3. Data Splitting

In the cell below, split your data set into training and testing subsets. Then, normalize your features.

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
from sklearn.preprocessing import MinMaxScaler


## 4. Model Selection and Training

We're going to test three different models to do our classification.

In the cell below, initialize a random forest classifier, using 300 estimators and a max depth of 10. Then, train the model. You do not need to create a function for this.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_cls = 


Repeat this step for two other models. You can use any classification model, including Logistic Regression, Support Vector Machines, Gradient Boosting, or any other you find on the sklearn documentation.

## 5. Model Evaluation

To evaluate our models, we are going to use three metrics:

 - Accuracy Score. The accuracy score is the fraction of correctly classified labels.
 - AUC ROC. This is the area under the receiving operating characteristic curve, shown below. The x-axis is the **false positive rate**, and the y axis is the **true positive rate**. A larger AUC indicates that a model is both sensitive and selective in its classification.
 - F1 score. This is another metric that combines the precision and recall of a model. It is a more honest metric compared to just the accuracy, since it accounts for any imbalance among the testing group.


First, we need to make the predictions on our testing set. Complete the cell below by passing your testing set features into the function. Then, reproduce the function to make predictions with your remaining two models.

In [None]:
y_pred_rf = rf_clf.predict()


Now, calculate the accuracy score, ROC AUC, and f1 score for each of your models. I'll demonstrate how to do it for the random forest model.

In [None]:
from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    f1_score
)


acc_rf = accuracy_score(ytest, y_pred_rf)
print(f"Accuracy of Random Forest Classifier is {acc_rf:.3f}")
auc_rf = roc_auc_score(ytest, y_pred_rf)
print(f"ROC-AUC of Random Forest Classifier is {auc_rf:.3f}")
f1s_rf = f1_score(ytest, y_pred_rf)
print(f"F1 Score of Random Forest Classifier is {f1s_rf:.3f}")

# Calculate the three metrics for your remaining two models



A very convenient way to analyze a classification model is using a confusion matrix. It is perhaps the most complete way to analyze a classification model. Confusion matrices plot the numbers of true positives (good), false positives (bad), true negatives (good), and false negatives (bad). 

Complete the function call below to plot the confusion matrix for one of your models. Then, copy the code in the remaining two cells to plot the confusion matrices for your remaining models.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# This function takes the correct test values as the first parameter
# and the predicted values as the y parameter
cm = confusion_matrix( , , labels=rf_clf.classes_)

#Don't change any of this below
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf_clf.classes_) 
disp.plot()

## 6. Apply the Model

Now, we need to apply the model to molecules of interest. We'll use a function that we can call many times for different molecules. The function will take two inputs: first a SMILES string for a molecule, and second the trained model you want to use. Complete the template below to make this function.

In [None]:
from IPython.display import display

def is_this_toxic(smiles, model):
    """
    Ask model if the input molecule (smiles) is toxic or not.
    
    """
    
    # Generate a Molecule from SMILES
    mol = 

    # Calculate features
    X_my_mol = 
    
    # I'll do some cleaning
    X_my_mol = X_my_mol[:, ~np.isnan(X_my_mol).any(axis=0)]
    
    
    # Perform the feature selection using
    # the same transformer
    # Be sure to normalize the features
    X_my_mol = 
    X_my_mol = 

    
    # Get model prediction
    is_toxic =

    # The of the function will draw the molecule and print a statement about its toxicity
    is_toxic = "This molecule is toxic!" if is_toxic else "This is not toxic :)"

    img = Draw.MolsToGridImage(
        [mol],
        subImgSize=(600,300),
        legends=[is_toxic],
        molsPerRow=1
    )
    display(img)
    

In the cells below, test your function on five poisonous molecules, and see what your models predict.

Do your results surprise you? Why or why not?

Looking back a the testing errors in part 5, do you think high accuracies in our testing data is reliable? Why might the accuracy be high compared to the other metrics?

How do the confusion matrices compare across all three models?

Based on all metrics, do you think your models are very suceptible to false positives, false negatives, or both? Use evidence in your answer. Finally, discuss which of your models you think is the best to use for toxicity classification.

Discuss the limitations of your models. In what contexts are your models most appropriately applied? Also, how might you improve your model?