In [1]:
import numpy as np
import pandas as pd
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools

# Supervised Learning

Supervised learning algorithms in chemistry make predictions on observables of molecular systems. They are built using a set of training data, usually composed of molecular features and desired observables. This training data can be experimentally obtained, or obtained with computational techniques. Many mathematical models exist which relate the molecular input to a desired output:
 - k-Nearest Neighbors
 - Linear Regression
 - Support Vector Machines (SVM)
 - Decision Trees
 - Ensemble Methods

There are two major types of SL algorithm: **Classification** and **Regression**. Classification models predict a non-numerical class-label based on inputs. They are very useful to predict certain properties of molecules that don't have inherent numerical values. Regression models predict a real number--a continuous variable--from chemical features.

In this notebook, we will learn and execute the main components of building SL models:
 1. Analyze and define the task
 2. Process training data
 3. Train the model
 4. Evaluate the model
 5. Optimize the model
 6. Apply the model

## Predicting Aqueous Solubility

### 1. Defining the Task

Solubility in water is a cruicial molecular property in the context of both drug discovery and agrochemistry. It determines the uptake and distribution of compounds throughout the body and our environment. Measuring solubility for a large number of compounds is a very time-consuming undertaking, so we'd like to use ML to avoid the need for physical samples.

What type of model should we use in this task?

### 2. Processing Training Data

### Getting the Data
We will use the same dataset from the ESOL dataset, which we saw briefly in a previous notebook. The entire dataset is contained in the `esol.csv` file in the `data` folder. Load it into a dataframe called `esol_df`, and print the first five indices and the total number of molecules.

We can see that our dataset has two columns, and 1128 molecules. We'll need some sort of descriptor for these molecules. In the cell below:
 1. Store the SMILES strings for all molecules into an array
 2. Store the solubilities into an array.
 3. Make a new array containing Morgan fingerprints for each molecule, be sure to conver this to a numpy array.

### Featurizers

Oftentimes, we can use a program to automate the selection of features. While we can use just the fingerprints we have generated, we can also have our program automatically generate several different descriptors, and use the most important ones.

Execute the cell below to download `deepchem`, which we'll use to auto-generate some features.

In [None]:
!pip install deepchem

Now, execute the following cell to have `deepchem` come up with some features. We'll build models with both these and our fingerprints to evaulate which features lead to better models.

In [4]:
from deepchem.feat import RDKitDescriptors
featurizer = RDKitDescriptors()
auto_features = featurizer.featurize(smiles)
print(f"Number of generated molecular descriptors: {auto_features.shape[1]}")

Skipped loading some Tensorflow models, missing a dependency. No module named 'tensorflow'
Skipped loading some PyTorch models, missing a dependency. No module named 'torch'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch'
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'torch'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


Number of generated molecular descriptors: 209


Its possible that the program fails to generate features for some molecules. These invalid values will create errors in our models--lets remove them.

In [5]:
auto_features = auto_features[:, ~np.isnan(auto_features).any(axis=0)]
print(f"Number of molecular descriptors without invalid values: {auto_features.shape[1]}")


Number of molecular descriptors without invalid values: 209


### Feature Selection

For any set of features, the next step is always to do feature selection. Feature selection is an essential part of data preprocessing. It involves analyzing all features from a dataset, and removing any features that are redundant or uninportant. What results is a dataset of features that will produce less noisy and more accurate models. Luckily, feature selection is very automated using `sklearn`. 

`sklearn` is one of the main workhorse modules in data science. We will be constantly using it, as it has automated most routines we'll come across. Execute the cell below to perform the feature selection on our set of fingerprints, we'll also print the number of features we retain.

In [6]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.0)
features = selector.fit_transform(fps)

print(f"Number of descriptors retained: {features.shape[1]}")

Number of descriptors retained: 1013


Now, repeat this procedure for the auto-generated features:

In [7]:
selector = VarianceThreshold(threshold=0.0)
auto_features = selector.fit_transform(auto_features)
print(f"Number of descriptors retained: {auto_features.shape[1]}")

Number of descriptors retained: 190


We can see that some of the descriptors were removed.


### Data Partitioning
Once all of our molecules are featurized, we need to split the data into the **training** and **testing** sets. The training set is the set of features/values used to build the model. The testing set is used to evaluate the model. These datasets need to be kept separated, to avoid any bias in building the model. Again `sklearn` has a very useful function to automatically split the dataset based on a user-specified size.

I've filled out the function for you, but print the sizes of the training and testing sets below.

In [8]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(
    features, y, train_size=0.8, random_state=0)



902 226


Now generate test/train sets of the auto-featurized data:

We still have one step to do before we can train a model, especially since we will be building some models from auto-featurization. Features of different types (say, molecular weight, electrochemical potential, etc) will have values that potentially span very different orders of magnitude. Large differences can cause massive instabilities in ML models, so we need to perform a normalization of the features. 

In the cell below, we'll normalize the features of our set of only fingerprints:

In [11]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(xtrain)

x_train_ori = xtrain
x_test_ori = xtest

xtrain = scaler.transform(xtrain)
xtest=scaler.transform(xtest)

and the auto-generated set:

Do you think it matters that we normalize the training data separate from the test data? Why or Why not?

What are the main steps in data preparation? Why is each one needed?

## 3. Training and Evaluating the Model

Finally! As you can see, the bulk of the work in building a ML model is in carefully preparing the data. When building a model, its always best to test a few out, and see which performs the best for the task at hand. Here, we'll test a random forest and a gradient boosting regressor, which are both ensemble models.

Initialize the models below

In [13]:
from sklearn.ensemble import RandomForestRegressor
ranf_reg = RandomForestRegressor(n_estimators=10, random_state=0) 

from sklearn.ensemble import GradientBoostingRegressor
gb_reg = GradientBoostingRegressor(n_estimators=10, random_state=0)  # using 10 trees and seed=0


Next, we'll define a function that trains a model, tests it, and prints the error in test set predictions. To evaluate the models, we'll use the root mean squared error,
$$ RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2} $$

Excecute the cell below to initialize the function.

In [14]:
from sklearn.metrics import mean_squared_error

def train_test_model(model, X_train, y_train, X_test, y_test):
    """
    Function that trains a model, and tests it.
    Inputs: sklearn model, train_data, test_data
    """
    # Train model
    model.fit(X_train, y_train)
    
    # Calculate RMSE on training
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    model_train_mse = mean_squared_error(y_train, y_pred_train)
    model_test_mse = mean_squared_error(y_test, y_pred_test)
    model_train_rmse = model_train_mse ** 0.5
    model_test_rmse = model_test_mse ** 0.5
    print(f"RMSE on train set: {model_train_rmse:.3f}, and test set: {model_test_rmse:.3f}.\n")


Now let's use this function. In the cell below, complete the function calls with the apropriate data sets. Then execute the cell to see how they do.

In [16]:
# Train and test the random forest model
print("Model: Random Forest")
print("Descriptor: Morgan Fingerprint")
train_test_model()

# Train and test Gradient Boost model
print("Model: Gradient Boost")
print("Descriptor: Morgan Fingerprint")
train_test_model( )

# Train and test the random forest model
print("Model: Random Forest")
print("Descriptor: Auto-generated")
train_test_model()

print("Model: Gradient Boost")
print("Descriptor: Auto-generated")
train_test_model()

Model: Random Forest
Descriptor: Morgan Fingerprint
RMSE on train set: 0.542, and test set: 1.386.

Model: Gradient Boost
Descriptor: Morgan Fingerprint
RMSE on train set: 1.628, and test set: 1.889.

Model: Random Forest
Descriptor: Auto-generated
RMSE on train set: 0.278, and test set: 0.723.

Model: Gradient Boost
Descriptor: Auto-generated
RMSE on train set: 1.029, and test set: 1.189.



Which ML algorithm was the most accurate? How can you evaluate this?

Which featurization led to more accurate models? Why

### 4. Optimizing the model

Nearly all ML models use some sort of parameterization beyond the training data. For example, in both ensemble models we used, we had to provide a `n_estimators` parameter. This parameter is used to determine the number of trees in tree-based models. We can also toggle the tree depth with a `max_depth` paramter. Using automated tools, we can optimize both of these *hyperparameters* to improve our models. 

Execute the cell below to perform the optimization. You'll need to choose the training data. This step may take a few minutes.

In [17]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, 20, 30]
}

# use 5-folds cross validation during grid searching
grid_search = GridSearchCV(RandomForestRegressor(random_state=0), param_grid, cv=5)

# Put in your x and y training values here
grid_search.fit()


# Print the parameters
print(grid_search.best_params_)

{'max_depth': 20, 'n_estimators': 50}


In the cell below, use your optimized parameters to train and test a new model.

RMSE on train set: 0.245, and test set: 0.687.



How did the hyperparameterization affect your testing and training errors? Were you surprised by these results at all?

### 6. Applying the Model

Now that you've trained and optimized your model, we can try to apply it to some new molecules. 

**1. Get your Molecules.** We are going to apply our ML model on some preclinical drug molecules currently in development to treat SARS-COV2. To find these molecules, go to the ChEMBL website, https://www.ebi.ac.uk/chembl/. In the homescreen, click the right arrow on the center image until "Explore SARS-CoV2 Data" appears. Then, click the icon that says "Compounds". 

You will then see a grid of many molecules. Use the filter on the left to display only small molecule, preclinical drug molecules. Then, store the SMILES string of **five** compounds in the cell below

**2. Visualize your Compounds.** Using your SMILES strings, display your compounds in a grid. You do not need to label them.

**3. Generate Features.** Again using your SMILES strings, generate features for each molecule, and use the selector from above to perform feature selection.

In [25]:
## 1. Recall what function we used to get features from smiles strings
## Auto-generate your features here


## 2. Next, use the selector to perform feature selection. Above, we used training data to 
##    do this. Here, we just use the same selector to transform our data. The command we need is
##    selector.transform()
## Perform feature selection here


## 3. Now, pass these features to the model. The model can take the whole list of molecules
##    and do all predictions together. The function you need is your_model.predict(your_features)
##    This will return an array of predictions, in the order given
## Predict solubilities below, print the results



**4. Chemical Analysis.** How soluble are the molecules you found? What implications might this have on their efficacy as therapeutics?

**5. Model Analysis.** On ChEMBL, you can find solubilities listed as AlogP in the chemical properties table. How well do the results from your model match those on ChEMBL? What might account for any differences?

**6. More Model Analysis.** Suggest one test you could run to determine why your model does or does not accurately predict the solubilities of the compounds you found.