# Chemprop

**Tutorial by Kevin P. Greenman (Ph.D. Candidate, MIT Department of Chemical Engineering)**

This notebook is a tutorial that demonstrates how to use the Python interface of Chemprop, a package for molecular property prediction using directed message-passing neural networks (d-MPNNs). The source code of Chemprop is available on [GitHub](https://github.com/chemprop/chemprop), and a [Read the Docs page](https://chemprop.readthedocs.io/en/latest/) is also available with the full documentation. The GitHub repo contains instructions for installing Chemprop on a local machine, either from source or from PyPi using `pip`.

# Acknowledgements

Chemprop was first described in the following [paper](https://doi.org/10.1021/acs.jcim.9b00237):

```
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling. 2019 Jul 30; 59(8):3370-88. DOI: 10.1021/acs.jcim.9b00237.
```

Numerous researchers at MIT and in the open-source community have contributed to Chemprop to expand its functionality since then. In particular, I acknowledge the work of Lior Hirschfeld, Charles McGill, Esther Heid, Florence Vermeire, Max Liu, David Graff, Oscar Wu, Yunsie Chung, Yanfei Guan, Michael Forsuelo, and Gabriele Scalia. The PIs associated with this work include Regina Barzilay, Tommi Jaakkola, Klavs Jensen, Connor Coley, William Green, and Rafael Gómez-Bombarelli. The development of Chemprop is funded largely by the [Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) Consortium](https://mlpds.mit.edu/).

# Applications

Chemprop has been applied in many subsequent publications, e.g.:
* [A Deep Learning Approach to Antibiotic Discovery](https://doi.org/10.1016/j.cell.2020.01.021)
* [Machine Learning of Reaction Properties via Learned Representations of the Condensed Graph of Reaction](https://doi.org/10.1021/acs.jcim.1c00975)
* [Predicting Infrared Spectra with Message Passing Neural Networks](https://doi.org/10.1021/acs.jcim.1c00055)
* [Group Contribution and Machine Learning Approaches to Predict Abraham Solute Parameters, Solvation Free Energy, and Solvation Enthalpy](https://doi.org/10.1021/acs.jcim.1c01103)
* [Multi-fidelity prediction of molecular optical peaks with deep learning](https://doi.org/10.1039/D1SC05677H)

# Setup

In [None]:
import os
import chemprop
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.decomposition import PCA

In [None]:
def plot_parity(y_true, y_pred, y_pred_unc=None):
    
    axmin = min(min(y_true), min(y_pred)) - 0.1*(max(y_true)-min(y_true))
    axmax = max(max(y_true), max(y_pred)) + 0.1*(max(y_true)-min(y_true))
    
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    
    plt.plot([axmin, axmax], [axmin, axmax], '--k')

    plt.errorbar(y_true, y_pred, yerr=y_pred_unc, linewidth=0, marker='o', markeredgecolor='w', alpha=1, elinewidth=1)
    
    plt.xlim((axmin, axmax))
    plt.ylim((axmin, axmax))
    
    ax = plt.gca()
    ax.set_aspect('equal')
    
    at = AnchoredText(
    f"MAE = {mae:.2f}\nRMSE = {rmse:.2f}", prop=dict(size=10), frameon=True, loc='upper left')
    at.patch.set_boxstyle("round,pad=0.,rounding_size=0.2")
    ax.add_artist(at)
    
    plt.xlabel('True')
    plt.ylabel('Chemprop Predicted')
    
    plt.show()
    
    return

# Train regression model

In [None]:
arguments = [
    '--data_path', '../data/examples/regression.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_reg',
    '--epochs', '5',
    '--save_smiles_splits'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

# Predict from file

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_reg/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_reg.csv',
    '--checkpoint_dir', 'test_checkpoints_reg'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

In [None]:
df = pd.read_csv('test_checkpoints_reg/fold_0/test_full.csv')
df['preds'] = [x[0] for x in preds]
df

In [None]:
plot_parity(df.logSolubility, df.preds)

# Predict from SMILES list

In [None]:
smiles = [['CCC'], ['CCCC'], ['OCC']]
arguments = [
    '--test_path', '/dev/null',
    '--preds_path', '/dev/null',
    '--checkpoint_dir', 'test_checkpoints_reg'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args, smiles=smiles)

# Load model once, predict multiple times

In [None]:
arguments = [
    '--test_path', '/dev/null',
    '--preds_path', '/dev/null',
    '--checkpoint_dir', 'test_checkpoints_reg'
]

args = chemprop.args.PredictArgs().parse_args(arguments)

model_objects = chemprop.train.load_model(args=args)

smiles = [['CCC'], ['CCCC'], ['OCC']]
preds = chemprop.train.make_predictions(args=args, smiles=smiles, model_objects=model_objects)

smiles = [['CCCC'], ['CCCCC'], ['COCC']]
preds = chemprop.train.make_predictions(args=args, smiles=smiles, model_objects=model_objects)

# Reactions

In [None]:
reaction_reg_df = pd.read_csv('../data/examples/reaction_regression.csv')
reaction_reg_df

In [None]:
arguments = [
    '--data_path', '../data/examples/reaction_regression.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_reaction',
    '--epochs', '5',
    '--reaction',
    '--save_smiles_splits'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_reaction/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_reaction.csv',
    '--checkpoint_dir', 'test_checkpoints_reaction'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

In [None]:
df = pd.read_csv('test_checkpoints_reaction/fold_0/test_full.csv')
df['preds'] = [x[0] for x in preds]

plot_parity(df.ea, df.preds)

# Multiple-Molecule Inputs

In [None]:
multimolecule_df = pd.read_csv('../data/examples/classification_multimolecule.csv')
multimolecule_df

In [None]:
arguments = [
    '--data_path', '../data/examples/classification_multimolecule.csv',
    '--dataset_type', 'classification',
    '--save_dir', 'test_checkpoints_multimolecule',
    '--epochs', '5',
    '--save_smiles_splits',
    '--number_of_molecules', '2',
    '--split_key_molecule', '1' # defaults to 0 (1st column) if not specified
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_multimolecule/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_multimolecule.csv',
    '--checkpoint_dir', 'test_checkpoints_multimolecule',
    '--number_of_molecules', '2',
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

# Split Type

In [None]:
arguments = [
    '--data_path', '../data/examples/regression.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_splits',
    '--epochs', '5',
    '--split_type', 'scaffold_balanced',
    '--save_smiles_splits'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_splits/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_splits.csv',
    '--checkpoint_dir', 'test_checkpoints_splits',
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

In [None]:
df = pd.read_csv('test_checkpoints_splits/fold_0/test_full.csv')
df['preds'] = [x[0] for x in preds]

plot_parity(df.logSolubility, df.preds)

# Ensembling and Uncertainty

In [None]:
arguments = [
    '--data_path', '../data/examples/reaction_regression.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_ensemble',
    '--epochs', '5',
    '--reaction',
    '--save_smiles_splits',
    '--ensemble_size', '5',
    '--split_type', 'scaffold_balanced'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_ensemble/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_ensemble.csv',
    '--checkpoint_dir', 'test_checkpoints_ensemble',
    '--ensemble_variance'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

In [None]:
preds_df = pd.read_csv('test_preds_ensemble.csv')
preds_df

In [None]:
df = pd.read_csv('test_checkpoints_ensemble/fold_0/test_full.csv')
plot_parity(df.ea, preds_df.ea, preds_df.ea_epi_unc)

# Fingerprint

In [None]:
arguments = [
    '--test_path', '../data/examples/regression.csv',
    '--preds_path', 'test_preds_fingerprint.csv',
    '--checkpoint_dir', 'test_checkpoints_reg',
    '--fingerprint_type', 'MPN'
]

args = chemprop.args.FingerprintArgs().parse_args(arguments)
preds = chemprop.train.molecule_fingerprint.molecule_fingerprint(args=args)

In [None]:
preds.squeeze().shape

In [None]:
df = pd.read_csv('../data/examples/regression.csv')

pca = PCA(n_components=2)
pca_xy = pca.fit_transform(preds.squeeze())
plt.scatter(pca_xy[:,0], pca_xy[:,1], s=30, c=df.logSolubility, edgecolor='w')
plt.colorbar(label='logSolubility')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.show()

# Spectra

In [None]:
arguments = [
    '--data_path', '../data/examples/spectra.csv',
    '--dataset_type', 'spectra',
    '--save_dir', 'test_checkpoints_spectra',
    '--epochs', '5',
    '--features_path', '../data/examples/spectra_features.csv',
    '--split_type', 'random_with_repeated_smiles',
    '--save_smiles_splits'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)       

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_spectra/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_spectra.csv',
    '--checkpoint_dir', 'test_checkpoints_spectra',
    '--features_path', '../data/examples/spectra_features.csv'
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)

# Pretraining / Transfer Learning

In [None]:
arguments = [
    '--data_path', '../data/examples/regression.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_transfer',
    '--epochs', '5',
    '--save_smiles_splits'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)       

In [None]:
arguments = [
    '--data_path', '../data/examples/regression.csv',
    '--dataset_type', 'regression',
    '--save_dir', 'test_checkpoints_transfer',
    '--epochs', '5',
    '--checkpoint_frzn', 'test_checkpoints_transfer/fold_0/model_0/model.pt'
]

args = chemprop.args.TrainArgs().parse_args(arguments)
mean_score, std_score = chemprop.train.cross_validate(args=args, train_func=chemprop.train.run_training)       

In [None]:
arguments = [
    '--test_path', 'test_checkpoints_transfer/fold_0/test_smiles.csv',
    '--preds_path', 'test_preds_transfer.csv',
    '--checkpoint_dir', 'test_checkpoints_transfer',
]

args = chemprop.args.PredictArgs().parse_args(arguments)
preds = chemprop.train.make_predictions(args=args)