# Molecular property prediction

Install dependencies if necessary (uncomment and run):

In [None]:
# !pip install -r requirements.txt

## 1. Molecular property prediction dataset

We will use BACE dataset from the [MoleculeNet benchmark](https://arxiv.org/abs/1703.00564). Throughout this notebook, we will also make heavy use of [scikit-fingerprints library](https://github.com/scikit-fingerprints/scikit-fingerprints), which is a scikit-learn compatible library built around RDKit.

The task is classifying inhibitors of Beta-Secretase 1 - a protein enzyme playing a significant role in development of Alzheimer’s disease. This is binary graph classification: molecule inhibits protein production or not.

For more information, see: ["Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches" G. Subramanian et al.](https://pubs.acs.org/doi/10.1021/acs.jcim.6b00290)

scikit-fingerprints has its own data loaders for several popular datasets.

In [None]:
from skfp.datasets.moleculenet import load_bace

smiles_list, y = load_bace()

print(f"Example molecule: {smiles_list[0]}")
print(f"Example class: {y[0]}")

## 2. Preprocessing

### SMILES -> Mol conversion

We already covered converting SMILES to RDKit `Mol` objects with pure RDKit. Its function works on one string at a time.

scikit-fingerprints allows us to do that for entire dataset with `MolFromSmilesTransformer`. It has the same `.transform()` method as other scikit-learn objects. Note that most classes in scikit-fingerprints do not need `.fit()` call before first usage.

In [None]:
from skfp.preprocessing import MolFromSmilesTransformer

mol_from_smiles = MolFromSmilesTransformer()

mols = mol_from_smiles.transform(smiles_list)

### Scaffold Split

In molecular property prediction, we typically **don't** use random or stratified random split.

In the real-world drug design problems, a trained ML model has to perform well on newly designed molecules. They differ significantly from the ones seen in the training set, e.g. to be patentable. We need a splitting strategy that will force and test **out-of-distribution (OOD) generalization**.

We can group the molecules by the similarity of their core internal structure, called **scaffold**. This effectively splits the data into sets that differ from one another.

It's important to note that there are some slight variations of scaffold split. They differ in definition what is a "core" part of the molecule. Most benchmarks provide explicit splits for datasets, e.g. Open Graph Benchmark ([OGB](https://ogb.stanford.edu/)) provides standardized scaffold splits for MoleculeNet.

In [None]:
import numpy as np
from skfp.model_selection import scaffold_train_test_split


train_idxs, test_idxs = scaffold_train_test_split(
    mols, test_size=0.2, return_indices=True
)

# split mols and labels
mols_train = np.array(mols)[train_idxs]
mols_test = np.array(mols)[test_idxs]

y_train = y[train_idxs]
y_test = y[test_idxs]

print(f"Train set size: {len(mols_train)}")
print(f"Test set size: {len(mols_test)}")

In [None]:
mols_train[0]

In [None]:
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol


GetScaffoldForMol(mols_train[0])

### Standardize

Molecular standardizer performs basic sanitization and standardization of the molecule object. It will make sure that e.g. certain functional groups or electric charges are represented in a uniform way.

In [None]:
from skfp.preprocessing import MolStandardizer


standardizer = MolStandardizer()

mols_train = standardizer.transform(mols_train)
mols_test = standardizer.transform(mols_test)

## 3. Classification with molecular fingerprints

To transform molecules into feature vectors, we will moleculecular fingerprints. scikit-fingeprints implements a lot of those.

We will start with popular ECFP fingerprint, which is a hashed fingerprint, using circular substructures. This turns our dataset into typical tabular classification problem. Then, we can use any off-the-shelf classifier, like Random Forest.

This is a single-task dataset. In chemistry, we often have **multitask** datasets, where we do many classifications at once. For example, molecule can be toxic in many different ways. `multioutput_auroc_score` in scikit-fingerprints will work in both single-task and multitask cases.

In [None]:
from skfp.fingerprints import ECFPFingerprint


# create fingerprint transformer object
ecfp_fp = ECFPFingerprint()

# transform molecules into feature vectors
X_train_ecfp = ecfp_fp.transform(mols_train)
X_test_ecfp = ecfp_fp.transform(mols_test)

print(f"Fingerprint data shape: {X_train_ecfp.shape}")
print(f"Example vector: {X_train_ecfp[0]}")

Let's train the classifier, using fingerprint features as inputs.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from skfp.metrics import multioutput_auroc_score


clf = RandomForestClassifier(n_jobs=-1, random_state=0)
clf.fit(X_train_ecfp, y_train)

y_pred = clf.predict_proba(X_test_ecfp)[:, 1]
auroc = multioutput_auroc_score(y_test, y_pred)

print(f"ECFP AUROC for Random Forest: {auroc:.2%}")

We can also other models, such as k-nearest-neighbours. Note that we have either binary or count vectors, so we should use an appropriate distance. It is typically Tanimoto distance

Binary Tanimoto distance formula:

$$
\text{dist}(\vec{a}, \vec{b}) = 1 - \frac{|\vec{a} \cap \vec{b}|}{|\vec{a}| + |\vec{b}| - |\vec{a} \cap \vec{b}|}
$$

We can also define Tanimoto distance for count data:

$$
\text{dist}(\vec{a}, \vec{b}) = 1 - \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|^2 + \|\vec{b}\|^2 - \vec{a} \cdot \vec{b}}
$$

If you're interested why Tanimoto distance is used, see e.g. ["Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?" D. Bajusz et al.](https://doi.org/10.1186/s13321-015-0069-3).

### Exercise 1

Implement kNN classifier, using count ECFP fingerprints and Tanimoto distance. Use [scikit-fingerprints documentation on Tanimoto distance](https://scikit-fingerprints.github.io/scikit-fingerprints/) if necessary.

In [None]:
# import Tanimoto count distance from scikit-fingerprints
from sklearn.neighbors import KNeighborsClassifier
from skfp.distances import tanimoto_count_distance

# create kNN model with appropriate metric
clf = KNeighborsClassifier(n_jobs=-1, metric=tanimoto_count_distance)

# fit, predict
clf.fit(X_train_ecfp, y_train)
y_pred = clf.predict_proba(X_test_ecfp)[:, 1]

# calculate and print AUROC score
print(f"ECFP AUROC for kNN: {multioutput_auroc_score(y_test, y_pred):.2%}")

### Exercise 2

scikit-fingerprints implements over 30 different algorithms. Another popular choice besides ECFP is MACCS, a substructure-based fingerprint.

Implement Random Forest model on binary MACCS fingerprint. Use [scikit-fingerprints documentation](https://scikit-fingerprints.github.io/scikit-fingerprints/) as needed.

In [None]:
# import MACCS fingerprint
from skfp.fingerprints import MACCSFingerprint

# create MACCS transformer, calculate fingerprints
maccs_fp = MACCSFingerprint(n_jobs=-1)

X_train_maccs = maccs_fp.transform(mols_train)
X_test_maccs = maccs_fp.transform(mols_test)

# train Random Forest, predict
clf = RandomForestClassifier(n_jobs=-1, random_state=0)
clf.fit(X_train_maccs, y_train)

y_pred = clf.predict_proba(X_test_maccs)[:, 1]

# calculate and print AUROC score
print(f"MACCS AUROC for RF: {multioutput_auroc_score(y_test, y_pred):.2%}")

## 4. Hyperparameter tuning

Molecular fingerprints have many hyperparameters that can be tuned. It's rarely done in papers, but can sometimes greatly improve performance.

Examples are:
- binary vs count variant
- length / number of bits
- ECFP radius

scikit-fingerprints implements `FingerprintEstimatorGridSearch` class for this. It can speed up the tuning of fingerprint and classifier, by avoiding unnecessary recalculations of the fingerprint.

Firstly, we will prepare tuning of the classifier.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

clf = RandomForestClassifier(n_jobs=-1, random_state=0)

clf_param_grid = {"n_estimators": [200, 500, 1000]}

scorer = make_scorer(multioutput_auroc_score, greater_is_better=True)

gridsearch_cv = GridSearchCV(
    estimator=clf,
    param_grid=clf_param_grid,
    scoring=scorer,
    verbose=2,
    cv=5,
    n_jobs=-1,
)

Now, we will set up the tuning of the fingerprint. `FingerprintEstimatorGridSearch` takes the fingeprints, its grid search, and the classifier grid search tuning objects.

In [None]:
from skfp.model_selection import FingerprintEstimatorGridSearch

ecfp_fp = ECFPFingerprint()

fp_grid = {
    "fp_size": [1024, 2048, 4096],
    "radius": [2, 3],
    "count": [False, True],
}

fp_estimator_cv = FingerprintEstimatorGridSearch(
    fingerprint=ecfp_fp,
    fp_param_grid=fp_grid,
    estimator_cv=gridsearch_cv,
    greater_is_better=True,
)

Let's perform the tuning now. We will also check the time, best hyperparameters, and performance after tuning.

In [None]:
from time import time

start = time()
fp_estimator_cv.fit(mols_train, y_train)
end = time()
print(f"scikit-learn tuning time : {end - start:.2f}s")

In [None]:
print("Best fingerprint hyperparameters:", fp_estimator_cv.best_fp_params_)

In [None]:
print("Best model hyperparameters:", fp_estimator_cv.best_estimator_cv_.best_params_)

In [None]:
y_pred = fp_estimator_cv.predict_proba(mols_test)[:, 1]

print(f"ECFP AUROC : {multioutput_auroc_score(y_test, y_pred):.2%}")

## 5. Multioutput prediction

Some molecular datasets focus on multiple tasks at a time. This often results in better classifiers, due to more general knowledge and built-in regularization. Random Forest from scikit-learn, combined with molecular fingerprints, is a natural solution for such problems.

One of such multi-output datasets is SIDER dataset. Tasks are related to adverse drug reactions (ADRs), or drug side effects, to 27 system organ classes of MedDRA classification.

For details, see ["Low Data Drug Discovery with One-Shot Learning" H. Altae-Tran et al.](https://pubs.acs.org/doi/10.1021/acscentsci.6b00367).

### Exercise 3

Perform model training with molecular fingerprints on SIDER dataset.

This consists of steps:
1. Load data
2. Parse as RDKit molecules
3. Scaffold split
4. Computing fingerprints
5. Training Random Forest
6. Evaluation

Code templates have been prepared for you below. Note that they sometimes assume variable names. `extract_pos_proba` function will be useful. Use the previous notebook code and [scikit-fingerprints docs](https://scikit-fingerprints.github.io/scikit-fingerprints/) as necessary.

In [None]:
# import necessary things
from skfp.datasets.moleculenet import load_sider

# load SIDER dataset: SMILES strings and labels
smiles_list_sider, y = load_sider()

print(f"Example molecule: {smiles_list[0]}")
print(f"Example classes: {y[0]}")
print(f"Number of outputs: {y[0].shape[0]}")

In [None]:
# import necessary things
from skfp.model_selection import scaffold_train_test_split
from skfp.preprocessing import MolFromSmilesTransformer, MolStandardizer

# parse SMILES as molecules
mol_from_smiles = MolFromSmilesTransformer()
mols = mol_from_smiles.transform(smiles_list_sider)

# scaffold split with 80-20% proportion
train_idxs, test_idxs = scaffold_train_test_split(
    mols, test_size=0.2, return_indices=True
)

mols_train = np.array(mols)[train_idxs]
mols_test = np.array(mols)[test_idxs]

y_train = y[train_idxs]
y_test = y[test_idxs]

# create standardizer and standardize molecules
standardizer = MolStandardizer()

mols_train = standardizer.transform(mols_train)
mols_test = standardizer.transform(mols_test)

In [None]:
# import necessary things
from skfp.fingerprints import ECFPFingerprint
from skfp.metrics import extract_pos_proba, multioutput_auroc_score
from sklearn.ensemble import RandomForestClassifier

# compute fingerprints
ecfp_fp = ECFPFingerprint()

X_train_ecfp = ecfp_fp.transform(mols_train)
X_test_ecfp = ecfp_fp.transform(mols_test)

# train RF classifier
rf_clf = RandomForestClassifier(n_jobs=-1, random_state=0)
rf_clf.fit(X_train_ecfp, y_train)

# predict, extract positive class probabilities
y_pred = rf_clf.predict_proba(X_test_ecfp)
y_pred = extract_pos_proba(y_pred)

# calculate AUROC
auroc = multioutput_auroc_score(y_test, y_pred)
print(f"SIDER ECFP+RF AUROC: {auroc:.2%}")

## 6. Pipelines

In ML, we often have multi-step pipelines, with preprocessing, training multiple models, merging their ensembles etc. We can do the same with scikit-fingerprints. Using many different fingerprints often helps.

Here, we will use 80-10-10% train-valid-test scaffold split, to be able to compare to other papers.

In [None]:
from skfp.datasets.moleculenet import load_bace
from skfp.model_selection import scaffold_train_valid_test_split

smiles_list, y = load_bace()

train_idxs, valid_idxs, test_idxs = scaffold_train_valid_test_split(
    smiles_list, train_size=0.8, valid_size=0.1, test_size=0.1, return_indices=True
)

# split mols and labels
smiles_train = np.array(smiles_list)[train_idxs]
smiles_test = np.array(smiles_list)[test_idxs]

y_train = y[train_idxs]
y_test = y[test_idxs]

In [None]:
from skfp.fingerprints import ECFPFingerprint, TopologicalTorsionFingerprint
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

fps_union = FeatureUnion(
    [
        ("tt_fp", TopologicalTorsionFingerprint()),
        ("ecfp_fp", ECFPFingerprint()),
    ]
)

rf_clf = RandomForestClassifier(class_weight="balanced", n_jobs=-1, random_state=0)

pipeline = Pipeline(
    [
        ("mol_from_smiles", MolFromSmilesTransformer()),
        ("mol_standardizer", MolStandardizer()),
        ("fps_union", fps_union),
        ("scaler", MinMaxScaler()),
        ("random_forest", rf_clf),
    ]
)
pipeline.fit(smiles_train, y_train)

y_pred = pipeline.predict_proba(smiles_test)[:, 1]

auroc = multioutput_auroc_score(y_test, y_pred)
print(f"Feature union AUROC for Random Forest: {auroc:.2%}")

A major breakthrough in pretraining graph neural networks (GNNs) in chemistry was the paper ["Strategies for Pre-training Graph Neural Networks" W. Hu et al.](https://arxiv.org/abs/1905.12265). Their best AUROC on BACE was 84.5%. Therefore, we managed to outperform a complex, pretrained GNN using simple molecular fingerprints without any pretraining or sophisticated methods.

## Bonus exercise

Peptides are small proteins, with typically up to a few dozen aminoacids. While they are frequently analyzed as text sequences, they are, are all chemistry and biology, molecules built from atoms. Peptide function prediction is an important task in bioinformatics and peptide-based drugs.

One dataset from this domain comes from Long Range Graph Benchmark (LRGB), which proposed a 10-task peptide function prediction dataset with over 15k peptide molecules. For details, see ["Long Range Graph Benchmark" V. Dwivedi et al.](https://arxiv.org/abs/2206.08164). Perform molecule classification on this dataset, using predetermined benchmark splits. It uses AUPRC (Average Precision, AP) as a metric.

1. Load dataset and splits.
2. Parse SMILES strings as molecules.
3. Split molecules and labels. You can ignore the validation set.
4. Calculate fingerprints, e.g. ECFP, Topological Torsion, EState.
5. Train Random Forest classifier. Since the dataset is quite large, use 500 trees, instead of default 100.
6. Calculate multioutput AUPRC score of the model.
7. Compare results with paper.

Use [scikit-fingerprints documentation](https://scikit-fingerprints.github.io/scikit-fingerprints/) as necessary. If you want to expand this further, you can perform hyperparameter tuning on the predetermined validation split.

In [None]:
import numpy as np
from skfp.datasets.lrgb import load_peptides_func, load_lrgb_mol_splits
from skfp.preprocessing import MolFromSmilesTransformer


smiles_list, y = load_peptides_func()
train_idxs, valid_idxs, test_idxs = load_lrgb_mol_splits("Peptides-func")

mol_from_smiles = MolFromSmilesTransformer(n_jobs=-1)
mols = mol_from_smiles.transform(smiles_list)

mols = np.array(mols)

mols_train = mols[train_idxs]
mols_test = mols[test_idxs]

y_train = y[train_idxs]
y_test = y[test_idxs]

In [None]:
from skfp.fingerprints import ECFPFingerprint, EStateFingerprint, TopologicalTorsionFingerprint
from skfp.metrics import extract_pos_proba, multioutput_auprc_score
from sklearn.ensemble import RandomForestClassifier


for fp_name, fp_cls in [
    ("ECFP", ECFPFingerprint),
    ("EState", EStateFingerprint),
    ("TopologicalTorsion", TopologicalTorsionFingerprint),
]:
    fp = fp_cls(n_jobs=-1)
    X_train = fp.transform(mols_train)
    X_test = fp.transform(mols_test)

    clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=0)
    clf.fit(X_train, y_train)

    y_pred = clf.predict_proba(X_test)
    y_pred = extract_pos_proba(y_pred)

    auprc = multioutput_auprc_score(y_test, y_pred)
    print(f"{fp_name} AUPRC: {auprc:.2%}")
