# Virtual screening

Ligand-based virtual screening (LBVS) involves predicting bioactivity of large collections of compounds. It involves multiple steps, which are quite easy to perform with scikit-fingerprints.

---

## Notebook setup

Import and load a malaria_hts dataset that comes from high-throughput screening (HTS) campaigns from around 2010-2012.

Let's load and preprocess the data to match our desired format. Which is, SMILES strings stored in a list and their corresponding labels stored in numpy array without ambiguous labels.

In [1]:
import pandas as pd
import numpy as np

# load malaria dataset
df = pd.read_parquet("data/malaria_hts_train.parquet")

# Remove ambiguous labels
df = df[df["label"] != "ambiguous"]

# map labels to binary
df["label"] = df["label"].map({"false": 0, "true": 1})

# Extract data into smiles and labels
smiles_list = df["SMILES"].values
labels = df["label"].values

# subsample
pos_idx = np.where(labels == 1)[0]
neg_idx = np.random.choice(np.where(labels == 0)[0], size=100_000, replace=False)
idx = np.concatenate([pos_idx, neg_idx])
smiles_list = smiles_list[idx].tolist()
labels = labels[idx]

len(smiles_list), labels.shape[0]

(101528, 101528)

Display example Smiles

In [2]:
smiles_list[:10]

['CC(C)CN(CC(C)C)S(=O)(=O)c1ccc(cc1)C(=O)Nc2nc(cs2)c3ccccn3',
 'Clc1cccc(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c1',
 'COc1ccc(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c(OC)c1',
 'COc1ccc(OC)c(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c1',
 'COc1ccc(Cl)cc1NC(=O)CSc2nc(ns2)c3ccccc3Cl',
 'Brc1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'CCOC(=O)C1=C(C)N=C2S\\C(=C\\c3ccccc3O)\\C(=O)N2C1c4ccccc4',
 'CCOC(=O)c1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'CC(=O)c1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'OC(=O)c1cccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)c1']

### Exercise 1

This dataset contains invalid molecules. Use `MolFromSmilesTransformer` to remove invalid molecules by setting `valid_only` parameter to `True`.

Notice that if we remove invalid molecules, the label array will be longer than list of molecules. We need to modify both of them by using `.transform_x_y()` method.

In case of issues consult the `MolFromSmilesTransformer` [documentation](https://scikit-fingerprints.readthedocs.io/latest/api_reference.html) that you will find in the preprocessing section.

In [3]:
# import MolFromSmilesTransformer
from skfp.preprocessing import MolFromSmilesTransformer

# Create MolFromSmilesTransformer
mol_from_smiles = MolFromSmilesTransformer(
    valid_only=True, n_jobs=-1, batch_size=1000, verbose=True
)

# Run the dataset transformation to createe molecules and labels
molecules, labels = mol_from_smiles.transform_x_y(smiles_list, labels)

print(f"Removed molecules: {len(smiles_list) - len(molecules)}")

  0%|          | 0/101 [00:00<?, ?it/s]

Removed molecules: 1


---

# Molecular Filters

First step in virtual screening (VS) workflows is filtering. Scikit fingerprints exports many physiochemial and substructural molecular filters.

First, take a look at how we would implement Lipinski's Rule of 5 filtering without scikit-fingerprints.

In [4]:
from tqdm import tqdm
from rdkit.Chem.Crippen import MolLogP
from rdkit.Chem.rdMolDescriptors import CalcNumLipinskiHBA, CalcNumLipinskiHBD
from rdkit.Chem.Descriptors import MolWt

# Initialize lists for filtered molecules and labels
molecules_filtered = []
labels_filtered = []

# iterate over molecules
for i, mol in tqdm(enumerate(molecules), total=len(molecules)):

    # check for Lipinski's Rule of 5 conditions
    rules = [
        MolWt(mol) <= 500,  # molecular weight
        CalcNumLipinskiHBA(mol) <= 10,  # HBA
        CalcNumLipinskiHBD(mol) <= 5,  # HBD
        MolLogP(mol) <= 5,  # logP
    ]

    # Append data and label to the end of the output list
    if all(rules):
        molecules_filtered.append(mol)
        labels_filtered.append(labels[i])

100%|██████████| 101527/101527 [00:22<00:00, 4553.64it/s]


### Exercise 2

Instead of filtering the molecules like that we can use scikit-learn compatible interface provided by scikit-fingerprints.

Import and initialize `LipinskiFilter`, and run `.transform_x_y()` transformation on molecules and labels, similarly to how we transformed both SMILES and labels before.

*Note that without labels we could still perform filtering of the dataset using* `.transform()` *method*



In [5]:
from skfp.filters import LipinskiFilter

filter = LipinskiFilter(n_jobs=-1, verbose=True)
molecules_filtered, labels_filtered = filter.transform_x_y(molecules, labels)



  0%|          | 0/24 [00:00<?, ?it/s]

Display the how the number of molecules changes

In [6]:
print(f"Original size     : {len(molecules)}")
print(f"Filtered size     : {len(molecules_filtered)}")
print(f"Removed molecules : {len(molecules) - len(molecules_filtered)}")

Original size     : 101527
Filtered size     : 100918
Removed molecules : 609


### Exercise 3

Using multiple filters in a loop.

Lets see how we can easily use several filters in a loop.

- First, import `PAINSFilter` from `filters` submodule of `skfp`.
- Create a list with 3 instances of `PAINSFilter`. For each of them pass different `variant` argument. `"A"`, `"B"` and `"C"`.
- In a for loop apply the `transform_x_y()` on `mols` (molecules) and `y` (labels).
- Make sure to override `mols` and `y` in each iteration of the loop.

In [7]:
# Create new names to avoid overriding `molecules` and `labels`
mols, y = molecules, labels

print(f"Molecules before filtering: {len(mols)}")

Molecules before filtering: 101527


In [8]:
from skfp.filters import PAINSFilter

# Create a list of pains filters
pains_filters = [
    PAINSFilter(variant="A", n_jobs=-1),
    PAINSFilter(variant="B", n_jobs=-1),
    PAINSFilter(variant="C", n_jobs=-1),
]

# Iterate over filters and perform filtering
for filter in pains_filters:
    mols, y = filter.transform_x_y(mols, y)

In [9]:
n_active = y.sum()
n_inactive = len(y) - n_active

print(f"Molecules after filtering: {len(mols)}")
print(f"Final inactive molecules: {n_inactive}")
print(f"Final active molecules: {n_active}")

Molecules after filtering: 94224
Final inactive molecules: 93009
Final active molecules: 1215


### All filters

Here you can find all filters currently available in scikit-fingerprints

In [10]:
# from skfp.filters import (
#     BeyondRo5Filter,
#     BMSFilter,
#     BrenkFilter,
#     FAF4DruglikeFilter,
#     FAF4LeadlikeFilter,
#     GhoseFilter,
#     GlaxoFilter,
#     GSKFilter,
#     HaoFilter,
#     InpharmaticaFilter,
#     LINTFilter,
#     LipinskiFilter,
#     MLSMRFilter,
#     MolecularWeightFilter,
#     NIBRFilter,
#     NIHFilter,
#     OpreaFilter,
#     PAINSFilter,
#     PfizerFilter,
#     REOSFilter,
#     RuleOfFourFilter,
#     RuleOfThreeFilter,
#     RuleOfTwoFilter,
#     RuleOfVeberFilter,
#     RuleOfXuFilter,
#     SureChEMBLFilter,
#     TiceHerbicidesFilter,
#     TiceInsecticidesFilter,
#     ValenceDiscoveryFilter,
#     ZINCBasicFilter,
#     ZINCDruglikeFilter,
# )

---

## Data splits

Quick word if someone isn't familiar

Before, we loaded pre-computed splits provided by benchmarks.

Sometimes we want to compute our own splits. In this case, implementing them by ourselves can get messy.

Take a look at scaffold split computation without scikit-fingerprints.

In [11]:
# Without skfp

from copy import deepcopy
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit import Chem
from collections import defaultdict

data_size = len(molecules)

# Compute sizes of the output dataset
test_size = int(0.1 * data_size)
valid_size = int(0.1 * data_size)
train_size = data_size - test_size - valid_size

# Determine indices that correspond to individual scaffold sets
scaffold_sets = defaultdict(list)
for idx, mol in enumerate(molecules):
    mol = deepcopy(mol)
    Chem.RemoveStereochemistry(mol)
    scaffold = MurckoScaffold.MurckoScaffoldSmiles(mol=mol)
    scaffold_sets[scaffold].append(idx)

# Sort scaffold sets
scaffold_sets = list(scaffold_sets.values())
scaffold_sets.sort(key=len)

# Create indices out of the created scaffold sets
train_idxs: list[int] = []
valid_idxs: list[int] = []
test_idxs: list[int] = []
for scaffold_set in scaffold_sets:
    if len(test_idxs) < test_size:
        test_idxs.extend(scaffold_set)
    elif len(valid_idxs) < valid_size:
        valid_idxs.extend(scaffold_set)
    else:
        train_idxs.extend(scaffold_set)

# Extract data
molecules_train = [molecules[i] for i in train_idxs]
molecules_valid = [molecules[i] for i in valid_idxs]
molecules_test = [molecules[i] for i in test_idxs]

labels_train = labels[train_idxs]
labels_valid = labels[valid_idxs]
labels_test = labels[test_idxs]

In [12]:
len(molecules_train), len(molecules_valid), len(molecules_test)

(81220, 10154, 10153)

### Exercise 4

Fortunately scikit-fingerprints provides us with implemented splitting functionality.

Lets split our original `molecules` and `labels` data into training, validation and testing dataset in 8:1:1 proportion.

To do that use `scaffold_train_valid_test_split` from `model_selection` submodule of `skfp`.

You can find more information about how to use splits in `model_selection` section of [documentation](https://scikit-fingerprints.readthedocs.io/latest/api_reference.html)

In [13]:
# With skfp

from skfp.model_selection import scaffold_train_valid_test_split

(
    molecules_train,
    molecules_valid,
    molecules_test,
    labels_train,
    labels_valid,
    labels_test,
) = scaffold_train_valid_test_split(
    molecules, labels, train_size=0.8, valid_size=0.1, test_size=0.1
)

Note that if we don't want to use validation set we can always import `scaffold_train_test_split` instead

In [14]:
len(molecules_train), len(molecules_valid), len(molecules_test)

(81220, 10154, 10153)

All splits available in skfp

In [15]:
# from skfp.model_selection import (
#     butina_train_test_split,
#     butina_train_valid_test_split,
#     maxmin_train_test_split,
#     maxmin_train_valid_test_split,
#     randomized_scaffold_train_test_split,
#     randomized_scaffold_train_valid_test_split,
#     scaffold_train_test_split,
#     scaffold_train_valid_test_split,
# )

---

## Similarity, distance & evaluation metrics



### Similarities & distances

In some cases we might want to compute how similar two molecules are to each other, or how far they are in some euclidean space mapping. To do that we can compute similarity or distance using molecular fingerprints.

scikit-fingerprints delivers both binary and count variants of similarity and distance computation functions.

Lets try to compute similarity between two binary vectors using `tanimoto_binary_similarity`

In [16]:
bit_vec_1 = np.array([1, 1, 0, 0])
bit_vec_2 = np.array([0, 1, 1, 0])

In [17]:
from skfp.distances import tanimoto_binary_similarity

binary_similarity = tanimoto_binary_similarity(bit_vec_1, bit_vec_2)

print(f"similarity of binary vectors: {binary_similarity}")

similarity of binary vectors: 0.3333333333333333


If we use count vectors we need to use `tanimoto_count_similarity`

In [18]:
count_vec_1 = np.array([2, 3, 4, 0])
count_vec_2 = np.array([2, 3, 4, 2])

In [19]:
from skfp.distances import tanimoto_count_similarity

count_similarity = tanimoto_count_similarity(bit_vec_1, bit_vec_2)

print(f"similarity of count vectors: {count_similarity}")

similarity of count vectors: 0.3333333333333333


We can also use bulk similarity computation to speed up very heavy computations between two sets of molecules.

If in one set we have X molecules and Y molecules in the other set, finding distances or similarities between every pair between X and Y can be very slow. scikit-fingerpirnts implementation allows us to compute them efficiently

In [20]:
from skfp.distances.tanimoto import (
    bulk_tanimoto_binary_similarity
)

arr_1 = np.array(
    [
        [1, 1, 1],
        [0, 0, 1],
        [1, 1, 1],
    ]
)

arr_2 = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(arr_1, arr_2)
sim

array([[0.66666667, 0.66666667, 1.        ],
       [0.5       , 0.5       , 0.33333333],
       [0.66666667, 0.66666667, 1.        ]])

Sometimes we want to compute similarity between every two molcules in a dataset. We can compute that by passing just one argument.

In [21]:
X = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(X)
sim

array([[1.        , 0.33333333, 0.66666667],
       [0.33333333, 1.        , 0.66666667],
       [0.66666667, 0.66666667, 1.        ]])

### Performance

Lets see the time difference between manual computation of distance matrix between all molecules in a small dataset of 300 molecules.

In [22]:
from skfp.fingerprints import ECFPFingerprint

ecfp_fingerprint = ECFPFingerprint(count=True, n_jobs=-1, verbose=True)

fps = ecfp_fingerprint.transform(molecules[:300])

  0%|          | 0/25 [00:00<?, ?it/s]

Manual implementation

In [23]:
%timeit -r 3 -n 10 [tanimoto_count_similarity(fps[i], fps[j]) for i in range(len(fps)) for j in range(len(fps))]

1.73 s ± 11.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


Bulk implementation

In [24]:
from skfp.distances.tanimoto import bulk_tanimoto_count_similarity

In [25]:
%timeit -r 3 -n 10 [bulk_tanimoto_count_similarity(fps)]

30.6 ms ± 19.8 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


### Exercise 5

Smilarities and distances have very similar implementation. Now that we know how to use similarities, let's try to train a K nearest neighbors model using binary variant of `ECFPFingerprint`. To do that we'll use a BACE dataset from MoleculeNet benchmark.

- Load SMILES strings and labels from BACE dataset.
- Perform scaffold train-test split with 8:2 ratio
- Build a scikit-learn pipeline using ECFP fingerprint annd KNeighborsClassfier. Remember to pass binary variant of Tanimoto **distance** as `metric` parameter of the KNN model.
- fit the pipeline on training data and make prediction

In [26]:
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier

# Complete needed imports!
from skfp.distances import tanimoto_binary_distance
from skfp.model_selection import scaffold_train_test_split
from skfp.datasets.moleculenet import load_bace

# load bace dataset
smiles, labels = load_bace()

# perform train/test split. (we don't want to use validation set!)
smiles_train, smiles_test, labels_train, labels_test = scaffold_train_test_split(
    smiles, labels, test_size=0.2
)

# initialize a pipeline consisting of binary ECFP fingerprint
# and KNeighborsClassifier with correct Tanimoto distance metric
pipeline = make_pipeline(
    ECFPFingerprint(n_jobs=-1),
    KNeighborsClassifier(metric=tanimoto_binary_distance, n_jobs=-1),
)

# rain and predict the pipeline
pipeline.fit(smiles_train, labels_train)
labels_pred = pipeline.predict_proba(smiles_test)[:,1]

# Compute auroc metric
auroc_score = roc_auc_score(labels_test, labels_pred)

print(f"AUROC score: {auroc_score}")

AUROC score: 0.7775590551181102


In [27]:
auroc_score

0.7775590551181102

### Multioutput evaluation metrics

Sometimes, the dataset that we use will have multiple labels.

In these scenarios we want to compute aggregated metrics from each predicted class.

To allow that, scikit-fingerprints exports multioutput scoring metrics.

### Exercise 6

- Load sider dataset from MoleculeNet benchmark as SMILES strings and labeks.
- Perform scaffold train-test split with 8:2 ratio.
- Create and fit the pipeline consisting of **count** variant of ECFP fingerprint, and RandomForestClassifier, similarly to the previous exercise.
- Make a probability prediction using `.predict_proba()` method
- Predicted probabilities are stored in a 3-dimensional array. `extract_pos_proba()` function from `metrics` submodule of `skfp` allows to extract relevant probabilities easily.
- Pass correct labels alongside the extracted probabilities to `multioutput_auprc_score` from `metrics` submodule of `skfp`.

In [28]:
from sklearn.ensemble import RandomForestClassifier

# complete imports
from skfp.datasets.moleculenet import load_sider
from skfp.metrics import multioutput_auprc_score, extract_pos_proba

smiles, labels = load_sider()

smiles_train, smiles_test, labels_train, labels_test = scaffold_train_test_split(
    smiles, labels, test_size=0.2
)

pipeline = make_pipeline(ECFPFingerprint(n_jobs=-1, count=True), RandomForestClassifier(n_jobs=-1))

pipeline.fit(smiles_train, labels_train)

labels_proba = pipeline.predict_proba(smiles_test)
labels_proba = extract_pos_proba(labels_proba)

auprc_score = multioutput_auprc_score(labels_test, labels_proba)

print(f"AUPRC score: {auprc_score}")



AUPRC score: 0.6865983292077823


In [29]:
auprc_score

0.6865983292077823

---