# Virtual screening

Ligand-based virtual screening (LBVS) involves predicting bioactivity of large collections of compounds. It involves multiple steps, which are quite easy to perform with scikit-fingerprints.

Notebook inspired by [Tutorial for the Teach-Discover-Treat (TDT) Competition 2014](https://github.com/sriniker/TDT-tutorial-2014/tree/master) by Sereina Riniker and Gregory Landrum.

First, we will load the Malaria HTS dataset from the above competition. It was created in a series of high-throughput screening (HTS) campaigns from around 2010-2012. It contains an already binarized bioactivity labels.

For shorter calculations in this tutorial, we will subsample the negative class to 100 thousand samples.

In [1]:
import pandas as pd
import numpy as np

# load malaria dataset
df = pd.read_parquet("data/malaria_hts_train.parquet")

# Remove ambiguous labels
df = df[df["label"] != "ambiguous"]

# map labels to binary
df["label"] = df["label"].map({"false": 0, "true": 1})

# Extract data into smiles and labels
smiles_list = df["SMILES"].values
labels = df["label"].values

# subsample
pos_idx = np.where(labels == 1)[0]
neg_idx = np.random.choice(np.where(labels == 0)[0], size=100_000, replace=False)
idx = np.concatenate([pos_idx, neg_idx])
smiles_list = smiles_list[idx].tolist()
labels = labels[idx]

print(f"Positive samples: {len(pos_idx)}")
print(f"Negative samples: {len(neg_idx)}")
print(f"Shapes: SMILES list {len(smiles_list)}, labels {len(labels)}")

Positive samples: 1528
Negative samples: 100000
Shapes: SMILES list 101528, labels 101528


### Exercise 1

This dataset contains some invalid molecules, e.g. breaking valence rules in modern RDKit. Use `MolFromSmilesTransformer`, and note that:
- some molecules may error out, so use `valid_only` and `.transform_x_y()`
- you can parallelize it with `n_jobs` and `batch_size` for efficiency, just like fingerprints.

Use [documentation](https://scikit-fingerprints.readthedocs.io/latest/api_reference.html) if necessary.

In [2]:
# import MolFromSmilesTransformer
from skfp.preprocessing import MolFromSmilesTransformer

# Create MolFromSmilesTransformer
mol_from_smiles = MolFromSmilesTransformer(
    valid_only=True, n_jobs=-1, batch_size=1000, verbose=True
)

# Run the dataset transformation to create molecules and labels
mols, labels = mol_from_smiles.transform_x_y(smiles_list, labels)

print(f"Removed molecules: {len(smiles_list) - len(mols)}")

  0%|          | 0/101 [00:00<?, ?it/s]

Removed molecules: 1


---

# Molecular Filters

Typically, in LBVS we have huge collections of molecules. Many of those are highly reactive, promiscuous compounds, dyes, or contain toxic functional groups. Thus, we use **molecular filters** to remove them. They are sets of rules to remove unwanted compounds.

They can be divided into 2 groups:

1. **Physicochemical filters** - define allowed ranges of physicochemical descriptors (e.g. molecular weight, logP, HBA, HBD) and keep only molecules fulfilling those requirements. Examples include [Lipinski's rule of 5](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.filters.LipinskiFilter.html#skfp.filters.LipinskiFilter) and [REOS](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.filters.REOSFilter.html#skfp.filters.REOSFilter).

2. **Substructural filters** - define unwanted substructures with SMARTS and keep only molecules that do not contain those patterns. Examples include [PAINS](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.filters.PAINSFilter.html) and [Brenk filter](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.filters.BrenkFilter.html).

RDKit contains 10 substructural filters, but no physicochemical filters. Scikit-fingerprints implements 31 filters, 21 physicochemical and 10 substructural. API is based on [feature-engine](https://feature-engine.trainindata.com/en/latest/) and other scikit-learn compatible libraries for preprocessing data.

Let's see an example for a substructural filter with RDKit.

In [3]:
from rdkit.Chem import FilterCatalog
from rdkit.Chem.rdfiltercatalog import FilterCatalogParams
from tqdm.notebook import tqdm

filter_catalog_pains_a = FilterCatalogParams.FilterCatalogs.PAINS_A
params = FilterCatalog.FilterCatalogParams()
params.AddCatalog(filter_catalog_pains_a)
filters = FilterCatalog.FilterCatalog(params)

mols_filtered_rdkit = []
labels_filtered_rdkit = []
for mol, label in tqdm(zip(mols, labels), total=len(mols)):
    matches = filters.GetMatches(mol)
    if not matches:
        mols_filtered_rdkit.append(mol)
        labels_filtered_rdkit.append(label)



  0%|          | 0/101527 [00:00<?, ?it/s]

And with scikit-fingerprints

In [4]:
from skfp.filters import PAINSFilter

filter_pains = PAINSFilter(variant="A", n_jobs=-1, verbose=True)
mols_filtered_skfp, labels_filtered_skfp = filter_pains.transform_x_y(mols, labels)

  0%|          | 0/24 [00:00<?, ?it/s]

### Exercise 2

Lipinski's rule of 5 is arguably the first and most famous molecular filters. It is a physicochemical filter, where a molecule can violate at most one of the rules:

- molecular weight $\leq 500$
- hydrogen bond acceptors (HBA) $\leq 10$
- hydrogen bond donors (HBD) $\leq 5$
- logP $\leq 5$

Implement the Lipinski's rule using RDKit and `for` loop, and filter molecules. Remember to also filter labels for consistency! Relevant RDKit descriptors are in [rdkit.Chem.Descriptors](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html), [rdkit.Chem.rdMolDescriptors](https://www.rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html), and [rdkit.Chem.Crippen](https://www.rdkit.org/docs/source/rdkit.Chem.Crippen.html) modules.

Then, use the [scikit-fingerprints LipinskiFilter](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.filters.LipinskiFilter.html#skfp.filters.LipinskiFilter) and compare the two.

In [5]:
from rdkit.Chem.Crippen import MolLogP
from rdkit.Chem.Descriptors import MolWt
from rdkit.Chem.rdMolDescriptors import CalcNumLipinskiHBD, CalcNumLipinskiHBA

mols_filtered_rdkit = []
labels_filtered_rdkit = []
for mol, label in tqdm(zip(mols, labels), total=len(mols)):
    mol_weight = MolWt(mol)
    hba = CalcNumLipinskiHBA(mol)
    hbd = CalcNumLipinskiHBD(mol)
    logp = MolLogP(mol)

    if mol_weight <= 500 and hba <= 10 and hbd <= 5 and logp <= 5:
        mols_filtered_rdkit.append(mol)
        labels_filtered_rdkit.append(label)


  0%|          | 0/101527 [00:00<?, ?it/s]

In [6]:
print(f"Original size     : {len(mols)}")
print(f"Filtered size     : {len(mols_filtered_rdkit)}")
print(f"Removed molecules : {len(mols) - len(mols_filtered_rdkit)}")

Original size     : 101527
Filtered size     : 90891
Removed molecules : 10636


In [7]:
from skfp.filters import LipinskiFilter

lipinski_filter = LipinskiFilter(n_jobs=-1, allow_one_violation=False)
mols_filtered_skfp, labels_filtered_skfp = lipinski_filter.transform_x_y(mols, labels)

Display the how the number of molecules changes

In [8]:
print(f"Original size     : {len(mols)}")
print(f"Filtered size     : {len(mols_filtered_skfp)}")
print(f"Removed molecules : {len(mols) - len(labels_filtered_skfp)}")

Original size     : 101527
Filtered size     : 90891
Removed molecules : 10636


### Exercise 3

Scikit-learn `Pipeline` does not support `.transform_x_y()`, so in order to use multiple filters, we need to apply them one after another. A common example is using multiple PAINS sets A, B, C.

Create a list of filters `PAINSFilter` with different variants A, B and C. Apply filtering on molecules and labels with all three filters one after another.

Save final results in variables `mols_filtered` and `labels_filtered`, which we will use in the rest of the notebook.

In [9]:
n_active = labels.sum()
n_inactive = len(labels) - n_active

print(f"Molecules before filtering: {len(mols)}")
print(f"Inactive molecules: {n_inactive}")
print(f"Active molecules: {n_active}")

Molecules before filtering: 101527
Inactive molecules: 99999
Active molecules: 1528


In [10]:
from skfp.filters import PAINSFilter

pains_filters = [
    PAINSFilter(variant="A", n_jobs=-1),
    PAINSFilter(variant="B", n_jobs=-1),
    PAINSFilter(variant="C", n_jobs=-1),
]

mols_filtered = mols
labels_filtered = labels

for pains_filter_variant in pains_filters:
    mols_filtered, labels_filtered = pains_filter_variant.transform_x_y(
        mols_filtered, labels_filtered
    )

In [11]:
n_active = labels_filtered.sum()
n_inactive = len(labels_filtered) - n_active

print(f"Number of molecules: before {len(mols)}, after {len(mols_filtered)}")
print(f"Inactive molecules: {n_inactive}")
print(f"Active molecules: {n_active}")

Number of molecules: before 101527, after 94354
Inactive molecules: 93139
Active molecules: 1215


### All filters

Here you can find all filters currently available in scikit-fingerprints

In [12]:
# from skfp.filters import (
#     BeyondRo5Filter,
#     BMSFilter,
#     BrenkFilter,
#     FAF4DruglikeFilter,
#     FAF4LeadlikeFilter,
#     GhoseFilter,
#     GlaxoFilter,
#     GSKFilter,
#     HaoFilter,
#     InpharmaticaFilter,
#     LINTFilter,
#     LipinskiFilter,
#     MLSMRFilter,
#     MolecularWeightFilter,
#     NIBRFilter,
#     NIHFilter,
#     OpreaFilter,
#     PAINSFilter,
#     PfizerFilter,
#     REOSFilter,
#     RuleOfFourFilter,
#     RuleOfThreeFilter,
#     RuleOfTwoFilter,
#     RuleOfVeberFilter,
#     RuleOfXuFilter,
#     SureChEMBLFilter,
#     TiceHerbicidesFilter,
#     TiceInsecticidesFilter,
#     ValenceDiscoveryFilter,
#     ZINCBasicFilter,
#     ZINCDruglikeFilter,
# )

---

## Data splits

Before, we loaded pre-computed splits provided by benchmarks.

Sometimes we want to compute our own splits. In this case, implementing them by ourselves can get messy.

Take a look at MaxMin split computation without scikit-fingerprints.

In [13]:
# Without skfp
from math import ceil
from rdkit.SimDivFilters import MaxMinPicker
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator

data_size = len(mols_filtered)

# Compute sizes of the output dataset
test_size = ceil(0.2 * data_size)
train_size = data_size - test_size

random_state = 42
fps = GetMorganGenerator(radius=2, fpSize=2048).GetFingerprints(mols_filtered)

picker = MaxMinPicker()
test_idxs = picker.LazyBitVectorPick(
    fps,
    poolSize=data_size,
    pickSize=test_size,
    seed=random_state,
)
test_idxs = list(test_idxs)
train_idxs = list(set(range(data_size)) - set(test_idxs))

# Extract data
mols_train_rdkit = [mols_filtered[i] for i in train_idxs]
mols_test_rdkit = [mols_filtered[i] for i in test_idxs]

labels_train = labels[train_idxs]
labels_test = labels[test_idxs]

In [14]:
len(mols_train_rdkit), len(mols_test_rdkit)

(75483, 18871)

### Exercise 4

Fortunately scikit-fingerprints provides us with implemented splitting functionality.

Lets split our original `molecules` and `labels` data into training, and testing dataset in 8:2 proportion.

To do that use `maxmin_train_test_split` from `model_selection` submodule of `skfp`.

You can find more information about how to use splits in `model_selection` section of [documentation](https://scikit-fingerprints.readthedocs.io/latest/api_reference.html)

In [15]:
# With skfp

from skfp.model_selection import maxmin_train_test_split

(
    mols_train,
    mols_test,
    labels_train,
    labels_test,
) = maxmin_train_test_split(
    mols_filtered, labels_filtered, train_size=0.8, test_size=0.2
)

Note that if we want to use validation set we can always import `maximn_train_valid_test_split` instead

In [16]:
len(mols_train), len(mols_test)

(75483, 18871)

All splits available in skfp

In [17]:
# from skfp.model_selection import (
#     butina_train_test_split,
#     butina_train_valid_test_split,
#     maxmin_train_test_split,
#     maxmin_train_valid_test_split,
#     randomized_scaffold_train_test_split,
#     randomized_scaffold_train_valid_test_split,
#     scaffold_train_test_split,
#     scaffold_train_valid_test_split,
# )

---

## Similarity, distance & evaluation metrics



Now that we have a training and testing dataset we can perform similarity search virtual screening. To do that, lets take a look at distances and similarities delivered by scikit-fingerprints. They can be used to determine how similar, or different, two molecules are to each other.

scikit-fingerprints delivers both binary and count variants of similarity and distance computation functions.

Lets try to compute similarity between two binary vectors using `tanimoto_binary_similarity`

In [18]:
bit_vec_1 = np.array([1, 1, 0, 0])
bit_vec_2 = np.array([0, 1, 1, 0])

In [19]:
from skfp.distances import tanimoto_binary_similarity

binary_similarity = tanimoto_binary_similarity(bit_vec_1, bit_vec_2)
print(f"similarity of binary vectors: {binary_similarity}")

similarity of binary vectors: 0.3333333333333333


If we use count vectors we need to use `tanimoto_count_similarity`

In [20]:
count_vec_1 = np.array([2, 3, 4, 0])
count_vec_2 = np.array([2, 3, 4, 2])

In [21]:
from skfp.distances import tanimoto_count_similarity

count_similarity = tanimoto_count_similarity(bit_vec_1, bit_vec_2)
print(f"similarity of count vectors: {count_similarity}")

similarity of count vectors: 0.3333333333333333


We can also use bulk similarity computation to speed up very heavy computations between two sets of molecules.

If in one set we have X molecules and Y molecules in the other set, finding distances or similarities between every pair between X and Y can be very slow. scikit-fingerpirnts implementation allows us to compute them efficiently

In [22]:
from skfp.distances.tanimoto import (
    bulk_tanimoto_binary_similarity
)

arr_1 = np.array(
    [
        [1, 1, 1],
        [0, 0, 1],
        [1, 1, 1],
    ]
)

arr_2 = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(arr_1, arr_2)
sim

array([[0.66666667, 0.66666667, 1.        ],
       [0.5       , 0.5       , 0.33333333],
       [0.66666667, 0.66666667, 1.        ]])

Sometimes we want to compute similarity between every two molcules in a dataset. We can compute that by passing just one argument.

In [23]:
X = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(X)
sim

array([[1.        , 0.33333333, 0.66666667],
       [0.33333333, 1.        , 0.66666667],
       [0.66666667, 0.66666667, 1.        ]])

### Performance

Lets see the time difference between manual computation of distance matrix between all molecules in a small dataset of 300 molecules.

In [24]:
from skfp.fingerprints import ECFPFingerprint

ecfp_fingerprint = ECFPFingerprint(count=True, n_jobs=-1, verbose=True)

fps = ecfp_fingerprint.transform(mols_train[:300])

  0%|          | 0/25 [00:00<?, ?it/s]

Manual implementation

In [25]:
%timeit -r 3 -n 10 [tanimoto_count_similarity(fps[i], fps[j]) for i in range(len(fps)) for j in range(len(fps))]

1.75 s ± 9.44 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


Bulk implementation

In [26]:
from skfp.distances import bulk_tanimoto_count_similarity

In [27]:
%timeit -r 3 -n 10 [bulk_tanimoto_count_similarity(fps)]

30.7 ms ± 18.4 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


### Exercise 5

Now that we know how to compute similarities in bulk, lets try ty perform MAX fusion similarity search VS.

- First, extract positive samples from training dataset and transform them to vectors using molecular fingerprints You can use a molecular fingerprint of your choice. We recommend using ECFP.
- After that transform the test set molecules using fingerprints.
- Using `bulk_tanimoto_count_similarity()` compute similarities between the positive training samples and all test samples
- Determine class probability for each molecule in the test set. Chose the highest similarity to a positive class molecule from test dataset. That's why this method is called "MAX fusion"
- Compute AUROC, BEDROC and Enrichment Factor at 5%. You can find out about relevant metrics in [metrics section of documentation](https://scikit-fingerprints.readthedocs.io/latest/modules/metrics.html)


In [28]:
from sklearn.metrics import roc_auc_score

# Complete needed imports!
from skfp.metrics import enrichment_factor, bedroc_score

positive_train_mols = [mols_train[i] for i in np.where(labels_train)[0]]

ecfp_fingerprint = ECFPFingerprint(count=True, n_jobs=-1)
positive_train_fps = ecfp_fingerprint.transform(positive_train_mols)

test_fps = ecfp_fingerprint.transform(mols_test)

bulk_similarity = bulk_tanimoto_count_similarity(positive_train_fps, test_fps)

test_proba = bulk_similarity.max(axis=0)

auroc = roc_auc_score(labels_test, test_proba)
bedroc = bedroc_score(labels_test, test_proba, alpha=20)
ef = enrichment_factor(labels_test, test_proba, fraction=0.05)

print(f"AUROC: {auroc}")
print(f"EF5%: {ef}")
print(f"bedroc score: {bedroc}")

AUROC: 0.7394705692456688
EF5%: 7.645476509069284


---