# Virtual screening

Ligand-based virtual screening (LBVS) involves predicting bioactivity of large collections of compounds. It involves multiple steps, which are quite easy to perform with scikit-fingerprints.

---

## Notebook setup

Import and load BACE dataset from moleculenet as shown in notebook 1.

In [1]:
import pandas as pd
import numpy as np

# load malaria dataset
df = pd.read_parquet("data/malaria_hts_train.parquet")

# Remove ambiguous labels
df = df[df["label"] != "ambiguous"]

# map labels to binary
df["label"] = df["label"].map({"false": 0, "true": 1})

# Extract data into smiles and labels
smiles_list = df["SMILES"].values
labels = df["label"].values

# subsample
pos_idx = np.where(labels == 1)[0]
neg_idx = np.random.choice(np.where(labels == 0)[0], size=100_000, replace=False)
idx = np.concatenate([pos_idx, neg_idx])
smiles_list = smiles_list[idx].tolist()
labels = labels[idx]

len(smiles_list), labels.shape

(101528, (101528,))

In [2]:
smiles_list[:10]

['CC(C)CN(CC(C)C)S(=O)(=O)c1ccc(cc1)C(=O)Nc2nc(cs2)c3ccccn3',
 'Clc1cccc(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c1',
 'COc1ccc(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c(OC)c1',
 'COc1ccc(OC)c(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c1',
 'COc1ccc(Cl)cc1NC(=O)CSc2nc(ns2)c3ccccc3Cl',
 'Brc1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'CCOC(=O)C1=C(C)N=C2S\\C(=C\\c3ccccc3O)\\C(=O)N2C1c4ccccc4',
 'CCOC(=O)c1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'CC(=O)c1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'OC(=O)c1cccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)c1']

This dataset contains invalid molecules. Let us use `MolFromSmilesTransformer` and remove invalid molecules by setting `valid_only` parameter to `True`.

Notice that if we remove invalid molecules, the label array will be longer than list of molecules. We need to modify them both using `.transform_x_y()` method.

In [3]:
from skfp.preprocessing import MolFromSmilesTransformer

# Create MolFromSmilesTransformer
mol_from_smiles = MolFromSmilesTransformer(
    valid_only=True, n_jobs=-1, batch_size=1000, verbose=True
)

# Run the dataset transformation
molecules, labels = mol_from_smiles.transform_x_y(smiles_list, labels)

print(f"Removed molecules: {len(smiles_list) - len(molecules)}")

  0%|          | 0/101 [00:00<?, ?it/s]

Removed molecules: 0


---

## Molecular filters

TODO quick word about molecular filters, what they are for and why we want to use them.

### Physicochemical filters

Todo quick word

Example without skfp

In [4]:
from tqdm import tqdm
from rdkit.Chem.Crippen import MolLogP
from rdkit.Chem.rdMolDescriptors import CalcNumLipinskiHBA, CalcNumLipinskiHBD
from rdkit.Chem.Descriptors import MolWt

# Initialize lists for filtered molecules and labels
molecules_filtered = []
labels_filtered = []

# iterate over molecules
for i, mol in tqdm(enumerate(molecules), total=len(molecules)):

    # check for Lipinski's Rule of 5 conditions
    rules = [
        MolWt(mol) <= 500,  # molecular weight
        CalcNumLipinskiHBA(mol) <= 10,  # HBA
        CalcNumLipinskiHBD(mol) <= 5,  # HBD
        MolLogP(mol) <= 5,  # logP
    ]

    # Append data and label to the end of the output list
    if all(rules):
        molecules_filtered.append(mol)
        labels_filtered.append(labels[i])

100%|██████████| 101528/101528 [00:23<00:00, 4231.31it/s]


Instead of filtering the molecules like that we can use scikit-learn compatible interface provided by scikit fingerprints.

Initialize `LipinskiFilter` and perform `.transform_x_y` transformation on molecules and labels

In [5]:
from skfp.filters import LipinskiFilter

filter = LipinskiFilter(n_jobs=-1, verbose=True)
molecules_filtered, labels_filtered = filter.transform_x_y(molecules, labels)



  0%|          | 0/24 [00:00<?, ?it/s]

If we don't have labels we can still filter the molecules only.

In [6]:
molecules_filtered = filter.transform(molecules)

  0%|          | 0/24 [00:00<?, ?it/s]

Display the how the number of molecules changes

In [7]:
print(f"Original size     : {len(molecules)}")
print(f"Filtered size     : {len(molecules_filtered)}")
print(f"Removed molecules : {len(molecules) - len(molecules_filtered)}")

Original size     : 101528
Filtered size     : 100946
Removed molecules : 582


### Substructural filters

TODO quick word

In [8]:
# Create new names to avoid overriding `molecules` and `labels`
mols, y = molecules, labels

In [9]:
from skfp.filters import PAINSFilter

# Create a list of pains filters
verbosity_args = dict(batch_size=1000, n_jobs=-1, verbose=True)
pains_filters = [
    PAINSFilter(variant="A", **verbosity_args),
    PAINSFilter(variant="B", **verbosity_args),
    PAINSFilter(variant="C", **verbosity_args),
]

print(f"Molecules before filtering: {len(mols)}")

# Iterate over filters and perform filtering
for i, filter in enumerate(pains_filters):
    mols, y = filter.transform_x_y(mols, y)
    print(f"Molecules after PAINS {i}: {len(mols)}")

n_active = y.sum()
n_inactive = len(y) - n_active
print(f"Final inactive molecules: {n_inactive}")
print(f"Final active molecules: {n_active}")

Molecules before filtering: 101528


  0%|          | 0/101 [00:00<?, ?it/s]

Molecules after PAINS 0: 96483


  0%|          | 0/96 [00:00<?, ?it/s]

Molecules after PAINS 1: 94850


  0%|          | 0/94 [00:00<?, ?it/s]

Molecules after PAINS 2: 94356
Final inactive molecules: 93141
Final active molecules: 1215


### All filters in scikit-fingerprints

In [10]:
# from skfp.filters import (
#     BeyondRo5Filter,
#     BMSFilter,
#     BrenkFilter,
#     FAF4DruglikeFilter,
#     FAF4LeadlikeFilter,
#     GhoseFilter,
#     GlaxoFilter,
#     GSKFilter,
#     HaoFilter,
#     InpharmaticaFilter,
#     LINTFilter,
#     LipinskiFilter,
#     MLSMRFilter,
#     MolecularWeightFilter,
#     NIBRFilter,
#     NIHFilter,
#     OpreaFilter,
#     PAINSFilter,
#     PfizerFilter,
#     REOSFilter,
#     RuleOfFourFilter,
#     RuleOfThreeFilter,
#     RuleOfTwoFilter,
#     RuleOfVeberFilter,
#     RuleOfXuFilter,
#     SureChEMBLFilter,
#     TiceHerbicidesFilter,
#     TiceInsecticidesFilter,
#     ValenceDiscoveryFilter,
#     ZINCBasicFilter,
#     ZINCDruglikeFilter,
# )

---

## Data splits

Quick word if someone isn't familiar

Before we used pre-computed splits provided by benchmarks.

Sometimes we want to compute our own splits.

In [11]:
"""TODO here:
    - Show scaffold split without skfp
    - Show manual split computation with skfp
"""

'TODO here:\n    - Show scaffold split without skfp\n    - Show manual split computation with skfp\n'

In [12]:
# Without skfp

from copy import deepcopy
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit import Chem
from collections import defaultdict

data_size = len(molecules)

# Compute sizes of the output dataset
test_size = int(0.1 * data_size)
valid_size = int(0.1 * data_size)
train_size = data_size - test_size - valid_size

# Determine indices that correspond to individual scaffold sets
scaffold_sets = defaultdict(list)
for idx, mol in enumerate(molecules):
    mol = deepcopy(mol)
    Chem.RemoveStereochemistry(mol)
    scaffold = MurckoScaffold.MurckoScaffoldSmiles(mol=mol)
    scaffold_sets[scaffold].append(idx)

# Sort scaffold sets
scaffold_sets = list(scaffold_sets.values())
scaffold_sets.sort(key=len)

# Create indices out of the created scaffold sets
train_idxs: list[int] = []
valid_idxs: list[int] = []
test_idxs: list[int] = []
for scaffold_set in scaffold_sets:
    if len(test_idxs) < test_size:
        test_idxs.extend(scaffold_set)
    elif len(valid_idxs) < valid_size:
        valid_idxs.extend(scaffold_set)
    else:
        train_idxs.extend(scaffold_set)

# Extract data
molecules_train = [molecules[i] for i in train_idxs]
molecules_valid = [molecules[i] for i in valid_idxs]
molecules_test = [molecules[i] for i in test_idxs]

labels_train = labels[train_idxs]
labels_valid = labels[valid_idxs]
labels_test = labels[test_idxs]

In [13]:
len(molecules_train), len(molecules_valid), len(molecules_test)

(81223, 10152, 10153)

In [14]:
# With skfp

from skfp.model_selection import scaffold_train_valid_test_split

# We can also import scaffold_train_test_split!

(
    molecules_train,
    molecules_valid,
    molecules_test,
    labels_train,
    labels_valid,
    labels_test,
) = scaffold_train_valid_test_split(
    molecules, labels, train_size=0.8, valid_size=0.1, test_size=0.1
)

In [15]:
len(molecules_train), len(molecules_valid), len(molecules_test)

(81223, 10152, 10153)

All splits available in skfp

In [16]:
# from skfp.model_selection import (
#     butina_train_test_split,
#     butina_train_valid_test_split,
#     maxmin_train_test_split,
#     maxmin_train_valid_test_split,
#     randomized_scaffold_train_test_split,
#     randomized_scaffold_train_valid_test_split,
#     scaffold_train_test_split,
#     scaffold_train_valid_test_split,
# )

---

## Similarity, distance & evaluation metrics



### Similarities & distances

quick word

In [17]:
bit_vec_1 = np.array([1, 1, 0, 0])
bit_vec_2 = np.array([0, 1, 1, 0])

In [18]:
from skfp.distances import tanimoto_binary_distance

binary_distance = tanimoto_binary_distance(bit_vec_1, bit_vec_2)

print(f"similarity of binary vectors: {binary_distance}")

similarity of binary vectors: 0.6666666666666667


In [19]:
count_vec_1 = np.array([2, 3, 4, 0])
count_vec_2 = np.array([2, 3, 4, 2])

In [20]:
from skfp.distances import tanimoto_count_distance

count_distance = tanimoto_count_distance(bit_vec_1, bit_vec_2)

print(f"similarity of count vectors: {count_distance}")

similarity of count vectors: 0.6666666666666667


In [21]:
from skfp.distances.tanimoto import (
    bulk_tanimoto_binary_similarity,
)

arr_1 = np.array(
    [
        [1, 1, 1],
        [0, 0, 1],
        [1, 1, 1],
    ]
)

arr_2 = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(arr_1, arr_2)
sim

array([[0.66666667, 0.66666667, 1.        ],
       [0.5       , 0.5       , 0.33333333],
       [0.66666667, 0.66666667, 1.        ]])

In [22]:
X = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(X)
sim

array([[1.        , 0.33333333, 0.66666667],
       [0.33333333, 1.        , 0.66666667],
       [0.66666667, 0.66666667, 1.        ]])

In [23]:
from skfp.fingerprints import ECFPFingerprint
from skfp.distances import bulk_tanimoto_count_similarity, tanimoto_count_similarity

ecfp_fingerprint = ECFPFingerprint(count=True, n_jobs=-1, verbose=True)

fps = ecfp_fingerprint.transform(molecules[:300])

  0%|          | 0/25 [00:00<?, ?it/s]

In [24]:
%timeit -r 3 -n 10 [tanimoto_count_similarity(fps[i], fps[j]) for i in range(len(fps)) for j in range(len(fps))]

1.64 s ± 1.56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [25]:
%timeit -r 3 -n 10 [bulk_tanimoto_count_similarity(fps)]

29.5 ms ± 17 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [26]:
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier

# Complete needed imports!
from skfp.model_selection import scaffold_train_test_split
from skfp.datasets.moleculenet import load_bace

# load bace dataset
smiles, labels = load_bace()

# perform train/test split. (we don't want to use validation set!)
smiles_train, smiles_test, labels_train, labels_test = scaffold_train_test_split(
    smiles, labels, test_size=0.2
)

# initialize a pipeline consisting of binary ECFP fingerprint,
# and KNeighborsClassifier with correct tanimoto distance metric
pipeline = make_pipeline(
    ECFPFingerprint(n_jobs=-1),
    KNeighborsClassifier(metric=tanimoto_binary_distance, n_jobs=-1),
)

# rain and predict the pipeline
pipeline.fit(smiles_train, labels_train)
labels_pred = pipeline.predict(smiles_test)

# Compute auroc metric
auroc = roc_auc_score(labels_test, labels_pred)

### Multioutput evaluation metrics

In [27]:
from sklearn.ensemble import RandomForestClassifier
from skfp.datasets.moleculenet import load_sider
from skfp.metrics import multioutput_auprc_score, extract_pos_proba

smiles, labels = load_sider()

smiles_train, smiles_test, labels_train, labels_test = scaffold_train_test_split(
    smiles, labels, test_size=0.2
)

pipeline = make_pipeline(ECFPFingerprint(n_jobs=-1), RandomForestClassifier(n_jobs=-1))

pipeline.fit(smiles_train, labels_train)

labels_proba = pipeline.predict_proba(smiles_test)
labels_proba = extract_pos_proba(labels_proba)

auprc_score = multioutput_auprc_score(labels_test, labels_proba)



In [28]:
auprc_score

0.671021176167776

---