# Virtual screening

Ligand-based virtual screening (LBVS) involves predicting bioactivity of large collections of compounds. It involves multiple steps, which are quite easy to perform with scikit-fingerprints.

---

## Notebook setup

Import and load BACE dataset from moleculenet as shown in notebook 1.

In [1]:
import pandas as pd
import numpy as np

# load malaria dataset
df = pd.read_parquet("data/malaria_hts_train.parquet")

# Remove ambiguous labels
df = df[df["label"] != "ambiguous"]

# map labels to binary
df["label"] = df["label"].map({"false": 0, "true": 1})

# Extract data into smiles and labels
smiles_list = df["SMILES"].values
labels = df["label"].values

# subsample
pos_idx = np.where(labels == 1)[0]
neg_idx = np.random.choice(np.where(labels == 0)[0], size=100_000, replace=False)
idx = np.concatenate([pos_idx, neg_idx])
smiles_list = smiles_list[idx].tolist()
labels = labels[idx]

len(smiles_list), labels.shape

(101528, (101528,))

In [2]:
smiles_list[:10]

['CC(C)CN(CC(C)C)S(=O)(=O)c1ccc(cc1)C(=O)Nc2nc(cs2)c3ccccn3',
 'Clc1cccc(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c1',
 'COc1ccc(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c(OC)c1',
 'COc1ccc(OC)c(NC(=O)CSc2nc(ns2)c3ccccc3Cl)c1',
 'COc1ccc(Cl)cc1NC(=O)CSc2nc(ns2)c3ccccc3Cl',
 'Brc1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'CCOC(=O)C1=C(C)N=C2S\\C(=C\\c3ccccc3O)\\C(=O)N2C1c4ccccc4',
 'CCOC(=O)c1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'CC(=O)c1ccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)cc1',
 'OC(=O)c1cccc(NC2=C(C(=O)c3ccccc3C2=O)n4nnc5ccccc45)c1']

This dataset contains invalid molecules. Let us use `MolFromSmilesTransformer` and remove invalid molecules by setting `valid_only` parameter to `True`.

Notice that if we remove invalid molecules, the label array will be longer than list of molecules. We need to modify them both using `.transform_x_y()` method.

In [3]:
from skfp.preprocessing import MolFromSmilesTransformer

# Create MolFromSmilesTransformer
mol_from_smiles = MolFromSmilesTransformer(valid_only=True, n_jobs=-1, batch_size=1000, verbose=True)

# Run the dataset transformation
molecules, labels = mol_from_smiles.transform_x_y(smiles_list, labels)

print(f"Removed molecules: {len(smiles_list) - len(molecules)}")

  0%|          | 0/101 [00:00<?, ?it/s]

Removed molecules: 0


---

## Molecular filters

TODO quick word about molecular filters, what they are for and why we want to use them.

### Physicochemical filters

Todo quick word

Example with skfp

In [4]:
from tqdm import tqdm
from rdkit.Chem.Crippen import MolLogP
from rdkit.Chem.rdMolDescriptors import CalcNumLipinskiHBA, CalcNumLipinskiHBD
from rdkit.Chem.Descriptors import MolWt

# Initialize lists for filtered molecules and labels
molecules_filtered = []
labels_filtered = []

# iterate over molecules
for i, mol in tqdm(enumerate(molecules), total=len(molecules)):

    # check for Lipinski's Rule of 5 conditions
    rules = [
        MolWt(mol) <= 500,  # molecular weight
        CalcNumLipinskiHBA(mol) <= 10,  # HBA
        CalcNumLipinskiHBD(mol) <= 5,  # HBD
        MolLogP(mol) <= 5,  # logP
    ]

    # Append data and label to the end of the output list
    if all(rules):
        molecules_filtered.append(mol)
        labels_filtered.append(labels[i])


100%|██████████| 101528/101528 [00:22<00:00, 4460.80it/s]


Instead of filtering the molecules like that we can use scikit-learn compatible interface provided by scikit fingerprints.

Initialize `LipinskiFilter` and perform `.transform_x_y` transformation on molecules and labels

In [5]:
from skfp.filters import LipinskiFilter

filter = LipinskiFilter(n_jobs=-1, verbose=True)
molecules_filtered, labels_filtered = filter.transform_x_y(molecules, labels)



  0%|          | 0/24 [00:00<?, ?it/s]

If we don't have labels we can still filter the molecules only.

In [6]:
molecules_filtered = filter.transform(molecules)

  0%|          | 0/24 [00:00<?, ?it/s]

Display the how the number of molecules changes

In [7]:
print(f"Original size     : {len(molecules)}")
print(f"Filtered size     : {len(molecules_filtered)}")
print(f"Removed molecules : {len(molecules) - len(molecules_filtered)}")

Original size     : 101528
Filtered size     : 100923
Removed molecules : 605


### Substructural filters

TODO quick word

In [8]:
# Create new names to avoid overriding `molecules` and `labels`
mols, y = molecules, labels

In [9]:
from skfp.filters import PAINSFilter

# Create a list of pains filters
verbosity_args = dict(batch_size=1000, n_jobs=-1, verbose=True)
pains_filters = [
    PAINSFilter(variant="A", **verbosity_args),
    PAINSFilter(variant="B", **verbosity_args),
    PAINSFilter(variant="C", **verbosity_args)
]

print(f"Molecules before filtering: {len(mols)}")

# Iterate over filters and perform filtering
for i, filter in enumerate(pains_filters):
    mols, y = filter.transform_x_y(mols, y)
    print(f"Molecules after PAINS {i}: {len(mols)}")

n_active = y.sum()
n_inactive = len(y) - n_active
print(f"Final inactive molecules: {n_inactive}")
print(f"Final active molecules: {n_active}")

Molecules before filtering: 101528


  0%|          | 0/101 [00:00<?, ?it/s]

Molecules after PAINS 0: 96477


  0%|          | 0/96 [00:00<?, ?it/s]

Molecules after PAINS 1: 94841


  0%|          | 0/94 [00:00<?, ?it/s]

Molecules after PAINS 2: 94385
Final inactive molecules: 93170
Final active molecules: 1215


---

## Data splits

Quick word if someone isn't familiar

In [10]:
"""TODO here:
    - Show scaffold split without skfp
    - Remind that we can use load_X_splits
    - Show manual split computation with skfp
"""

'TODO here:\n    - Show scaffold split without skfp\n    - Remind that we can use load_X_splits\n    - Show manual split computation with skfp\n'

---

## Similarity searching & evaluation metrics



### Similarity search

quick word

In [11]:
"""TODO here:
    - Similarity search without skfp
    - Similarity search - find similar molecules
    - KNN and compatibility
    - Bulk similarity computation
    - Times
"""

'TODO here:\n    - Similarity search without skfp\n    - Similarity search - find similar molecules\n    - KNN and compatibility\n    - Bulk similarity computation\n    - Times\n'

### Evaluation metrics

In [12]:
"""TODO here:
    - Show example metric. like auroc
    - Show multioutput
    - Make scorer for grid search using our metric. Like mcc
"""

'TODO here:\n    - Show example metric. like auroc\n    - Show multioutput\n    - Make scorer for grid search using our metric. Like mcc\n'

---