# Part 3: Filter compounds for PAINS and other unwanted substructures

Unwanted side effects are not desired for promising compounds. In this part will compounds with known unwanted substructures filtered out of the dataset. 

Therefore Pan-assay interference compounds (PAINS) will be filtered out of the dataset, as these structures bind unspecifically, which can result in inwanted side effects. Filtering those PAINS can be done by filtering on the substructres that are desribed by Baell et al. (2010). Moreover, Brenk et al. (2008) provided a list of other unwatned substructures, which will be used for futher substructure filtering. 

Import required libraries

In [1]:
from pathlib import Path

import pandas as pd
from tqdm.auto import tqdm
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams

Define paths to this notebook

In [2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

In [14]:
print(DATA)

/Users/Jurren/Documents/GitHub/ACMDD/data


Read data from Part 2

In [3]:
data = pd.read_csv(DATA/"BACE_compounds_lipinski.csv",
    index_col=0,
)
data

Unnamed: 0,molecule_chembl_id,IC50,units,smiles,pIC50,molecular_weight,n_hba,n_hbd,logp,ro5_fulfilled
0,CHEMBL3969403,2.000000e-04,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3ccc(C#N)cn3)c...,12.698970,429.127089,7,2,2.12408,True
1,CHEMBL3937515,9.000000e-04,nM,COc1cnc(C(=O)Nc2ccc(F)c([C@]3(C)CS(=O)(=O)C(C)...,12.045757,435.137653,8,2,1.65600,True
2,CHEMBL3949213,1.000000e-03,nM,C[C@@]1(c2cc(NC(=O)c3ccc(C#N)cn3)ccc2F)CS(=O)(...,12.000000,455.142739,7,2,2.65828,True
3,CHEMBL3955051,1.800000e-03,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3cnc(C(F)F)cn3...,11.744727,455.123895,7,2,2.58500,True
4,CHEMBL3936264,5.700000e-03,nM,C[C@@]1(c2cc(NC(=O)c3ccc(OC(F)F)cn3)ccc2F)CS(=...,11.244125,442.092261,7,2,2.07520,True
...,...,...,...,...,...,...,...,...,...,...
6686,CHEMBL1222034,2.854000e+06,nM,Nc1nc2cc(Cl)ccc2n1CCCC(=O)NCC1CC1,2.544546,306.124739,4,2,2.57830,True
6687,CHEMBL1934194,3.442000e+06,nM,COc1c2occc2c(OCC=C(C)C)c2ccc(=O)oc12,2.463189,300.099774,5,0,3.89280,True
6688,CHEMBL3586134,4.000000e+06,nM,NC1=NC2CCCCC2CS1,2.397940,170.087769,3,1,1.60670,True
6689,CHEMBL3261080,8.200000e+06,nM,CC1=CSC(N)=NN1,2.086186,129.036068,4,2,0.41380,True


Drop the columns molecular weight, n_hbd, n_hba, logp as they are not needed anymore. They were only used for to check if it fullfill the RO5. 

In [4]:
print("Dataframe shape:", data.shape)
data.drop(columns=["molecular_weight", "n_hbd", "n_hba", "logp"], inplace=True)
data.head()

Dataframe shape: (6691, 10)


Unnamed: 0,molecule_chembl_id,IC50,units,smiles,pIC50,ro5_fulfilled
0,CHEMBL3969403,0.0002,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3ccc(C#N)cn3)c...,12.69897,True
1,CHEMBL3937515,0.0009,nM,COc1cnc(C(=O)Nc2ccc(F)c([C@]3(C)CS(=O)(=O)C(C)...,12.045757,True
2,CHEMBL3949213,0.001,nM,C[C@@]1(c2cc(NC(=O)c3ccc(C#N)cn3)ccc2F)CS(=O)(...,12.0,True
3,CHEMBL3955051,0.0018,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3cnc(C(F)F)cn3...,11.744727,True
4,CHEMBL3936264,0.0057,nM,C[C@@]1(c2cc(NC(=O)c3ccc(OC(F)F)cn3)ccc2F)CS(=...,11.244125,True


Add molecule column

In [5]:
PandasTools.AddMoleculeColumnToFrame(data, smilesCol="smiles")
data.head()

Unnamed: 0,molecule_chembl_id,IC50,units,smiles,pIC50,ro5_fulfilled,ROMol
0,CHEMBL3969403,0.0002,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3ccc(C#N)cn3)c...,12.69897,True,<rdkit.Chem.rdchem.Mol object at 0x7f8108992680>
1,CHEMBL3937515,0.0009,nM,COc1cnc(C(=O)Nc2ccc(F)c([C@]3(C)CS(=O)(=O)C(C)...,12.045757,True,<rdkit.Chem.rdchem.Mol object at 0x7f81089926e0>
2,CHEMBL3949213,0.001,nM,C[C@@]1(c2cc(NC(=O)c3ccc(C#N)cn3)ccc2F)CS(=O)(...,12.0,True,<rdkit.Chem.rdchem.Mol object at 0x7f81089925c0>
3,CHEMBL3955051,0.0018,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3cnc(C(F)F)cn3...,11.744727,True,<rdkit.Chem.rdchem.Mol object at 0x7f8108992800>
4,CHEMBL3936264,0.0057,nM,C[C@@]1(c2cc(NC(=O)c3ccc(OC(F)F)cn3)ccc2F)CS(=...,11.244125,True,<rdkit.Chem.rdchem.Mol object at 0x7f81089924a0>


## Filter for PAINS

Initialize filter

[Uitleg: het is een ingebouwde filter]

In [6]:
params = FilterCatalogParams()
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
catalog = FilterCatalog(params)

Search for PAINS in the dataset and keep the molecules without PAINS

In [7]:
matches = []
clean = []
for index, row in tqdm(data.iterrows(), total=data.shape[0]):
    molecule = Chem.MolFromSmiles(row.smiles)
    entry = catalog.GetFirstMatch(molecule)  # Get the first matching PAINS
    if entry is not None:
        # store PAINS information
        matches.append(
            {
                "chembl_id": row.molecule_chembl_id,
                "rdkit_molecule": molecule,
                "pains": entry.GetDescription().capitalize(),
            }
        )
    else:
        # collect indices of molecules without PAINS
        clean.append(index)

matches = pd.DataFrame(matches)
data = data.loc[clean]  # keep molecules without PAINS

  0%|          | 0/6691 [00:00<?, ?it/s]

In [8]:
print(f"Number of compounds with PAINS: {len(matches)}")
print(f"Number of compounds without PAINS: {len(data)}")
print(f"percentage of compounds with PAINS: {round(len(matches)/len(data)*100,2)}%")


Number of compounds with PAINS: 133
Number of compounds without PAINS: 6558
percentage of compounds with PAINS: 2.03%


The compounds in the dataset are filtered for PAINS in order to avoid unwanted side-effects. 133 (2.03%) compounds contained PAINS and were therefore removed from the dataset. 

## Filter for other unwanted substructures

Read file with the unwanted substructures obtained from article (Chem. Med. Chem. (2008), 3, 535-44)

In the table below are the unwanted substructures shown

In [9]:
substructures = pd.read_csv(DATA/"unwanted_substructures.csv", sep="\s+")
substructures["rdkit_molecule"] = substructures.smarts.apply(Chem.MolFromSmarts)
print("Number of unwanted substructures in collection:", len(substructures))
display(substructures) # Show the substructures that are unwanted 

Number of unwanted substructures in collection: 104


Unnamed: 0,name,smarts,rdkit_molecule
0,>2EsterGroups,"C(=O)O[C,H1].C(=O)O[C,H1].C(=O)O[C,H1]",<rdkit.Chem.rdchem.Mol object at 0x7f81091e7160>
1,2-haloPyridine,"n1c([F,Cl,Br,I])cccc1",<rdkit.Chem.rdchem.Mol object at 0x7f81091e6f80>
2,acidHalide,"C(=O)[Cl,Br,I,F]",<rdkit.Chem.rdchem.Mol object at 0x7f81091e7340>
3,acyclic-C=C-O,C=[C!r]O,<rdkit.Chem.rdchem.Mol object at 0x7f81091e71c0>
4,acylCyanide,N#CC(=O),<rdkit.Chem.rdchem.Mol object at 0x7f81091e74c0>
...,...,...,...
99,thiol,[SH],<rdkit.Chem.rdchem.Mol object at 0x7f810920d660>
100,Three-membered-heterocycle,"*1[O,S,N]*1",<rdkit.Chem.rdchem.Mol object at 0x7f810920d6c0>
101,triflate,OS(=O)(=O)C(F)(F)F,<rdkit.Chem.rdchem.Mol object at 0x7f810920d720>
102,triphenyl-methylsilyl,"[SiR0,CR0](c1ccccc1)(c2ccccc2)(c3ccccc3)",<rdkit.Chem.rdchem.Mol object at 0x7f810920d780>


Search for these unwanted substructures in the dataset and keep the molecules without unwanted substructures

In [10]:
matches = []
clean = []
for index, row in tqdm(data.iterrows(), total=data.shape[0]):
    molecule = Chem.MolFromSmiles(row.smiles)
    match = False
    for _, substructure in substructures.iterrows():
        if molecule.HasSubstructMatch(substructure.rdkit_molecule):
            matches.append(
                {
                    "chembl_id": row.molecule_chembl_id,
                    "rdkit_molecule": molecule,
                    "substructure": substructure.rdkit_molecule,
                    "substructure_name": substructure["name"],
                }
            )
            match = True
    if not match:
        clean.append(index)

matches = pd.DataFrame(matches)
data = data.loc[clean]

  0%|          | 0/6558 [00:00<?, ?it/s]

In [11]:
print(f"Number of found unwanted substructure: {len(matches)}")
print(f"Number of compounds without unwanted substructure: {len(data)}")
print(f"percentage of compounds with unwanted substructure: {round(len(matches)/len(data)*100,2)}%")


Number of found unwanted substructure: 2319
Number of compounds without unwanted substructure: 4823
percentage of compounds with unwanted substructure: 48.08%


The current dataset contained 2319 (48.08%) compounds with unwanted substures from the list in the article of Brenk et al. (2008). These are removed, resulting in a dataset of 2319 remaining compounds.

List with frequenction that the unwanted substructures are present in the data 

In [12]:
groups = matches.groupby("substructure_name")
group_frequencies = groups.size()
group_frequencies.sort_values(ascending=False, inplace=True)
group_frequencies

substructure_name
imine                              438
triple-bond                        367
2-haloPyridine                     330
Aliphatic-long-chain               318
Michael-acceptor                   128
Oxygen-nitrogen-single-bond        111
isolate-alkene                     103
betaketo/anhydride                  86
aniline                             65
cumarine                            53
nitro-group                         41
Carbo-cation/anion                  37
charged-oxygen/sulfur-atoms         25
Thiocarbonyl-group                  22
quaternary-nitrogen                 15
polyene                             14
aldehyde                            13
heavy-metal                         13
acyclic-C=C-O                       13
iodine                              12
perfluorinated-chain                12
het-C-het-not-in-ring               11
halogenated-ring                    11
Sulfonic-acid                       10
phosphor-P-phthalimide              10
hydroqu

The most common of those unwanted structures were imines, triple-bonds and 2-haloPyridine

Save filtered data to a csv file 

In [13]:
data = data.drop("ROMol", axis=1)
data.to_csv(DATA/"BACE_compounds_part3.csv")
data.head()

Unnamed: 0,molecule_chembl_id,IC50,units,smiles,pIC50,ro5_fulfilled
0,CHEMBL3969403,0.0002,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3ccc(C#N)cn3)c...,12.69897,True
1,CHEMBL3937515,0.0009,nM,COc1cnc(C(=O)Nc2ccc(F)c([C@]3(C)CS(=O)(=O)C(C)...,12.045757,True
2,CHEMBL3949213,0.001,nM,C[C@@]1(c2cc(NC(=O)c3ccc(C#N)cn3)ccc2F)CS(=O)(...,12.0,True
3,CHEMBL3955051,0.0018,nM,CC1(C)C(N)=N[C@](C)(c2cc(NC(=O)c3cnc(C(F)F)cn3...,11.744727,True
4,CHEMBL3936264,0.0057,nM,C[C@@]1(c2cc(NC(=O)c3ccc(OC(F)F)cn3)ccc2F)CS(=...,11.244125,True


This filtering result in a filtered dataset of 4823 compounds. This dataset will be used to draw a scaffold. This data will also be used to train and validate the machine learning models. 