In this Jupyter notebook, we generate a model that's able to successfully predict whether a molecule will bind to a specific target. For the first iteration of this notebook, we'll focus on predicting whether or not molecules will bind to the 5-hydroxytryptamine receptors (5-HTR). We draw inspiration from the following paper: https://pdfs.semanticscholar.org/c108/970bcb96967af5f5ba2783738bd39e751725.pdf

Let's start by importing the necessary dependencies.

In [1]:
# Base packages
import os
import string
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import interp
import matplotlib.pyplot as plt
from google.cloud import bigquery

# Scikit-learn packages
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# RDkit, a cheminformatics library
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Our analysis dataset will be generated by querying the ChEMBL database. This database is publicly available through the Google Cloud Platform (GCP). To access this platform through Python, we need to configure our Python environment and define the location of the GCP credentials.

In [2]:
# Define relative paths
NOTEBOOKS = os.getcwd()
WKDIR = NOTEBOOKS.replace('/02 Notebooks', '')
INPUT = WKDIR + '/01 Input'
DATA = WKDIR + '/03 Data'

# Define location of credentials and initialize client
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = INPUT + "/TestProject21-1c3d2ad24a1d.json"
EBI_CHEMBL = "patents-public-data.ebi_chembl"
client = bigquery.Client()

Let's first define the specific targets whose binding profile we want to characterize with our model.

In [3]:
target_query = f"""
SELECT DISTINCT 
    pref_name
FROM 
    `{EBI_CHEMBL}.target_dictionary_24`
WHERE 
    pref_name like "%5-HT%"   AND
    organism = "Homo sapiens" AND
    target_type = "SINGLE PROTEIN"
"""
target_list = client.query(target_query).to_dataframe()
target_list

Unnamed: 0,pref_name
0,Serotonin 1a (5-HT1a) receptor
1,Serotonin 1d (5-HT1d) receptor
2,Serotonin 1b (5-HT1b) receptor
3,Serotonin 2a (5-HT2a) receptor
4,Serotonin 2c (5-HT2c) receptor
5,Serotonin 3a (5-HT3a) receptor
6,Serotonin 4 (5-HT4) receptor
7,Serotonin 2b (5-HT2b) receptor
8,Serotonin 1f (5-HT1f) receptor
9,Serotonin 7 (5-HT7) receptor


Let's start by characterizing the binding profile of the serotonin 1a receptor.

In [10]:
ACTIVITIES = f"{EBI_CHEMBL}.activities_24"
ASSAYS = f"{EBI_CHEMBL}.assays_24"
CMPD_STR = f"{EBI_CHEMBL}.compound_structures_24"
MOL_DICT = f"{EBI_CHEMBL}.molecule_dictionary_24"
TGT_DICT = f"{EBI_CHEMBL}.target_dictionary_24"

analysis_qtext = f"""
SELECT DISTINCT
    mol_dict.chembl_id        AS mol_chembl_id,
    act.molregno              AS molregno,
    act.standard_type         AS standard_type,
    act.standard_relation     AS standard_relation,
    act.standard_value        AS standard_value,
    act.standard_units        AS standard_units,
    cmpd_str.canonical_smiles AS canonical_smiles,
    tgt_dict.chembl_id        AS tgt_chembl_id,
    tgt_dict.pref_name        AS tgt_name,
    assays.chembl_id          AS assay_chembl_id,
    assays.description        AS assay_desc
FROM `{ACTIVITIES}` act
    INNER JOIN `{MOL_DICT}` mol_dict ON mol_dict.molregno = act.molregno
    INNER JOIN `{ASSAYS}`   assays   ON assays.assay_id = act.assay_id
    INNER JOIN `{TGT_DICT}` tgt_dict ON tgt_dict.tid = assays.tid
    INNER JOIN `{CMPD_STR}` cmpd_str ON cmpd_str.molregno = act.molregno
WHERE
    act.standard_units = "nM"               AND
    act.standard_relation in ("<", "=")     AND
    act.potential_duplicate = '0'           AND
    assays.assay_type = "B"                 AND
    assays.confidence_score >= '8'          AND
    tgt_dict.organism = "Homo sapiens"      AND
    tgt_dict.target_type = "SINGLE PROTEIN" AND
    tgt_dict.pref_name = "Serotonin 1a (5-HT1a) receptor" AND
    act.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency')
"""
data = client.query(analysis_qtext).to_dataframe()

# Convert the standard value to a numeric variable
data['standard_value'] = pd.to_numeric(data['standard_value'])

# Save out a CSV
data.to_csv(DATA + '/ChEMBL Activity data (5-HT1a).csv', index=False)

# View the data
data.head()

Unnamed: 0,mol_chembl_id,molregno,standard_type,standard_relation,standard_value,standard_units,canonical_smiles,tgt_chembl_id,tgt_name,assay_chembl_id,assay_desc
0,CHEMBL270176,438512,Ki,=,1081.0,nM,CCC1NC(=Nc2cccc(Cl)c12)N,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL930818,Displacement of [3H]8-OH-DPAT from human recom...
1,CHEMBL2431282,1583337,Ki,=,13.8,nM,CCCc1ccccc1OC(C)C2=NCCN2,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL2434417,Binding affinity at human 5HT1A receptor
2,CHEMBL18928,22152,IC50,=,72.44,nM,NCCC1COc2ccc(cc12)C(=O)N,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL873463,In Vitro Binding affinity againist 5-HT1A rece...
3,CHEMBL146942,243209,Ki,=,1.995,nM,NCCCC1(CCCN)CCc2cccc(O)c2C1,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL616227,Binding affinity against 5-hydroxytryptamine 1...
4,CHEMBL131736,220874,IC50,=,12200.0,nM,CC(N)Cc1cn(C)c2ccc3OCCCc3c12,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL616274,In vitro inhibitory concentration required aga...


There are 458 duplicate molecules, and there are 6 duplicate molecule-activity values. In order to handle the latter, we will simply drop the duplicates. In order to handle the former, we will keep the molecule associated with the lower activity value.

In [11]:
print("No. of duplicate molecules: " + str(data.duplicated(subset=['mol_chembl_id']).sum()))
print("No. of duplicate molecule-activity values: " + str(data.duplicated(subset=['mol_chembl_id', 'standard_value']).sum()))

# Handle duplicate molecule-activity values
data_nodup = data.drop_duplicates(subset=['mol_chembl_id', 'standard_value'], inplace=False)

# Handle duplicate molecules
assert data_nodup.duplicated(subset=['mol_chembl_id', 'standard_value']).sum()==0
data_nodup.sort_values(by=['standard_value', 'mol_chembl_id'], ascending=True, inplace=True)
data_nodup.drop_duplicates(subset=['mol_chembl_id'], keep='first', inplace=True)
assert data_nodup.duplicated(subset=['mol_chembl_id']).sum()==0

# Save out a dataset without duplicates
data_nodup.to_csv(DATA + '/ChEMBL Activity Data (5-HT1a, no dup).csv', index=False)
print("Data dimensions after removing duplicates: " + str(data_nodup.shape))

No. of duplicate molecules: 458
No. of duplicate molecule-activity values: 6
Data dimensions after removing duplicates: (3477, 11)


Biological drug activity is typically represented as a concentration necessary to induce a reaction. For example, if researchers wanted to assess whether a molecule binding a specific target, they would perform experimental assays to determine the concentration of the molecule needed to induce an observation associated with said target. We specified the names of the relevant concentrations of interest in our query above (act.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency')).

We assume that a molecule binds to the target if its activity value (i.e., the concentration needed to induce a response) is less than or equal to 100nM. This threshold is based on the activity threshold for GPCRs here: https://druggablegenome.net/ProteinFam

In [25]:
def set_active(row):
    active = 0
    if row['standard_value'] <= 100:
        active = 1
    return active

# Create the 'active' variable in both the duplicated and non-duplicated datasets
data['active'] = data.apply(lambda x: set_active(x), axis=1)
data_nodup['active'] = data_nodup.apply(lambda x: set_active(x), axis=1)

# View data
data_nodup.head()

Unnamed: 0,mol_chembl_id,molregno,standard_type,standard_relation,standard_value,standard_units,canonical_smiles,tgt_chembl_id,tgt_name,assay_chembl_id,assay_desc,active
2459,CHEMBL1258761,707774,Ki,=,0.0141,nM,Fc1cccc2cccc(N3CCN(CCCOc4ccc5CNC(=O)c5c4)CC3)c12,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL1260560,Displacement of [3H]-8-OH-DPAT from human 5HT1...,1
2432,CHEMBL1258533,707544,Ki,=,0.0152,nM,O=C1NCc2ccc(OCCCN3CCN(CC3)c4cccc5ccccc45)cc12,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL1260560,Displacement of [3H]-8-OH-DPAT from human 5HT1...,1
1483,CHEMBL511857,515018,Ki,=,0.028,nM,COc1cccc(c1)[C@@H]2CC[C@H](CC2)N3CCN(CC3)c4ccccn4,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL3594589,Agonist activity at human 5-HT1A receptor expr...,1
1149,CHEMBL271987,441890,Ki,=,0.03,nM,COc1ccccc1N2CCN(CCCCCNc3ccc4ccccc4n3)CC2,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL950989,Displacement of [3H]8-OH-DPAT from human 5HT1A...,1
1615,CHEMBL1290716,717135,EC50,=,0.03162,nM,CN1C(=O)C=Cc2c(CCN3CCN(CC3)c4cccc5nc(C)ccc45)c...,CHEMBL214,Serotonin 1a (5-HT1a) receptor,CHEMBL1293144,Agonist activity at human 5-HT1A receptor expr...,1


The probability that a given molecule binds to a target is dependent on, among other factors, the structure of the molecule itself. We can encode the structures of the molecules in our datasets using circular fingerprints. Circular fingerprints flatten the 2-D molecule structure into a 1-D vector by exhaustively enumerating and hashing all circular fragments grown radially from each heavy atom of the molecule up to a given radius. Figure 1 below illustrates this algorithm:

In [None]:
def 