# Database Building

This notebook contains the code used to build my database, by downloading and sorting data from ChEMBL (https://www.ebi.ac.uk/chembl/). For these notebooks, EGFR will be used as an example, but this same code was implemented for all 3 of my target receptors and their inhibitors. 

In [2]:
#These libraries were used for all 3 stages of my project 
#Pandas, NumPy, and RDkit are all open source 
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from rdkit import DataStructs

## Dowloading the data from ChEMBL

The first step is to download all of the data associated with the target and the appropriate assay. In this case, the target is the epidermal growth factor receptor (EGFR) and any assays finding the activity (IC50). Then set the index as the ChEMBL ID associated with each inhibitor. This can be visualised to ensure all the data has been downloaded.

In [5]:
df_full = pd.read_csv (r'/Users/isobelhamilton-burns/full_chembl_data.csv')
df_full = df_full.set_index('Molecule ChEMBL ID')

In [6]:
df_full

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL386296,,0,428.95,1,5.58,3l,COc1cc2c(cc1OC)Sc1nc(C)nc(Nc3ccc(Cl)c(C)c3)c1NC2,IC50,'>',100000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1149491,1,Scientific Literature,Bioorg. Med. Chem. Lett.,2006.0,,
CHEMBL57462,,0,227.18,0,0.85,16,N#CC(C#N)=C1C(=O)Nc2cc(O)c(O)cc21,IC50,'>',16000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1125266,1,Scientific Literature,J. Med. Chem.,1991.0,,
CHEMBL133147,,0,263.34,0,4.01,12,Cc1cc(C)c2ncnc(N(C)c3ccccc3)c2c1,IC50,'>',20000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1129840,1,Scientific Literature,Bioorg. Med. Chem. Lett.,1997.0,,
CHEMBL299881,,0,365.44,0,2.22,6g,Cc1cccc(Nc2ncnc3cnc(NCCN4CCOCC4)nc23)c1,IC50,'=',2.3,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1130057,1,Scientific Literature,J. Med. Chem.,1997.0,,
CHEMBL171144,,0,385.46,0,4.78,I17,CC(COC(=O)c1cc(/N=C/c2cc(O)ccc2O)ccc1O)CC(C)(C)C,IC50,'=',10000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1127676,1,Scientific Literature,J. Med. Chem.,1994.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL4749064,,0,448.55,0,3.56,69,CN(C)CCNCc1cc(-c2cc3c(N[C@H](CO)c4ccccc4)ncnc3...,IC50,'=',1.2,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4715760,1,Scientific Literature,Eur J Med Chem,2016.0,,
CHEMBL2110732,DACOMITINIB,4,469.95,1,5.16,Dacomitinib,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1NC(=O)/C=C/C...,IC50,'=',6.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4725379,1,Scientific Literature,Bioorg Med Chem,2020.0,,
CHEMBL4520788,BAY-294,0,448.59,1,5.58,EUB0000743,CNCc1ccccc1-c1csc([C@H](C)Nc2nc(C)nc3cc(OC)c(O...,IC50,'>',20.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4507273,54,Donated Chemical Probes - SGC Frankfurt,,2021.0,,Num of experiments None 2.0 None
CHEMBL4753700,,0,1038.29,,,14o,C=CC(=O)Nc1cccc(-n2c(=O)cc(C)c3cnc(Nc4ccc(N5CC...,IC50,'>',1000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4680227,1,Scientific Literature,Eur J Med Chem,2020.0,,TIME = 1.0 hr


## Cleaning up the data

Any inhibitors which have only been assayed once (and therefore whose activities haven't been replicated) can be removed from the dataset. This removes about 500 compounds from this list. 

In [7]:
df_full_a = df_full[df_full.duplicated(subset=['Assay ChEMBL ID'], keep=False)]
df_full_a

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL386296,,0,428.95,1,5.58,3l,COc1cc2c(cc1OC)Sc1nc(C)nc(Nc3ccc(Cl)c(C)c3)c1NC2,IC50,'>',100000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1149491,1,Scientific Literature,Bioorg. Med. Chem. Lett.,2006.0,,
CHEMBL57462,,0,227.18,0,0.85,16,N#CC(C#N)=C1C(=O)Nc2cc(O)c(O)cc21,IC50,'>',16000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1125266,1,Scientific Literature,J. Med. Chem.,1991.0,,
CHEMBL133147,,0,263.34,0,4.01,12,Cc1cc(C)c2ncnc(N(C)c3ccccc3)c2c1,IC50,'>',20000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1129840,1,Scientific Literature,Bioorg. Med. Chem. Lett.,1997.0,,
CHEMBL299881,,0,365.44,0,2.22,6g,Cc1cccc(Nc2ncnc3cnc(NCCN4CCOCC4)nc23)c1,IC50,'=',2.3,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1130057,1,Scientific Literature,J. Med. Chem.,1997.0,,
CHEMBL171144,,0,385.46,0,4.78,I17,CC(COC(=O)c1cc(/N=C/c2cc(O)ccc2O)ccc1O)CC(C)(C)C,IC50,'=',10000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1127676,1,Scientific Literature,J. Med. Chem.,1994.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL4785569,,0,358.45,1,5.2,42,CCOc1ccccc1-c1cc2c(N[C@H](C)c3ccccc3)ncnc2[nH]1,IC50,'=',0.3,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4715760,1,Scientific Literature,Eur J Med Chem,2016.0,,
CHEMBL4749064,,0,448.55,0,3.56,69,CN(C)CCNCc1cc(-c2cc3c(N[C@H](CO)c4ccccc4)ncnc3...,IC50,'=',1.2,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4715760,1,Scientific Literature,Eur J Med Chem,2016.0,,
CHEMBL2110732,DACOMITINIB,4,469.95,1,5.16,Dacomitinib,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1NC(=O)/C=C/C...,IC50,'=',6.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4725379,1,Scientific Literature,Bioorg Med Chem,2020.0,,
CHEMBL4520788,BAY-294,0,448.59,1,5.58,EUB0000743,CNCc1ccccc1-c1csc([C@H](C)Nc2nc(C)nc3cc(OC)c(O...,IC50,'>',20.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4507273,54,Donated Chemical Probes - SGC Frankfurt,,2021.0,,Num of experiments None 2.0 None


ChEMBL draws data from many sources in the literature. To ensure the activities are as comparable as possible, the data can be split per assay type. Any assay types with too few entries can be removed at  this step. For EGFR, the assay groups with the largest amounts of compounds in them were 'single protein', 'protein', 'assay format', and 'cell based'. 'Cell membrane' and 'cell free' assays only have ~30 entries each and are removed before pair analysis. 

In [8]:
#'assay format' assays 
df_af = df_full_a[df_full_a['BAO Label'] == 'assay format']
df_af

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL386296,,0,428.95,1,5.58,3l,COc1cc2c(cc1OC)Sc1nc(C)nc(Nc3ccc(Cl)c(C)c3)c1NC2,IC50,'>',100000.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1149491,1,Scientific Literature,Bioorg. Med. Chem. Lett.,2006.0,,
CHEMBL57462,,0,227.18,0,0.85,16,N#CC(C#N)=C1C(=O)Nc2cc(O)c(O)cc21,IC50,'>',16000.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1125266,1,Scientific Literature,J. Med. Chem.,1991.0,,
CHEMBL133147,,0,263.34,0,4.01,12,Cc1cc(C)c2ncnc(N(C)c3ccccc3)c2c1,IC50,'>',20000.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1129840,1,Scientific Literature,Bioorg. Med. Chem. Lett.,1997.0,,
CHEMBL299881,,0,365.44,0,2.22,6g,Cc1cccc(Nc2ncnc3cnc(NCCN4CCOCC4)nc23)c1,IC50,'=',2.300,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1130057,1,Scientific Literature,J. Med. Chem.,1997.0,,
CHEMBL171144,,0,385.46,0,4.78,I17,CC(COC(=O)c1cc(/N=C/c2cc(O)ccc2O)ccc1O)CC(C)(C)C,IC50,'=',10000.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1127676,1,Scientific Literature,J. Med. Chem.,1994.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL321494,,0,376.46,1,5.74,1,CSc1nc(-c2ccc(F)cc2)c(-c2ccnc(Nc3ccccc3)c2)[nH]1,IC50,'=',130.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4011631,1,Scientific Literature,J Med Chem,2017.0,,
CHEMBL4102224,,0,461.57,1,5.64,6b,CCC(=O)NCCSc1nc(-c2ccc(F)cc2)c(-c2ccnc(Nc3cccc...,IC50,'=',3.200,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4011631,1,Scientific Literature,J Med Chem,2017.0,,
CHEMBL321494,,0,376.46,1,5.74,1,CSc1nc(-c2ccc(F)cc2)c(-c2ccnc(Nc3ccccc3)c2)[nH]1,IC50,'=',2.500,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4011631,1,Scientific Literature,J Med Chem,2017.0,,
CHEMBL4868614,,0,548.65,2,5.63,30,C#Cc1cccc(Nc2ncnc3cc(OC)c(OCCCCCCCN/C(=N\C#N)N...,IC50,'=',0.473,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4818995,1,Scientific Literature,Eur J Med Chem,2021.0,,


In [9]:
#'cell membrane' assays
df_cm = df_full_a[df_full_a['BAO Label'] == 'cell membrane format']
df_cm

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL543361,,0,409.0,0,3.32,16,CN(C)CC(CSCCN)C(=O)c1ccc(OCc2ccccc2)cc1.Cl,IC50,'=',8800.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL542893,,0,361.87,0,3.17,12,Cl.O=C(CCN1CCOCC1)c1ccc(OCc2ccccc2)cc1,IC50,'=',1500.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL542891,,0,431.96,0,4.11,13,CCOC(=O)C1CCN(CCC(=O)c2ccc(OCc3ccccc3)cc2)CC1.Cl,IC50,'=',300.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL544067,,0,373.88,0,3.34,5,C=C(CN1CCOCC1)C(=O)c1ccc(OCc2ccccc2)cc1.Cl,IC50,'=',1100.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL544074,,0,371.91,0,4.49,4,C=C(CN1CCCCC1)C(=O)c1ccc(OCc2ccccc2)cc1.Cl,IC50,'=',1100.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL2442319,,0,305.33,0,2.99,5b,C#Cc1cccc(-n2ccc3cc(OC)cc(OC)c3c2=O)c1,IC50,'=',3510.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL2440093,1,Scientific Literature,Bioorg. Med. Chem.,2013.0,CHEMBL3307523,
CHEMBL543600,,0,427.91,0,2.53,21,CC(CN(C)C)C(=O)c1ccc(OS(=O)(=O)c2ccc(C(=O)O)cc...,IC50,'=',100000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL80030,,0,389.43,0,2.45,7,C=C(CN(C)C)C(=O)c1ccc(OS(=O)(=O)c2ccc(C(=O)O)c...,IC50,'=',230.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL542887,,0,347.84,0,3.27,3,C=C(CN(C)C)C(=O)c1ccc(OCc2ccccc2)c(O)c1.Cl,IC50,'=',670.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,
CHEMBL544538,,0,372.9,0,2.91,6,C=C(CN1CCNCC1)C(=O)c1ccc(OCc2ccccc2)cc1.Cl,IC50,'=',1500.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128740,1,Scientific Literature,J. Med. Chem.,1995.0,,


In [10]:
#'cell free' assays 
df_cf = df_full_a[df_full_a['BAO Label'] == 'cell-free format']
df_cf

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL435054,,0,564.7,2,6.76,7r,Cn1c(SSc2c(C(=O)Nc3ccccn3)c3ccccc3n2C)c(C(=O)N...,IC50,'=',47000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL345622,,0,574.78,2,8.1,7u,Cn1c(SSc2c(C(=O)Nc3cccs3)c3ccccc3n2C)c(C(=O)Nc...,IC50,'>',100000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL156020,,0,590.77,2,8.94,8j,CCn1c(SSc2c(C(=O)Nc3ccccc3)c3ccccc3n2CC)c(C(=O...,IC50,'>',100000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL79704,,0,440.55,1,5.48,1,O=C(O)CCc1c(SSc2[nH]c3ccccc3c2CCC(=O)O)[nH]c2c...,IC50,'=',4200.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL160207,,0,466.63,0,4.87,7c,CN(C)C(=O)c1c(SSc2c(C(=O)N(C)C)c3ccccc3n2C)n(C...,IC50,'=',21200.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL158275,,0,590.77,2,7.33,7g,Cn1c(SSc2c(C(=O)NCc3ccccc3)c3ccccc3n2C)c(C(=O)...,IC50,'=',1700.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL158879,,0,352.53,1,6.09,7aa,Cc1c(SSc2c(C)c3ccccc3n2C)n(C)c2ccccc12,IC50,'>',100000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL348554,,0,552.77,1,4.05,7f,CN(C)CCNC(=O)c1c(SSc2c(C(=O)NCCN(C)C)c3ccccc3n...,IC50,'=',17500.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL346553,,0,648.76,2,7.5,7y,COC(=O)c1ccc(C(=O)c2c(SSc3c(C(=O)c4ccc(C(=O)OC...,IC50,'=',6100.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,
CHEMBL347160,,0,562.72,2,7.31,6g,O=C(NCc1ccccc1)c1c(SSc2[nH]c3ccccc3c2C(=O)NCc2...,IC50,'=',15000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1128344,1,Scientific Literature,J. Med. Chem.,1995.0,CHEMBL3307523,


In [11]:
#'protein' assays 
df_p = df_full_a[df_full_a['BAO Label'] == 'protein format']
df_p

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL1173655,AFATINIB,4,485.95,0,4.39,"9, BIBW-2992",CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2...,IC50,'=',15.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1177739,1,Scientific Literature,J. Med. Chem.,2010.0,,
CHEMBL1272113,,0,392.48,0,2.74,S-13,CC#CC(=O)N1CCc2c(sc3ncnc(N[C@H](CO)c4ccccc4)c2...,IC50,'>',10000.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1269003,1,Scientific Literature,J. Med. Chem.,2010.0,,
CHEMBL1271618,,0,451.6,0,2.71,S-26,CN(C)C/C=C/C(=O)N1CCc2c(sc3ncnc(N[C@H](CO)Cc4c...,IC50,'=',4107.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1269003,1,Scientific Literature,J. Med. Chem.,2010.0,,
CHEMBL1090362,,0,614.78,2,6.59,28,CCN1CCN(c2ccc(Nc3nccc(-c4c(-c5cccc(NC(=O)Cc6cc...,IC50,'=',16.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1156213,1,Scientific Literature,Bioorg. Med. Chem. Lett.,2010.0,,
CHEMBL1093486,,0,484.97,1,6.09,4,Nc1nc2ccc(-c3c(-c4cccc(NC(=O)c5ccccc5Cl)c4)nc4...,IC50,'>',12500.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1156213,1,Scientific Literature,Bioorg. Med. Chem. Lett.,2010.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL3814878,,0,485.6,0,2.9,4,C=CC(=O)NCCCn1c(-c2nc(-c3cnn(C4CCN(C)CC4)c3)cn...,IC50,'=',200.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL3813644,1,Scientific Literature,ACS Med. Chem. Lett.,2016.0,,
CHEMBL3742128,,0,485.93,0,4.65,15c,CN(C)C/C=C/C(=O)Nc1cccc(Nc2nc(N/N=C/c3ccc(F)cc...,IC50,'=',2520.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL3739350,1,Scientific Literature,Eur. J. Med. Chem.,2015.0,,
CHEMBL3741436,,0,467.94,0,4.52,15a,CN(C)C/C=C/C(=O)Nc1cccc(Nc2nc(N/N=C/c3ccccc3F)...,IC50,'=',3270.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL3739350,1,Scientific Literature,Eur. J. Med. Chem.,2015.0,,
CHEMBL3741324,,0,464.96,0,4.16,15m,C/C(=N\Nc1ncc(Cl)c(Nc2cccc(NC(=O)/C=C/CN(C)C)c...,IC50,,,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL3739350,1,Scientific Literature,Eur. J. Med. Chem.,2015.0,,


In [12]:
#'single protein' assays 
df_sp = df_full_a[df_full_a['BAO Label'] == 'single protein format']
df_sp

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL2448065,,0,467.37,0,4.9,15,COc1cc2c(Nc3ccc(Cl)cc3F)ncnc2cc1OCC1CCN(C)CC1.Cl,IC50,'=',300.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1135889,1,Scientific Literature,J. Med. Chem.,2002.0,,
CHEMBL153573,,0,389.25,1,5.07,64,Nc1cc2sc3c(Nc4cccc(Br)c4)ncnc3c2cc1F,IC50,'=',1.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1132555,1,Scientific Literature,J. Med. Chem.,1999.0,,
CHEMBL286160,,0,377.8,0,4.2,3,COCCOc1cc2ncnc(Nc3ccc(Cl)cc3F)c2cc1OC,IC50,'=',100.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1135889,1,Scientific Literature,J. Med. Chem.,2002.0,,
CHEMBL217092,SARACATINIB,3,542.04,1,3.94,"33, AZD-0530",CN1CCN(CCOc2cc(OC3CCOCC3)c3c(Nc4c(Cl)ccc5c4OCO...,IC50,'=',2590.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1137334,1,Scientific Literature,J. Med. Chem.,2006.0,,
CHEMBL80809,,0,456.58,0,4.81,41,CC(=O)NCCNCc1ccc(-c2cc3ncnc(Nc4ccc5[nH]ccc5c4)...,IC50,'=',5.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1146693,1,Scientific Literature,Bioorg. Med. Chem. Lett.,2004.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL4785569,,0,358.45,1,5.2,42,CCOc1ccccc1-c1cc2c(N[C@H](C)c3ccccc3)ncnc2[nH]1,IC50,'=',0.3,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4715760,1,Scientific Literature,Eur J Med Chem,2016.0,,
CHEMBL4749064,,0,448.55,0,3.56,69,CN(C)CCNCc1cc(-c2cc3c(N[C@H](CO)c4ccccc4)ncnc3...,IC50,'=',1.2,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4715760,1,Scientific Literature,Eur J Med Chem,2016.0,,
CHEMBL2110732,DACOMITINIB,4,469.95,1,5.16,Dacomitinib,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1NC(=O)/C=C/C...,IC50,'=',6.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4725379,1,Scientific Literature,Bioorg Med Chem,2020.0,,
CHEMBL4520788,BAY-294,0,448.59,1,5.58,EUB0000743,CNCc1ccccc1-c1csc([C@H](C)Nc2nc(C)nc3cc(OC)c(O...,IC50,'>',20.0,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4507273,54,Donated Chemical Probes - SGC Frankfurt,,2021.0,,Num of experiments None 2.0 None


In [13]:
#'cell-based' assays 
df_cb = df_full_a[df_full_a['BAO Label'] == 'cell-based format']
df_cb

Unnamed: 0_level_0,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,Standard Value,...,Target Name,Target Organism,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHEMBL25610,,0,432.88,0,3.89,8,COc1cc2c(Nc3ccc(Cl)cc3F)ncnc2cc1OCCN1CCOCC1,IC50,'=',300.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1135889,1,Scientific Literature,J. Med. Chem.,2002.0,,
CHEMBL31630,,0,500.96,1,3.81,26,C=CC(=O)N(C)c1nc2c(Nc3ccc(F)c(Cl)c3)ncnc2cc1OC...,IC50,'=',88.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1133162,1,Scientific Literature,J. Med. Chem.,2000.0,CHEMBL3307523,
CHEMBL89940,,0,236.28,0,2.96,9,Nc1ccc2ncnc(Nc3ccccc3)c2c1,IC50,'=',770.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1129291,1,Scientific Literature,J. Med. Chem.,1996.0,CHEMBL3307523,
CHEMBL306081,,0,272.74,0,3.97,13,Cc1[nH]c2ncnc(Nc3cccc(Cl)c3)c2c1C,IC50,'=',300.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1129072,1,Scientific Literature,J. Med. Chem.,1996.0,CHEMBL3307523,
CHEMBL29197,,0,360.21,0,4.15,3,COc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OC,IC50,'=',0.029,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL1129564,1,Scientific Literature,J. Med. Chem.,1996.0,CHEMBL3307523,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CHEMBL4743464,,0,322.3,0,3.16,3,N#Cc1cccc(Nc2ncnc3cc4c(cc23)OCCO4)c1F,IC50,'=',28.200,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4725422,1,Scientific Literature,ACS Med Chem Lett,2020.0,,TIME = 1.0 hr
CHEMBL4467033,,0,439.87,1,5.76,16; CA314,COc1cc2ncnc(Nc3ccc(OCc4cccc(F)c4)c(Cl)c3)c2cc1OC,IC50,'<',15.000,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4393671,1,Scientific Literature,J Med Chem,2019.0,,
CHEMBL4869409,,0,471.49,0,3.7,19c,C#CC(=O)N1CCC(c2cnn3c(C(N)=O)c(-c4ccc(Oc5ccccc...,IC50,'=',16.300,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4842353,1,Scientific Literature,Eur J Med Chem,2021.0,,TIME = 3.0 hr
CHEMBL4856592,,0,483.53,0,3.57,19f,C#CC(=O)N1CCC(c2cnn3c(C(N)=O)c(-c4ccc(Oc5ccc(O...,IC50,'=',50.800,...,Epidermal growth factor receptor erbB1,Homo sapiens,SINGLE PROTEIN,CHEMBL4842353,1,Scientific Literature,Eur J Med Chem,2021.0,,TIME = 3.0 hr


## Find the median for the IC50 and pIC50 values for each compound

For each compound, the median value of the activity can be calculated.

In [15]:
df_af_median = df_af.groupby( 'Molecule ChEMBL ID') [['Standard Value', 'pChEMBL Value']].median()

df_p_median = df_p.groupby( 'Molecule ChEMBL ID') [['Standard Value', 'pChEMBL Value']].median()

df_sp_median = df_sp.groupby( 'Molecule ChEMBL ID') [['Standard Value', 'pChEMBL Value']].median()

df_cb_median = df_cb.groupby( 'Molecule ChEMBL ID') [['Standard Value', 'pChEMBL Value']].median() 

In [16]:
df_af_median

Unnamed: 0_level_0,Standard Value,pChEMBL Value
Molecule ChEMBL ID,Unnamed: 1_level_1,Unnamed: 2_level_1
CHEMBL113900,250.0,6.60
CHEMBL113901,5220.0,5.28
CHEMBL113902,7500.0,5.12
CHEMBL113985,5390.0,5.27
CHEMBL113996,13600.0,4.87
...,...,...
CHEMBL75232,10000.0,
CHEMBL77030,7000.0,5.16
CHEMBL89363,50000.0,
CHEMBL91867,0.2,9.70


Some compounds have no value in their median column (NaN) and therefore cannot be used in this project, so these are removed from the dataset.

In [17]:
df_af_median_c = df_af_median.dropna(subset = ['pChEMBL Value'])

df_p_median_c = df_p_median.dropna(subset = ['pChEMBL Value'])

df_sp_median_c = df_sp_median.dropna(subset = ['pChEMBL Value'])

df_cb_median_c = df_cb_median.dropna(subset = ['pChEMBL Value'])

These files can then be exported to csv.

In [18]:
df_af_median_c.to_csv('assayformat_median.csv')

df_p_median_c.to_csv('protein_median.csv')

df_sp_median_c.to_csv('singleprotein_median.csv')

df_cb_median_c.to_csv('cellbased_median.csv')

## Creating SMILES tables for each compound

These filtered datasets can also be made into tables which only have the compound names and their associated SMILES. 

In [20]:
df_af_smiles = df_af.iloc[:, [6]]

df_p_smiles = df_p.iloc[:, [6]]

df_sp_smiles = df_sp.iloc[:, [6]]

df_cb_smiles = df_cb.iloc[:, [6]]

In [21]:
df_af_smiles

Unnamed: 0_level_0,Smiles
Molecule ChEMBL ID,Unnamed: 1_level_1
CHEMBL386296,COc1cc2c(cc1OC)Sc1nc(C)nc(Nc3ccc(Cl)c(C)c3)c1NC2
CHEMBL57462,N#CC(C#N)=C1C(=O)Nc2cc(O)c(O)cc21
CHEMBL133147,Cc1cc(C)c2ncnc(N(C)c3ccccc3)c2c1
CHEMBL299881,Cc1cccc(Nc2ncnc3cnc(NCCN4CCOCC4)nc23)c1
CHEMBL171144,CC(COC(=O)c1cc(/N=C/c2cc(O)ccc2O)ccc1O)CC(C)(C)C
...,...
CHEMBL321494,CSc1nc(-c2ccc(F)cc2)c(-c2ccnc(Nc3ccccc3)c2)[nH]1
CHEMBL4102224,CCC(=O)NCCSc1nc(-c2ccc(F)cc2)c(-c2ccnc(Nc3cccc...
CHEMBL321494,CSc1nc(-c2ccc(F)cc2)c(-c2ccnc(Nc3ccccc3)c2)[nH]1
CHEMBL4868614,C#Cc1cccc(Nc2ncnc3cc(OC)c(OCCCCCCCN/C(=N\C#N)N...


These tables can also be exported to csv.

In [22]:
df_af_smiles.to_csv('assayformat_smiles.csv')

df_p_smiles.to_csv('protein_smiles.csv')

df_sp_smiles.to_csv('singleprotein_smiles.csv')

df_cb_smiles.to_csv('cellbased_smiles.csv')