# Part 1:
## Using the ChEMBL Database
The chembl database is a database that contains curated bioactivity data of more than 2 million compounds. 

I installed the chembl web service package so that i can retrieve bioactivity data from the database.

In this jupyter notebook i will be building a real life data science project. I will be building a machine learning model using the ChEMBL bioactivity data. 

In [1]:
# importing libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

# Search for target protein
Essentially this will be the protien that the drug will act on. 

Biologically these compounds will come into contact with the protien or organism and induce modulatory activity towards it. It could be to activate the protien/organism. 

The code for this section will be like going to the chembl website and searching for a virus in the search bar.

In [8]:
# TODO: Target search for Leukemia:

# assign new client.target to the target variable
target = new_client.target

# create variable that searches for the desired virus/disease
target_query = target.search('Myeloproliferative leukemia')

# create a target dataframe and assign it the target query as the argument.
targets = pd.DataFrame.from_dict(target_query)

# display the contents of the dataframe
targets

# for this part i will be using the single protein and chimeric protien

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P40238', 'xref_name': None, 'xre...",Homo sapiens,Thrombopoietin receptor,26.0,False,CHEMBL1864,"[{'accession': 'P40238', 'component_descriptio...",SINGLE PROTEIN,9606.0
1,[],Mus musculus,Thrombopoietin receptor,26.0,False,CHEMBL1075309,"[{'accession': 'Q08351', 'component_descriptio...",SINGLE PROTEIN,10090.0
2,"[{'xref_id': 'Thrombopoietin', 'xref_name': No...",Homo sapiens,Thrombopoietin,23.0,False,CHEMBL1293256,"[{'accession': 'P40225', 'component_descriptio...",SINGLE PROTEIN,9606.0
3,[],Homo sapiens,Myeloid leukemia factor 2,14.0,False,CHEMBL4295830,"[{'accession': 'Q15773', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,[],Homo sapiens,Leukemia cells,13.0,False,CHEMBL614844,[],CELL-LINE,9606.0
...,...,...,...,...,...,...,...,...,...
58,[],Mus musculus,Runt-related transcription factor 2,5.0,False,CHEMBL1681609,"[{'accession': 'Q08775', 'component_descriptio...",SINGLE PROTEIN,10090.0
59,[],Homo sapiens,MLL1-ASH2L/RbBP5/WDR5/DPY30,5.0,False,CHEMBL4106124,"[{'accession': 'P61964', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606.0
60,[],Homo sapiens,Runt-related transcription factor 1/Core-bindi...,4.0,False,CHEMBL2093862,"[{'accession': 'Q01196', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606.0
61,[],Homo sapiens,Baculoviral IAP repeat-containing protein 2/BC...,4.0,False,CHEMBL4296119,"[{'accession': 'P00519', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606.0


## Select and retrieve bioactivity data for Myeloproliferative leukemia

In [9]:
# I will assign the 1st index entry (which corresponds to the target protien, Thrombopoietin receptor) -
# to the selected_target variable. 
selected_target = targets.target_chembl_id[0]
selected_target

# CHEMBL1864 is a unique id of the target

'CHEMBL1864'

In [18]:
# Here i will retrieve only bioactivity data for Myeloproliferative leukemia(CHEMBL1864) that are reported as-
# IC50 values in nM(nanomolar) unit.

# define new variable
activity = new_client.activity

# this will select only the values containing IC50 for the column called standard type
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="EC50")

In [20]:
# make a dataframe and assign it res and print the content
pd.set_option('display.max_columns', None)

df = pd.DataFrame.from_dict(res)
df.head()

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1184334,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,Cc1[nH]n(-c2ccc([N+](=O)[O-])cc2)c(=O)c1/N=N/c...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '13.63', 'le': '0.26', 'lle': '2.50', ...",CHEMBL3144682,,CHEMBL3144682,6.4,0,http://www.openphacts.org/units/Nanomolar,234728,=,1,1,=,,EC50,nM,,400.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,0.4
1,,1187111,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,O=S(=O)(O)c1cc(O)c(/N=N/c2c(O)ccc3cc(O)ccc23)c...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '13.89', 'le': '0.27', 'lle': '0.93', ...",CHEMBL125996,,CHEMBL125996,5.7,0,http://www.openphacts.org/units/Nanomolar,234684,=,1,1,=,,EC50,nM,,2000.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,2.0
2,,1187113,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,Cc1[nH]n(-c2ccc(C(C)C)cc2)c(=O)c1/N=N/c1c(O)cc...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '15.34', 'le': '0.30', 'lle': '2.03', ...",CHEMBL3144783,,CHEMBL3144783,7.16,0,http://www.openphacts.org/units/Nanomolar,234726,=,1,1,=,,EC50,nM,,70.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,0.07
3,,1188428,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,O=S(=O)(O)c1cc(O)c(/N=N/c2c(O)c(O)cc3ccccc23)c...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '13.89', 'le': '0.27', 'lle': '0.93', ...",CHEMBL122504,,CHEMBL122504,5.7,0,http://www.openphacts.org/units/Nanomolar,234736,=,1,1,=,,EC50,nM,,2000.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,2.0
4,,1188430,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,CCNS(=O)(=O)c1cccc(-n2[nH]c(C)c(/N=N/c3c(O)cc(...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '9.41', 'le': '0.19', 'lle': '1.71', '...",CHEMBL3144667,,CHEMBL3144667,5.0,0,http://www.openphacts.org/units/Nanomolar,234722,=,1,1,=,,EC50,nM,,10000.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,10.0


In [21]:
# making sure that the standard_type column only has EC50
# when i define a particular standard type here, it will make my data set more uniform.
df['standard_type']

0      EC50
1      EC50
2      EC50
3      EC50
4      EC50
       ... 
327    EC50
328    EC50
329    EC50
330    EC50
331    EC50
Name: standard_type, Length: 332, dtype: object

In [22]:
# the standard_value column is the potency of the drug. the number represents the potency. the lower the number, the better the potency of the drug becomes. the higher the number, the worse the potency becomes.

# ideally i want this number to be the lowest it can be. meaning that the inhibitory conentration at 50% will have a low concentration, meaning that in order to elicit 50% of the inhibition of a target protien, i would need lower concentration of the drug.

df['standard_value']

0        400.0
1       2000.0
2         70.0
3       2000.0
4      10000.0
        ...   
327       22.0
328       31.0
329       26.0
330       42.0
331       46.0
Name: standard_value, Length: 332, dtype: object

In [23]:
# finally i will save the resulting bioactivity data to a CSV file called bioactivity_data.csv
df.to_csv('data/bioactivity_data.csv', index=False)

# Handling missing data
If any compounds have missing values for the standard_value column, then drop it.

In [24]:
df2 = pd.read_csv('data/bioactivity_data.csv')

In [26]:
df2 = df2[df2.standard_value.notna()]
df2
# there is no missing values for this data set.

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1184334,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,Cc1[nH]n(-c2ccc([N+](=O)[O-])cc2)c(=O)c1/N=N/c...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '13.63', 'le': '0.26', 'lle': '2.50', ...",CHEMBL3144682,,CHEMBL3144682,6.40,0,http://www.openphacts.org/units/Nanomolar,234728,=,1,1,=,,EC50,nM,,400.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,0.40
1,,1187111,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,O=S(=O)(O)c1cc(O)c(/N=N/c2c(O)ccc3cc(O)ccc23)c...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '13.89', 'le': '0.27', 'lle': '0.93', ...",CHEMBL125996,,CHEMBL125996,5.70,0,http://www.openphacts.org/units/Nanomolar,234684,=,1,1,=,,EC50,nM,,2000.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,2.00
2,,1187113,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,Cc1[nH]n(-c2ccc(C(C)C)cc2)c(=O)c1/N=N/c1c(O)cc...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '15.34', 'le': '0.30', 'lle': '2.03', ...",CHEMBL3144783,,CHEMBL3144783,7.16,0,http://www.openphacts.org/units/Nanomolar,234726,=,1,1,=,,EC50,nM,,70.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,0.07
3,,1188428,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,O=S(=O)(O)c1cc(O)c(/N=N/c2c(O)c(O)cc3ccccc23)c...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '13.89', 'le': '0.27', 'lle': '0.93', ...",CHEMBL122504,,CHEMBL122504,5.70,0,http://www.openphacts.org/units/Nanomolar,234736,=,1,1,=,,EC50,nM,,2000.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,2.00
4,,1188430,[],CHEMBL815867,Effective concentration for thrombopoietin luc...,B,,,BAO_0000188,BAO_0000219,cell-based format,CCNS(=O)(=O)c1cccc(-n2[nH]c(C)c(/N=N/c3c(O)cc(...,,,CHEMBL1134668,J. Med. Chem.,2001,"{'bei': '9.41', 'le': '0.19', 'lle': '1.71', '...",CHEMBL3144667,,CHEMBL3144667,5.00,0,http://www.openphacts.org/units/Nanomolar,234722,=,1,1,=,,EC50,nM,,10000.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,uM,UO_0000065,,10.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
327,,3499609,[],CHEMBL1260125,Inhibition of human TPOR expressed in human Ba...,B,,,BAO_0000188,BAO_0000219,cell-based format,CNC(=O)c1ccc2c(c1)CCN2C(=S)NN/C(C)=C1\C(=O)N(c...,,,CHEMBL1255541,Bioorg. Med. Chem. Lett.,2010,"{'bei': '16.07', 'le': '0.31', 'lle': '4.56', ...",CHEMBL1257846,,CHEMBL1257846,7.66,0,http://www.openphacts.org/units/Nanomolar,962762,=,1,1,=,,EC50,nM,,22.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,nM,UO_0000065,,22.00
328,,3499610,[],CHEMBL1260125,Inhibition of human TPOR expressed in human Ba...,B,,,BAO_0000188,BAO_0000219,cell-based format,CC1=NN(c2ccc(C)c(C)c2)C(=O)/C1=C(/C)NNC(=S)N1C...,,,CHEMBL1255541,Bioorg. Med. Chem. Lett.,2010,"{'bei': '14.10', 'le': '0.27', 'lle': '4.30', ...",CHEMBL1257847,,CHEMBL1257847,7.51,0,http://www.openphacts.org/units/Nanomolar,962763,=,1,1,=,,EC50,nM,,31.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,nM,UO_0000065,,31.00
329,,3499611,[],CHEMBL1260125,Inhibition of human TPOR expressed in human Ba...,B,,,BAO_0000188,BAO_0000219,cell-based format,CCNC(=O)c1ccc2c(c1)CCN2C(=S)NN/C(C)=C1\C(=O)N(...,,,CHEMBL1255541,Bioorg. Med. Chem. Lett.,2010,"{'bei': '15.46', 'le': '0.30', 'lle': '4.10', ...",CHEMBL1257962,,CHEMBL1257962,7.58,0,http://www.openphacts.org/units/Nanomolar,962892,=,1,1,=,,EC50,nM,,26.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,nM,UO_0000065,,26.00
330,,3499612,[],CHEMBL1260125,Inhibition of human TPOR expressed in human Ba...,B,,,BAO_0000188,BAO_0000219,cell-based format,CC1=NN(c2ccc(C)c(C)c2)C(=O)/C1=C(/C)NNC(=O)N1C...,,,CHEMBL1255541,Bioorg. Med. Chem. Lett.,2010,"{'bei': '16.48', 'le': '0.31', 'lle': '4.10', ...",CHEMBL1257963,,CHEMBL1257963,7.38,0,http://www.openphacts.org/units/Nanomolar,962893,=,1,1,=,,EC50,nM,,42.0,CHEMBL1864,Homo sapiens,Thrombopoietin receptor,9606,,,EC50,nM,UO_0000065,,42.00


## Data pre-processing of the bioactivity data

Here i will label compounds as either being active, inactive, or intermediate.

The bioactivity data is in the EC50 unit. Compounds having values less than 1000nM will be considered to be active. Those greater than 10,000nM will be considered to be inactive. As for those in between 1,000 and 10,000 nM will be referred to as intermediate. 

In [27]:
# For the benefit of creating machine learning models where i can classify compounds into 3 categories as either being active compound, inactive compound, intermediate compound.

bioactivity_class = []

for i in df2.standard_value:
    if float(i) >= 10000:
        bioactivity_class.append('inactive')
    elif float(i) <= 1000:
        bioactivity_class.append('active')
    else:
         bioactivity_class.append('intermediate')

In [28]:
# Iterate over the molecule_chembl_id to a list:
# this will go through the molecule id's and put them in the mol_cid variable.
# since this dataset contains a lot of molecular structres(Drug compounds), I need to keep them for future analysis.
# These drugs will act on the target protein to get the desired benefits of the medication.
mol_cid = []
for i in df2.molecule_chembl_id:
    mol_cid.append(i)

In [29]:
mol_cid

['CHEMBL3144682',
 'CHEMBL125996',
 'CHEMBL3144783',
 'CHEMBL122504',
 'CHEMBL3144667',
 'CHEMBL3144785',
 'CHEMBL3144672',
 'CHEMBL3144775',
 'CHEMBL3144679',
 'CHEMBL340040',
 'CHEMBL3144788',
 'CHEMBL3144786',
 'CHEMBL3144789',
 'CHEMBL124367',
 'CHEMBL3144776',
 'CHEMBL3144677',
 'CHEMBL121918',
 'CHEMBL3144778',
 'CHEMBL3144665',
 'CHEMBL3144780',
 'CHEMBL3144666',
 'CHEMBL3144790',
 'CHEMBL3144676',
 'CHEMBL3144836',
 'CHEMBL3144675',
 'CHEMBL3144787',
 'CHEMBL3144670',
 'CHEMBL3144781',
 'CHEMBL3144668',
 'CHEMBL3144663',
 'CHEMBL3144664',
 'CHEMBL3144844',
 'CHEMBL445374',
 'CHEMBL331220',
 'CHEMBL124855',
 'CHEMBL123373',
 'CHEMBL3144662',
 'CHEMBL3144661',
 'CHEMBL3144852',
 'CHEMBL3144669',
 'CHEMBL3144680',
 'CHEMBL121996',
 'CHEMBL420168',
 'CHEMBL3144784',
 'CHEMBL122867',
 'CHEMBL426663',
 'CHEMBL204667',
 'CHEMBL380407',
 'CHEMBL377292',
 'CHEMBL202796',
 'CHEMBL382806',
 'CHEMBL205027',
 'CHEMBL206304',
 'CHEMBL204160',
 'CHEMBL206393',
 'CHEMBL247013',
 'CHEMBL397519'

In [31]:
# Iterate canonical_smiles to a list:
canonical_smiles = []
for i in df2.canonical_smiles:
    canonical_smiles.append(i)

In [32]:
# Iterate standard_value to a list:
standard_value = []
for i in df2.standard_value:
    standard_value.append(i)

In [33]:
# Combine the 4 lists into a dataframe
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame(data_tuples, columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [34]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL3144682,Cc1[nH]n(-c2ccc([N+](=O)[O-])cc2)c(=O)c1/N=N/c...,active,400.0
1,CHEMBL125996,O=S(=O)(O)c1cc(O)c(/N=N/c2c(O)ccc3cc(O)ccc23)c...,intermediate,2000.0
2,CHEMBL3144783,Cc1[nH]n(-c2ccc(C(C)C)cc2)c(=O)c1/N=N/c1c(O)cc...,active,70.0
3,CHEMBL122504,O=S(=O)(O)c1cc(O)c(/N=N/c2c(O)c(O)cc3ccccc23)c...,intermediate,2000.0
4,CHEMBL3144667,CCNS(=O)(=O)c1cccc(-n2[nH]c(C)c(/N=N/c3c(O)cc(...,inactive,10000.0
...,...,...,...,...
323,CHEMBL1257846,CNC(=O)c1ccc2c(c1)CCN2C(=S)NN/C(C)=C1\C(=O)N(c...,active,22.0
324,CHEMBL1257847,CC1=NN(c2ccc(C)c(C)c2)C(=O)/C1=C(/C)NNC(=S)N1C...,active,31.0
325,CHEMBL1257962,CCNC(=O)c1ccc2c(c1)CCN2C(=S)NN/C(C)=C1\C(=O)N(...,active,26.0
326,CHEMBL1257963,CC1=NN(c2ccc(C)c(C)c2)C(=O)/C1=C(/C)NNC(=O)N1C...,active,42.0


In [35]:
# save pre-processed dataframe to CSV file
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)