<a href="https://colab.research.google.com/github/misgana30/alpha/blob/master/Copy_of_CDD_ML_Part_1_Bioactivity_Data_Concised_cryptosporidium.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 1] impdh of crypto (Concised version)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
[?25l  Downloading https://files.pythonhosted.org/packages/2e/48/0db29040c92726fcc6f99a5bc89e0ea8cf5a9d84753ebaaf53108792da2a/chembl-webresource-client-0.10.2.tar.gz (51kB)
[K     |██████▍                         | 10kB 17.7MB/s eta 0:00:01[K     |████████████▊                   | 20kB 21.5MB/s eta 0:00:01[K     |███████████████████             | 30kB 12.8MB/s eta 0:00:01[K     |█████████████████████████▍      | 40kB 10.1MB/s eta 0:00:01[K     |███████████████████████████████▊| 51kB 5.6MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.0MB/s 
Collecting requests-cache>=0.4.7
  Downloading https://files.pythonhosted.org/packages/7f/55/9b1c40eb83c16d8fc79c5f6c2ffade04208b080670fbfc35e0a5effb5a92/requests_cache-0.5.2-py2.py3-none-any.whl
Building wheels for collected packages: chembl-webresource-client
  Building wheel for chembl-webresource-client (setup.py) ... [?25l[?25hdone
  Created wheel for chembl-webresource-client:

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Acetylcholinesterase**

In [None]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('cryptosporidium')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Cryptosporidium,Cryptosporidium,18.0,False,CHEMBL615036,[],ORGANISM,5806
1,[],Cryptosporidium parvum,Cryptosporidium parvum,16.0,False,CHEMBL612855,[],ORGANISM,5807
2,[],Cryptosporidium hominis,Cryptosporidium hominis,16.0,False,CHEMBL613353,[],ORGANISM,237895
3,"[{'xref_id': 'Q7YY76', 'xref_name': None, 'xre...",Cryptosporidium parvum,Histone deacetylase,10.0,False,CHEMBL3626,"[{'accession': 'Q7YY76', 'component_descriptio...",SINGLE PROTEIN,5807
4,"[{'xref_id': 'Q7YY07', 'xref_name': None, 'xre...",Cryptosporidium parvum,Histone deacetylase,10.0,False,CHEMBL4740,"[{'accession': 'Q7YY07', 'component_descriptio...",SINGLE PROTEIN,5807
5,"[{'xref_id': 'Q5CGA3', 'xref_name': None, 'xre...",Cryptosporidium hominis,Dihydrofolate reductase,10.0,False,CHEMBL3327,"[{'accession': 'Q27552', 'component_descriptio...",SINGLE PROTEIN,237895
6,"[{'xref_id': 'Q8T6T2', 'xref_name': None, 'xre...",Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",10.0,False,CHEMBL6145,"[{'accession': 'Q8T6T2', 'component_descriptio...",SINGLE PROTEIN,5807
7,[],Cryptosporidium parvum,Pyruvate:ferredoxin oxidoreductase,10.0,False,CHEMBL2364026,"[{'accession': 'Q968X7', 'component_descriptio...",SINGLE PROTEIN,5807
8,[],Cryptosporidium parvum,Bifunctional dihydrofolate reductase-thymidyla...,10.0,False,CHEMBL2366489,"[{'accession': 'Q27552', 'component_descriptio...",SINGLE PROTEIN,5807
9,[],Cryptosporidium parvum (strain Iowa II),"Cell division control protein 28, putative",10.0,False,CHEMBL4295532,"[{'accession': 'A3FPL9', 'component_descriptio...",SINGLE PROTEIN,353152


### **Select and retrieve bioactivity data for *Human Acetylcholinesterase* (first entry)**

We will assign the fifth entry (which corresponds to the target protein, *Human Acetylcholinesterase*) to the ***selected_target*** variable 

In [None]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL6145'

Here, we will retrieve only bioactivity data for *Human Acetylcholinesterase* (CHEMBL220) that are reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,2646090,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '20.88', 'le': '0.42', 'lle': '2.79', ...",CHEMBL554067,,CHEMBL554067,8.05,False,http://www.openphacts.org/units/Nanomolar,816745,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.009
1,,2646091,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1cccc2ncccc12)c1cn(-c2ccc(Cl)cc2)nn1,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '22.93', 'le': '0.44', 'lle': '3.44', ...",CHEMBL549610,,CHEMBL549610,8.05,False,http://www.openphacts.org/units/Nanomolar,816775,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.009
2,,2646092,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(C...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '21.50', 'le': '0.41', 'lle': '4.04', ...",CHEMBL549612,,CHEMBL549612,7.89,False,http://www.openphacts.org/units/Nanomolar,816800,=,1,True,=,,IC50,nM,,13.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.013
3,,2646093,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,CC(Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(Cl)c(C...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '19.30', 'le': '0.39', 'lle': '3.24', ...",CHEMBL563640,,CHEMBL563640,7.75,False,http://www.openphacts.org/units/Nanomolar,816777,=,1,True,=,,IC50,nM,,18.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.018
4,,2646094,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,CC(Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2)nn1,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '19.98', 'le': '0.40', 'lle': '2.44', ...",CHEMBL562828,,CHEMBL562828,7.70,False,http://www.openphacts.org/units/Nanomolar,816717,=,1,True,=,,IC50,nM,,20.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439,,15122617,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C[C@H](Oc1cccc(Cl)c1Cl)C(=O)Nc1cccc(NC(=O)c2cc...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,,CHEMBL2348631,,CHEMBL2348631,,False,http://www.openphacts.org/units/Nanomolar,2263927,>,1,True,>,,IC50,nM,,5000.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,5000.0
440,,15122618,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C[C@H](C(=O)Nc1ccc2oc(-c3ccncc3)nc2c1)c1ccccc1,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,,CHEMBL2348634,,CHEMBL2348634,,False,http://www.openphacts.org/units/Nanomolar,2263928,>,1,True,>,,IC50,nM,,5000.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,5000.0
441,,15122619,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,CC(Oc1cccc([N+](=O)[O-])c1Cl)C(=O)Nc1ccc2oc(-c...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '16.66', 'le': '0.32', 'lle': '2.45', ...",CHEMBL3329564,,CHEMBL3329564,7.31,False,http://www.openphacts.org/units/Nanomolar,2263929,=,1,True,=,,IC50,nM,,49.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,49.0
442,,15122620,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C[C@H](Oc1cccc(Cl)c1Cl)C(=O)Nc1ccc2nc(-c3ccncc...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '21.53', 'le': '0.43', 'lle': '3.62', ...",CHEMBL2348627,,CHEMBL2348627,9.22,False,http://www.openphacts.org/units/Nanomolar,2263930,=,1,True,=,,IC50,nM,,0.6,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,0.6


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('IMDH_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,2646090,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '20.88', 'le': '0.42', 'lle': '2.79', ...",CHEMBL554067,,CHEMBL554067,8.05,False,http://www.openphacts.org/units/Nanomolar,816745,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.009
1,,2646091,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1cccc2ncccc12)c1cn(-c2ccc(Cl)cc2)nn1,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '22.93', 'le': '0.44', 'lle': '3.44', ...",CHEMBL549610,,CHEMBL549610,8.05,False,http://www.openphacts.org/units/Nanomolar,816775,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.009
2,,2646092,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(C...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '21.50', 'le': '0.41', 'lle': '4.04', ...",CHEMBL549612,,CHEMBL549612,7.89,False,http://www.openphacts.org/units/Nanomolar,816800,=,1,True,=,,IC50,nM,,13.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.013
3,,2646093,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,CC(Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(Cl)c(C...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '19.30', 'le': '0.39', 'lle': '3.24', ...",CHEMBL563640,,CHEMBL563640,7.75,False,http://www.openphacts.org/units/Nanomolar,816777,=,1,True,=,,IC50,nM,,18.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.018
4,,2646094,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,CC(Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2)nn1,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '19.98', 'le': '0.40', 'lle': '2.44', ...",CHEMBL562828,,CHEMBL562828,7.70,False,http://www.openphacts.org/units/Nanomolar,816717,=,1,True,=,,IC50,nM,,20.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439,,15122617,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C[C@H](Oc1cccc(Cl)c1Cl)C(=O)Nc1cccc(NC(=O)c2cc...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,,CHEMBL2348631,,CHEMBL2348631,,False,http://www.openphacts.org/units/Nanomolar,2263927,>,1,True,>,,IC50,nM,,5000.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,5000.0
440,,15122618,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C[C@H](C(=O)Nc1ccc2oc(-c3ccncc3)nc2c1)c1ccccc1,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,,CHEMBL2348634,,CHEMBL2348634,,False,http://www.openphacts.org/units/Nanomolar,2263928,>,1,True,>,,IC50,nM,,5000.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,5000.0
441,,15122619,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,CC(Oc1cccc([N+](=O)[O-])c1Cl)C(=O)Nc1ccc2oc(-c...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '16.66', 'le': '0.32', 'lle': '2.45', ...",CHEMBL3329564,,CHEMBL3329564,7.31,False,http://www.openphacts.org/units/Nanomolar,2263929,=,1,True,=,,IC50,nM,,49.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,49.0
442,,15122620,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C[C@H](Oc1cccc(Cl)c1Cl)C(=O)Nc1ccc2nc(-c3ccncc...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '21.53', 'le': '0.43', 'lle': '3.62', ...",CHEMBL2348627,,CHEMBL2348627,9.22,False,http://www.openphacts.org/units/Nanomolar,2263930,=,1,True,=,,IC50,nM,,0.6,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,0.6


In [None]:
len(df2.canonical_smiles.unique())

231

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,2646090,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '20.88', 'le': '0.42', 'lle': '2.79', ...",CHEMBL554067,,CHEMBL554067,8.05,False,http://www.openphacts.org/units/Nanomolar,816745,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.009
1,,2646091,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1cccc2ncccc12)c1cn(-c2ccc(Cl)cc2)nn1,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '22.93', 'le': '0.44', 'lle': '3.44', ...",CHEMBL549610,,CHEMBL549610,8.05,False,http://www.openphacts.org/units/Nanomolar,816775,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.009
2,,2646092,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,C[C@@H](Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(C...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '21.50', 'le': '0.41', 'lle': '4.04', ...",CHEMBL549612,,CHEMBL549612,7.89,False,http://www.openphacts.org/units/Nanomolar,816800,=,1,True,=,,IC50,nM,,13.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.013
3,,2646093,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,CC(Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(Cl)c(C...,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '19.30', 'le': '0.39', 'lle': '3.24', ...",CHEMBL563640,,CHEMBL563640,7.75,False,http://www.openphacts.org/units/Nanomolar,816777,=,1,True,=,,IC50,nM,,18.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.018
4,,2646094,[],CHEMBL1032991,Inhibition of Cryptosporidium parvum recombina...,B,BAO_0000190,BAO_0000019,assay format,CC(Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2)nn1,,,CHEMBL1158492,J. Med. Chem.,2009,"{'bei': '19.98', 'le': '0.40', 'lle': '2.44', ...",CHEMBL562828,,CHEMBL562828,7.70,False,http://www.openphacts.org/units/Nanomolar,816717,=,1,True,=,,IC50,nM,,20.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,uM,UO_0000065,,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
428,,15122606,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,O=C(COc1cccc2ccccc12)Nc1ccc(Br)cc1,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,,CHEMBL3329560,,CHEMBL3329560,,False,http://www.openphacts.org/units/Nanomolar,2263916,>,1,True,>,,IC50,nM,,5000.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,5000.0
432,,15122610,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Cl)c(Cl)c2)c1,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '17.67', 'le': '0.37', 'lle': '1.47', ...",CHEMBL3329561,,CHEMBL3329561,6.72,False,http://www.openphacts.org/units/Nanomolar,2263920,=,1,True,=,,IC50,nM,,190.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,190.0
433,,15122611,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Br)cc2)c1,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '20.62', 'le': '0.46', 'lle': '3.35', ...",CHEMBL3329562,,CHEMBL3329562,8.05,False,http://www.openphacts.org/units/Nanomolar,2263921,=,1,True,=,,IC50,nM,,9.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,9.0
436,,15122614,[],CHEMBL3362092,Inhibition of Cryptosporidium IMPDH preincubat...,B,BAO_0000190,BAO_0000357,single protein format,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Cl)c(C(=O)...,,,CHEMBL3351409,ACS Med. Chem. Lett.,2014,"{'bei': '13.73', 'le': '0.27', 'lle': '2.50', ...",CHEMBL3329563,,CHEMBL3329563,6.48,False,http://www.openphacts.org/units/Nanomolar,2263924,=,1,True,=,,IC50,nM,,330.0,CHEMBL6145,Cryptosporidium parvum,"Inosine-5'-monophosphate dehydrogenase, probable",5807,,,IC50,nM,UO_0000065,,330.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL554067,C[C@@H](Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2...,9.0
1,CHEMBL549610,C[C@@H](Oc1cccc2ncccc12)c1cn(-c2ccc(Cl)cc2)nn1,9.0
2,CHEMBL549612,C[C@@H](Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(C...,13.0
3,CHEMBL563640,CC(Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(Cl)c(C...,18.0
4,CHEMBL562828,CC(Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2)nn1,20.0
...,...,...,...
428,CHEMBL3329560,O=C(COc1cccc2ccccc12)Nc1ccc(Br)cc1,5000.0
432,CHEMBL3329561,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Cl)c(Cl)c2)c1,190.0
433,CHEMBL3329562,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Br)cc2)c1,9.0
436,CHEMBL3329563,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Cl)c(C(=O)...,330.0


Saves dataframe to CSV file

In [None]:
df3.to_csv('impdh_02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [None]:
df4 = pd.read_csv('impdh_02_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL554067,C[C@@H](Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2...,9.0,active
1,CHEMBL549610,C[C@@H](Oc1cccc2ncccc12)c1cn(-c2ccc(Cl)cc2)nn1,9.0,active
2,CHEMBL549612,C[C@@H](Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(C...,13.0,active
3,CHEMBL563640,CC(Oc1cc[n+]([O-])c2ccccc12)c1cn(-c2ccc(Cl)c(C...,18.0,active
4,CHEMBL562828,CC(Oc1ccnc2ccccc12)c1cn(-c2ccc(Cl)c(Cl)c2)nn1,20.0,active
...,...,...,...,...
226,CHEMBL3329560,O=C(COc1cccc2ccccc12)Nc1ccc(Br)cc1,5000.0,intermediate
227,CHEMBL3329561,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Cl)c(Cl)c2)c1,190.0,active
228,CHEMBL3329562,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Br)cc2)c1,9.0,active
229,CHEMBL3329563,C/C(=N\O)c1cccc(C(C)(C)NC(=O)Nc2ccc(Cl)c(C(=O)...,330.0,active


Saves dataframe to CSV file

In [None]:
df5.to_csv('impdh_03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip acetylcholinesterase.zip *.csv

  adding: IMDH_data_raw.csv (deflated 94%)
  adding: impdh_02_bioactivity_data_preprocessed.csv (deflated 83%)
  adding: impdh_03_bioactivity_data_curated.csv (deflated 85%)


In [None]:
! ls -l

total 324
-rw-r--r-- 1 root root  21941 Jan 10 15:19 acetylcholinesterase.zip
-rw-r--r-- 1 root root 264176 Jan 10 15:17 IMDH_data_raw.csv
-rw-r--r-- 1 root root  15375 Jan 10 15:18 impdh_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  17466 Jan 10 15:19 impdh_03_bioactivity_data_curated.csv
drwxr-xr-x 1 root root   4096 Jan  6 18:10 sample_data


---