# **Computational Drug Discovery [Part 1] Download Bioactivity Data**


---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client



## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Monoamine oxidase A**

In [None]:
# Target search for Monoamine oxidase A
target = new_client.target
target_query = target.search('Monoamine oxidase A')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Rattus norvegicus,Monoamine oxidase,32.0,False,CHEMBL2095196,"[{'accession': 'P19643', 'component_descriptio...",PROTEIN FAMILY,10116.0
1,[],Homo sapiens,Monoamine oxidase,32.0,False,CHEMBL2095205,"[{'accession': 'P21397', 'component_descriptio...",PROTEIN FAMILY,9606.0
2,[],Mus musculus,Monoamine oxidase,32.0,False,CHEMBL2111442,"[{'accession': 'Q8BW75', 'component_descriptio...",PROTEIN FAMILY,10090.0
3,[],Bos taurus,Monoamine oxidase,32.0,False,CHEMBL2111399,"[{'accession': 'P56560', 'component_descriptio...",PROTEIN FAMILY,9913.0
4,[],Homo sapiens,Monoamine oxidase A,31.0,False,CHEMBL1951,"[{'accession': 'P21397', 'component_descriptio...",SINGLE PROTEIN,9606.0
...,...,...,...,...,...,...,...,...,...
7459,[],Homo sapiens,CDK15/Cyclin Y,0.0,False,CHEMBL5483184,"[{'accession': 'Q96Q40', 'component_descriptio...",PROTEIN COMPLEX,9606.0
7460,[],Homo sapiens,CDK5/Cyclin D3,0.0,False,CHEMBL5483185,"[{'accession': 'Q00535', 'component_descriptio...",PROTEIN COMPLEX,9606.0
7461,[],Homo sapiens,CDK11A/Cyclin L2,0.0,False,CHEMBL5483187,"[{'accession': 'Q9UQ88', 'component_descriptio...",PROTEIN COMPLEX,9606.0
7462,[],Homo sapiens,CDK11B/Cyclin L2,0.0,False,CHEMBL5483188,"[{'accession': 'P21127', 'component_descriptio...",PROTEIN COMPLEX,9606.0


### **Select and retrieve bioactivity data for *Monoamine oxidase* (4th entry)**

We will assign the fifth entry (which corresponds to the target protein monoamine oxidase A from homo sapiens) to the ***selected_target*** variable

In [None]:
selected_target = targets.target_chembl_id[4]
selected_target

'CHEMBL1951'

Here, we will retrieve only bioactivity data for *monoamine oxidase A* (CHEMBL1951) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,184068,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,5.47
1,,,185442,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,1.36
2,,,189322,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,30.8


The lower IC50 value, the better the potency of the drug.

In [None]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('bioactivity_data.csv', index=False)

## **Copying files to Google Drive**

Firstly, we need to mount the Google Drive into Colab so that we can have access to our Google adrive from within Colab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Mounted at /content/gdrive/


In [None]:
!ls '/content/gdrive/My Drive/Colab Notebooks'


'Copy of Final copy of RLO_Group4.ipynb'   MAO_CDD-ML-Part-1-bioactivity-data.ipynb
 data					   rna_seq.ipynb


Next, we create a **data** folder in our **Colab Notebooks** folder on Google Drive.

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists


In [None]:
! cp bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 3454
-rw------- 1 root root 3270887 Feb  9 05:55 bioactivity_data.csv
-rw------- 1 root root  265217 Feb  9 05:57 bioactivity_preprocessed_data.csv


Let's see the CSV files that we have so far.

In [None]:
! ls

bioactivity_data.csv  bioactivity_preprocessed_data.csv  gdrive  sample_data


Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [None]:
! head bioactivity_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,184068,[],CHEMBL715193,Compound was tested for inhibition of monoamine oxidase-A (MAO-A).,B,,,BAO_0000190,BAO_0000357,single protein format,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NCc3ccccc3)c(Br)c21,,,CHEMBL1151505,Bioorg Med Chem Lett,1996.0,"{'bei': '12.34', 'le': '0.27', 'lle': '0.67', 'sei': '9.46'}",CHEMBL156630

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it

In [None]:
df = pd.read_csv('bioactivity_data.csv')
df2 = df.dropna(subset=['standard_value'])

# Drop rows where 'canonical_smiles' is NaN
df2 = df2.dropna(subset=['canonical_smiles'])

len(df2)

4222

In [None]:
len(df2.canonical_smiles.unique())

3562

If the output of unique values is smaller than the number of rows, then we will know that we have duplicate values. In this case, the number of unqiue values is 3562. Therefore, we drop duplicate values.

In [None]:
df2 = df2.drop_duplicates(['canonical_smiles'])
df2 = df2.reset_index(drop=True)
df2


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,184068,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,5.470
1,,,185442,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,1.360
2,,,189322,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,30.800
3,,,190494,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,22.500
4,,,191747,[],CHEMBL715193,Compound was tested for inhibition of monoamin...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,0.180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3557,"{'action_type': 'INHIBITOR', 'description': 'N...",,25658317,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372889,Inhibition of human recombinant MAO-A expresse...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,1.500
3558,"{'action_type': 'INHIBITOR', 'description': 'N...",,25658318,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5372889,Inhibition of human recombinant MAO-A expresse...,B,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,1.180
3559,"{'action_type': 'INHIBITOR', 'description': 'N...",,25731050,[],CHEMBL5393270,Inhibition of recombinant MAO-A (unknown origi...,T,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,37.756
3560,"{'action_type': 'INHIBITOR', 'description': 'N...",,25731051,[],CHEMBL5393270,Inhibition of recombinant MAO-A (unknown origi...,T,,,BAO_0000190,...,Homo sapiens,Monoamine oxidase A,9606,,,IC50,uM,UO_0000065,,25.431


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

df_value_bioactivity = pd.DataFrame({
    'standard_value': df2.standard_value,
    'bioactivity_class': bioactivity_class
})
print(df_value_bioactivity)

      standard_value bioactivity_class
0             5470.0      intermediate
1             1360.0      intermediate
2            30800.0          inactive
3            22500.0          inactive
4              180.0            active
...              ...               ...
3557          1500.0      intermediate
3558          1180.0      intermediate
3559         37756.0          inactive
3560         25431.0          inactive
3561         30000.0          inactive

[3562 rows x 2 columns]


### **Combine molecule_chembl_id, canonical_smiles, standard_valu, and bioactivity_class into a DataFrame into a new df**

In [None]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL156630,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NCc3ccccc3)c(Br)c21,5470.0
1,CHEMBL155754,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NC)c(Cl)c21,1360.0
2,CHEMBL348083,CC(C)/N=C1/CCc2c1n(C)c1ccccc21,30800.0
3,CHEMBL157182,C/N=C1/CCc2c1n(C)c1cc(Cl)c(OC(=O)NC)cc21,22500.0
4,CHEMBL160347,COc1cc(Br)c2oc(C3CCNCC3)cc2c1,180.0
...,...,...,...
3557,CHEMBL5398630,O=C(O)/C=C/C(=O)O.[2H]C([2H])(Cc1c[nH]c2ccccc1...,1500.0
3558,CHEMBL3183055,CN(C)CCc1c[nH]c2ccccc12.O=C(O)/C=C/C(=O)O,1180.0
3559,CHEMBL91,Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1,37756.0
3560,CHEMBL104,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,25431.0


In [None]:
print(len(bioactivity_class))
print(len(df3))
df3.loc[:, 'bioactivity_class'] = bioactivity_class
df3 = df3.reset_index(drop=True)
df3


3562
3562


Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL156630,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NCc3ccccc3)c(Br)c21,5470.0,intermediate
1,CHEMBL155754,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NC)c(Cl)c21,1360.0,intermediate
2,CHEMBL348083,CC(C)/N=C1/CCc2c1n(C)c1ccccc21,30800.0,inactive
3,CHEMBL157182,C/N=C1/CCc2c1n(C)c1cc(Cl)c(OC(=O)NC)cc21,22500.0,inactive
4,CHEMBL160347,COc1cc(Br)c2oc(C3CCNCC3)cc2c1,180.0,active
...,...,...,...,...
3557,CHEMBL5398630,O=C(O)/C=C/C(=O)O.[2H]C([2H])(Cc1c[nH]c2ccccc1...,1500.0,intermediate
3558,CHEMBL3183055,CN(C)CCc1c[nH]c2ccccc12.O=C(O)/C=C/C(=O)O,1180.0,intermediate
3559,CHEMBL91,Clc1ccc(COC(Cn2ccnc2)c2ccc(Cl)cc2Cl)c(Cl)c1,37756.0,inactive
3560,CHEMBL104,Clc1ccccc1C(c1ccccc1)(c1ccccc1)n1ccnc1,25431.0,inactive


Saves dataframe to CSV file

In [None]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [None]:
! ls -l

total 3464
-rw-r--r-- 1 root root 3271526 Feb  9 06:12 bioactivity_data.csv
-rw-r--r-- 1 root root  265217 Feb  9 06:27 bioactivity_preprocessed_data.csv
drwx------ 7 root root    4096 Feb  9 06:12 gdrive
drwxr-xr-x 1 root root    4096 Feb  6 14:19 sample_data


Let's copy to the Google Drive

In [None]:
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data.csv  bioactivity_preprocessed_data.csv


---