# **Computational Drug Discovery [Part 1] Downloading Bioactivity Data from ChEMBL**

Building a machine learning model using the ChEMBL bioactivity data.
Based on coursework from data science professor: Chanin Nantasenamat

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that "brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs."

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [0]:
pip install chembl_webresource_client

[0mNote: you may need to restart the kernel to use updated packages.


## **Importing libraries**

In [0]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for BCR::ABL1 gene fusion**

The BCR::ABL1 fusion is a genetic abnormality resulting from a translocation between chromosomes 9 and 22, forming the Philadelphia chromosome. This fusion gene encodes a constitutively active tyrosine kinase that drives uncontrolled cell proliferation and survival, contributing to pediatric B-cell acute lymphoblastic leukemia (B-ALL). It is associated with a more aggressive disease course but is treatable with targeted therapies like tyrosine kinase inhibitors (TKIs). Early detection and combining TKIs with chemotherapy have significantly improved outcomes in affected children. 




In [0]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('BCR::ABL1')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,BCR/ABL1,30.0,False,CHEMBL4523652,"[{'accession': 'P00520', 'component_descriptio...",CHIMERIC PROTEIN,10090
1,[],Homo sapiens,VHL/BCR-ABL1,26.0,False,CHEMBL4523751,"[{'accession': 'P00519', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
2,[],Mus musculus,Cereblon-BCR-ABL,24.0,False,CHEMBL5483195,"[{'accession': 'P00520', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,10090
3,[],Homo sapiens,Bcr/Abl fusion protein,23.0,False,CHEMBL2096618,"[{'accession': 'P00519', 'component_descriptio...",CHIMERIC PROTEIN,9606
4,[],Homo sapiens,Cereblon/BCR/ABL,22.0,False,CHEMBL4296137,"[{'accession': 'P00519', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
5,[],Homo sapiens,BCR/ABL p210 fusion protein,19.0,False,CHEMBL6105,"[{'accession': 'A1Z199', 'component_descriptio...",CHIMERIC PROTEIN,9606
6,[],Homo sapiens,Tyrosine-protein kinase ABL,15.0,False,CHEMBL1862,"[{'accession': 'P00519', 'component_descriptio...",SINGLE PROTEIN,9606
7,[],Mus musculus,Tyrosine-protein kinase ABL,15.0,False,CHEMBL3099,"[{'accession': 'P00520', 'component_descriptio...",SINGLE PROTEIN,10090
8,[],Homo sapiens,Baculoviral IAP repeat-containing protein 2/BC...,15.0,False,CHEMBL4296119,"[{'accession': 'P00519', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
9,[],Homo sapiens,E3 ubiquitin-protein ligase XIAP/BCR/ABL,15.0,False,CHEMBL4296120,"[{'accession': 'P00519', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


### **We proceed with the bioactivity data for *"Bcr/Abl fusion protein"* with the ID CHEMBL2096618**

We will assign the fourth entry, which corresponds to the target protein, *"Bcr/Abl fusion protein"* to the ***selected_target*** variable 

This will make the BCR::ABL1 fusion protein, as the target for downstream steps. We will use the index number 3 for this entry

In [0]:
selected_target = targets.target_chembl_id[3]
selected_target

'CHEMBL2096618'

Here, we will retrieve only bioactivity data for *"Bcr/Abl fusion protein"* that are reported as IC50 values in nM (nanomolar).

In [0]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(type="IC50").filter(units="nM")

Put the info into a dataframe named `df`

In [0]:
df = pd.DataFrame.from_dict(res)

Displaying the data frame we just made

In [0]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1767602,[],CHEMBL913216,Inhibition of Bcr-Abl fusion protein,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,20.0
1,,,1767603,[],CHEMBL913216,Inhibition of Bcr-Abl fusion protein,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,2.0
2,,,1778588,[],CHEMBL910557,Inhibition of Bcr-Abl in presence of ATP,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,3.0
3,,,1778589,[],CHEMBL910557,Inhibition of Bcr-Abl in presence of ATP,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0
4,,,1841809,[],CHEMBL914494,Inhibition of Bcr-Abl kinase,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668989,[],CHEMBL5375724,Inhibition of human BCR-ABL1 T315I mutant by r...,B,P00519,T315I,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,44.0
324,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668990,[],CHEMBL5375724,Inhibition of human BCR-ABL1 T315I mutant by r...,B,P00519,T315I,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,47.0
325,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668991,[],CHEMBL5375725,Inhibition of human BCR-ABL1 Y253F mutant by r...,B,P00519,Y253F,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0
326,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668992,[],CHEMBL5375725,Inhibition of human BCR-ABL1 Y253F mutant by r...,B,P00519,Y253F,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0


The lower the IC50 nM value, the more potent the drug. Meaning it takes a lower concentration of drug to inhibit 50% of the target protein.

In [0]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**. We use index=False because we don't want the index to be written to the CSV file.

In [0]:
df.to_csv('bioactivity_data.csv', index=False)

We store the new bioactivity_data.csv into our "data" folder.

In [0]:
! mkdir /data

mkdir: cannot create directory '/data': File exists


In [0]:
cp bioactivity_data.csv "data"

cp: cannot stat 'bioactivity_data.csv': No such file or directory


View **bioactivity_data.csv** file in /data.

In [0]:
! head /home/drug_discovery/Drug-Discovery-with-Python-and-Machine-Learning/data/bioactivity_data.csv

## **Drop missing values...if any**
If any compounds has missing value for the **standard_value** column then drop it

In [0]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1767602,[],CHEMBL913216,Inhibition of Bcr-Abl fusion protein,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,20.0
1,,,1767603,[],CHEMBL913216,Inhibition of Bcr-Abl fusion protein,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,2.0
2,,,1778588,[],CHEMBL910557,Inhibition of Bcr-Abl in presence of ATP,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,3.0
3,,,1778589,[],CHEMBL910557,Inhibition of Bcr-Abl in presence of ATP,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0
4,,,1841809,[],CHEMBL914494,Inhibition of Bcr-Abl kinase,B,,,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668989,[],CHEMBL5375724,Inhibition of human BCR-ABL1 T315I mutant by r...,B,P00519,T315I,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,44.0
324,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668990,[],CHEMBL5375724,Inhibition of human BCR-ABL1 T315I mutant by r...,B,P00519,T315I,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,47.0
325,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668991,[],CHEMBL5375725,Inhibition of human BCR-ABL1 Y253F mutant by r...,B,P00519,Y253F,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0
326,"{'action_type': 'INHIBITOR', 'description': 'N...",,25668992,[],CHEMBL5375725,Inhibition of human BCR-ABL1 Y253F mutant by r...,B,P00519,Y253F,BAO_0000190,...,Homo sapiens,Bcr/Abl fusion protein,9606,,,IC50,nM,UO_0000065,,1.0


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit in nM. The ranges you pick will depend on the target, disease, or other factors. 

**Generally** compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

**For BCR::ABL1** example we will be more conservative and use IC50 values for Imatinib as a starting point. IC50 values need to be in this range or lower is critical to outperform existing therapies and provide efficacy against potential resistance mutations, such as T315I Thus compounds >=1000 will be inactive, <=100 will be active and 100-1000 will be intermediate.

In [0]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 1000:
    bioactivity_class.append("inactive")
  elif float(i) <= 100:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Iterate the *molecule_chembl_id* to a list**

In [None]:
df2.molecule_chembl_id

0       CHEMBL288441
1       CHEMBL386051
2       CHEMBL364623
3      CHEMBL5416410
4      CHEMBL5416410
           ...      
323     CHEMBL288441
324    CHEMBL5435819
325     CHEMBL288441
326    CHEMBL5435819
327    CHEMBL1852688
Name: molecule_chembl_id, Length: 328, dtype: object

In [0]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

### **Iterate *canonical_smiles* to a list**

In [0]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

### **Iterate *standard_value* to a list**

In [0]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

### **Combine the 4 lists into a dataframe**

In [0]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [0]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,intermediate,7200.0
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,intermediate,9400.0
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,inactive,13500.0
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,inactive,13110.0
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],intermediate,2000.0
...,...,...,...,...
128,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,inactive,10600.0
129,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,inactive,10100.0
130,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,inactive,11500.0
131,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,inactive,10700.0


### **Alternative method**

In [0]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,13110.0
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],2000.0
...,...,...,...
128,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,10600.0
129,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,10100.0
130,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,11500.0
131,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,10700.0


In [0]:
pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value,0
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,intermediate,7200.0,intermediate
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,intermediate,9400.0,intermediate
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,inactive,13500.0,inactive
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,inactive,13110.0,inactive
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],intermediate,2000.0,intermediate
...,...,...,...,...,...
128,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,inactive,10600.0,inactive
129,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,inactive,10100.0,inactive
130,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,inactive,11500.0,inactive
131,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,inactive,10700.0,inactive


Saves dataframe to CSV file

In [0]:
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [0]:
! ls -l

total 92
-rw-r--r-- 1 root root 70010 Apr 29 17:07 bioactivity_data.csv
-rw-r--r-- 1 root root  9326 Apr 29 17:24 bioactivity_preprocessed_data.csv
drwx------ 4 root root  4096 Apr 29 17:08 gdrive
drwxr-xr-x 1 root root  4096 Apr  3 16:24 sample_data


Let's copy to the Google Drive

In [0]:
! cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [0]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data.csv  bioactivity_preprocessed_data.csv


---