# **Computational Drug Discovery [Part 1] Downloading Bioactivity Data from ChEMBL**

Building a machine learning model using the ChEMBL bioactivity data.
Based on coursework from data science professor: Chanin Nantasenamat

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that "brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs."

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [0]:
pip install chembl_webresource_client

## **Importing libraries**

In [0]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for BCR::ABL1 gene fusion**

The BCR::ABL1 fusion is a genetic abnormality resulting from a translocation between chromosomes 9 and 22, forming the Philadelphia chromosome. This fusion gene encodes a constitutively active tyrosine kinase that drives uncontrolled cell proliferation and survival, contributing to pediatric B-cell acute lymphoblastic leukemia (B-ALL). It is associated with a more aggressive disease course but is treatable with targeted therapies like tyrosine kinase inhibitors (TKIs). Early detection and combining TKIs with chemotherapy have significantly improved outcomes in affected children. 




In [0]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('BCR::ABL1')
targets = pd.DataFrame.from_dict(target_query)
targets

### **We proceed with the bioactivity data for *"Bcr/Abl fusion protein"* with the ID CHEMBL2096618**

We will assign the fourth entry, which corresponds to the target protein, *"Bcr/Abl fusion protein"* to the ***selected_target*** variable 

This will make the BCR::ABL1 fusion protein, as the target for downstream steps. We will use the index number 3 for this entry

In [0]:
selected_target = targets.target_chembl_id[3]
selected_target

'CHEMBL2096618'

Here, we will retrieve only bioactivity data for *"Bcr/Abl fusion protein"* that are reported as IC50 values in nM (nanomolar).

In [0]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(type="IC50").filter(units="nM")

Put the info into a dataframe named `df`

In [0]:
df = pd.DataFrame.from_dict(res)

Displaying the data frame we just made

In [0]:
df

The lower the IC50 nM value, the more potent the drug. Meaning it takes a lower concentration of drug to inhibit 50% of the target protein.

In [0]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**. We use index=False because we don't want the index to be written to the CSV file.

In [0]:
! mkdir "/data"
%cd "/path/to/data"

mkdir: cannot create directory '/data': File exists


In [0]:
df.to_csv('bioactivity_data.csv', index=False)

View **bioactivity_data.csv** file in /data.

In [0]:
! head bioactivity_data.csv

## **Drop missing values...if any**
If any compounds has missing value for the **standard_value** column then drop it

In [0]:
df2 = df[df.standard_value.notna()]
df2

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit in nM. The ranges you pick will depend on the target, disease, or other factors. 

**Generally** compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

**For BCR::ABL1** example we will be more conservative and use IC50 values for Imatinib as a starting point. IC50 values need to be in this range or lower is critical to outperform existing therapies and provide efficacy against potential resistance mutations, such as T315I. We want our drug discovery efforts to be on par or better than existing standard of care therapies like Imatinib. Thus compounds >=1000 will be inactive, <=100 will be active and 100-1000 will be intermediate.

#### **Create a list of bioactivity_class with your selected parameters**

In [0]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 1000:
    bioactivity_class.append("inactive")
  elif float(i) <= 100:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

#### **Iterate the *molecule_chembl_id* to a list**

In [0]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

#### **Iterate *canonical_smiles* to a list**

In [0]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

#### **Iterate *standard_value* to a list**

In [0]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

#### **Combine the 4 lists into a dataframe called df3**

In [0]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame( data_tuples,  columns=['molecul e_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

#### **Alternative method**

In [0]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

In [0]:
pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

Save df3 to CSV file in /data

In [0]:
df3.to_csv('bioactivity_data_preprocessed.csv', index=False)

---