# **Compurational Drug Discovery [Part 1]: Downloading Bioactivity Data**


Hamza Ahmed

---

### **Install Libraries**

In [None]:

# Install the chembl data base
! pip install chembl_webresource_client

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

### **Search for Target Proteins**

In [None]:
""" Searching for coronavirus """

# intialise the newclinent object
target = new_client.target 
# Using the new clinet object search coronavirus, this will return list of dictionaries where each dictionary contains information about the target for coronavirus
target_query_coronavirus = target.search("coronavirus") 
# Convert the dictionaries into a pandas data frame
targets_coronavirus = pd.DataFrame.from_dict(target_query_coronavirus)
# display the data frame
targets_coronavirus

### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (sixth entry)**

The 3C-like protease is an attractive target for antiviral intervention due to its essential role in processing polyproteins translated from viral RNA. The structure of 3C protease is conserved across the variations of COVID virus
We will assign the 6th entry (which corresponds to the target protein: *coronavirus 3C-like proteinase*) to the selected_target variable

In [None]:
selected_target = targets_coronavirus.target_chembl_id[6]
selected_target

Here we will only retrieve the bioactivity data of the *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as $IC{50}$ values in nM (nanomolar) unit. IC50 is a quantative measure to see how much of a inhibitory substance (drug) is needed to inhibit a biological compound in vitro by 50%. 

In [None]:
activity = new_client.activity
# For the selected target protien (3C-protienase) store the IC50 values in a variable 
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df.head(3)

The potency of a drug is the *value* columb of the previous cell's output. The lower the value the less of the drug is needed to inhibit the target protein's activity by 50%

### **Saving the Data to csv format**

Let's now save the bioactivity of our selected molecules to a suitable csv format

In [27]:
# Index set to false prevents the index number appearing in the resulting csv file
df.to_csv('3C-proteinase_data.csv', index=False)

In [None]:
# Here we are making a data directory to put all our data into
! mkdir data

In [None]:
! mv 3C-proteinase_data.csv data/

In [None]:
! ls -l data/3C-proteinase_data.csv

## **PreProcessing**

### *Hangle Missing data*

if any compounds have missing value for the standard_Value column then drop it

In [None]:
# drops all columns with a missing standard value 
df2 = df[df.standard_value.notna()]
df2

### *Labelling compounds as either being active, inactive or intermediate*

For the purposes of training a machine learning model later on, we will label compounds as active, inactive or intermediate. 
- Compounds with IC50 value of less than `1000 mM` will be labelled **active**
- Compounds with IC50 value between `1000 to 10,000 mM` will be labelled **intermediate**
- Compounds with IC50 value more then `10,000` will be labelled **inactive**

In [39]:
bioactivity_class = []
for i in df2.standard_value:
    if float(i) > 10000:
        bioactivity_class.append("inactive")
    elif float(i) <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

In [40]:
# Iterate over the molacule id list and add to an array
mol_id = []
for i in df2.molecule_chembl_id:
    mol_id.append(i)

In [43]:
# iterate over the canonical_smiles
smiles = []
for i in df2.canonical_smiles:
    smiles.append(i)

In [44]:
# iterate over the standard value
standard_value = []
for i in df2.standard_value:
    standard_value.append(i)

In [None]:
# Combine into a single data frame: 
data_tuples = list(zip(mol_id, smiles, standard_value, bioactivity_class))
df3 = pd.DataFrame(data_tuples, columns=['mol_id', 'structure', 'IC50_values', 'bioactivity_class'])

In [60]:
# Save the data as a csv file
df3.to_csv('data/3C-proteinase_processed_date.csv', index=False)