# **Project Description**

# **Section 1st: Downloading Bioactivity Data**

Installing the ChEMBL web service package to extract bioactivity data from the [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/). ChEMBL has ~ 2.4 Mil componenets as of Aug 2024

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting easydict (from chembl_webresource_client)
  Downloading easydict-1.13-py3-none-any.whl.metadata (4.2 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-23.2.3-py3-none-any.whl.metadata (10 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
   ---------------------------------------- 0.0/55.2 kB ? eta -:--:--
   ---------------------------------------- 55.2/55.2 kB 1.5 MB/s eta 0:00:00
Downloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
   ---------------------------------------- 0.0/61.4 kB ? 

# **Importing Libraries**

In [5]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

# **Target Protein Selection**

### **Protein Target search Coronavirus**
### **Other example can be: Aromatase Protein Target search (Aromatase inhibitors (AIs) are drugs used to lower breast cancer risk)**
### **Acetylcholine Protein target for Alzheimer's disease**
But here I stick with Coronavirus

In [9]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],Feline coronavirus,Feline coronavirus,14.0,False,CHEMBL612744,[],ORGANISM,12663
2,[],Murine coronavirus,Murine coronavirus,14.0,False,CHEMBL5209664,[],ORGANISM,694005
3,[],Canine coronavirus,Canine coronavirus,14.0,False,CHEMBL5291668,[],ORGANISM,11153
4,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
5,[],Human coronavirus OC43,Human coronavirus OC43,13.0,False,CHEMBL5209665,[],ORGANISM,31631
6,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
8,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
9,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


I use single protein for further investigation

### **Extract bioactivity data for *SARS coronavirus 3C-like proteinase* (7th entry) which has the target type of single protein**
I assign the 7th entry (which corresponds to the target protein, *coronavirus 3C-like proteinase*) to the ***selected_target*** variable 

In [10]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL3927'

I consider only bioactivity data for *coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as IC$_{50}$ values in nM (nanomolar) unit.
the lower IC$_{50}$ value (standard_value) the better the potency of the drug becomes as less concentration of the drug is required to have 50% inhibition capability. Below, activity looks for activity of the small molecules on modulating the specified target protein "selected_target".

In [17]:

activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
1,,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
2,,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5


Finally I save the resulting bioactivity data to a CSV file **bioactivity_data.csv** for further preprocessing before training the ML model.

In [18]:
df.to_csv('bioactivity_data.csv', index=False)

I transfered the generated data into a folder named "Extracted_ChEMBL_data"!

# **Section 2: Data Preprocessing**

## **Handling missing data**
If any compounds has missing value for the **standard_value** column which is the value of the IC$_{50}$ then drop it

In [27]:
df2 = df[df.standard_value.notna()]
print(df.shape[0])

133


Fortunately, this dataset has no missing data.