<a href="https://colab.research.google.com/github/mehdimerbah/CompDrugDiscovery/blob/main/CDD_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Insatalling the ChEMBL Library
The ChEMBL library is a programatic way to access the ChEMBL database and retrieve disease/drug targets for a specific condition.


In [2]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 20.7 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 23.0 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 28.2 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 32.7 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 36.5 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 3.6 MB/s 
[?25hCollecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.4 MB/s 
[?25hCollecting itsdangerous>=2.0.1
  Downloading itsdan

In [3]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [4]:
target = new_client.target
target_query= target.search('coronavirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Coronavirus,Coronavirus,17.0,False,CHEMBL613732,[],ORGANISM,11119
1,[],SARS coronavirus,SARS coronavirus,15.0,False,CHEMBL612575,[],ORGANISM,227859
2,[],Feline coronavirus,Feline coronavirus,15.0,False,CHEMBL612744,[],ORGANISM,12663
3,[],Human coronavirus 229E,Human coronavirus 229E,13.0,False,CHEMBL613837,[],ORGANISM,11137
4,"[{'xref_id': 'P0C6U8', 'xref_name': None, 'xre...",SARS coronavirus,SARS coronavirus 3C-like proteinase,10.0,False,CHEMBL3927,"[{'accession': 'P0C6U8', 'component_descriptio...",SINGLE PROTEIN,227859
5,[],Middle East respiratory syndrome-related coron...,Middle East respiratory syndrome-related coron...,9.0,False,CHEMBL4296578,[],ORGANISM,1335626
6,"[{'xref_id': 'P0C6X7', 'xref_name': None, 'xre...",SARS coronavirus,Replicase polyprotein 1ab,4.0,False,CHEMBL5118,"[{'accession': 'P0C6X7', 'component_descriptio...",SINGLE PROTEIN,227859
7,[],Severe acute respiratory syndrome coronavirus 2,Replicase polyprotein 1ab,4.0,False,CHEMBL4523582,"[{'accession': 'P0DTD1', 'component_descriptio...",SINGLE PROTEIN,2697049


## Select for Bioactivity Data for SARS-CoV Proteinase
This is where we filter out the target receptor protein data for COVID-19


In [5]:
selected_protein_targets = targets.target_chembl_id[4]
selected_protein_targets

'CHEMBL3927'

### IC50 Measurements
Half-maximal inhibitory concentration (IC50) is the most widely used and informative measure of a drug's efficacy. It indicates how much drug is needed to inhibit a biological process by half, thus providing a measure of potency of an antagonist drug in pharmacological research. (https://pubmed.ncbi.nlm.nih.gov/27365221/)

In [23]:
# Retrieve Bioactivity data for the selected targets
bioactivity_data = new_client.activity
# Filter data for those specific targets and set standard measuement unit to IC50 measurements
filtered_data = bioactivity_data.filter(target_chembl_id=selected_protein_targets).filter(standat_type="IC50")
# Create DataFrame from the filtered data stored in a dictionary, remove None/NA and then store it in a csv file for reusability
bioactivity_DF = pd.DataFrame.from_dict(filtered_data)
bioactivity_DF = bioactivity_DF[bioactivity_DF.standard_value.notna()]
bioactivity_DF.to_csv('raw_bioactivity_data.csv', index = False)
bioactivity_DF.head(5)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1480934,[],CHEMBL831837,In vitro percent inhibition against SARS coron...,B,,,BAO_0000201,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,Inhibition,%,UO_0000187,,25.0
1,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
2,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
3,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5
4,,1481062,[],CHEMBL832053,In vitro percent inhibition against SARS coron...,B,,,BAO_0000201,BAO_0000357,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,Inhibition,%,UO_0000187,,13.0


### Activity Level Filtering
Now that we have the compounds involved, we can label them as being either active of inactive relative to a certain activity measurement threshold. We would label three classes:    
-  Active: activity<1000 nM.   
-  Inactive: activity>10000 nM.     
-  Moderate: 1000<activity<10000 nM.

In [27]:
activity_classes = []
for i in bioactivity_DF.standard_value:
  if float(i) >= 10000:
    activity_classes.append("inactive")
  elif float(i) <= 1000:
    activity_classes.append("active")
  else:
    activity_classes.append("moderate")

activity_classes = pd.Series(activity_classes, name='activity_classes')


### Selecting Relevant Columns

In [31]:
## Select columns of interest
bioactivity_DF = bioactivity_DF[['molecule_chembl_id','canonical_smiles', 'standard_value']]
bioactivity_DF = pd.concat([bioactivity_DF, activity_classes], axis=1)
bioactivity_DF

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,activity_classes
0,CHEMBL372889,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C(N)=O)ccc21,25.0,active
1,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,7200.0,moderate
2,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,9400.0,moderate
3,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,13500.0,inactive
4,CHEMBL188484,O=C1C(=O)N(Cc2cc3ccccc3o2)c2ccc(I)cc21,13.0,active
...,...,...,...,...
304,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,10600.0,
305,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,10100.0,
306,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,11500.0,
307,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,10700.0,


In [32]:
## Saving preprocessed data to csv
bioactivity_DF.to_csv('preprocessed_bioactivity_data.csv', index= False)