### **Bioinformatics Project - Computational Drug Discovery [Part 1]**

## **ChEMBL Database**
The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of March 25, 2020; ChEMBL version 26].

### **1. Install Libaries**
Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting attrs<22.0,>=21.2
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url-normalize, attrs, requests-cache, chembl_webresource_client
  Attempting uninstall: attrs
    Found existing installation: attrs 22.2.0
    Uninstalling attrs-22.2.0:
      Successfully uninstalled attrs-22.2.0
Successfully installed attrs-

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

### **2. Search for Target_protine**
**Target search for Acetylcholinesterase** **bold text**

In [None]:
target = new_client.target
target_query = target.search('acetylcholinesterase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,27.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cholinesterases; ACHE & BCHE,27.0,False,CHEMBL2095233,"[{'accession': 'P06276', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Drosophila melanogaster,Acetylcholinesterase,18.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
3,[],Bemisia tabaci,AChE2,16.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038
4,[],Leptinotarsa decemlineata,Acetylcholinesterase,16.0,False,CHEMBL2366490,"[{'accession': 'Q27677', 'component_descriptio...",SINGLE PROTEIN,7539
5,"[{'xref_id': 'P04058', 'xref_name': None, 'xre...",Torpedo californica,Acetylcholinesterase,15.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
6,"[{'xref_id': 'P21836', 'xref_name': None, 'xre...",Mus musculus,Acetylcholinesterase,15.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
7,"[{'xref_id': 'P37136', 'xref_name': None, 'xre...",Rattus norvegicus,Acetylcholinesterase,15.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
8,"[{'xref_id': 'O42275', 'xref_name': None, 'xre...",Electrophorus electricus,Acetylcholinesterase,15.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
9,"[{'xref_id': 'P23795', 'xref_name': None, 'xre...",Bos taurus,Acetylcholinesterase,15.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913


### **Select and retrieve bioactivity data for Human Acetylcholinesterase**
We will assign the fifth entry (which corresponds to the target protein, Human Acetylcholinesterase) to the **selected_target** variable


In [None]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL220'

Here, we will retrieve only bioactivity data for Human Acetylcholinesterase (CHEMBL220) that are reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

df = pd.DataFrame.from_dict(res)
df.head()

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8


In [None]:
df.shape

(8502, 45)

### **3. Handling Missing data**
If any compounds has missing value for the standard_value and canonical_smiles column then drop it.

In [None]:
df_1 = df[df.standard_value.notna()]
df_2 = df_1[df.canonical_smiles.notna()]
df_2.head(3)

  df_2 = df_1[df.canonical_smiles.notna()]


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0


In [None]:
df_2.shape

(7226, 45)

In [None]:
# Number of unique item of canonical_smiles
len(df_2.canonical_smiles.unique())

5905

In [None]:
df_2.duplicated(['canonical_smiles']).sum()

1321

In [None]:
df_2 = df_2.drop_duplicates(['canonical_smiles'])

In [None]:
df_2.shape

(5905, 45)

### **4. Data pre-processing of the bioactivity data**
Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df_2 = df_2[selection]
df_2.sample(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
4584,CHEMBL3085881,CCn1c(=O)n(C)s/c1=N\CCC1CCN(Cc2ccccc2)CC1,140.0
6069,CHEMBL4099340,CCN(CC)CCCCCCOc1ccc(/C=C/C(=O)O)cc1OC,2670.0
4090,CHEMBL2019049,Cl.O=C(NCCCCCCCCCCNc1c2c(nc3cc(Cl)ccc13)CCCC2)...,0.065
846,CHEMBL139259,C[N+]1(C)CCO[C@@](O)(c2ccc(N)cc2)C1,690239.8
4218,CHEMBL2237994,Nc1cc(-c2ccc(Cl)cc2)nc2nc(-c3ccccc3)cc(C(F)(F)...,520.0
4443,CHEMBL2396904,CCCCN(C)c1nc2ccccc2n1C,100000.0
4481,CHEMBL2413558,CC(C)N(CCOc1ccc(-c2nc3ccccc3[nH]2)cc1)C(C)C,23420.0
2157,CHEMBL610219,COc1ccc2c(c1)C/C(=C\c1ccc(CN(C)Cc3ccccc3)cc1)C2=O,28700.0
6348,CHEMBL4064304,COc1cc(/C=N/O)c(O)c(CN2CCC(CN3CCc4cc(OC)c(OC)c...,543400.0
1142,CHEMBL343780,C/C=C(\C)C(=O)N[C@H]1CC[C@@]2(C)C(=CC[C@H]3[C@...,7079.46


In [None]:
df_2.shape

(5905, 3)

### **5. Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [None]:
bioactivity_threshold = []

for i in df_2.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append('inactive')
  elif float(i) <= 1000:
    bioactivity_threshold.append('active')
  else:
    bioactivity_threshold.append('intermediate')

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold,name='class')
df1 = pd.concat([df_2, bioactivity_class], axis=1)
df1.sample(10)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
8331,CHEMBL4853093,COc1ccc(CN2CCC(c3oc4ccccc4c3C(=O)c3ccccc3)CC2)cc1,97130.0,
6437,CHEMBL4087364,CCN(CC)c1ccc(/C=C/c2cc(N3CCN(CCO)CC3)c3ccccc3[...,5000.0,
6607,CHEMBL4175003,COc1ccc(C(=O)CCc2cc[n+](Cc3ccccc3C(F)(F)F)cc2C...,275.0,
7289,CHEMBL4593316,O=C(Nc1cc(CCc2cn(CCNc3c4c(nc5ccccc35)CCCC4)nn2...,23.6,
654,,,,active
4356,,,,active
1828,CHEMBL109018,Cc1c(C)c2ccc(OCc3cccc(Cl)c3)cc2oc1=O,3390.0,inactive
5622,,,,intermediate
2981,,,,intermediate
5724,CHEMBL1801816,COc1ccc(C(=O)N2c3ccccc3Sc3ccc(Cl)cc32)cc1,5900.0,active


In [None]:
df1.shape

(8009, 4)

In [None]:
df1.canonical_smiles.isnull().sum()

2104

In [None]:
df1 = df1.dropna()
df1

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0,active
...,...,...,...,...
5899,CHEMBL3972214,C#CCN(Cc1ccc2cccc(O)c2n1)C(C#N)CC1CCN(Cc2ccccc...,3300.0,active
5900,CHEMBL3910142,C#CCN(Cc1ccc2cccc(O)c2n1)C(C#N)CCC1CCN(Cc2cccc...,29.0,active
5901,CHEMBL8706,C#CCN(C)CCCOc1ccc(Cl)cc1Cl,500000.0,active
5902,CHEMBL972,C#CCN(C)[C@H](C)Cc1ccccc1,500000.0,active


In [None]:
df1.shape

(3801, 4)

In [None]:
df1.to_csv('acetylcholinesterase_bioactivity_data.csv', index=False)