<a href="https://colab.research.google.com/github/mounsifelatouch/code/blob/master/python/CDD/PFE/CDD_ML_Part_1_Campylobacterpylori_Bioactivity_Data_Concised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Mounsif EL ATOUCH


In this Jupyter notebook, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.


## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds (2.4M). It is compiled from more than 86,000 documents, 1.5 million assays and the data spans 15,000 targets and 2,000 cells and 45,000 indications.
[Data as of January 25, 2022; ChEMBL version 32].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=0.7.0 (from chembl_webresource_client)
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting attrs<22.0,>=21.2 (from requests-cache~=0.7.0->chembl_webresource_client)
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize<2.0,>=1.4 (from requests-cache~=0.7.0->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url-normalize, attrs, requests-cache, chembl_webresource_client
  Attempting uninstall: attrs
    F

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Urease**

In [3]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('Urease')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P07374', 'xref_name': None, 'xre...",Canavalia ensiformis,Urease,21.0,False,CHEMBL4161,"[{'accession': 'P07374', 'component_descriptio...",SINGLE PROTEIN,3823
1,[],Bacteria,Bacterial urease,19.0,True,CHEMBL2364683,"[{'accession': 'Q03282', 'component_descriptio...",PROTEIN COMPLEX,2
2,"[{'xref_id': 'Q0PXQ5', 'xref_name': None, 'xre...",Helicobacter pylori,Urease,18.0,False,CHEMBL5325,"[{'accession': 'Q0PXQ5', 'component_descriptio...",SINGLE PROTEIN,210
3,[],Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,18.0,False,CHEMBL3885651,"[{'accession': 'P69996', 'component_descriptio...",PROTEIN COMPLEX,85962


### **Select and retrieve bioactivity data for *Helicobacter pylori (strain ATCC 700392 / 26695) (Campylobacterpylori)* (fourth entry)**

We will assign the fourth entry (which corresponds to the target protein, *Helicobacter pylori (strain ATCC 700392 / 26695) (Campylobacterpylori)*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[3]
selected_target

'CHEMBL3885651'

Here, we will retrieve only bioactivity data for *Helicobacter pylori (strain ATCC 700392 / 26695) (Campylobacterpylori)* (CHEMBL3885651) that are reported as pChEMBL values.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [7]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1944249,[],CHEMBL893889,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.03
1,,1944250,[],CHEMBL893889,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.017
2,,1944251,[],CHEMBL893889,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.14
3,,1944874,[],CHEMBL893890,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,1.48
4,,1944876,[],CHEMBL893891,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,1.48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,,19445670,[],CHEMBL4432507,Inhibition of urease in Helicobacter pylori J9...,B,,,BAO_0000190,BAO_0000223,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,uM,UO_0000065,,83.5
500,,19445671,[],CHEMBL4432507,Inhibition of urease in Helicobacter pylori J9...,B,,,BAO_0000190,BAO_0000223,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,uM,UO_0000065,,11.8
501,Not Active,19445672,[],CHEMBL4432507,Inhibition of urease in Helicobacter pylori J9...,B,,,BAO_0000190,BAO_0000223,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,,,,
502,Not Active,19445673,[],CHEMBL4432507,Inhibition of urease in Helicobacter pylori J9...,B,,,BAO_0000190,BAO_0000223,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,,,,


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [8]:
df.to_csv('bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [9]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2.shape

  df2 = df2[df.canonical_smiles.notna()]


(362, 45)

In [10]:
len(df2.canonical_smiles.unique())

254

In [11]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,1944249,[],CHEMBL893889,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.03
1,,1944250,[],CHEMBL893889,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.017
2,,1944251,[],CHEMBL893889,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.14
10,,2306335,[],CHEMBL960362,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.017
11,,2306336,[],CHEMBL960362,Inhibition of Helicobacter pylori ATCC 43504 u...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,mM,UO_0000065,,0.047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
445,,19445616,[],CHEMBL4432505,Inhibition of recombinant Helicobacter pylori ...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,uM,UO_0000065,,108.0
448,,19445619,[],CHEMBL4432505,Inhibition of recombinant Helicobacter pylori ...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,uM,UO_0000065,,35.4
449,,19445620,[],CHEMBL4432505,Inhibition of recombinant Helicobacter pylori ...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,uM,UO_0000065,,85.0
450,,19445621,[],CHEMBL4432505,Inhibition of recombinant Helicobacter pylori ...,B,,,BAO_0000190,BAO_0000019,...,Helicobacter pylori (strain ATCC 700392 / 2669...,Urease subunit alpha/Urease subunit beta,85962,,,IC50,uM,UO_0000065,,20.1


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [12]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL243822,Oc1ccc(CCc2ccc(O)c(O)c2O)cc1,30000.0
1,CHEMBL734,CC(=O)NO,17000.0
2,CHEMBL242739,O=c1c(-c2ccc(O)cc2)coc2c(O)c(O)ccc12,140000.0
10,CHEMBL503157,N=C(Cc1ccc(O)cc1)c1ccc(O)c(O)c1,17000.0
11,CHEMBL412199,NC(Cc1ccc(O)cc1)c1ccc(O)c(O)c1O,47000.0
...,...,...,...
445,CHEMBL4539240,CCC(CO)NC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)NC(C...,108000.0
448,CHEMBL4461450,O=C(O)CCCNC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)NC...,35400.0
449,CHEMBL4471428,COC(=O)CNC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)NCC...,85000.0
450,CHEMBL4575318,COC(=O)CCCNC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)N...,20100.0


Saves dataframe to CSV file

In [13]:
df3.to_csv('bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [14]:
df4 = pd.read_csv('bioactivity_data_preprocessed.csv')

In [15]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [16]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL243822,Oc1ccc(CCc2ccc(O)c(O)c2O)cc1,30000.0,inactive
1,CHEMBL734,CC(=O)NO,17000.0,inactive
2,CHEMBL242739,O=c1c(-c2ccc(O)cc2)coc2c(O)c(O)ccc12,140000.0,inactive
3,CHEMBL503157,N=C(Cc1ccc(O)cc1)c1ccc(O)c(O)c1,17000.0,inactive
4,CHEMBL412199,NC(Cc1ccc(O)cc1)c1ccc(O)c(O)c1O,47000.0,inactive
...,...,...,...,...
249,CHEMBL4539240,CCC(CO)NC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)NC(C...,108000.0,inactive
250,CHEMBL4461450,O=C(O)CCCNC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)NC...,35400.0,inactive
251,CHEMBL4471428,COC(=O)CNC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)NCC...,85000.0,inactive
252,CHEMBL4575318,COC(=O)CCCNC(=O)c1ccccc1[Se][Se]c1ccccc1C(=O)N...,20100.0,inactive


Saves dataframe to CSV file

In [17]:
df5.to_csv('bioactivity_data_curated.csv', index=False)

In [18]:
! zip Campylobacterpylori.zip *.csv

  adding: bioactivity_data_curated.csv (deflated 79%)
  adding: bioactivity_data_preprocessed.csv (deflated 77%)
  adding: bioactivity_data_raw.csv (deflated 94%)


In [19]:
! ls -l

total 364
-rw-r--r-- 1 root root  15665 May  2 12:06 bioactivity_data_curated.csv
-rw-r--r-- 1 root root  13185 May  2 12:06 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 304469 May  2 12:06 bioactivity_data_raw.csv
-rw-r--r-- 1 root root  25631 May  2 12:06 Campylobacterpylori.zip
drwxr-xr-x 1 root root   4096 Apr 28 13:35 sample_data


---