<a href="https://colab.research.google.com/github/maronem/CDD_ML_Bioactivity_Project/blob/main/CDD_ML_bioactivity_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Computational Drug Discvoery Using Bioactivity Data

The goal of this notebook and project is to build an ML model for drug discovery  using the ChEMBL database.


## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

### Install libraries

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=0.7.0 (from chembl_webresource_client)
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting attrs<22.0,>=21.2 (from requests-cache~=0.7.0->chembl_webresource_client)
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize<2.0,>=1.4 (from requests-cache~=0.7.0->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url-normalize, attrs, requests-cache, chembl_webresource_client
  Attempting uninstall: attrs
    Found existing installation: attrs 23.1.0
    Uninstalling attrs-23.1.0:
      Successfully uninstal

### Search target and create dataframe

In [2]:
 # Import libraries

 import pandas as pd
 from chembl_webresource_client.new_client import new_client

In [6]:
# Target search for BRCA1

target = new_client.target
target_query = target.search('brca1')
targets = pd.DataFrame.from_dict(target_query)
targets


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P38398', 'xref_name': None, 'xre...",Homo sapiens,Breast cancer type 1 susceptibility protein,19.0,False,CHEMBL5990,"[{'accession': 'P38398', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Lys-63-specific deubiquitinase BRCC36,19.0,False,CHEMBL4105965,"[{'accession': 'P46736', 'component_descriptio...",SINGLE PROTEIN,9606
2,"[{'xref_id': 'BAP1', 'xref_name': None, 'xref_...",Homo sapiens,Ubiquitin carboxyl-terminal hydrolase BAP1,15.0,False,CHEMBL1293314,"[{'accession': 'Q92560', 'component_descriptio...",SINGLE PROTEIN,9606


In [8]:
# Select chembl_id for BRCA1-susceptibility protein

selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL5990'

In [19]:
# Retreive bioactivity data from CHEMBL5990 (reported as IC50 values in nM)

activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [20]:
df = pd.DataFrame.from_dict(res)
df.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,6222842,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,4.6
1,,,6222843,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,250.0
2,,,6222844,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,52.8
3,,,6222845,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,250.0
4,,,6222846,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,15.0


In [21]:
df.standard_type.unique()

array(['IC50'], dtype=object)

### Save bioactivity data as csv

In [27]:
df.to_csv('CHEMBL5990_bioactivity_data.csv', index=False)

### Copying files to Google Drive

In [28]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [29]:
! mkdir "/content/gdrive/MyDrive/Colab Notebooks/data"

In [30]:
! cp CHEMBL5990_bioactivity_data.csv "/content/gdrive/MyDrive/Colab Notebooks/data"

In [32]:
!ls -l "/content/gdrive/MyDrive/Colab Notebooks/data"

total 11
-rw------- 1 root root 10974 Oct  2 19:55 CHEMBL5990_bioactivity_data.csv


# Preprocessing Data

In [33]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,6222842,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,4.6
1,,,6222843,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,250.0
2,,,6222844,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,52.8
3,,,6222845,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,250.0
4,,,6222846,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,15.0
5,,,6222847,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,35.0
6,,,6222848,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,3.2
7,,,6222849,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,30.1
8,,,6222850,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,7.1
9,,,6222851,[],CHEMBL1785941,Inhibition of BRCA1 by fluorescence polarizati...,B,,,BAO_0000190,...,Homo sapiens,Breast cancer type 1 susceptibility protein,9606,,,IC50,uM,UO_0000065,,18.4


In [37]:
df.standard_value.notna()

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
18    True
Name: standard_value, dtype: bool

No missing IC50 values for this dataset

## Labeling compounds as being active, inactive, or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [39]:
bioactivity_class = []

for value in df.standard_value:
  if float(value) >= 10000:
    bioactivity_class.append("inactive")
  elif float(value) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

In [42]:
bioactivity_class

['intermediate',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'intermediate',
 'inactive',
 'intermediate',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'inactive',
 'active',
 'intermediate']

In [51]:
len(df.molecule_chembl_id.unique()) == len(df.molecule_chembl_id)

True

No repeat chembl_id's in our dataset

In [55]:
# Iterate the chembl_id's to a list

mol_cid = []

for id in df.molecule_chembl_id:
  mol_cid.append(id)

In [59]:
# Iterate canonical smiles to list

canon_smiles = []

for i in df.canonical_smiles:
  canon_smiles.append(i)

In [60]:
# Iterate standard values to a list

std_value = []

for value in df.standard_value:
  std_value.append(value)

In [63]:
# Combine the four data lists into a dataframe

bioactivity_df = pd.DataFrame(list(zip(mol_cid, bioactivity_class, canon_smiles, std_value)),
                              columns=['Molecular cID', 'Bioactivity Class', 'Canonical Smiles', 'Standard Value'])
bioactivity_df

Unnamed: 0,Molecular cID,Bioactivity Class,Canonical Smiles,Standard Value
0,CHEMBL1784774,intermediate,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N1CCC[C@H]1C(=O...,4600.0
1,CHEMBL1784771,inactive,CC(=O)N[C@@H](CCC(=O)O)C(=O)N1CCC[C@H]1C(=O)N[...,250000.0
2,CHEMBL1784772,inactive,CC(=O)N[C@@H](CC(C(=O)O)C(=O)O)C(=O)N1CCC[C@H]...,52800.0
3,CHEMBL1784773,inactive,CC(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)...,250000.0
4,CHEMBL1784704,inactive,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N[C@@H](C)C(=O)...,15000.0
5,CHEMBL1784770,inactive,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N1CCC[C@H]1C(=O...,35000.0
6,CHEMBL1784703,intermediate,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N1CCC[C@H]1C(=O...,3200.0
7,CHEMBL1784775,inactive,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N1CCC[C@H]1C(=O...,30100.0
8,CHEMBL1784776,intermediate,CC[C@H](C)[C@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H](C...,7100.0
9,CHEMBL1784777,inactive,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N1CCC[C@H]1C(=O...,18400.0


### Save bioactivity data df to csv

In [64]:
bioactivity_df.to_csv('bioactivity_preprocessed_data.csv', index=False)

In [66]:
!ls -l

total 24
-rw-r--r-- 1 root root  2442 Oct  2 20:31 bioactivity_preprocessed_data.csv
-rw-r--r-- 1 root root 10974 Oct  2 19:41 CHEMBL5990_bioactivity_data.csv
drwx------ 5 root root  4096 Oct  2 19:42 gdrive
drwxr-xr-x 1 root root  4096 Sep 29 13:23 sample_data


In [67]:
#copy/paste csv to data folder

!cp bioactivity_preprocessed_data.csv "/content/gdrive/MyDrive/Colab Notebooks/data"

In [68]:
!ls "/content/gdrive/MyDrive/Colab Notebooks/data"

bioactivity_preprocessed_data.csv  CHEMBL5990_bioactivity_data.csv


# **Exploratory Data Analysis**

## Install conda and rdkit

RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python.

In [69]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2023-10-02 20:52:37--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’


2023-10-02 20:52:38 (335 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - done
Solving environment: | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447c_0
    - conda=

## Calculate Lipinski descriptors

Lipinski's rule of five, also known as Pfizer's rule of five or simply the rule of five (RO5), is a rule of thumb to evaluate druglikeness or determine if a chemical compound with a certain pharmacological or biological activity has chemical properties and physical properties that would likely make it an orally active drug in humans.

The rule was formulated by Christopher A. Lipinski, based on the observation that most orally administered drugs are relatively small and moderately lipophilic molecules. - https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five#:~:text=Lipinski's%20rule%20states%20that%2C%20in,all%20nitrogen%20or%20oxygen%20atoms

Lipinski's rule states that, in general, an orally active drug has no more than one violation of the following criteria:

* No more than 5 hydrogen bond donors (the total number of nitrogen–hydrogen and oxygen–hydrogen bonds)
* No more than 10 hydrogen bond acceptors (all nitrogen or oxygen atoms)
* A molecular mass less than 500 daltons
* A calculated octanol-water partition coefficient (Clog P) < 5



In [71]:
bioactivity_df.head()

Unnamed: 0,Molecular cID,Bioactivity Class,Canonical Smiles,Standard Value
0,CHEMBL1784774,intermediate,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N1CCC[C@H]1C(=O...,4600.0
1,CHEMBL1784771,inactive,CC(=O)N[C@@H](CCC(=O)O)C(=O)N1CCC[C@H]1C(=O)N[...,250000.0
2,CHEMBL1784772,inactive,CC(=O)N[C@@H](CC(C(=O)O)C(=O)O)C(=O)N1CCC[C@H]...,52800.0
3,CHEMBL1784773,inactive,CC(=O)N[C@H](C(=O)N1CCC[C@H]1C(=O)N[C@H](C(=O)...,250000.0
4,CHEMBL1784704,inactive,CC(=O)N[C@@H](COP(=O)(O)O)C(=O)N[C@@H](C)C(=O)...,15000.0


## Import Libraries

In [74]:
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2023.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.5 MB)
[K     |████████████████████████████████| 29.5 MB 1.4 MB/s 
Installing collected packages: rdkit
Successfully installed rdkit-2023.3.2


In [76]:
import numpy as np
import rdkit
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

rdkit.__version__

'2023.03.2'