# **Bioinformatics Project - Computational Drug Discovery for Breast Cancer**

This is a real-life Data Science and Machine Learning Model and in this model we will be building machine learning model using ChEMBL bioactivity data.

# ***PART 1: Data Collection and Data Pre-Processing***

## Installing and Importing Libraries

I am using ChEMBL Database for this project. The ChEMBL Database contains curated bioactivity data of more than 2 million compounds.

In [1]:
pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 20.5 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 10.2 MB/s eta 0:00:01[K     |█████████████████▉              | 30 kB 8.2 MB/s eta 0:00:01[K     |███████████████████████▊        | 40 kB 4.6 MB/s eta 0:00:01[K     |█████████████████████████████▊  | 51 kB 5.0 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.4 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting attrs<22.0,>=21.2
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 7.7 MB/s 
[?25hCollecting itsdangerous>=2.0.1
 

In [2]:
# Importing necessary libraries

import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target Protein in the ChEMBL Database named 'aromatase' which is an enzyme responsible for breast cancer. So my goal is to create a drug compound that can inhibit the ability of aromatase.

The target here indicates the Target Protein or the Target Organism that the drug (compound) is going to target, and perform a biological activity to either activate the target (protein or organism or humans) or deactivate / destroy the target.

In [3]:
# Searching for Target Protein in the CheMBL Database.

target = new_client.target
target_query = target.search('aromatase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P11511', 'xref_name': None, 'xre...",Homo sapiens,Cytochrome P450 19A1,20.0,False,CHEMBL1978,"[{'accession': 'P11511', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'P22443', 'xref_name': None, 'xre...",Rattus norvegicus,Cytochrome P450 19A1,20.0,False,CHEMBL3859,"[{'accession': 'P22443', 'component_descriptio...",SINGLE PROTEIN,10116


As we can see above there are 1 SINGLE PROTEIN for 'homo sapiens' and 1 SINGLE PROTEIN for 'rattus norvegicus'.

## Select and Retrieve Bioactivity Data for Homo Sapiens

In [4]:
# Getting the ChemBL ID for the SINGLE PROTEIN

selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1978'

In [5]:
# Retrieving bioactivity data for that SINGLE PROTEIN for only that are reported as IC50 value

activity = new_client.activity
result = activity.filter(target_chembl_id = selected_target).filter(standard_type = "IC50")

df = pd.DataFrame.from_dict(result)
df

# Here the last column represents the 'standard_value' of the drug (which is te potency of the drug) needed to perform a particular task (to activate or destroy the Protein)
# So lower the value, better the results because less concentration of that compound is required to do the job.

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2907,,23277437,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.9
2908,,23277438,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,3.7
2909,,23277439,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,2.4
2910,,23277440,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.023


Storing the activity data for IC50 Type PROTEIN into csv file

In [6]:
df.to_csv('bioactivity_data_1.csv', index=False)

1. Mounting the Google Drive in Colab Notebook
2. Creating a folder named 'data' inside the Colab Notebook directory
3. Copying the bioactivity_data.csv file inside the 'data' folder


In [7]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
# ! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

In [8]:
cp bioactivity_data_1.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [9]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data_1.csv	bioactivity_data.csv  bioactivity_preprocessed_data.csv


# Handling Missing Data in the Dataset

In [21]:
# Dropping missing values from the standard_value column

df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2907,,23277437,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.9
2908,,23277438,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,3.7
2909,,23277439,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,2.4
2910,,23277440,[],CHEMBL4836470,Inhibition of aromatase (unknown origin),B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.023


# Data Pre-processing of the bioactivity dataset

## Labeling compounds as either being active, inactive or intermediate

The biactivity data is in the IC50 unit.
Compounds the have values less than 1000 nM will be labeled as '**active**', compounds that have values more than 10,000 nM will be labeld '**inactive**' and the compounds that have values between 1000 nM and 10,000 nM will be labeled as '**intermediate**'.

In [22]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) <= 1:
    bioactivity_class.append("active")
  elif float(i) >= 10:
    bioactivity_class.append("inactive")
  else:
    bioactivity_class.append("intermediate")

## Iterate the *molecule_chembl_id* to a list - Removing repeated *molecule_chembl_id*

In [23]:
# There will be multiple listing of the same drug in the database being the same 'molecule_chembl_id'.
# So I am going to remove all the repeated values to reduce redundancy.

# df2.molecule_chembl_id // to view all the molecule_chembl_id in the database

mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

mol_cid

['CHEMBL341591',
 'CHEMBL2111947',
 'CHEMBL431859',
 'CHEMBL113637',
 'CHEMBL112021',
 'CHEMBL324070',
 'CHEMBL41761',
 'CHEMBL111868',
 'CHEMBL111888',
 'CHEMBL112074',
 'CHEMBL324326',
 'CHEMBL37321',
 'CHEMBL353068',
 'CHEMBL41066',
 'CHEMBL166709',
 'CHEMBL424556',
 'CHEMBL1630273',
 'CHEMBL1630261',
 'CHEMBL169251',
 'CHEMBL168636',
 'CHEMBL90585',
 'CHEMBL1629805',
 'CHEMBL433728',
 'CHEMBL38877',
 'CHEMBL169449',
 'CHEMBL39275',
 'CHEMBL39513',
 'CHEMBL2112738',
 'CHEMBL289116',
 'CHEMBL289116',
 'CHEMBL39782',
 'CHEMBL304903',
 'CHEMBL488',
 'CHEMBL488',
 'CHEMBL168434',
 'CHEMBL352645',
 'CHEMBL2112739',
 'CHEMBL440930',
 'CHEMBL1630274',
 'CHEMBL1629804',
 'CHEMBL39661',
 'CHEMBL39661',
 'CHEMBL38550',
 'CHEMBL39152',
 'CHEMBL166789',
 'CHEMBL3349856',
 'CHEMBL1630275',
 'CHEMBL1630267',
 'CHEMBL38439',
 'CHEMBL38510',
 'CHEMBL415753',
 'CHEMBL3349864',
 'CHEMBL1629806',
 'CHEMBL169664',
 'CHEMBL168762',
 'CHEMBL275594',
 'CHEMBL39781',
 'CHEMBL413344',
 'CHEMBL168444',
 'CHE

## Iterate the *canonical_smiles* to a list - Removing repeated *canonical_smiles*

In [24]:
canonical_smiles = []

for i in df2.canonical_smiles:
  canonical_smiles.append(i)

canonical_smiles

['CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12',
 'C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43C)[C@@H]1CC[C@@H]2[C@H]1CN1',
 'CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21',
 'CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21',
 'Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21',
 'Cc1ccc(S(=O)(=O)n2cc(C(c3ccccc3)n3ccnc3)c3ccccc32)cc1',
 'CCn1ccc2cc(C(c3ccc(F)cc3)n3ccnc3)ccc21',
 'Cn1cc(C(c2ccc(F)cc2)n2ccnc2)c2cc(Br)ccc21',
 'CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2cc(Br)ccc21',
 'CCn1ccc2cc(C(c3ccccc3)n3ccnc3)ccc21',
 'N#Cc1ccc(Cn2cc(Cn3ccnc3)c3ccccc32)cc1',
 'CCCCCCN1C(=O)CCC(CC)(c2ccncc2)C1=O',
 'c1ccc2c(c1)CCC1C(c3cc[nH]n3)C21',
 'CCCCCCCC1(c2ccncc2)CCC(=O)NC1=O',
 'O=C1/C(=C/c2cccnn2)CCc2ccccc21',
 'O=C1/C(=C/c2ccnnc2)CCc2ccccc21',
 'C[C@]12CC[C@H]3[C@@H](CC=C4[C@H](O)CCC[C@@]43CO)[C@@H]1CCC2=O',
 'C[C@]12CC[C@H]3[C@@H](C[C@@H](O)C4=CCCC[C@@]43C)[C@@H]1CCC2=O',
 'O=C1/C(=C\\c2c[nH]cn2)CCc2ccccc21',
 'O=C1/C(=C/c2cccnc2)CCc2ccccc21',
 'O=C1/C(=C/c2ccncc2)CCc2ccccc21',
 'C[C@]12CC[C@H]3[C@@H](CCC4=CCCC[C@@]43CO)[C@@

## Iterate the *standard_value* to a list - Removing repeated *standard_value*

In [25]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

standard_value

['7100.0',
 '50000.0',
 '238.0',
 '57.0',
 '54.0',
 '5400.0',
 '41.0',
 '78.5',
 '51.8',
 '205.0',
 '50.0',
 '6600.0',
 '51000.0',
 '3200.0',
 '250000.0',
 '103000.0',
 '6800.0',
 '50.0',
 '170.0',
 '9200.0',
 '4600.0',
 '49.0',
 '250000.0',
 '31000.0',
 '250000.0',
 '40000.0',
 '21000.0',
 '11000.0',
 '10000.0',
 '45000.0',
 '30000.0',
 '370.0',
 '8000.0',
 '14000.0',
 '260.0',
 '600.0',
 '1000.0',
 '13000.0',
 '31.0',
 '60.0',
 '800.0',
 '5000.0',
 '3000.0',
 '3600.0',
 '250000.0',
 '860.0',
 '150000.0',
 '940.0',
 '1500.0',
 '245000.0',
 '18000.0',
 '280.0',
 '190.0',
 '250000.0',
 '250000.0',
 '6000.0',
 '385000.0',
 '38000.0',
 '27000.0',
 '2500.0',
 '1100.0',
 '250000.0',
 '250000.0',
 '2400.0',
 '1400.0',
 '700.0',
 '10000.0',
 '250000.0',
 '6900.0',
 '4000.0',
 '250000.0',
 '250000.0',
 '660.0',
 '330.0',
 '600.0',
 '100000.0',
 '100000.0',
 '100000.0',
 '28500.0',
 '187.5',
 '176.0',
 '620.0',
 '3300.0',
 '31.0',
 '8000.0',
 '800.0',
 '56.2',
 '5200.0',
 '275.0',
 '1600.0',
 '

## Creating a pre-processed data.csv file after adding only the **4 required columns** in the new dataset.

In [26]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame(data_tuples, columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,inactive,7100.0
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,inactive,50000.0
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,inactive,238.0
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,inactive,57.0
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,inactive,54.0
...,...,...,...,...
2831,CHEMBL4874928,C=C[C@@]1(C)CC(=O)C2=C(CC[C@H]3C(C)(C)CCC[C@]2...,inactive,7900.0
2832,CHEMBL4852023,CC(C)c1cc(O)c2c(c1)CC[C@H]1C(C)(C)CCC[C@]21C,inactive,3700.0
2833,CHEMBL75,CC(=O)N1CCN(c2ccc(OC[C@H]3CO[C@](Cn4ccnc4)(c4c...,inactive,2400.0
2834,CHEMBL1200374,C=C1C[C@@H]2[C@H](CC[C@]3(C)C(=O)CC[C@@H]23)[C...,inactive,23.0


Saves the new Dataframes to CSV File

In [27]:
df3.to_csv('bioactivity_preprocessed_data_1.csv', index=False)

In [28]:
! ls -l

total 1620
-rw-r--r-- 1 root root 1445319 Oct 29 22:47 bioactivity_data_1.csv
-rw-r--r-- 1 root root  204242 Oct 29 22:55 bioactivity_preprocessed_data_1.csv
drwx------ 5 root root    4096 Oct 29 22:48 gdrive
drwxr-xr-x 1 root root    4096 Oct 27 13:28 sample_data


Copying the bioactivity_preprocessed_data.csv file to the Google Drive's 'data' folder

In [29]:
! cp bioactivity_preprocessed_data_1.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [30]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data_1.csv	bioactivity_preprocessed_data_1.csv
bioactivity_data.csv	bioactivity_preprocessed_data.csv
