<a href="https://colab.research.google.com/github/mozey256/TGR5/blob/main/ML_Part_3_TGR5_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediction of Novel Small Molecules as Potential TGR5/GLP1 Agonists in Type 2 Diabetes Treatment [Part 3] Descriptor Calculation and Dataset Preparation**

Moses Ainembabazi
[*'mozey25' github*](https://github.com/mozey256/TGR5)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-02-20 15:35:41--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-02-20 15:35:42--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-02-20 15:35:44 (47.2 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-02-20 15:35:44--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (g

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pEC50.csv** file that essentially contain the pEC50 values that we will be using for building a regression model.

In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv('/content/bioactivity_data_3class_pEC50.csv')

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pEC50
0,CHEMBL566315,CC[C@H]1[C@@H](O)[C@@H]2[C@H](CC[C@]3(C)[C@@H]...,active,420.634,5.1140,3.0,3.0,6.122053
1,CHEMBL388679,C[C@H](C[C@@H](C)[C@H]1CC[C@H]2[C@@H]3[C@H](O)...,inactive,422.606,3.6947,4.0,4.0,4.284833
2,CHEMBL245001,C[C@H](C[C@@H](C)[C@H]1CC[C@H]2[C@@H]3[C@H](O)...,inactive,406.607,4.7239,3.0,3.0,4.593460
3,CHEMBL244785,CC[C@@H]1C2C[C@H](O)CC[C@]2(C)[C@H]2CC[C@]3(C)...,active,434.661,5.3600,3.0,3.0,7.022276
4,CHEMBL244784,C[C@H](C[C@H](C)C(=O)O)[C@H]1CC[C@H]2[C@@H]3[C...,active,420.634,4.9699,3.0,3.0,6.853872
...,...,...,...,...,...,...,...,...
714,CHEMBL5189181,CCCOc1cccc(OCc2ccc3ccccc3n2)c1,active,293.366,4.6026,0.0,3.0,6.000000
715,CHEMBL5192002,CCC(C)Oc1cccc(OCc2ccc3ccccc3n2)c1,active,307.393,4.9911,0.0,3.0,7.000000
716,CHEMBL18775,CCCCOc1cccc(OCc2ccc3ccccc3n2)c1,active,307.393,4.9927,0.0,3.0,6.301030
717,CHEMBL5173976,CCC(C)COc1cccc(OCc2ccc3ccccc3n2)c1,active,321.420,5.2387,0.0,3.0,6.769551


In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

CC[C@H]1[C@@H](O)[C@@H]2[C@H](CC[C@]3(C)[C@@H]([C@H](C)CCC(=O)O)CC[C@@H]23)[C@@]2(C)CC[C@@H](O)C[C@@H]12	CHEMBL566315
C[C@H](C[C@@H](C)[C@H]1CC[C@H]2[C@@H]3[C@H](O)CC4C[C@H](O)CC[C@]4(C)[C@H]3C[C@H](O)[C@]12C)C(=O)O	CHEMBL388679
C[C@H](C[C@@H](C)[C@H]1CC[C@H]2[C@@H]3[C@H](O)CC4C[C@H](O)CC[C@]4(C)[C@H]3CC[C@]12C)C(=O)O	CHEMBL245001
CC[C@@H]1C2C[C@H](O)CC[C@]2(C)[C@H]2CC[C@]3(C)[C@@H]([C@H](C)C[C@H](C)C(=O)O)CC[C@H]3[C@@H]2[C@@H]1O	CHEMBL244785
C[C@H](C[C@H](C)C(=O)O)[C@H]1CC[C@H]2[C@@H]3[C@H](O)[C@H](C)C4C[C@H](O)CC[C@]4(C)[C@H]3CC[C@]12C	CHEMBL244784


In [None]:
! cat molecule.smi | wc -l

719


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Processing CHEMBL388679 in molecule.smi (2/719). 
Processing CHEMBL566315 in molecule.smi (1/719). 
Processing CHEMBL245001 in molecule.smi (3/719). Average speed: 7.51 s/mol.
Processing CHEMBL244785 in molecule.smi (4/719). Average speed: 4.01 s/mol.
Processing CHEMBL244784 in molecule.smi (5/719). Average speed: 3.21 s/mol.
Processing CHEMBL135 in molecule.smi (7/719). Average speed: 2.20 s/mol.
Processing CHEMBL205596 in molecule.smi (6/719). Average speed: 2.55 s/mol.
Processing CHEMBL259898 in molecule.smi (9/719). Average speed: 1.70 s/mol.
Processing CHEMBL386630 in molecule.smi (8/719). Average speed: 1.91 s/mol.
Processing CHEMBL407717 in molecule.smi (10/719). Average speed: 1.52 s/mol.
Processing CHEMBL269897 in molecule.smi (11/719). Average speed: 1.42 s/mol.
Processing CHEMBL1254991 in molecule.smi (12/719). Average speed: 1.31 s/mol.
Processing CHEMBL1254990 in molecule.smi (13/719). Average speed: 1.24 s/mol.
Processing CHEMBL408445 in molecule.smi (14/719). Average spe

In [None]:
! ls -l

total 26612
-rw-r--r-- 1 root root   108718 Feb 20 15:30 bioactivity_data_3class_pEC50.csv
-rw-r--r-- 1 root root  1289543 Feb 20 15:41 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Feb 20 15:36 __MACOSX
-rw-r--r-- 1 root root    60542 Feb 20 15:36 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Feb 20 15:35 padel.sh
-rw-r--r-- 1 root root 25768637 Feb 20 15:35 padel.zip
drwxr-xr-x 1 root root     4096 Feb 14 14:28 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL388679,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL566315,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL245001,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL244785,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL244784,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
714,CHEMBL5189181,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
715,CHEMBL5192002,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
716,CHEMBL18775,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
717,CHEMBL5173976,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
714,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
715,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
716,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
717,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pEC50']
df3_Y

0      6.122053
1      4.284833
2      4.593460
3      7.022276
4      6.853872
         ...   
714    6.000000
715    7.000000
716    6.301030
717    6.769551
718    5.000000
Name: pEC50, Length: 719, dtype: float64

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pEC50
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.122053
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.284833
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.593460
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.022276
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.853872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
714,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.000000
715,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
716,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.301030
717,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.769551


In [None]:
dataset3.to_csv('06_bioactivity_data_3class_pEC50_pubchem_fp.csv', index=False)