# **Bioinformatics Project - Computational Drug Discovery - Descriptor Calculation and Dataset Preparation**


In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. We will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2023-04-08 06:56:43--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-04-08 06:56:44--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-04-08 06:56:45 (194 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-04-08 06:56:45--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('bioactivity_data_pIC50.csv')

In [5]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL156630,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NCc3ccccc3)c(Br)c21,inactive,426.314,4.59450,1.0,4.0,3.000000
1,CHEMBL155754,C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NC)c(Cl)c21,inactive,305.765,2.91500,1.0,4.0,3.000000
2,CHEMBL350093,N#CCCN1CC(=O)OC(c2ccc(OCc3ccccc3)cc2)=N1,active,335.363,2.69968,0.0,6.0,7.744727
3,CHEMBL161907,O=c1c(=O)c2ccc(OCCCC(F)(F)F)cc2c1=O,active,286.205,1.51730,0.0,4.0,8.045757
4,CHEMBL17079,N#CCCn1nc(-c2ccc(OCc3ccccc3)cc2)oc1=S,active,337.404,4.36527,0.0,6.0,8.356547
...,...,...,...,...,...,...,...,...
3930,CHEMBL5075834,C#CCNC(=O)/C=C/c1ccc(O)c(OC)c1,inactive,231.251,1.16340,2.0,3.0,4.301030
3931,CHEMBL4284618,C#CCNC(=O)/C=C/c1ccc(O)c(O)c1,inactive,217.224,0.86040,3.0,3.0,4.301030
3932,CHEMBL5077905,C#CCNC/C=C/c1ccc(O)c(OC)c1,inactive,217.268,1.63680,2.0,3.0,4.301030
3933,CHEMBL165,Oc1ccc(/C=C/c2cc(O)cc(O)c2)cc1,inactive,228.247,2.97380,3.0,3.0,4.524329


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NCc3ccccc3)c(Br)c21	CHEMBL156630
C/N=C1/CCc2c1n(C)c1ccc(OC(=O)NC)c(Cl)c21	CHEMBL155754
N#CCCN1CC(=O)OC(c2ccc(OCc3ccccc3)cc2)=N1	CHEMBL350093
O=c1c(=O)c2ccc(OCCCC(F)(F)F)cc2c1=O	CHEMBL161907
N#CCCn1nc(-c2ccc(OCc3ccccc3)cc2)oc1=S	CHEMBL17079


In [8]:
! cat molecule.smi | wc -l

3935


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [9]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [10]:
! bash padel.sh

Processing CHEMBL155754 in molecule.smi (2/3935). 
Processing CHEMBL156630 in molecule.smi (1/3935). 
Processing CHEMBL350093 in molecule.smi (3/3935). Average speed: 6.91 s/mol.
Processing CHEMBL161907 in molecule.smi (4/3935). Average speed: 8.03 s/mol.
Processing CHEMBL17079 in molecule.smi (5/3935). Average speed: 3.12 s/mol.
Processing CHEMBL348083 in molecule.smi (6/3935). Average speed: 2.48 s/mol.
Processing CHEMBL157182 in molecule.smi (7/3935). Average speed: 2.30 s/mol.
Processing CHEMBL160347 in molecule.smi (8/3935). Average speed: 1.94 s/mol.
Processing CHEMBL347197 in molecule.smi (9/3935). Average speed: 1.80 s/mol.
Processing CHEMBL160219 in molecule.smi (10/3935). Average speed: 1.61 s/mol.
Processing CHEMBL445916 in molecule.smi (11/3935). Average speed: 1.48 s/mol.
Processing CHEMBL158041 in molecule.smi (12/3935). Average speed: 1.45 s/mol.
Processing CHEMBL434261 in molecule.smi (13/3935). Average speed: 1.33 s/mol.
Processing CHEMBL348607 in molecule.smi (14/3935

In [11]:
! ls -l

total 32696
-rw-r--r-- 1 root root   465867 Apr  8 06:56 bioactivity_data_pIC50.csv
-rw-r--r-- 1 root root  7006259 Apr  8 07:08 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Apr  8 06:56 __MACOSX
-rw-r--r-- 1 root root   216170 Apr  8 06:56 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Apr  8 06:56 padel.sh
-rw-r--r-- 1 root root 25768637 Apr  8 06:56 padel.zip
drwxr-xr-x 1 root root     4096 Apr  6 13:39 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [12]:
df3_X = pd.read_csv('descriptors_output.csv')

In [13]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL155754,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL156630,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL350093,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL161907,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL348083,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3930,CHEMBL5075834,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3931,CHEMBL4284618,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3932,CHEMBL5077905,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3933,CHEMBL86304,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3930,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3931,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3932,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3933,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

In [15]:
df3_Y = df3['pIC50']
df3_Y

0       3.000000
1       3.000000
2       7.744727
3       8.045757
4       8.356547
          ...   
3930    4.301030
3931    4.301030
3932    4.301030
3933    4.524329
3934    4.000000
Name: pIC50, Length: 3935, dtype: float64

## **Combining X and Y variable**

In [16]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3=dataset3.dropna()
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000000
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000000
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.744727
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.045757
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.356547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3930,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3931,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3932,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3933,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.524329


In [17]:
dataset3.to_csv('bioactivity_data_pIC50_pubchem_fp.csv', index=False)

In [18]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [19]:
! cp bioactivity_data_pIC50_pubchem_fp.csv "/content/gdrive/My Drive/Colab Notebooks/AI project"