# **Descriptor Calculation and Dataset Preparation**

 we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

zsh:1: command not found: wget
zsh:1: command not found: wget


In [2]:
! unzip padel.zip

unzip:  cannot find or open padel.zip, padel.zip.zip or padel.zip.ZIP.


## **Load bioactivity data**

In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('/content/df_2class-2.csv')

In [5]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL1771409,Cc1cc(N/N=C/c2ccc(O)c(O)c2)nc2ccccc12,inactive,293.326,3.40042,3.0,5.0,4.879426
1,2,CHEMBL1933288,C[C@@H]1CCNC(=O)c2cc3ccc(C(=O)Nc4nc5ccccc5n4CC...,active,458.566,3.88950,2.0,6.0,7.795880
2,3,CHEMBL2012582,COc1cc(C(=O)N2CCC(N3CCN(C)CC3)CC2)ccc1Nc1ncc2c...,active,570.698,3.43870,1.0,9.0,7.886057
3,4,CHEMBL509032,COc1cc(N2CCC(N3CCN(C)CC3)CC2)ccc1Nc1ncc(Cl)c(N...,active,614.216,5.02410,2.0,10.0,8.107905
4,5,CHEMBL2012582,COc1cc(C(=O)N2CCC(N3CCN(C)CC3)CC2)ccc1Nc1ncc2c...,active,570.698,3.43870,1.0,9.0,8.221849
...,...,...,...,...,...,...,...,...,...
1950,2157,CHEMBL5086901,Cc1nn(CC(F)(F)F)cc1Nc1ncc(C2CC2)c(NCCc2c[nH]cn...,active,406.416,3.54252,3.0,7.0,7.050610
1951,2158,CHEMBL5079601,Cc1nn(CC(F)(F)F)cc1Nc1ncc(C2CC2)c(NCCCNC(=O)C2...,active,451.497,3.88312,3.0,7.0,7.721246
1952,2159,CHEMBL4446892,Cc1nn(CC(F)(F)F)cc1Nc1ncc(Br)c(NCc2ccc(S(N)(=O...,active,520.335,3.30942,3.0,8.0,6.798603
1953,2160,CHEMBL5086901,Cc1nn(CC(F)(F)F)cc1Nc1ncc(C2CC2)c(NCCc2c[nH]cn...,inactive,406.416,3.54252,3.0,7.0,5.000000


In [6]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [7]:
! cat molecule.smi | head -5

Cc1cc(N/N=C/c2ccc(O)c(O)c2)nc2ccccc12	CHEMBL1771409
C[C@@H]1CCNC(=O)c2cc3ccc(C(=O)Nc4nc5ccccc5n4CCCN(C)C)cc3n21	CHEMBL1933288
COc1cc(C(=O)N2CCC(N3CCN(C)CC3)CC2)ccc1Nc1ncc2c(n1)N(C)c1ccccc1C(=O)N2C	CHEMBL2012582
COc1cc(N2CCC(N3CCN(C)CC3)CC2)ccc1Nc1ncc(Cl)c(Nc2ccccc2S(=O)(=O)C(C)C)n1	CHEMBL509032
COc1cc(C(=O)N2CCC(N3CCN(C)CC3)CC2)ccc1Nc1ncc2c(n1)N(C)c1ccccc1C(=O)N2C	CHEMBL2012582


In [8]:
! cat molecule.smi | wc -l

1955


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [9]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [10]:
! bash padel.sh

Processing CHEMBL1933288 in molecule.smi (2/1955). 
Processing CHEMBL1771409 in molecule.smi (1/1955). 
Processing CHEMBL509032 in molecule.smi (4/1955). Average speed: 3.22 s/mol.
Processing CHEMBL2012582 in molecule.smi (3/1955). Average speed: 6.19 s/mol.
Processing CHEMBL2012582 in molecule.smi (5/1955). Average speed: 2.88 s/mol.
Processing CHEMBL509032 in molecule.smi (6/1955). Average speed: 2.22 s/mol.
Processing CHEMBL509032 in molecule.smi (7/1955). Average speed: 2.32 s/mol.
Processing CHEMBL509032 in molecule.smi (8/1955). Average speed: 1.99 s/mol.
Processing CHEMBL2170016 in molecule.smi (9/1955). Average speed: 1.86 s/mol.
Processing CHEMBL2170016 in molecule.smi (10/1955). Average speed: 1.66 s/mol.
Processing CHEMBL2170016 in molecule.smi (11/1955). Average speed: 1.51 s/mol.
Processing CHEMBL2178141 in molecule.smi (13/1955). Average speed: 1.27 s/mol.
Processing CHEMBL2170016 in molecule.smi (12/1955). Average speed: 1.38 s/mol.
Processing CHEMBL2178139 in molecule.s

In [11]:
! ls -l

total 28968
-rw-r--r-- 1 root root  3487317 Apr  3 11:07 descriptors_output.csv
-rw-r--r-- 1 root root   261196 Apr  3 11:01 df_2class-2.csv
drwxr-xr-x 3 root root     4096 Apr  3 11:01 __MACOSX
-rw-r--r-- 1 root root   122346 Apr  3 11:01 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Apr  3 11:01 padel.sh
-rw-r--r-- 1 root root 25768637 Apr  3 11:01 padel.zip
drwxr-xr-x 1 root root     4096 Mar 30 13:53 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [12]:
df3_X = pd.read_csv('descriptors_output.csv')

In [13]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1771409,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL1933288,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL509032,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL2012582,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL2012582,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1950,CHEMBL5086901,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1951,CHEMBL4446892,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1952,CHEMBL5079601,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1953,CHEMBL5086901,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1950,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1951,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1952,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1953,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [15]:
#@title
df3_Y = df3['pIC50']
df3_Y

0       4.879426
1       7.795880
2       7.886057
3       8.107905
4       8.221849
          ...   
1950    7.050610
1951    7.721246
1952    6.798603
1953    5.000000
1954    5.000000
Name: pIC50, Length: 1955, dtype: float64

## **Combining X and Y variable**

In [16]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.879426
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.795880
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.886057
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.107905
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.221849
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1950,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.050610
1951,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.721246
1952,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.798603
1953,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000


In [17]:
dataset3.to_csv('dardarin_class_pIC50_pubchem_fp.csv', index=False)