<a href="https://colab.research.google.com/github/michaufsc/betalactam/blob/main/beta03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


 Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation
Chanin Nantasenamat

In Part 3, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.


---



Introduction: PaDEL-Descriptor is a software for calculating molecular descriptors and fingerprints. The software
currently calculates 797 descriptors (663 1D, 2D descriptors, and 134 3D descriptors) and 10 types of fingerprints.
These descriptors and fingerprints are calculated mainly using The Chemistry Development Kit. Some additional
descriptors and fingerprints were added, which include atom type electrotopological state descriptors, McGowan volume, molecular linear free energy relation descriptors, ring counts, count of chemical substructures identified by
Laggner, and binary fingerprints and count of chemical substructures identified by Klekota and Roth.
Methods: PaDEL-Descriptor was developed using the Java language and consists of a library component and an
interface component. The library component allows it to be easily integrated into quantitative structure activity relationship software to provide the descriptor calculation feature while the interface component allows it to be used as
a standalone software. The software uses a Master/Worker pattern to take advantage of the multiple CPU cores that
are present in most modern computers to speed up calculations of molecular descriptors.
Results: The software has several advantages over existing standalone molecular descriptor calculation software. It
is free and open source, has both graphical user interface and command line interfaces, can work on all major platforms (Windows, Linux, MacOS), supports more than 90 different molecular file formats, and is multithreaded.
Conclusion: PaDEL-Descriptor is a useful addition to the currently available molecular descriptor calculation software. The software can be downloaded at http://padel.nus.edu.sg/software/padeldescriptor.
q 2010 Wiley Periodicals, Inc. J Comput Chem 32: 1466–1474, 2011

In [23]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-08-10 23:47:44--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-08-10 23:47:44--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-08-10 23:47:45 (219 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-08-10 23:47:45--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [24]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

Load bioactivity data
Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the bioactivity_data_3class_pIC50.csv file that essentially contain the pIC50 values that we will be using for building a regression model.

In [13]:
! wget https://github.com/michaufsc/betalactam/blob/main/betalactamase_04_bioactivity_data_3class_pIC50.csv

--2022-08-10 23:31:21--  https://github.com/michaufsc/betalactam/blob/main/betalactamase_04_bioactivity_data_3class_pIC50.csv
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘betalactamase_04_bioactivity_data_3class_pIC50.csv’

betalactamase_04_bi     [ <=>                ] 168.46K  --.-KB/s    in 0.02s   

2022-08-10 23:31:21 (8.43 MB/s) - ‘betalactamase_04_bioactivity_data_3class_pIC50.csv’ saved [172500]



In [10]:
import pandas as pd

In [16]:
df3 = pd.read_csv('/content/drive/MyDrive/Alfavaca/betalactamase_04_bioactivity_data_3class_pIC50.csv')

In [17]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL3949733,COC(=O)C1/C(=C(\C=O)C/C=C/c2ccccc2)O[C@@H]2CC(...,active,327.336,1.6731,0.0,5.0,6.194227
1,1,CHEMBL777,O=C(O)[C@H]1/C(=C/CO)O[C@@H]2CC(=O)N21,active,199.162,-1.0956,2.0,4.0,7.530178
2,2,CHEMBL403,CC1(C)[C@H](C(=O)O)N2C(=O)C[C@H]2S1(=O)=O,intermediate,233.245,-0.795,1.0,4.0,5.972284
3,3,CHEMBL404,C[C@]1(Cn2ccnn2)[C@H](C(=O)O)N2C(=O)C[C@H]2S1(...,active,300.296,-1.5232,1.0,7.0,7.764472
4,4,CHEMBL4114663,C=CC/C(C=O)=C1\O[C@@H]2CC(=O)N2C1C(=O)OCc1ccccc1,active,327.336,1.7161,0.0,5.0,7.176526
5,5,CHEMBL4114809,C=CCC1(C(=O)OCc2ccccc2)/C(=C/C=O)O[C@@H]2CC(=O...,active,327.336,1.7161,0.0,5.0,6.024384
6,6,CHEMBL4114670,C=CC/C(C=O)=C1\O[C@@H]2CC(=O)N2C1(CC=C)C(=O)OC...,intermediate,367.401,2.6624,0.0,5.0,5.07643
7,7,CHEMBL4114713,O=C/C(C/C=C/c1ccccc1)=C1/O[C@@H]2CC(=O)N2C1C(=...,active,403.434,3.2435,0.0,5.0,7.05061
8,8,CHEMBL3917939,O=C/C=C1\O[C@@H]2CC(=O)N2C1(C/C=C/c1ccccc1)C(=...,intermediate,403.434,3.2435,0.0,5.0,5.307744
9,9,CHEMBL4114757,O=C/C(C/C=C/CO)=C1/O[C@@H]2CC(=O)N2C1C(=O)OCc1...,active,357.362,1.0786,1.0,6.0,7.105684


In [18]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [25]:
! cat molecule.smi

COC(=O)C1/C(=C(\C=O)C/C=C/c2ccccc2)O[C@@H]2CC(=O)N12	CHEMBL3949733
O=C(O)[C@H]1/C(=C/CO)O[C@@H]2CC(=O)N21	CHEMBL777
CC1(C)[C@H](C(=O)O)N2C(=O)C[C@H]2S1(=O)=O	CHEMBL403
C[C@]1(Cn2ccnn2)[C@H](C(=O)O)N2C(=O)C[C@H]2S1(=O)=O	CHEMBL404
C=CC/C(C=O)=C1\O[C@@H]2CC(=O)N2C1C(=O)OCc1ccccc1	CHEMBL4114663
C=CCC1(C(=O)OCc2ccccc2)/C(=C/C=O)O[C@@H]2CC(=O)N21	CHEMBL4114809
C=CC/C(C=O)=C1\O[C@@H]2CC(=O)N2C1(CC=C)C(=O)OCc1ccccc1	CHEMBL4114670
O=C/C(C/C=C/c1ccccc1)=C1/O[C@@H]2CC(=O)N2C1C(=O)OCc1ccccc1	CHEMBL4114713
O=C/C=C1\O[C@@H]2CC(=O)N2C1(C/C=C/c1ccccc1)C(=O)OCc1ccccc1	CHEMBL3917939
O=C/C(C/C=C/CO)=C1/O[C@@H]2CC(=O)N2C1C(=O)OCc1ccccc1	CHEMBL4114757
O=C/C=C1\O[C@@H]2CC(=O)N2C1(C/C=C/CO)C(=O)OCc1ccccc1	CHEMBL4114673
CC(=O)OC/C=C/C/C(C=O)=C1\O[C@@H]2CC(=O)N2C1C(=O)OCc1ccccc1	CHEMBL4114790
CC(=O)OC/C=C/CC1(C(=O)OCc2ccccc2)/C(=C/C=O)O[C@@H]2CC(=O)N21	CHEMBL4114806
COC(=O)C1(C/C=C/c2ccccc2)/C(=C/C=O)O[C@@H]2CC(=O)N21	CHEMBL4114662
C=CC/C(C=O)=C1\O[C@@H]2CC(=O)N2C1C(=O)OC	CHEMBL4114788
C=CCC1(C(=O)OC)/C(=C/C=

In [26]:
! cat molecule.smi | wc -l

49


Calculate fingerprint descriptors
Calculate PaDEL descriptors

In [27]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [28]:
! bash padel.sh

Processing CHEMBL3949733 in molecule.smi (1/49). 
Processing CHEMBL777 in molecule.smi (2/49). 
Processing CHEMBL403 in molecule.smi (3/49). Average speed: 1.47 s/mol.
Processing CHEMBL404 in molecule.smi (4/49). Average speed: 0.87 s/mol.
Processing CHEMBL4114663 in molecule.smi (5/49). Average speed: 0.68 s/mol.
Processing CHEMBL4114809 in molecule.smi (6/49). Average speed: 0.52 s/mol.
Processing CHEMBL4114713 in molecule.smi (8/49). Average speed: 0.53 s/mol.
Processing CHEMBL4114670 in molecule.smi (7/49). Average speed: 0.53 s/mol.
Processing CHEMBL3917939 in molecule.smi (9/49). Average speed: 0.46 s/mol.
Processing CHEMBL4114757 in molecule.smi (10/49). Average speed: 0.44 s/mol.
Processing CHEMBL4114673 in molecule.smi (11/49). Average speed: 0.45 s/mol.
Processing CHEMBL4114790 in molecule.smi (12/49). Average speed: 0.41 s/mol.
Processing CHEMBL4114806 in molecule.smi (13/49). Average speed: 0.42 s/mol.
Processing CHEMBL4114662 in molecule.smi (14/49). Average speed: 0.38 s/

In [29]:
! ls -l

total 25464
-rw-r--r-- 1 root root   172500 Aug 10 23:31 betalactamase_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 root root    98450 Aug 10 23:48 descriptors_output.csv
drwx------ 5 root root     4096 Aug 10 23:20 drive
drwxr-xr-x 3 root root     4096 Aug 10 23:47 __MACOSX
-rw-r--r-- 1 root root     3239 Aug 10 23:44 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Aug 10 23:47 padel.sh
-rw-r--r-- 1 root root 25768637 Aug 10 23:47 padel.zip
drwxr-xr-x 1 root root     4096 Aug  3 20:21 sample_data


Preparing the X and Y Data Matrices
X data matrix

In [30]:
df3_X = pd.read_csv('descriptors_output.csv')

In [31]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL777,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL403,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL3949733,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL404,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL4114809,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,CHEMBL4114663,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,CHEMBL4114670,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,CHEMBL4114713,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,CHEMBL3917939,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,CHEMBL4114757,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Y variable
Convert IC50 to pIC50

In [34]:
df3_Y = df3['pIC50']
df3_Y

0     6.194227
1     7.530178
2     5.972284
3     7.764472
4     7.176526
5     6.024384
6     5.076430
7     7.050610
8     5.307744
9     7.105684
10    5.248552
11    6.854182
12    5.136279
13    4.707035
14    7.185752
15    4.877470
16    7.567031
17    6.060181
18    7.514279
19    8.036212
20    6.370488
21    6.915424
22    7.663540
23    5.806458
24    7.354578
25    5.096042
26    5.384587
27    4.626693
28    9.698970
29    8.455932
30    8.769551
31    8.619789
32    8.744727
33    8.376751
34    6.595166
35    7.137869
36    7.013228
37    3.926637
38    4.749848
39    7.974694
40    8.853872
41    5.357308
42    4.484126
43    5.259637
44    8.522879
45    8.301030
46    3.903090
47    5.468521
48    4.252588
Name: pIC50, dtype: float64

Combining X and Y variable

In [35]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.194227
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.530178
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.972284
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.764472
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.176526
5,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.024384
6,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.07643
7,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.05061
8,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.307744
9,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.105684


In [36]:
dataset3.to_csv('betalactamase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

Let's download the CSV file to your local computer for the Part 3B (Model Building).