# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Based on coursework from data science professor: Chanin Nantasenamat

**Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

We will use PaDEL to calculate the molecular descriptors

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

In [None]:
! unzip padel.zip

## **Load bioactivity data from Part 2**

In [17]:
# Change the working directory
%cd /home/drug_discovery/Drug-Discovery-with-Python-and-Machine-Learning/data

# Verify the current working directory
print(os.getcwd())

/home/drug_discovery/Drug-Discovery-with-Python-and-Machine-Learning/data
/home/drug_discovery/Drug-Discovery-with-Python-and-Machine-Learning/data


Here we will be using the **bioactivity_data_preprocessed_pIC50_3class.csv** file from Part 2 that contains the pIC50 values for building a regression model.

In [3]:
import pandas as pd

In [9]:
df3 = pd.read_csv('data/bioactivity_data_preprocessed_pIC50_3class.csv')

In [10]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL288441,COc1cc(Nc2c(C#N)cnc3cc(OCCCN4CCN(C)CC4)c(OC)cc...,active,530.456,5.19038,1.0,8.0,7.698970
1,CHEMBL386051,CSc1cccc(Nc2ncc3cc(-c4c(Cl)cccc4Cl)c(=O)n(C)c3...,active,443.359,5.76780,1.0,6.0,8.698970
2,CHEMBL364623,Cc1cc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)nc(C)n1,active,373.869,4.50766,2.0,6.0,8.522879
3,CHEMBL5416410,Cc1nc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)cc(N2CCN(C...,active,506.032,2.48884,3.0,9.0,9.000000
4,CHEMBL5416410,Cc1nc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)cc(N2CCN(C...,active,506.032,2.48884,3.0,9.0,9.000000
...,...,...,...,...,...,...,...,...
323,CHEMBL288441,COc1cc(Nc2c(C#N)cnc3cc(OCCCN4CCN(C)CC4)c(OC)cc...,active,530.456,5.19038,1.0,8.0,7.356547
324,CHEMBL5435819,COc1cc(Nc2c(C#N)cnc3cc(OCCON4CCN(C)CC4)c(OC)cc...,active,532.428,4.73188,1.0,9.0,7.327902
325,CHEMBL288441,COc1cc(Nc2c(C#N)cnc3cc(OCCCN4CCN(C)CC4)c(OC)cc...,active,530.456,5.19038,1.0,8.0,9.000000
326,CHEMBL5435819,COc1cc(Nc2c(C#N)cnc3cc(OCCON4CCN(C)CC4)c(OC)cc...,active,532.428,4.73188,1.0,9.0,9.000000


In [11]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [12]:
! cat molecule.smi | head -5

COc1cc(Nc2c(C#N)cnc3cc(OCCCN4CCN(C)CC4)c(OC)cc23)c(Cl)cc1Cl	CHEMBL288441
CSc1cccc(Nc2ncc3cc(-c4c(Cl)cccc4Cl)c(=O)n(C)c3n2)c1	CHEMBL386051
Cc1cc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)nc(C)n1	CHEMBL364623
Cc1nc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)cc(N2CCN(CCO)CC2)n1.O	CHEMBL5416410
Cc1nc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)cc(N2CCN(CCO)CC2)n1.O	CHEMBL5416410


In [13]:
! cat molecule.smi | wc -l

328


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [14]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


PaDEL will help clean up the data like removing salts (and other impurities) from the chemical structure, compute the Pubchem fingerprint...etc (http://yapcwsoft.com/dd/padeldescriptor/). It will generate a file called ``descriptors_output.csv``

In [15]:
! bash padel.sh

Processing CHEMBL288441 in molecule.smi (1/328). 
Processing CHEMBL386051 in molecule.smi (2/328). 
Processing CHEMBL364623 in molecule.smi (3/328). 
Processing CHEMBL5416410 in molecule.smi (4/328). 
Processing CHEMBL5416410 in molecule.smi (5/328). 
Processing CHEMBL5416410 in molecule.smi (6/328). 
Processing CHEMBL941 in molecule.smi (7/328). 
Processing CHEMBL1077317 in molecule.smi (8/328). 
Processing CHEMBL1081312 in molecule.smi (9/328). 
Processing CHEMBL483847 in molecule.smi (10/328). 
Processing CHEMBL941 in molecule.smi (11/328). 
Processing CHEMBL941 in molecule.smi (12/328). 
Processing CHEMBL941 in molecule.smi (13/328). Average speed: 3.13 s/mol.
Processing CHEMBL255863 in molecule.smi (15/328). Average speed: 1.16 s/mol.
Processing CHEMBL255863 in molecule.smi (14/328). Average speed: 1.73 s/mol.
Processing CHEMBL941 in molecule.smi (16/328). Average speed: 0.91 s/mol.
Processing CHEMBL255863 in molecule.smi (17/328). Average speed: 0.80 s/mol.
Processing CHEMBL17603

In [18]:
! ls -l

total 1224
-rw-r--r-- 1 root root 236503 Dec 29 00:07 bioactivity_data.csv
-rw-r--r-- 1 root root  31636 Dec 29 20:01 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  49152 Jan  6 19:49 bioactivity_data_preprocessed_pIC50_3class.csv
-rw-r--r-- 1 root root 594397 Jan  6 20:01 descriptors_output.csv
-rw-r--r-- 1 root root    123 Jan  5 22:21 mannwhitneyu_LogP.csv
-rw-r--r-- 1 root root    120 Jan  5 22:20 mannwhitneyu_MW.csv
-rw-r--r-- 1 root root    131 Jan  5 22:22 mannwhitneyu_NumHAcceptors.csv
-rw-r--r-- 1 root root    129 Jan  5 22:21 mannwhitneyu_NumHDonors.csv
-rw-r--r-- 1 root root    124 Jan  5 22:18 mannwhitneyu_pIC50.csv
-rw-r--r-- 1 root root  26670 Jan  6 19:50 molecule.smi
-rw-r--r-- 1 root root  14080 Jan  5 22:21 plot_LogP.pdf
-rw-r--r-- 1 root root  13463 Jan  5 22:20 plot_MW.pdf
-rw-r--r-- 1 root root  54289 Jan  5 22:09 plot_MW_vs_LogP.pdf
-rw-r--r-- 1 root root  16248 Jan  5 22:22 plot_NumHAcceptors.pdf
-rw-r--r-- 1 root root  15069 Jan  5 22:21 plot_NumHDono

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [19]:
df3_X = pd.read_csv('descriptors_output.csv')

In [20]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL5416410,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL5416410,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL5416410,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL364623,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL483847,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,CHEMBL5435819,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
324,CHEMBL288441,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
325,CHEMBL1852688,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
326,CHEMBL288441,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
324,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
325,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
326,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [23]:
df3_Y = df3['pIC50']
df3_Y

0      7.698970
1      8.698970
2      8.522879
3      9.000000
4      9.000000
         ...   
323    7.356547
324    7.327902
325    9.000000
326    9.000000
327    5.659159
Name: pIC50, Length: 328, dtype: float64

## **Combining X and Y variable**

In [24]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.698970
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.698970
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.522879
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.000000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.356547
324,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.327902
325,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.000000
326,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.000000


Write to a CSV file

In [26]:
dataset3.to_csv('bioactivity_data_preprocessed_pIC50_3class_pubchem_fp_BCR-ABL1.csv', index=False)