# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2025-12-15 17:18:06--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-12-15 17:18:07 ERROR 404: Not Found.

--2025-12-15 17:18:07--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2025-12-15 17:18:08 ERROR 404: Not Found.



In [2]:
! unzip padel.zip

unzip:  cannot find or open padel.zip, padel.zip.zip or padel.zip.ZIP.


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2025-12-15 17:18:08--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.1’


2025-12-15 17:18:08 (9.11 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.1’ saved [655414/655414]



In [4]:
import pandas as pd

In [5]:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL463210,CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl,intermediate,350.591,4.7181,0.0,5.0,5.737549
1,1,CHEMBL2252723,CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,455.557,6.3177,0.0,6.0,3.947999
2,2,CHEMBL2252722,CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,441.53,5.9276,0.0,6.0,4.425969
3,3,CHEMBL2252721,CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,427.503,5.5375,0.0,6.0,5.346787
4,4,CHEMBL2252851,CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,413.476,5.1474,0.0,6.0,5.735182
5,5,CHEMBL2252850,CCOP(=O)(OCC)SCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,399.449,4.7573,0.0,6.0,5.419075
6,6,CHEMBL2252849,CCOP(=O)(OCC)SCCCCCN1C(=O)c2ccccc2C1=O,inactive,385.422,4.3672,0.0,6.0,4.908685
7,7,CHEMBL2252848,CCOP(=O)(OCC)SCCCCN1C(=O)c2ccccc2C1=O,intermediate,371.395,3.9771,0.0,6.0,5.003488
8,8,CHEMBL2252847,CCOP(=O)(OCC)SCCCN1C(=O)c2ccccc2C1=O,intermediate,357.368,3.587,0.0,6.0,5.081445
9,9,CHEMBL2252846,CCOP(=O)(OCC)SCCCCCCCCCCSP(=O)(OCC)OCC,intermediate,478.594,7.9358,0.0,8.0,5.754487


In [7]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl	CHEMBL463210
CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252723
CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252722
CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252721
CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252851


In [9]:
! cat molecule.smi | wc -l

18


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat padel.sh

cat: padel.sh: No such file or directory


In [11]:
! bash padel.sh

bash: padel.sh: No such file or directory


In [12]:
! ls -l

total 101572
-rw-r--r-- 1 leeva leeva      9828 Dec 15 17:16 acetylcholinesterase_01_bioactivity_data_raw.csv
-rw-r--r-- 1 leeva leeva      1093 Dec 15 17:16 acetylcholinesterase_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 leeva leeva      1303 Dec 15 17:16 acetylcholinesterase_03_bioactivity_data_curated.csv
-rw-r--r-- 1 leeva rayhan   385924 Dec 15 17:07 acetylcholinesterase_03_bioactivity_data_curated.csv.1
-rw-r--r-- 1 leeva rayhan   385924 Dec 15 17:16 acetylcholinesterase_03_bioactivity_data_curated.csv.2
-rw-r--r-- 1 leeva leeva      2345 Dec 15 17:16 acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 leeva rayhan   655414 Dec 15 17:18 acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv.1
-rw-r--r-- 1 leeva leeva       932 Dec 15 17:16 acetylcholinesterase_05_bioactivity_data_2class_pIC50.csv
-rw-r--r-- 1 leeva leeva   8368605 Dec 13 19:09 acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv
-rw-r--r-- 1 leeva leeva   6327445 Dec 1

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [13]:
df3_X = pd.read_csv('descriptors_output.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'descriptors_output.csv'

In [None]:
df3_X

In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

In [None]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**