# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

**MOUNSIF EL ATOUCH**

In this Jupyter notebook, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [20]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2023-05-27 03:15:08--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-05-27 03:15:08--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-05-27 03:15:10 (136 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-05-27 03:15:10--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gi

In [21]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Installing librairies**

In [4]:
! pip install rdkit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdkit
  Downloading rdkit-2023.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.3.1


## **Importing libraires**

In [38]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_combined_data.csv** file that essentially contain the pIC50 values that we will be using for building a classification model.

In [39]:
df = pd.read_csv('bioactivity_combined_data.csv')

Transform the "active" label to 1 and "inactive" label to 0 using a mapping function to replace the labels with the desired values.

In [40]:
def label_to_numeric(label):
    if label == 'active':
        return 1
    else:
        return 0

Apply the mapping function to the target variable.

In [41]:
df['target'] = df['target'].apply(label_to_numeric)

Save the transformed dataset to a new csv file.

In [42]:
df.to_csv('transformed_dataset.csv', index=False)

In [43]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [44]:
! cat molecule.smi | head -5

Oc1ccc(CCc2ccc(O)c(O)c2O)cc1	CHEMBL243822
CC(=O)NO	CHEMBL734
O=c1c(-c2ccc(O)cc2)coc2c(O)c(O)ccc12	CHEMBL242739
N=C(Cc1ccc(O)cc1)c1ccc(O)c(O)c1	CHEMBL503157
NC(Cc1ccc(O)cc1)c1ccc(O)c(O)c1O	CHEMBL412199


In [45]:
! cat molecule.smi | wc -l

254


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [46]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [47]:
! bash padel.sh

Processing CHEMBL243822 in molecule.smi (1/254). 
Processing CHEMBL734 in molecule.smi (2/254). 
Processing CHEMBL242739 in molecule.smi (3/254). Average speed: 1.50 s/mol.
Processing CHEMBL503157 in molecule.smi (4/254). Average speed: 1.22 s/mol.
Processing CHEMBL593513 in molecule.smi (6/254). Average speed: 0.89 s/mol.
Processing CHEMBL412199 in molecule.smi (5/254). Average speed: 1.00 s/mol.
Processing CHEMBL595619 in molecule.smi (7/254). Average speed: 0.94 s/mol.
Processing CHEMBL1276206 in molecule.smi (8/254). Average speed: 0.87 s/mol.
Processing CHEMBL1276207 in molecule.smi (9/254). Average speed: 0.93 s/mol.
Processing CHEMBL1276035 in molecule.smi (10/254). Average speed: 0.91 s/mol.
Processing CHEMBL1276036 in molecule.smi (11/254). Average speed: 0.89 s/mol.
Processing CHEMBL1276037 in molecule.smi (12/254). Average speed: 0.85 s/mol.
Processing CHEMBL1276338 in molecule.smi (13/254). Average speed: 0.90 s/mol.
Processing CHEMBL1276359 in molecule.smi (14/254). Averag

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [48]:
df_X = pd.read_csv('descriptors_output.csv')

In [49]:
df_X = df_X.drop(columns=['Name'])

In [50]:
df_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
250,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
251,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
252,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

In [51]:
df_Y = df['target']

In [52]:
df_Y

0      0
1      0
2      0
3      0
4      0
      ..
249    0
250    0
251    0
252    0
253    0
Name: target, Length: 254, dtype: int64

## **Combining X and Y variable**

In [53]:
data = pd.concat([df_X, df_Y], axis=1)

In [54]:
data

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,target
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
250,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
251,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
252,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [55]:
data.to_csv('bioactivity_data_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**