# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

**MOUNSIF EL ATOUCH**

In this Jupyter notebook, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://raw.githubusercontent.com/mounsifelatouch/cdd/master/padel/padel.zip
! wget https://raw.githubusercontent.com/mounsifelatouch/cdd/master/padel/padel.sh

--2023-05-31 06:34:00--  https://raw.githubusercontent.com/mounsifelatouch/cdd/master/padel/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-05-31 06:34:00 (318 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-05-31 06:34:00--  https://raw.githubusercontent.com/mounsifelatouch/cdd/master/padel/padel.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 233 [text/plain]
Saving to: ‘padel.sh’


2023-05-31 06:34:00 (10.4 MB/s) - ‘padel.sh’ saved [233/233]



In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Installing librairies**

In [3]:
! pip install rdkit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdkit
  Downloading rdkit-2023.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.3.1


## **Importing libraires**

In [5]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_curated.csv** file that essentially contain the pIC50 values that we will be using for building a classification model.

In [6]:
df = pd.read_csv('bioactivity_data_curated.csv')

In [7]:
selection = ['canonical_smiles','chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

O=c1oc2ccccc2c(O)c1Cc1c(O)c2ccccc2oc1=O	CHEMBL1466
O=c1oc2ccccc2c(O)c1C(c1ccc[nH]1)c1c(O)c2ccccc2oc1=O	CHEMBL260998
CCOc1cc(C(c2c(O)c3ccccc3oc2=O)c2c(O)c3ccccc3oc2=O)ccc1O	CHEMBL260997
O=c1oc2ccccc2c(O)c1C(c1c(O)c2ccccc2oc1=O)c1c[nH]c2ccccc12	CHEMBL258733
O=c1c(C(c2ccc([N+](=O)[O-])cc2)c2c(O)oc3ccccc3c2=O)c(O)oc2ccccc12	CHEMBL81935


In [9]:
! cat molecule.smi | wc -l

648


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/GraphOnlyFingerprinter.xml -dir ./ -file descriptors_output.csv


In [11]:
%%time
! bash padel.sh

Processing CHEMBL1466 in molecule.smi (1/648). 
Processing CHEMBL260998 in molecule.smi (2/648). 
Processing CHEMBL260997 in molecule.smi (3/648). Average speed: 2.54 s/mol.
Processing CHEMBL258733 in molecule.smi (4/648). Average speed: 1.35 s/mol.
Processing CHEMBL81935 in molecule.smi (5/648). Average speed: 0.91 s/mol.
Processing CHEMBL409739 in molecule.smi (6/648). Average speed: 0.73 s/mol.
Processing CHEMBL430228 in molecule.smi (7/648). Average speed: 0.61 s/mol.
Processing CHEMBL409439 in molecule.smi (8/648). Average speed: 0.52 s/mol.
Processing CHEMBL259477 in molecule.smi (9/648). Average speed: 0.47 s/mol.
Processing CHEMBL259153 in molecule.smi (10/648). Average speed: 0.41 s/mol.
Processing CHEMBL259154 in molecule.smi (11/648). Average speed: 0.39 s/mol.
Processing CHEMBL409740 in molecule.smi (12/648). Average speed: 0.36 s/mol.
Processing CHEMBL261558 in molecule.smi (13/648). Average speed: 0.33 s/mol.
Processing CHEMBL261789 in molecule.smi (14/648). Average speed

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [12]:
df_X = pd.read_csv('descriptors_output.csv')

In [13]:
df_X = df_X.drop(columns=['Name'])

In [14]:
df_X

Unnamed: 0,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,GraphFP10,...,GraphFP1015,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
643,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
644,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
645,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
646,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


## **Y variable: Transformation into a binary variable**
The bioactivity data is in the pIC50 unit. Compounds having values of >= 6 will be considered to be **active = `1`** while those > 6 will be considered to be **inactive = `0`**.

In [15]:
def get_class(y, thres=6):
  labels = {0: '0', 1: '1'}
  return labels[1] if y >= thres else labels[0]

In [16]:
df_Y = df['pIC50']

In [17]:
df_Y

0      4.820000
1      4.330000
2      4.040000
3      4.210000
4      4.150000
         ...   
643    3.966576
644    4.450997
645    4.070581
646    4.696804
647    3.100179
Name: pIC50, Length: 648, dtype: float64

In [18]:
df_Y = df_Y.apply(get_class)

## **Combining X and Y variable**

In [19]:
data = pd.concat([df_X, df_Y], axis=1)

In [20]:
data = data.rename(columns={'pIC50': 'Activity'})

In [21]:
data

Unnamed: 0,GraphFP1,GraphFP2,GraphFP3,GraphFP4,GraphFP5,GraphFP6,GraphFP7,GraphFP8,GraphFP9,GraphFP10,...,GraphFP1016,GraphFP1017,GraphFP1018,GraphFP1019,GraphFP1020,GraphFP1021,GraphFP1022,GraphFP1023,GraphFP1024,Activity
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
643,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
644,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
645,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
646,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
data.to_csv('bioactivity_data_GraphOnlyFingerprinter.csv', index=False)

In [24]:
! zip lab3.zip *

  adding: bioactivity_data_curated.csv (deflated 77%)
  adding: bioactivity_data_GraphOnlyFingerprinter.csv (deflated 95%)
  adding: descriptors_output.csv (deflated 95%)
  adding: __MACOSX/ (stored 0%)
  adding: molecule.smi (deflated 82%)
  adding: PaDEL-Descriptor/ (stored 0%)
  adding: padel.sh (deflated 36%)
  adding: padel.zip (stored 0%)
  adding: sample_data/ (stored 0%)
