<a href="https://colab.research.google.com/github/mr-nahash/drug-discovery-antipsychotics-D2DR/blob/main/CDD_ML_Part_3_sigma1_molecular_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-02-06 04:35:49--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-02-06 04:35:49--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-02-06 04:35:50 (166 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-02-06 04:35:50--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
! gdown --id 1s8VA5GXBWBzmZ5-We_PA0c1MT7hDnDJE

Downloading...
From: https://drive.google.com/uc?id=1s8VA5GXBWBzmZ5-We_PA0c1MT7hDnDJE
To: /content/sigma1_bioactivity_data_preprocessed_pIC50.csv
  0% 0.00/112k [00:00<?, ?B/s]100% 112k/112k [00:00<00:00, 7.65MB/s]


In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv('sigma1_bioactivity_data_preprocessed_pIC50.csv')

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,M Weight,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL67010,C/C(=N\C1CCCCC1)NC12CC3CC(CC(C3)C1)C2,active,274.452,4.29590,1.0,1.0,7.142668
1,CHEMBL542638,C/C(=N\C12CC3CC(CC(C3)C1)C2)Nc1ccccc1C.Cl,active,318.892,5.21592,1.0,1.0,8.221849
2,CHEMBL544054,C/C(=N\C1CCCCC1)Nc1ccccc1C.Cl,active,266.816,4.57982,1.0,1.0,8.045757
3,CHEMBL67388,C/C(=N\C12CC3CC(CC(C3)C1)C2)NC12CC3CC(CC(C3)C1)C2,active,326.528,4.93200,1.0,1.0,7.795880
4,CHEMBL538754,C/C(=N\c1ccccc1C)Nc1ccccc1C.Cl,active,274.795,4.88724,1.0,1.0,7.823909
...,...,...,...,...,...,...,...,...
1037,CHEMBL60542,CC(C)=CCN1CC[C@]2(C)c3cc(O)ccc3C[C@H]1[C@H]2C,,,,,,8.084073
1038,CHEMBL177952,COc1ccc2c(c1)CCCC2CCCCN1CCC(C)CC1,,,,,,7.917215
1039,CHEMBL176941,COc1ccc2c(CCCCN3CCC(C)CC3)cccc2c1,,,,,,7.974694
1040,CHEMBL60542,CC(C)=CCN1CC[C@]2(C)c3cc(O)ccc3C[C@H]1[C@H]2C,,,,,,8.119758


In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

C/C(=N\C1CCCCC1)NC12CC3CC(CC(C3)C1)C2	CHEMBL67010
C/C(=N\C12CC3CC(CC(C3)C1)C2)Nc1ccccc1C.Cl	CHEMBL542638
C/C(=N\C1CCCCC1)Nc1ccccc1C.Cl	CHEMBL544054
C/C(=N\C12CC3CC(CC(C3)C1)C2)NC12CC3CC(CC(C3)C1)C2	CHEMBL67388
C/C(=N\c1ccccc1C)Nc1ccccc1C.Cl	CHEMBL538754


In [None]:
! cat molecule.smi | wc -l

1042


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh
#the  program will remove salts and impurities. 

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Processing CHEMBL542638 in molecule.smi (2/32). 
Processing CHEMBL67010 in molecule.smi (1/32). 
Processing CHEMBL67388 in molecule.smi (4/32). Average speed: 1.04 s/mol.
Processing CHEMBL544054 in molecule.smi (3/32). Average speed: 2.02 s/mol.
Processing CHEMBL538754 in molecule.smi (5/32). Average speed: 0.86 s/mol.
Processing CHEMBL63508 in molecule.smi (6/32). Average speed: 0.77 s/mol.
Processing CHEMBL67665 in molecule.smi (7/32). Average speed: 0.78 s/mol.
Processing CHEMBL282433 in molecule.smi (8/32). Average speed: 0.59 s/mol.
Processing CHEMBL159967 in molecule.smi (9/32). Average speed: 0.52 s/mol.
Processing CHEMBL26320 in molecule.smi (10/32). Average speed: 0.51 s/mol.
Processing CHEMBL159608 in molecule.smi (11/32). Average speed: 0.51 s/mol.
Processing CHEMBL159320 in molecule.smi (12/32). Average speed: 0.57 s/mol.
Processing CHEMBL164037 in molecule.smi (13/32). Average speed: 0.49 s/mol.
Processing CHEMBL281594 in molecule.smi (14/32). Average speed: 0.49 s/mol.
Pr

In [None]:
! ls -l

total 25416
-rw-r--r-- 1 root root    68208 Feb  6 04:41 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Feb  6 04:35 __MACOSX
-rw-r--r-- 1 root root    50425 Feb  6 04:40 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Feb  6 04:35 padel.sh
-rw-r--r-- 1 root root 25768637 Feb  6 04:35 padel.zip
drwxr-xr-x 1 root root     4096 Feb  1 14:32 sample_data
-rw------- 1 root root   112455 Feb  6 04:38 sigma1_bioactivity_data_preprocessed_pIC50.csv


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

0       7.142668
1       8.221849
2       8.045757
3       7.795880
4       7.823909
          ...   
1037    8.084073
1038    7.917215
1039    7.974694
1040    8.119758
1041    6.450997
Name: pIC50, Length: 1042, dtype: float64

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

In [None]:
dataset3.to_csv('sigma1-bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**

In [None]:
from google.colab import drive
drive.mount('/content/drive')
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data3"
! cp sigma1-bioactivity_data_3class_pIC50_pubchem_fp.csv.csv "/content/gdrive/My Drive/Colab Notebooks/data3"
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data3"
! ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data3’: No such file or directory
cp: cannot stat 'sigma1-bioactivity_data_3class_pIC50_pubchem_fp.csv.csv': No such file or directory
ls: cannot access '/content/gdrive/My Drive/Colab Notebooks/data3': No such file or directory
descriptors_output.csv	padel.sh
drive			padel.zip
__MACOSX		sample_data
molecule.smi		sigma1-bioactivity_data_3class_pIC50_pubchem_fp.csv
PaDEL-Descriptor	sigma1_bioactivity_data_preprocessed_pIC50.csv
