<a href="https://colab.research.google.com/github/kake08/chembl_ml/blob/main/code/CDD_ML_Part_3_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In Part 3, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

In [2]:
! pip install padelpy
# padelpy is a Python wrapper for the PaDEL-Descriptor software.



In [4]:
from padelpy import padeldescriptor

In [23]:
! unzip /content/PaDEL-Descriptor.zip

Archive:  /content/PaDEL-Descriptor.zip
  inflating: Descriptors.xls         
  inflating: descriptors.xml         
replace lib/ambit2-base-2.4.7-SNAPSHOT.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/ambit2-base-2.4.7-SNAPSHOT.jar  
replace lib/ambit2-core-2.4.7-SNAPSHOT.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/ambit2-core-2.4.7-SNAPSHOT.jar  
replace lib/ambit2-smarts-2.4.7-SNAPSHOT.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/ambit2-smarts-2.4.7-SNAPSHOT.jar  
replace lib/appframework-1.0.3.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/appframework-1.0.3.jar  
replace lib/cdk-1.4.15.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/cdk-1.4.15.jar      
replace lib/commons-cli-1.2.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/commons-cli-1.2.jar  
replace lib/guava-17.0.jar? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: lib/guava-17.0.jar      
replace lib/jama.jar? [y]es, [n]o, [

Create the `padel.sh` script with the specified content, which will be used to execute the `PaDEL-Descriptor.jar` with appropriate arguments for descriptor calculation.



In [40]:
%%writefile padel.sh
#!/bin/bash
java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./descriptors.xml -dir ./ -file descriptors_output.csv

Overwriting padel.sh


padel.sh is now configured to use PaDEL-Descriptor.jar with salt removal, nitro standardization, fingerprint calculation, and to save the output to descriptors_output.csv.

To make the `padel.sh` script executable as required by the subtask, I will use the `chmod +x` command, which adds execute permissions to the file.



In [25]:
get_ipython().system('chmod +x padel.sh')

In [26]:
import pandas as pd

In [30]:
df3 = pd.read_csv('/content/bioactivity_data_preprocessed.csv')

In [31]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL480,Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1,active,369.368,3.51522,1.0,4.0,6.408935
1,CHEMBL178459,Cc1c(-c2cnccn2)ssc1=S,active,226.351,3.30451,0.0,5.0,6.677781
2,CHEMBL3545157,O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1,active,334.400,3.26220,0.0,5.0,7.096910
3,CHEMBL4303595,O=C1C=Cc2cc(Br)ccc2C1=O,active,237.052,2.22770,0.0,2.0,7.397940
4,CHEMBL55400,Nc1ccc2cc3ccc(N)cc3nc2c1,active,209.252,2.55240,2.0,3.0,6.443697
...,...,...,...,...,...,...,...,...
4028,CHEMBL6065869,COc1cccc2[nH]c(C(=O)N(C)[C@@H](CC(C)C)C(=O)N3C...,active,530.625,3.37740,1.0,5.0,7.000000
4029,CHEMBL5958100,COc1cccc2[nH]c(C(=O)N(C)[C@@H](CC(C)(C)F)C(=O)...,active,531.588,3.76988,2.0,5.0,6.259637
4030,CHEMBL6031391,C#C[C@@H]1C[C@@]2(CN1C(=O)[C@H](CC(C)C)N(C)C(=...,active,526.637,3.81170,1.0,4.0,7.000000
4031,CHEMBL5820046,CC(C)C[C@@H](C(=O)N1C[C@]2(CC1C#N)C(=O)Nc1cccc...,active,473.577,3.25728,2.0,4.0,7.000000


In [32]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [33]:
! cat molecule.smi | head -5

Cc1c(OCC(F)(F)F)ccnc1C[S+]([O-])c1nc2ccccc2[nH]1	CHEMBL480
Cc1c(-c2cnccn2)ssc1=S	CHEMBL178459
O=c1sn(-c2cccc3ccccc23)c(=O)n1Cc1ccccc1	CHEMBL3545157
O=C1C=Cc2cc(Br)ccc2C1=O	CHEMBL4303595
Nc1ccc2cc3ccc(N)cc3nc2c1	CHEMBL55400


In [34]:
! cat molecule.smi | wc -l

4033


# Calculate fingerprint Descriptors

In [38]:
! cat padel.sh

#!/bin/bash
java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [41]:
! bash padel.sh

Processing CHEMBL480 in molecule.smi (1/4033). 
Processing CHEMBL178459 in molecule.smi (2/4033). 
Processing CHEMBL3545157 in molecule.smi (3/4033). 
Processing CHEMBL4303595 in molecule.smi (4/4033). Average speed: 8.72 s/mol.
Processing CHEMBL55400 in molecule.smi (5/4033). Average speed: 3.10 s/mol.
Processing CHEMBL1886408 in molecule.smi (6/4033). Average speed: 3.31 s/mol.
Processing CHEMBL505670 in molecule.smi (7/4033). Average speed: 2.00 s/mol.
Processing CHEMBL460499 in molecule.smi (8/4033). Average speed: 1.71 s/mol.
Processing CHEMBL1096979 in molecule.smi (9/4033). Average speed: 1.72 s/mol.
Processing CHEMBL164 in molecule.smi (10/4033). Average speed: 1.50 s/mol.
Processing CHEMBL1422849 in molecule.smi (11/4033). Average speed: 1.17 s/mol.
Processing CHEMBL284861 in molecule.smi (12/4033). Average speed: 1.10 s/mol.
Processing CHEMBL3963349 in molecule.smi (13/4033). Average speed: 1.10 s/mol.
Processing CHEMBL3797437 in molecule.smi (14/4033). Average speed: 0.93 s/

# Preparing X and Y Data Matrices

In [42]:
df3_X = pd.read_csv('descriptors_output.csv')

In [43]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL178459,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL480,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL4303595,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL3545157,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL55400,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4028,CHEMBL6065869,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4029,CHEMBL5958100,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4030,CHEMBL6031391,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4031,CHEMBL5820046,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4028,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4029,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4030,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4031,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


# Y Variable

In [45]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,6.408935
1,6.677781
2,7.096910
3,7.397940
4,6.443697
...,...
4028,7.000000
4029,6.259637
4030,7.000000
4031,7.000000


# Combine X and Y variable

In [46]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.408935
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.677781
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.096910
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.397940
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.443697
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4028,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
4029,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.259637
4030,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000
4031,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.000000


In [49]:
dataset3.to_csv('replicase_polyprotein_1ab_bioactivity_data_2class_pIC50_pubchem_fp.csv', index=False)

Target: Replicase polyprotein 1ab affects organism: Severe acute respiratory syndrome coronavirus 2