## **Descriptor Calculations and Dataset Preparation**

### **Import Libraries and Load Data**

In [24]:
import pandas as pd
from padelpy import padeldescriptor, from_mdl

df_load = pd.read_csv('egfr_01_biactivity_data_with_Lipinski.csv')
df_load.head(5)

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL304929,C#Cc1cccc(Nc2ncnc3cc(OC)c(OC)cc23)c1,inactive,305.337,3.3719,1.0,5.0,6.853872
1,CHEMBL1092250,C#Cc1cccc(Nc2ccnc3cc(OC)c(OC)cc23)c1,inactive,304.349,3.9769,1.0,4.0,5.823909
2,CHEMBL553,C#Cc1cccc(Nc2ncnc3cc(OCCOC)c(OCCOC)cc23)c1,inactive,393.443,3.4051,1.0,7.0,6.0
3,CHEMBL1089203,C#Cc1cccc(Nc2ccnc3cc(OCCOC)c(OCCOC)cc23)c1,inactive,392.455,4.0101,1.0,6.0,5.823909
4,CHEMBL1088901,CN(C)CCCC(=O)Nc1ccc2ncnc(Nc3cccc(Br)c3)c2c1,inactive,428.334,4.4162,2.0,5.0,6.721246


### **Generate fingerprint descriptor file**
Using padelpy libary to extract [PubChem Substructure Fingerprints](https://web.cse.ohio-state.edu/~zhang.10631/bak/drugreposition/list_fingerprints.pdf) from all moleculues given their SMILES

This outputs a robust binary dataframe with 880 variables for each molecule to describe element count, atom pairing, nearest neighbors, etc. 

In [25]:
padel_cols = ['canonical_smiles', 'molecule_chembl_id']
df_padel = df_load[padel_cols]
df_padel.to_csv('molecules.smi', sep='\t', index=False, header=False)
padeldescriptor(mol_dir='molecules.smi', d_file='descriptors.csv', fingerprints=True)

### **Preparing Data Matricies for Model**
X: inputs - fingerprint descriptors of molecule

Y: output - single value of pIC50 for bioactivity

In [26]:
df_descriptors = pd.read_csv('descriptors.csv')

df_x = df_descriptors.drop(columns=['Name'])
df_y = df_load['pIC50']

df_dataset = pd.concat([df_x, df_y], axis=1)
df_dataset.head(5)

df_dataset.to_csv('egfr_02_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)