# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

In [1]:
import os
print(os.getcwd())

C:\Users\uddin\Documents\bioinformatics_freecodecamp


In [2]:
import os

print("Files in current directory:")
print(os.listdir())

Files in current directory:
['.git', '.gitignore', '.ipynb_checkpoints', 'acetylcholinesterase_01_bioactivity_data_raw.csv', 'acetylcholinesterase_02_bioactivity_data_preprocessed.csv', 'acetylcholinesterase_03_bioactivity_data_curated.csv', 'acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv', 'acetylcholinesterase_05_bioactivity_data_2class_pIC50.csv', 'acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', 'acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv', 'anaconda_projects', 'CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb', 'CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb', 'CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb', 'CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb', 'mannwhitneyu_LogP.csv', 'mannwhitneyu_MW.csv', 'mannwhitneyu_NumHAcceptors.csv', 'mannwhitneyu_NumHDonors.csv', 'mannwhitneyu_pIC50.csv', 'molecule.smi', 'plot_bioactivity_class.pdf', 'plot_ic5

In [3]:
os.chdir(r"C:\Users\uddin\Documents\first_repo\bioinformatics_freecodecamp")
print("📂 Changed working directory to:", os.getcwd())

📂 Changed working directory to: C:\Users\uddin\Documents\first_repo\bioinformatics_freecodecamp


In [4]:
import requests

# Download padel.zip
url_zip = "https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip"
zip_filename = "padel.zip"
r = requests.get(url_zip)
with open(zip_filename, "wb") as f:
    f.write(r.content)
print(f"✅ Downloaded: {zip_filename}")

# Download padel.sh
url_sh = "https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh"
sh_filename = "padel.sh"
r = requests.get(url_sh)
with open(sh_filename, "wb") as f:
    f.write(r.content)
print(f"✅ Downloaded: {sh_filename}")

✅ Downloaded: padel.zip
✅ Downloaded: padel.sh


## **Download PaDEL-Descriptor**

In [5]:
# ✅ Install and use padelpy (no ZIP needed)
import sys, subprocess
subprocess.check_call([sys.executable, "-m", "pip", "install", "padelpy"])

print("✅ padelpy installed")

✅ padelpy installed


In [7]:
from padelpy import padeldescriptor
import os

smi_file = os.path.join(os.getcwd(), "molecule.smi")

padeldescriptor(
    mol_dir=smi_file,
    d_file="descriptors_output.csv",
    fingerprints=True
)

print("✅ descriptors_output.csv created")
print("📄 Files:", os.listdir())

ReferenceError: Java JRE 6+ not found (required for PaDEL-Descriptor)

In [None]:
import pandas as pd
df3_X = pd.read_csv("descriptors_output.csv")
df3_X.head()

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
import requests

url = "https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv"
filename = "acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv"

r = requests.get(url)
with open(filename, "wb") as f:
    f.write(r.content)

print(f"✅ Downloaded dataset: {filename}")

In [None]:
import pandas as pd

In [None]:
import os

# If the file was downloaded in the same folder as the notebook:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

# If that fails, uncomment the line below and use the absolute path instead:
# df3 = pd.read_csv(r"C:\Users\uddin\Documents\first_repo\bioinformatics_freecodecamp\acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv")

print("✅ File loaded successfully")
df3.head()

In [None]:
df3

In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
# Show first 5 lines of molecule.smi
with open("molecule.smi", "r") as f:
    for i in range(5):
        print(f.readline().strip())

In [None]:
# Count total lines in molecule.smi
with open("molecule.smi", "r") as f:
    count = sum(1 for _ in f)
print(f"📄 Total lines in molecule.smi: {count}")

## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
# Print padel.sh file contents
with open("padel.sh", "r") as f:
    print(f.read())

In [None]:
import subprocess
import os

padel_dir = os.path.join(os.getcwd(), "padel")
cmd = [
    "java",
    "-Xms2G",
    "-Xmx2G",
    "-Djava.awt.headless=true",
    "-jar",
    os.path.join(padel_dir, "PaDEL-Descriptor.jar"),
    "-removesalt",
    "-standardizenitro",
    "-fingerprints",
    "-descriptortypes",
    os.path.join(padel_dir, "PubchemFingerprinter.xml"),
    "-dir",
    os.getcwd(),
    "-file",
    "descriptors_output.csv"
]

subprocess.run(cmd)

In [None]:
!dir

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
import pandas as pd
import os

print(os.listdir())  # check if file exists in working dir

df3_X = pd.read_csv('descriptors_output.csv')
print("✅ descriptors_output.csv loaded")
df3_X.head()

In [None]:
df3_X

In [None]:
if 'Name' in df3_X.columns:
    df3_X = df3_X.drop(columns=['Name'])
    print("✅ Dropped column: Name")
else:
    print("⚠️ Column 'Name' not found — skipping")

df3_X.head()

## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X, df3_Y], axis=1)
dataset3.head()

In [None]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)
print("✅ File saved successfully")

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**