# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

In [1]:
import os

# CHANGE THIS ONLY if your repo lives somewhere else
REPO_DIR = r"C:\Users\uddin\Documents\first_repo\bioinformatics_freecodecamp"

os.chdir(REPO_DIR)
print("📂 Working directory set to:", os.getcwd())
print("📄 Files here now:", os.listdir())

C:\Users\uddin\Documents\bioinformatics_freecodecamp


In [2]:
import sys, subprocess

def pip_install(pkg):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet", pkg])

for pkg in ["pandas", "requests", "padelpy"]:
    pip_install(pkg)

import pandas as pd
import requests
from padelpy import padeldescriptor

print("✅ pandas/requests/padelpy ready")

C:\Users\uddin\anaconda3\python.exe


In [3]:
import shutil, subprocess

java_path = shutil.which("java")
print("🔎 Java path seen by Jupyter:", java_path)

if java_path is None:
    raise SystemExit("❌ Java not found in PATH for Jupyter.\n"
                     "Fix: Add your Java bin folder to PATH, then restart Jupyter launched from CMD.\n"
                     "Example PATH: C:\\Program Files\\Eclipse Adoptium\\jdk-17.x.x\\bin")

# Show version (Java prints version to stderr typically)
ver = subprocess.run(["java", "-version"], capture_output=True, text=True)
print(ver.stderr or ver.stdout)
print("✅ Java available")

✅ requests is installed and working


In [4]:
import os, requests

DATA_URL = "https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv"
DATA_CSV = "acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv"

if not os.path.exists(DATA_CSV):
    print("⬇️ Downloading dataset …")
    r = requests.get(DATA_URL)
    r.raise_for_status()
    with open(DATA_CSV, "wb") as f:
        f.write(r.content)
    print("✅ Downloaded:", DATA_CSV)
else:
    print("✅ Already present:", DATA_CSV)

Files in current directory:
['.git', '.gitignore', '.ipynb_checkpoints', 'acetylcholinesterase_01_bioactivity_data_raw.csv', 'acetylcholinesterase_02_bioactivity_data_preprocessed.csv', 'acetylcholinesterase_03_bioactivity_data_curated.csv', 'acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv', 'acetylcholinesterase_05_bioactivity_data_2class_pIC50.csv', 'acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', 'acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv', 'anaconda_projects', 'CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb', 'CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb', 'CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb', 'CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb', 'descriptors_output.csv', 'mannwhitneyu_LogP.csv', 'mannwhitneyu_MW.csv', 'mannwhitneyu_NumHAcceptors.csv', 'mannwhitneyu_NumHDonors.csv', 'mannwhitneyu_pIC50.csv', 'molecule.smi', 'padel.sh', '

In [5]:
import pandas as pd

df_source = pd.read_csv("acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv")

# Normalize likely column names across tutorial variants
cols = {c.lower(): c for c in df_source.columns}

# Try to discover SMILES + ID columns
smiles_col = None
for cand in ["canonical_smiles", "smiles", "molecule_smiles"]:
    if cand in cols:
        smiles_col = cols[cand]
        break

id_col = None
for cand in ["molecule_chembl_id", "chembl_id", "name", "compound_name", "molecule_id", "id"]:
    if cand in cols:
        id_col = cols[cand]
        break

# Fallbacks if not found
if smiles_col is None:
    raise SystemExit("❌ Could not find a SMILES column in the dataset. "
                     "Expected one of: canonical_smiles/smiles/molecule_smiles.")
if id_col is None:
    # Create a stable ID from index if no obvious ID col
    id_col = "__generated_name__"
    df_source[id_col] = [f"MOL_{i:06d}" for i in range(len(df_source))]

print("✅ Using columns → SMILES:", smiles_col, "| NAME:", id_col)
df_source[[smiles_col, id_col]].head()

📂 Changed working directory to: C:\Users\uddin\Documents\first_repo\bioinformatics_freecodecamp


In [6]:
SMI_FILE = "molecule.smi"

with open(SMI_FILE, "w", encoding="utf-8") as f:
    for smi, name in zip(df_source[smiles_col], df_source[id_col]):
        if pd.notna(smi) and pd.notna(name):
            f.write(f"{smi} {str(name).strip()}\n")

print(f"✅ Wrote {SMI_FILE} with {sum(1 for _ in open(SMI_FILE, 'r', encoding='utf-8'))} lines")

✅ Downloaded: padel.zip
✅ Downloaded: padel.sh


## **Download PaDEL-Descriptor**

In [7]:
from padelpy import padeldescriptor
import os, shutil, sys, subprocess

# Verify Java availability
java_path = shutil.which("java")
if not java_path:
    print("⚠️ Java not found in PATH. Attempting to locate manually...")
    possible_path = r"C:\Program Files\Eclipse Adoptium\jdk-25\bin"
    if os.path.exists(possible_path):
        os.environ["PATH"] += os.pathsep + possible_path
        print("✅ Added Java to PATH temporarily.")
    else:
        sys.exit("❌ Java not found. Please install JRE 8+ or update PATH.")

# Confirm Java version
print("🧠 Using Java from:", shutil.which("java"))
subprocess.run(["java", "-version"])

# Run PaDEL
smi_file = os.path.join(os.getcwd(), "molecule.smi")

padeldescriptor(
    mol_dir=smi_file,
    d_file="descriptors_output.csv",
    fingerprints=True
)

print("✅ descriptors_output.csv created successfully.")
print("📄 Files in directory:", os.listdir())

C:\Program Files\Eclipse Adoptium\jdk-25.0.0.36-hotspot\bin\java.EXE


In [8]:
import pandas as pd

df3_X = pd.read_csv("descriptors_output.csv")

# Drop 'Name' if present (to avoid duplicate on merge later)
if "Name" in df3_X.columns:
    df3_X = df3_X.drop(columns=["Name"])

print("✅ Descriptors shape:", df3_X.shape)
df3_X.head()

✅ padelpy installed


In [9]:
import pandas as pd

# Rebuild the same 'Name' we used in molecule.smi
df_labels = df_source[[id_col]].copy()
df_labels = df_labels.rename(columns={id_col: "Name"})

# Bring over activity columns if present
for cand in ["class", "pIC50", "pic50", "activity_class"]:
    if cand in df_source.columns:
        df_labels[cand] = df_source[cand]

# Ensure 'Name' exists for merge key
if "Name" not in df_labels.columns:
    raise SystemExit("❌ Could not construct 'Name' column for labels table.")

print("✅ Labels preview:")
df_labels.head()

✅ Downloaded: padel.zip
✅ Downloaded: padel.sh


In [None]:
import pandas as pd

# Re-load descriptors WITH 'Name' to merge properly, then drop afterward
dfX_full = pd.read_csv("descriptors_output.csv")
if "Name" not in dfX_full.columns:
    raise SystemExit("❌ 'Name' column missing in descriptors_output.csv (unexpected).")

dataset3 = pd.merge(dfX_full, df_labels, on="Name", how="inner")

# Now drop Name column if you don't need it further
if "Name" in dataset3.columns:
    dataset3 = dataset3.drop(columns=["Name"])

OUT_FINAL = "acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv"
dataset3.to_csv(OUT_FINAL, index=False)

print("✅ Final dataset saved:", OUT_FINAL)
print("🧮 Shape:", dataset3.shape)
dataset3.head()

In [None]:
# Windows-compatible dir listing (handy confirmation)
!dir

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
import requests

url = "https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv"
filename = "acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv"

r = requests.get(url)
with open(filename, "wb") as f:
    f.write(r.content)

print(f"✅ Downloaded dataset: {filename}")

In [None]:
import pandas as pd

In [None]:
import os

# If the file was downloaded in the same folder as the notebook:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

# If that fails, uncomment the line below and use the absolute path instead:
# df3 = pd.read_csv(r"C:\Users\uddin\Documents\first_repo\bioinformatics_freecodecamp\acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv")

print("✅ File loaded successfully")
df3.head()

In [None]:
df3

In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
# Show first 5 lines of molecule.smi
with open("molecule.smi", "r") as f:
    for i in range(5):
        print(f.readline().strip())

In [None]:
# Count total lines in molecule.smi
with open("molecule.smi", "r") as f:
    count = sum(1 for _ in f)
print(f"📄 Total lines in molecule.smi: {count}")

## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
# Print padel.sh file contents
with open("padel.sh", "r") as f:
    print(f.read())

In [None]:
import subprocess
import os

padel_dir = os.path.join(os.getcwd(), "padel")
cmd = [
    "java",
    "-Xms2G",
    "-Xmx2G",
    "-Djava.awt.headless=true",
    "-jar",
    os.path.join(padel_dir, "PaDEL-Descriptor.jar"),
    "-removesalt",
    "-standardizenitro",
    "-fingerprints",
    "-descriptortypes",
    os.path.join(padel_dir, "PubchemFingerprinter.xml"),
    "-dir",
    os.getcwd(),
    "-file",
    "descriptors_output.csv"
]

subprocess.run(cmd)

In [None]:
!dir

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
import pandas as pd
import os

print(os.listdir())  # check if file exists in working dir

df3_X = pd.read_csv('descriptors_output.csv')
print("✅ descriptors_output.csv loaded")
df3_X.head()

In [None]:
df3_X

In [None]:
if 'Name' in df3_X.columns:
    df3_X = df3_X.drop(columns=['Name'])
    print("✅ Dropped column: Name")
else:
    print("⚠️ Column 'Name' not found — skipping")

df3_X.head()

## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X, df3_Y], axis=1)
dataset3.head()

In [None]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)
print("✅ File saved successfully")

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**