# 02 - Molecular Featurization with RDKit

In this step, we convert each molecule into a machine-readable numerical format using RDKit. We'll generate:
- 2D physicochemical descriptors
- Morgan fingerprints (ECFP)

These features will later be used to train machine learning models to predict pIC50 values.


## Step 1: Load Cleaned Data with SMILES

In the previous step, we already fetched the SMILES strings from ChEMBL. Now, we'll load the CSV file containing:

- molecule_chembl_id
- canonical_smiles
- pIC50

From here, we begin featurization directly using RDKit.

In [3]:
import pandas as pd

df = pd.read_csv("../data/chembl_egfr_clean.csv")
df.head()


Unnamed: 0,molecule_chembl_id,canonical_smiles,IC50,units,pIC50
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,41.0,nM,7.387
1,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,300.0,nM,6.523
2,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,7820.0,nM,5.107
3,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,170.0,nM,6.77
4,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,40.0,nM,7.398


## Step 2: Convert SMILES to RDKit Objects

In [5]:
from rdkit import Chem

df["mol"] = df["canonical_smiles"].apply(Chem.MolFromSmiles) # Convert SMILES to RDKit Mol objects
df = df[df["mol"].notnull()]  # Remove rows with invalid SMILES

## Step 3: Generate Descriptors

Example: molecular weight, logP, number of H-bond donors.

In [7]:
from rdkit.Chem import Descriptors

df["MolWt"] = df["mol"].apply(Descriptors.MolWt)  # Molecular weight
df["LogP"] = df["mol"].apply(Descriptors.MolLogP) # LogP (octanol-water partition coefficient)
df["NumHDonors"] = df["mol"].apply(Descriptors.NumHDonors)  # Number of hydrogen bond donors
df["NumHAcceptors"] = df["mol"].apply(Descriptors.NumHAcceptors)  # Number of hydrogen bond acceptors

## Step 4: Generate Morgan Fingerprints (ECFP4)

These are known as Extended-Connectivity Fingerprints, or ECFP. A way to numerically represent a molecule’s structure based on its substructures.

They capture circular patterns around each atom, similar to how a human might notice functional groups or atomic environments.

- The “4” in ECFP4 refers to a diameter of 4 bonds, or a radius of 2.
- This means the fingerprint captures information up to 2 bonds away from each atom.

**How it works:**

1. Start at each atom in the molecule.
2. Iteratively collect information about neighboring atoms within a certain radius (e.g., radius = 2).
3. Hash each unique substructure into a fixed-length binary vector (e.g., 2048 bits).
4. The final vector tells you which substructures are present in the molecule.
5. Each bit corresponds to a different feature or substructure.

In [8]:
from rdkit.Chem import AllChem
import numpy as np

def mol_to_fp(mol, radius=2, nBits=2048):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=nBits)
    arr = np.zeros((nBits,), dtype=int)
    AllChem.DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

# Convert to a matrix
fp_array = np.array([mol_to_fp(m) for m in df["mol"]])


## Step 5: Save Festures for Modeling

In [11]:
# Save descriptors
desc_cols = ["MolWt", "LogP", "NumHDonors", "NumHAcceptors", "pIC50"]
df[desc_cols].to_csv("../data/descriptors.csv", index=False)

# Save fingerprints as NumPy array and labels separately
np.save("../data/fingerprints.npy", fp_array)
df["pIC50"].to_csv("../data/labels.csv", index=False)
