<a href="https://colab.research.google.com/github/lmVl12/AI_and_Drug_Discovery_Course_2026/blob/main/Assignment_3_Task2_3D_descriptors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

# **Part 3: Descriptor Calculation**

The selection of a descriptor calculation tool depends on the requirement for a high-dimensional feature space and the robustness of the software environment. In this study, a 'process of elimination' was applied to choose the most suitable tool:

* Commercial software (e.g., **Dragon, alvaDesc**), while offering the most extensive descriptor sets (5,000+), was excluded due to licensing constraints.

* Basic libraries (e.g., **RDKit**) provide high-quality data but are limited in the variety of topological indices (~200 descriptors), which may not capture sufficient structural complexity for this target.

* **Mordred** emerged as a strong candidate with over 1,800 descriptors; however, it was dismissed due to significant technical limitations, specifically its dependency on outdated software versions and lack of active maintenance, which poses risks to reproducibility in modern environments.

Consequently, **PaDEL-Descriptor** was selected as the optimal solution. It offers a balanced set of 1,444 descriptors and, crucially, operates as a standalone tool. Its autonomy from the versioning of underlying cheminformatics toolkits ensures greater stability and reliability for the QSAR modeling pipeline

## **1. Technical Framework**

In [None]:
!pip install padelpy

Collecting padelpy
  Downloading padelpy-0.1.16-py3-none-any.whl.metadata (7.7 kB)
Downloading padelpy-0.1.16-py3-none-any.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: padelpy
Successfully installed padelpy-0.1.16


In [None]:
import pandas as pd
import numpy as np
from google.colab import files
from padelpy import padeldescriptor

## **2. Load dataset**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


Loading the preprocessed Lipinski-filtered dataset from Google Drive and inspecting its structure.

In [None]:
results_path = "/content/gdrive/My Drive/Colab Notebooks/data/"
df = pd.read_csv(results_path + 'df_lipinski.csv')
df.head()

Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL4208168,inactive,5.922632,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3C2CCCCO2)cc1,459.348,3.8681,1.0,8.0
1,CHEMBL1173420,inactive,4.69897,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3Cc2ccccc2)cc1,465.355,4.2173,1.0,7.0
2,CHEMBL6005160,active,7.070581,Brc1ccc2ncc(-c3cccc(NC4CNC4)n3)n2c1,344.216,2.5425,2.0,5.0
3,CHEMBL3900620,inactive,5.0,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(C)c3)n...,399.454,3.8188,0.0,7.0
4,CHEMBL3939018,inactive,5.004971,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(CC4CCO...,483.572,4.7084,0.0,8.0


In [None]:
data = df[['canonical_smiles', 'molecule_chembl_id']]
data.head()

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3C2CCCCO2)cc1,CHEMBL4208168
1,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3Cc2ccccc2)cc1,CHEMBL1173420
2,Brc1ccc2ncc(-c3cccc(NC4CNC4)n3)n2c1,CHEMBL6005160
3,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(C)c3)n...,CHEMBL3900620
4,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(CC4CCO...,CHEMBL3939018


## **3. Convert to .smi format**

SMILES strings are converted into a .smi format to ensure a standardized input for the PaDEL software, allowing for consistent parsing of chemical structures.

In [None]:
df_smi = data['canonical_smiles'].to_csv('smiles_chembl.smi', index=None, header=None)

In [None]:
! cat smiles_chembl.smi | head

Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3C2CCCCO2)cc1
Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3Cc2ccccc2)cc1
Brc1ccc2ncc(-c3cccc(NC4CNC4)n3)n2c1
C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(C)c3)nc2c1
C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(CC4CCOCC4)c3)nc2c1
C#CCN(c1ccc2ncc(-c3cnn(C)c3)nc2c1)c1c(Cl)c(OC)cc(OC)c1Cl
C#CCN(c1ccc2ncc(-c3cnn(C)c3)nc2c1)c1c(F)c(OC)cc(OC)c1F
C#CCN1CCc2cc(Nc3ncc(C)c(-c4cnn(C(C)C)c4)n3)ccc2C1
C#CCOc1ccc(Nc2ccc3ncc(N4CCOCC4)nc3c2C#N)cc1OC
C#CCn1cc(-c2ccc(NC(=O)Nc3cc(C(C)(C)C)on3)c(F)c2)c2c(N)ncnc21


## **4. Calculate molecular Pubchem Fingerprints using "padeldescriptor" function**


PubChem fingerprints are generated to represent molecular features as binary vectors, enabling the machine learning model to detect the presence of specific functional groups.

In [None]:
padeldescriptor(mol_dir="smiles_chembl.smi",
                d_file='descriptors_2d.csv',
                d_2d=True,
                fingerprints=False, # explicit set to false to avoid default calculation
                retainorder=True
                )

In [None]:
!ls -lh descriptors_2d.csv

-rw-r--r-- 1 root root 82M Feb 11 12:25 descriptors_2d.csv


In [None]:
df_2d = pd.read_csv("descriptors_2d.csv")
df_2d.head()

Unnamed: 0,Name,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,nAtom,nHeavyAtom,...,AMW,WTPT-1,WTPT-2,WTPT-3,WTPT-4,WTPT-5,WPATH,WPOL,XLogP,Zagreb
0,AUTOGEN_smiles_chembl_1,0,-0.4872,0.237364,56.4066,61.790239,15,16,52,29,...,8.809742,60.830308,2.097597,28.478208,6.09021,19.865004,2185.0,46.0,3.273,158.0
1,AUTOGEN_smiles_chembl_2,0,0.3987,0.158962,42.8752,63.174653,21,22,51,30,...,9.099922,62.82446,2.094149,25.368555,3.013874,19.831713,2464.0,46.0,5.333,162.0
2,AUTOGEN_smiles_chembl_3,0,0.034,0.001156,30.9194,44.285102,15,17,35,21,...,9.801236,43.715787,2.081704,18.354576,0.0,15.808257,946.0,30.0,2.638,116.0
3,AUTOGEN_smiles_chembl_4,0,0.8016,0.642563,38.6007,61.586653,21,23,51,30,...,7.826853,61.630301,2.054343,21.656568,5.595335,16.061233,2502.0,48.0,2.867,158.0
4,AUTOGEN_smiles_chembl_5,0,-0.3021,0.091264,57.2857,76.522997,21,23,65,36,...,7.434262,74.557462,2.071041,25.049685,8.597274,16.452411,4425.0,57.0,3.415,190.0


## **5. Prepare Dataset for ML**

In [None]:
df.head()

Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL4208168,inactive,5.922632,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3C2CCCCO2)cc1,459.348,3.8681,1.0,8.0
1,CHEMBL1173420,inactive,4.69897,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3Cc2ccccc2)cc1,465.355,4.2173,1.0,7.0
2,CHEMBL6005160,active,7.070581,Brc1ccc2ncc(-c3cccc(NC4CNC4)n3)n2c1,344.216,2.5425,2.0,5.0
3,CHEMBL3900620,inactive,5.0,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(C)c3)n...,399.454,3.8188,0.0,7.0
4,CHEMBL3939018,inactive,5.004971,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(CC4CCO...,483.572,4.7084,0.0,8.0


Zero-variance features are removed from the dataset to eliminate non-informative data, thereby improving computational efficiency and reducing potential model noise

Calculated molecular fingerprints are merged with biological activity labels (pIC50) to construct a finalized training dataset for subsequent QSAR modeling.

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Delete the temporary ID column
X = df_2d.drop(df_2d.columns[0], axis=1)

# Remove constant non-informative descriptors
selector = VarianceThreshold(threshold=0)
X_reduced = selector.fit_transform(X)

# Important - save the original names of descriptors
selected_cols = X.columns[selector.get_support()]
df_2d_clean = pd.DataFrame(X_reduced, columns=selected_cols)

print(f"Total number of 2D descriptors: {X.shape[1]}")
print(f"Number of unique descriptors:   {df_2d_clean.shape[1]}")

# 2. Select only columns for ML
meta_cols = df[['molecule_chembl_id', 'bioactivity_class', 'pIC50']]

meta_cols = meta_cols.reset_index(drop=True)
df_2d_clean = df_2d_clean.reset_index(drop=True)

# Get a full dataset
combined_df = pd.concat([meta_cols, df_2d_clean], axis=1)
combined_df.head()

Total number of 2D descriptors: 1444
Number of unique descriptors:   1209


Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,...,AMW,WTPT-1,WTPT-2,WTPT-3,WTPT-4,WTPT-5,WPATH,WPOL,XLogP,Zagreb
0,CHEMBL4208168,inactive,5.922632,0.0,-0.4872,0.237364,56.4066,61.790239,15.0,16.0,...,8.809742,60.830308,2.097597,28.478208,6.09021,19.865004,2185.0,46.0,3.273,158.0
1,CHEMBL1173420,inactive,4.69897,0.0,0.3987,0.158962,42.8752,63.174653,21.0,22.0,...,9.099922,62.82446,2.094149,25.368555,3.013874,19.831713,2464.0,46.0,5.333,162.0
2,CHEMBL6005160,active,7.070581,0.0,0.034,0.001156,30.9194,44.285102,15.0,17.0,...,9.801236,43.715787,2.081704,18.354576,0.0,15.808257,946.0,30.0,2.638,116.0
3,CHEMBL3900620,inactive,5.0,0.0,0.8016,0.642563,38.6007,61.586653,21.0,23.0,...,7.826853,61.630301,2.054343,21.656568,5.595335,16.061233,2502.0,48.0,2.867,158.0
4,CHEMBL3939018,inactive,5.004971,0.0,-0.3021,0.091264,57.2857,76.522997,21.0,23.0,...,7.434262,74.557462,2.071041,25.049685,8.597274,16.452411,4425.0,57.0,3.415,190.0


## **6. Check for comleteness of dataset after calculation**

In [None]:
nan_count = combined_df.isnull().sum().sum()
if nan_count > 0:
    print(f" {nan_count} empty values found and reduced")
    combined_df = combined_df.dropna()
else:
    print("No missing values")

print(f"Final records count: {combined_df.shape[0]}")

No missing values
Final records count: 4647


## **7. Save and download the dataset**

In [None]:
# Save as CSV
combined_df.to_csv(results_path + 'QSAR_dataset_2d.csv', index=False)
print("Combined dataset saved as QSAR_dataset_2d.csv")

# Download file in Colab
files.download(results_path +'QSAR_dataset_2d.csv' )

Combined dataset saved as QSAR_dataset_2d.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>