<a href="https://colab.research.google.com/github/lmVl12/AI_and_Drug_Discovery_Course_2026/blob/main/Assignment_2_QSAR_data_curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**


# **Part 1: Data Collection & Curation**

### **1.1 Environment Setup**
The Google Colab environment is integrated with Google Drive to facilitate persistent data storage and project organization. The chembl_webresource_client is installed to provide programmatic access to the ChEMBL database API.


In [1]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [2]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

#### **Library Imports**
The **chembl_webresource_client** is installed to provide programmatic access to the ChEMBL database API.
Essential Python libraries are imported for data manipulation and database interaction:
* pandas for data handling
* new_client to perform targeted queries against the ChEMBL database.
* rdkit for chemical curation

In [3]:
!pip install chembl_webresource_client
!pip install rdkit


Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.3.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

In [4]:
import pandas as pd
from chembl_webresource_client.new_client import new_client
from rdkit import Chem
from rdkit.Chem.SaltRemover import SaltRemover
remover = SaltRemover()

### **1.2 Target Identification (FLT3)**
The study focuses on the Receptor-type tyrosine-protein kinase FLT3 (Homo sapiens). To ensure biological accuracy, the target is queried using its UniProt ID (P36888). ChEMBL ID CHEMBL1974 is identified as the primary target for bioactivity data retrieval.

In [5]:
target = new_client.target
target_query = target.search("P36888")
targets = pd.DataFrame.from_dict(target_query)
targets.head()


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,17.0,False,CHEMBL1974,"[{'accession': 'P36888', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Aurora kinase B/Receptor-type tyrosine-protein...,15.0,False,CHEMBL3430908,"[{'accession': 'P36888', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Homo sapiens,Protein cereblon/Tyrosine-protein kinase recep...,15.0,False,CHEMBL4630730,"[{'accession': 'P36888', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,von Hippel-Lindau disease tumor suppressor/FLT3,13.0,False,CHEMBL4523735,"[{'accession': 'P36888', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


### **1.3 Bioactivity Data Retrieval**
Experimental records for CHEMBL1974 are retrieved, specifically filtering for IC50 measurements. The raw dataset is exported as bioactivity_raw_data.csv and backed up to the Google Drive /data directory to ensure data integrity.

In [6]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1974'

In [7]:
# Retrieving IC50 bioactivity data for selected target
activity = new_client.activity
results = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [9]:
# Initial data assembly for curation
df1 = pd.DataFrame.from_dict(results)
df1.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,866063,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.128
1,,,872532,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.22
2,,,872564,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,8.79
3,,,879718,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,1.91
4,,,884645,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,30.0


In [10]:
# Check the homogenity of IC50 values in a dataset
df1.standard_type.unique()

array(['IC50'], dtype=object)

In [11]:
# Exporting raw database results for traceability
df1.to_csv('bioactivity_raw_data.csv', index=False)

In [12]:
# Archiving raw data to Google Drive project folder
! cp bioactivity_raw_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [13]:
# Verifying successful archival
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 5659
-rw------- 1 root root 5793945 Jan 23 10:00 bioactivity_raw_data.csv


In [14]:
! head bioactivity_raw_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,866063,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-3 cytoplasmic domain phosphorylation in CHO cells,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc2cc1OCCCN1CCC(C)CC1,,,CHEMBL1135998,J Med Chem,2002,"{'bei': '11.95', 'le': '0.22', 'lle': '1.61'

### **1.4 Data Preprocessing & Categorization**
####**Cleaning and Bioactivity Classification**
The dataset undergoes systematic curation to ensure data quality:**
* Handling Missing Values: Rows lacking a standard_value are removed to ensure only valid experimental results are analyzed.
* Classification: Based on IC50 potency, compounds are categorized into:
  * Active: $\text{IC50} \le 1000 \text{ nM}$
  * Inactive: $\text{IC50} \ge 10000 \text{ nM}$
  * Intermediate: between 1000 and 10000 nM.
* Feature Selection: Relevant columns (Molecule ChEMBL ID, Canonical SMILES, and Bioactivity Class) are extracted.

In [15]:
# Convert IC50 values to numeric
def to_numeric(val):
    try:
        if isinstance(val, list): return float(val[0])
        return float(val)
    except: return None
df1['standard_value'] = df1['standard_value'].apply(to_numeric)

In [16]:
# Filter the dataset to remove records lacking bioactivity values (IC50)
df2 = df1.dropna(subset=['standard_value'])
# Monitoring curation efficiency
print(f"Initial records: {len(df1)}")
print(f"Only complete records: {len(df2)}")
df2.head()

Initial records: 7096
Only complete records: 6962


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,866063,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.128
1,,,872532,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.22
2,,,872564,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,8.79
3,,,879718,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,1.91
4,,,884645,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,30.0


In [17]:
# Filter for nM and ensure the value is there
excluded_df = df2[df2['standard_units'] != 'nM']
df2 = df2[df2['standard_units'] == 'nM']
print(f"Filter complete. Rows remaining: {len(df2)} | Units check: {df2.standard_units.unique()}")
if not excluded_df.empty:
    print("\nReasons for exclusion (Units found):")
    print(excluded_df['units'].value_counts())
else:
    print("\nNo rows were excluded.")

Filter complete. Rows remaining: 6946 | Units check: ['nM']

Reasons for exclusion (Units found):
units
10'-6g/ml      5
10'-7g/ml      4
ug ml-1        4
10'-5g/ml      2
10^-5 mol/L    1
Name: count, dtype: int64


In [18]:
# Group by ID to avoid duplicates (take median if those occur)
df2 = df2.groupby('molecule_chembl_id').agg({
    'canonical_smiles': 'first',    # Keep one SMILES
    'standard_value': 'median'      # Median IC50 across the different IDs
}).reset_index()

In [19]:
# Categorizing compounds into potency classes based on IC50 values (nM)
bioactivity_class = []
for value in df2.standard_value:
    value = float(value)
    if value >= 10000:
        bioactivity_class.append("inactive")
    elif value <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

In [20]:
#Create final table
df3 = df2.copy()
df3['bioactivity_class'] = bioactivity_class

df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL102301,COc1cc2ncnc(N3CCN(/C(S)=N\Cc4ccc5c(c4)OCO5)CC3...,14900.0,inactive
1,CHEMBL102346,COc1cc2ncnc(N3CCN(C(=O)Nc4ccc(Oc5ccccc5)cc4)CC...,230.0,active
2,CHEMBL103307,O=C1Nc2ccccc2/C1=C\c1ccc(O)cc1,1800.0,intermediate
3,CHEMBL103667,Cc1ccc(-n2nc(C(C)(C)C)cc2NC(=O)Nc2ccc(OCCN3CCO...,30000.0,inactive
4,CHEMBL104067,COc1cc2ncnc(N3CCN(C(=O)Nc4ccc(C(C)C)cc4)CC3)c2...,50.0,active


### **1.5. Chemical Structure Validation**
A final structural check is performed to remove compounds without valid chemical structures. Any rows with NaN, empty, or "None" SMILES values are dropped. Salts removal and canonization step ensures the dataset is compatible with molecular descriptor calculations in Part 2.

In [21]:
# Check the input for empty or missing text values
def clean_molecule(smiles):
    if not isinstance(smiles, str) or smiles.lower() in ["none", ""]:
        return None
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        mol = remover.StripMol(mol) # Salts removal
        return Chem.MolToSmiles(mol, canonical=True) # Canonization
    return None

# Standartization of SMILES
df3['canonical_smiles'] = df3['canonical_smiles'].apply(clean_molecule)
df3 = df3.dropna(subset=["canonical_smiles"])
duplicate_structures = df3.duplicated(subset=['canonical_smiles']).sum()
print(f"Number of duplicate structures with different IDs: {duplicate_structures}")

# Check for chemically identical duplicates (take median if those occur)
df4 = df3.groupby('canonical_smiles').agg({
    'molecule_chembl_id': 'first',  # Keep one ID as a reference
    'standard_value': 'median',     # Median IC50 across the different IDs
    'bioactivity_class': 'first'    # Keep the assigned class
}).reset_index()

print(f"Records in df3: {len(df3)}")
print(f"Records in final df4: {len(df4)}")
print(f"Successfully collapsed {len(df3) - len(df4)} hidden chemical duplicates.")

Number of duplicate structures with different IDs: 57
Records in df3: 4704
Records in final df4: 4647
Successfully collapsed 57 hidden chemical duplicates.


### **1.6. Export of Curated Dataset**
The preprocessed dataset is saved as bioactivity_preprocessed_data.csv. This file is synchronized with Google Drive to serve as the input for subsequent QSAR modeling.

In [22]:
# Monitoring curation efficiency
print(f"Initial records: {len(df1)}")
print(f"Final curated records: {len(df4)}")
df4.head()

Initial records: 7096
Final curated records: 4647


Unnamed: 0,canonical_smiles,molecule_chembl_id,standard_value,bioactivity_class
0,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3C2CCCCO2)cc1,CHEMBL4208168,1195.0,intermediate
1,Brc1ccc(Nc2nc(N3CCOCC3)nc3c2ncn3Cc2ccccc2)cc1,CHEMBL1173420,20000.0,inactive
2,Brc1ccc2ncc(-c3cccc(NC4CNC4)n3)n2c1,CHEMBL6005160,85.0,active
3,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(C)c3)n...,CHEMBL3900620,10000.0,inactive
4,C#CCN(c1cc(OC)cc(OC)c1)c1ccc2ncc(-c3cnn(CC4CCO...,CHEMBL3939018,9886.185,intermediate


In [23]:
df4.to_csv("bioactivity_preprocessed_data.csv", index=False)

!cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
!ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_preprocessed_data.csv  bioactivity_raw_data.csv


Check if the dataset is balanced (to know how QSAR will predict the activity)

In [24]:
# Calculate counts and percentages
class_counts = df4['bioactivity_class'].value_counts()
class_pcts = df4['bioactivity_class'].value_counts(normalize=True) * 100

print("--- Bioactivity Class Balance ---")
for cls in class_counts.index:
    print(f"{cls.capitalize()}: {class_counts[cls]} ({class_pcts[cls]:.2f}%)")

# Calculate the imbalance ratio (Majority vs Minority)
ratios = class_counts / class_counts.min()

print(f"Act : Inact : Inter = {ratios['active']:.2f} : {ratios['inactive']:.2f} : 1.00")

--- Bioactivity Class Balance ---
Active: 2936 (63.18%)
Inactive: 979 (21.07%)
Intermediate: 732 (15.75%)
Act : Inact : Inter = 4.01 : 1.34 : 1.00


The dataset is "Active-heavy,"  and moderately imbalanced, which is common for well-studied targets like FLT3. It will predict the active molecules rather then inactive. However it is also feasible (the size of the cleaned dataset is still workable).

## **End of Part 1: Data Collection and Curation**