<a href="https://colab.research.google.com/github/lmVl12/AI_and_Drug_Discovery_Course_2026/blob/main/Assignment_2_QSAR_data_curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

# **Part 1: Data Collection & Curation**

### **1.1 Environment Setup**
The Google Colab environment is integrated with Google Drive to facilitate persistent data storage and project organization. The chembl_webresource_client is installed to provide programmatic access to the ChEMBL database API.


In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

#### **Library Imports**
The **chembl_webresource_client** is installed to provide programmatic access to the ChEMBL database API.
Essential Python libraries are imported for data manipulation and database interaction:
* pandas for data handling
* new_client to perform targeted queries against the ChEMBL database.

In [3]:
!pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.3.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

In [4]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

### **1.2 Target Identification (FLT3)**
The study focuses on the Receptor-type tyrosine-protein kinase FLT3 (Homo sapiens). To ensure biological accuracy, the target is queried using its UniProt ID (P36888). ChEMBL ID CHEMBL1974 is identified as the primary target for bioactivity data retrieval.

In [5]:
target = new_client.target
target_query = target.search("P36888")
targets = pd.DataFrame.from_dict(target_query)
targets.head()


Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,17.0,False,CHEMBL1974,"[{'accession': 'P36888', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Aurora kinase B/Receptor-type tyrosine-protein...,15.0,False,CHEMBL3430908,"[{'accession': 'P36888', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Homo sapiens,Protein cereblon/Tyrosine-protein kinase recep...,15.0,False,CHEMBL4630730,"[{'accession': 'P36888', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,von Hippel-Lindau disease tumor suppressor/FLT3,13.0,False,CHEMBL4523735,"[{'accession': 'P36888', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


### **1.3 Bioactivity Data Retrieval**
Experimental records for CHEMBL1974 are retrieved, specifically filtering for IC50 measurements. The raw dataset is exported as bioactivity_raw_data.csv and backed up to the Google Drive /data directory to ensure data integrity.

In [11]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1974'

In [12]:
# Retrieving IC50 bioactivity data for selected target
activity = new_client.activity
results = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [13]:
# Initial data assembly for curation
df1 = pd.DataFrame.from_dict(results)
df1.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,866063,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.128
1,,,872532,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.22
2,,,872564,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,8.79
3,,,879718,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,1.91
4,,,884645,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,30.0


In [14]:
df1.standard_type.unique()

array(['IC50'], dtype=object)

In [15]:
# Exporting raw database results for traceability
df1.to_csv('bioactivity_raw_data.csv', index=False)

In [None]:
# Archiving raw data to Google Drive project folder
! cp bioactivity_raw_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
# Verifying successful archival of the 5.7MB dataset
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 5659
-rw------- 1 root root 5793932 Jan 21 15:26 bioactivity_raw_data.csv


In [None]:
! head bioactivity_raw_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,866063,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-3 cytoplasmic domain phosphorylation in CHO cells,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc2cc1OCCCN1CCC(C)CC1,,,CHEMBL1135998,J Med Chem,2002,"{'bei': '11.95', 'le': '0.22', 'lle': '1.61'

### **1.4 Data Preprocessing & Categorization**
####**Cleaning and Bioactivity Classification**
The dataset undergoes systematic curation to ensure data quality:**
* Handling Missing Values: Rows lacking a standard_value are removed to ensure only valid experimental results are analyzed.
* Classification: Based on IC50 potency, compounds are categorized into:
  * Active: $\text{IC50} \le 1000 \text{ nM}$
  * Inactive: $\text{IC50} \ge 10000 \text{ nM}$
  * Intermediate: Values between 1000 and 10000 nM.
* Feature Selection: Relevant columns (Molecule ChEMBL ID, Canonical SMILES, and Bioactivity Class) are extracted.

In [None]:
df1["standard_type"].isna().sum()

np.int64(0)

In [17]:
# Filtering the dataset to remove records lacking bioactivity values (IC50)
df2 = df1[df1["standard_value"].notna()]
df2.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,866063,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.128
1,,,872532,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,0.22
2,,,872564,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,8.79
3,,,879718,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,1.91
4,,,884645,[],CHEMBL766072,Inhibition of chimeric PDGF receptor with FLT-...,B,,,BAO_0000190,...,Homo sapiens,Receptor-type tyrosine-protein kinase FLT3,9606,,,IC50,uM,UO_0000065,,30.0


In [18]:
# Categorizing compounds into potency classes based on IC50 values (nM)
bioactivity_class = []
for value in df2.standard_value:
    value = float(value)
    if value >= 10000:
        bioactivity_class.append("inactive")
    elif value <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

In [19]:
# Consolidating molecular identifiers, structures, and potency labels
molecule_ids = df2.molecule_chembl_id.tolist()
canonical_smiles = df2.canonical_smiles.tolist()
standard_values = df2.standard_value.tolist()

In [20]:
data = list(zip(
    molecule_ids,
    canonical_smiles,
    standard_values,
        bioactivity_class,
))

In [21]:
# Constructing the curated dataframe for subsequent analysis
df3 = pd.DataFrame(
    data,
    columns=[
        "molecule_chembl_id",
        "canonical_smiles",
        "standard_value",
        "bioactivity_class",
    ]
)
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL330863,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,128.0,active
1,CHEMBL124660,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,220.0,active
2,CHEMBL126699,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,8790.0,intermediate
3,CHEMBL445636,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,1910.0,intermediate
4,CHEMBL941,Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc...,30000.0,inactive


### **1.5. Chemical Structure Validation**
A final structural check is performed to remove compounds without valid chemical structures. Any rows with NaN, empty, or "None" SMILES values are dropped. This ensures the dataset is compatible with molecular descriptor calculations in Part 2.

In [22]:
# Structural validation: dropping entries with missing or empty SMILES strings
df3 = df3.dropna(subset=["canonical_smiles"])
df3 = df3[df3["canonical_smiles"].str.lower() != "none"]
df3 = df3[df3["canonical_smiles"].str.strip() != ""]
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL330863,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,128.0,active
1,CHEMBL124660,COc1cc2c(N3CCN(C(=O)Nc4ccc(OC(C)C)cc4)CC3)ncnc...,220.0,active
2,CHEMBL126699,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,8790.0,intermediate
3,CHEMBL445636,COc1cc2c(N3CCN(C(=O)Nc4ccc(C#N)cc4)CC3)ncnc2cc...,1910.0,intermediate
4,CHEMBL941,Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc...,30000.0,inactive


In [29]:
# Monitoring curation efficiency (Initial vs. Final records)
len(df1)

7096

In [30]:
len(df3)

6960

### **1.6. Export of Curated Dataset**
The preprocessed dataset is saved as bioactivity_preprocessed_data.csv. This file is synchronized with Google Drive to serve as the input for subsequent QSAR modeling.

In [None]:
df3.to_csv("bioactivity_preprocessed_data.csv", index=False)

!cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
!ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_preprocessed_data.csv  bioactivity_raw_data.csv


## **End of Part 1: Data Collection and Curation**