<a href="https://colab.research.google.com/github/imranasalisu1/AI_And_Drug_Discovery_Course_2026/blob/main/assignment_2_QSAR_data_curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**AI and Drug Discovery Course: QSAR Modeling**

This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

**Part 1: Data Collection & Curation**

**First we need to connect Google Colab with our Google Drive, so that we can have access to our Google drive within Colab.**

This allows us to:

1. Save datasets
2. Reload data across sessions
3. Organize project files

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


**Now create "data" folder in our "Colab Notebooks" folder on Google Drive.**

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists


**Install and Import Required Libraries**

We install the ChEMBL web service package so that we can retrieve bioactivity data

In [None]:
!pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.3.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

**Import Libraries**


* pandas for data handling
* new_client from chembl for accessing the database


In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

**Step 1: Search for Traget Protein**

**Target Identification (5O1C)**

*Search ChEMBL for the KRAS target and select the most relevant entry.*

In [None]:
target = new_client.target
target_query = target.search("5O1C")
targets = pd.DataFrame.from_dict(target_query)
targets.head()

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Cellular tumor antigen p53,4.0,False,CHEMBL4096,"[{'accession': 'P04637', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cellular tumor antigen p53/Peptidyl-prolyl cis...,4.0,False,CHEMBL3885544,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
2,[],Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,3.0,False,CHEMBL1907611,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,Tumour suppressor protein p53/Mdm4,3.0,False,CHEMBL2221344,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
4,[],Homo sapiens,CREB-binding protein/p53,3.0,False,CHEMBL3301383,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


**Reterive Bioactivity data for selected target**

In [None]:
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL1907611'

**Now retrieve only bioactivity data for target; Tumour suppressor p53 5O1C(CHEMBL1907611) with reported IC50 values in nM (nanomolar) unit.**

In [None]:
activity = new_client.activity
results = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df1 = pd.DataFrame.from_dict(results)
df1.head(5)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1424130,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,27.0
1,,,1424131,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,66.0
2,,,1424132,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,90.0
3,,,1424135,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,85.0
4,,,1424136,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,92.0


In [None]:
df1.standard_type.unique()

array(['IC50'], dtype=object)

**Finally Save the resulting bioactivity data to a CSV file bioactivity_raw_data.csv.**

In [None]:
df1.to_csv('bioactivity_raw_data.csv', index=False)

**Now copy "bioactivity_raw_data.csv" file to Google Drive, in foler "data"**

In [None]:
! cp bioactivity_raw_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 2371
-rw------- 1 root root 2427770 Jan 27 19:12 bioactivity_raw_data.csv


In [None]:
! head bioactivity_raw_data.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,1424130,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 interaction in SJSA human sarcoma cells,B,,,BAO_0000190,BAO_0000219,cell-based format,CCCN1C(=O)c2ccccc2C1(NC(=O)c1ccc(C(C)(C)C)cc1)c1ccc(OCOCC[Si](C)(C)C)cc1,,,CHEMBL1142026,Bioorg Med Chem Lett,2005,"{'bei': '7.98', 'le': None, 'lle': None,

**Step 3: Bioactivity Data Retrieval (IC50)**

**Retrieve bioactivity data (IC50) for the selected 5O1C target.**

**Inspect Missing Values**

In [23]:
df1["standard_type"].isna().sum()

np.int64(0)

**Filter Rows with Valid Bioactivity Values**

In [27]:
df2 = df1[df1["standard_value"].notna()]
df2.head()

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1424130,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,27.0
1,,,1424131,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,66.0
2,,,1424132,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,90.0
3,,,1424135,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,85.0
4,,,1424136,[],CHEMBL828042,Inhibitory concentration against MDM2-p53 inte...,B,,,BAO_0000190,...,Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,9606,,,IC50,uM,UO_0000065,,92.0


**Assign Bioactivity Classes Define active, intermediate, and inactive classes based on IC50 values.**

In [28]:
bioactivity_class = []
for value in df2.standard_value:
    value = float(value)
    if value >= 10000:
        bioactivity_class.append("inactive")
    elif value <= 1000:
        bioactivity_class.append("active")
    else:
        bioactivity_class.append("intermediate")

**Extract Relevant Columns**

In [29]:
molecule_ids = df2.molecule_chembl_id.tolist()
canonical_smiles = df2.canonical_smiles.tolist()
standard_values = df2.standard_value.tolist()

In [30]:
data = list(zip(
    molecule_ids,
    canonical_smiles,
    standard_values,
        bioactivity_class,
))

In [31]:
df3 = pd.DataFrame(
    data,
    columns=[
        "molecule_chembl_id",
        "canonical_smiles",
        "standard_value",
        "bioactivity_class",
    ]
)
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL179549,CCCN1C(=O)c2ccccc2C1(NC(=O)c1ccc(C(C)(C)C)cc1)...,27000.0,inactive
1,CHEMBL360920,CCCN1C(=O)c2ccccc2C1(NC(=O)c1ccccc1)c1ccc(C(C)...,66000.0,inactive
2,CHEMBL182052,CCCOC1(c2ccccc2)c2ccccc2C(=O)N1C(Cc1ccc(O)cc1)...,90000.0,inactive
3,CHEMBL179662,OCC(NC1(c2ccccc2)c2ccccc2CC1Cc1ccccc1)C(O)c1cc...,85000.0,inactive
4,CHEMBL181688,CC(C)(C)c1ccc(COC2(c3ccccc3)c3ccccc3C(=O)N2Cc2...,92000.0,inactive


**Remove Compounds without Valid SMILES.** Drop rows with **NaN**, **empty** or **None** SMILES values.

In [32]:
df3 = df3.dropna(subset=["canonical_smiles"])
df3 = df3[df3["canonical_smiles"].str.lower() != "none"]
df3 = df3[df3["canonical_smiles"].str.strip() != ""]
df3.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL179549,CCCN1C(=O)c2ccccc2C1(NC(=O)c1ccc(C(C)(C)C)cc1)...,27000.0,inactive
1,CHEMBL360920,CCCN1C(=O)c2ccccc2C1(NC(=O)c1ccccc1)c1ccc(C(C)...,66000.0,inactive
2,CHEMBL182052,CCCOC1(c2ccccc2)c2ccccc2C(=O)N1C(Cc1ccc(O)cc1)...,90000.0,inactive
3,CHEMBL179662,OCC(NC1(c2ccccc2)c2ccccc2CC1Cc1ccccc1)C(O)c1cc...,85000.0,inactive
4,CHEMBL181688,CC(C)(C)c1ccc(COC2(c3ccccc3)c3ccccc3C(=O)N2Cc2...,92000.0,inactive


**Save Preprocessed Bioactivity Data.** Save the cleaned dataset to CSV and copy to Google Drive.

In [33]:
df3.to_csv("bioactivity_preprocessed_data.csv", index=False)

!cp bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"
!ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_preprocessed_data.csv  bioactivity_raw_data.csv


In [34]:
df3.describe()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
count,1819,1819,1819.0,1819
unique,1495,1495,878.0,3
top,CHEMBL191334,COc1ccc(C2=N[C@@H](c3ccc(Cl)cc3)[C@@H](c3ccc(C...,10.0,active
freq,20,20,58.0,1355
