# 01 - Data Collection: ChEMBL Bioactivity Data

In this notebook we will:
- Download bioactivity data for EGFR (CHEMBL203) from the ChEMBL API
- Filter for IC50 values
- Retrieve SMILES for each compound
- Save the cleaned dataset to a CSV file

In [1]:
import pandas as pd
import requests

In [2]:
target_id = "CHEMBL203"
base_url = "https://www.ebi.ac.uk/chembl/api/data/activity.json"

params = {
    "target_chembl_id": target_id,
    "standard_type": "IC50",
    "limit": 1000
}

response = requests.get(base_url, params=params)
data = response.json() # Converts the JSON (JavaScript Object Notation) response into a Python dictionary
data["activities"][:2] # Retrieves the first 2 bioactivity records returned from ChEMBL

[{'action_type': None,
  'activity_comment': None,
  'activity_id': 32260,
  'activity_properties': [],
  'assay_chembl_id': 'CHEMBL674637',
  'assay_description': 'Inhibitory activity towards tyrosine phosphorylation for the epidermal growth factor-receptor kinase',
  'assay_type': 'B',
  'assay_variant_accession': None,
  'assay_variant_mutation': None,
  'bao_endpoint': 'BAO_0000190',
  'bao_format': 'BAO_0000357',
  'bao_label': 'single protein format',
  'canonical_smiles': 'Cc1cc(C)c(/C=C2\\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1',
  'data_validity_comment': None,
  'data_validity_description': None,
  'document_chembl_id': 'CHEMBL1134862',
  'document_journal': 'Bioorg Med Chem Lett',
  'document_year': 2002,
  'ligand_efficiency': {'bei': '19.25',
   'le': '0.37',
   'lle': '2.94',
   'sei': '8.93'},
  'molecule_chembl_id': 'CHEMBL68920',
  'molecule_pref_name': None,
  'parent_molecule_chembl_id': 'CHEMBL68920',
  'pchembl_value': '7.39',
  'potential_duplicate': 0,
  'qudt_un

In [11]:
records = []

for entry in data["activities"]:
    if entry["standard_value"] and entry["molecule_chembl_id"] and entry["canonical_smiles"]: # Filter out entries without IC50 values or molecule IDs
        records.append({
            "molecule_chembl_id": entry["molecule_chembl_id"],
            "canonical_smiles": entry["canonical_smiles"],
            "IC50": entry["standard_value"],
            "units": entry["standard_units"],
        })

df = pd.DataFrame(records) 

print(df.head())
print(df.shape)

  molecule_chembl_id                                   canonical_smiles  \
0        CHEMBL68920  Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...   
1        CHEMBL68920  Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...   
2        CHEMBL68920  Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...   
3        CHEMBL69960  Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...   
4        CHEMBL69960  Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...   

     IC50 units  
0    41.0    nM  
1   300.0    nM  
2  7820.0    nM  
3   170.0    nM  
4    40.0    nM  
(979, 4)


## IC50 to pIC50 Conversion

In order to use IC50 values as continuous inputs for regression modeling, we convert them into **pIC50** values. This transformation makes the scale logarithmic, which is more suitable for statistical learning and aligns better with the physicochemical meaning of binding affinity.

The formula used for the conversion is:

$$
\text{pIC}_{50} = -\log_{10} \left( \frac{\text{IC}_{50} \ (\text{in nM})}{10^9} \right) = 9 - \log_{10}(\text{IC}_{50})
$$

This converts **IC50 values in nanomolar (nM)** into their corresponding **pIC50 values**.

We also remove invalid values (e.g., missing or zero IC50s) to ensure a clean dataset for modeling.n


In [12]:
import numpy as np

# Convert IC50 to numeric (force errors to NaN)
df["IC50"] = pd.to_numeric(df["IC50"], errors="coerce")

df = df.dropna(subset=["IC50"]) # Remove rows with NaN values in the IC50 column

df = df[df["IC50"] > 0] # Filter out rows where IC50 is less than or equal to 0

df["IC50_M"] = df["IC50"] * 1e-9  # Convert IC50 values from nM to M

df["pIC50"] = -np.log10(df["IC50_M"]) # Calculate pIC50 values from IC50 in M

df["pIC50"] = df["pIC50"].round(3)  # Round pIC50 values to 3 decimal places

df = df.drop(columns=["IC50_M"])  # Drop the IC50_M column as it's no longer needed

df.head()




Unnamed: 0,molecule_chembl_id,canonical_smiles,IC50,units,pIC50
0,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,41.0,nM,7.387
1,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,300.0,nM,6.523
2,CHEMBL68920,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,7820.0,nM,5.107
3,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,170.0,nM,6.77
4,CHEMBL69960,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,40.0,nM,7.398


In [None]:
df.to_csv("../data/chembl_egfr_clean.csv", index=False) # Save the cleaned DataFrame to a CSV file