### Introductory information

ZINC and ChEMBL are two important databases in the field of chemistry and drug discovery, but they differ in terms of content and purpose:  
  
**ZINC**:
- **Content**: A database containing millions of commercially available chemical compounds, mainly small molecules. It includes information on chemical structure (e.g., SMILES, SDF), physicochemical properties (e.g., logP, molecular weight), and commercial availability.
- **Purpose**: Primarily used for virtual screening - searching compound libraries to find potential drug candidates. Compounds from ZINC are often used as a starting point for designing new drugs or optimizing existing ones.

**ChEMBL**:
- **Content**: A bioactivity database containing information on chemical compounds, their biological targets (e.g., proteins, enzymes), biological activity (e.g., IC₅₀, Kᵢ), and data from clinical studies.
- **Purpose**: Mainly used for research on structure–activity relationships (SAR), analyzing trends in drug discovery, identifying new biological targets, and validating computational models.

The aim of this project is to predict binding affinity expressed as **pKᵢ**.

### Imports

In [121]:
import pandas as pd
import numpy as np
from pathlib import Path

### Loading the Data  

In [122]:
zinc_data = pd.read_csv(Path("data/raw/ZINC_data_5HT2A.csv"))
chembl_data = pd.read_csv(Path("data/raw/Chembl_data_5HT2A.csv"))
descriptors_data = pd.read_csv(Path("data/raw/Mordred_descriptors_database.csv"))

  descriptors_data = pd.read_csv(Path("data/raw/Mordred_descriptors_database.csv"))


### Looking at the Data

In [123]:
zinc_data.head()

Unnamed: 0,zinc_id,smiles,ortholog_name,gene_name,affinity,chembldocid,title,reference.pubmed_id,reference.doi,reference.chembl_id,reference.journal,reference.year,pKi_numeric
0,ZINC000029038589,CC(=O)N[C@@H]1CCc2ccc(CCN3CCN(c4nsc5ccccc45)CC...,5HT2A_HUMAN,HTR2A,11.0,38266,,18160289.0,,CHEMBL1140829,Bioorg. Med. Chem. Lett.,2008.0,11.0
1,ZINC000029038591,CC(=O)N[C@H]1CCc2ccc(CCN3CCN(c4nsc5ccccc45)CC3...,5HT2A_HUMAN,HTR2A,11.0,38266,,18160289.0,,CHEMBL1140829,Bioorg. Med. Chem. Lett.,2008.0,11.0
2,ZINC000029038592,CCC(=O)N[C@@H]1CCc2ccc(CCN3CCN(c4nsc5ccccc45)C...,5HT2A_HUMAN,HTR2A,11.0,38266,,18160289.0,,CHEMBL1140829,Bioorg. Med. Chem. Lett.,2008.0,11.0
3,ZINC000029038594,CCC(=O)N[C@H]1CCc2ccc(CCN3CCN(c4nsc5ccccc45)CC...,5HT2A_HUMAN,HTR2A,11.0,38266,,18160289.0,,CHEMBL1140829,Bioorg. Med. Chem. Lett.,2008.0,11.0
4,ZINC000000597400,O=S1(=O)c2cccc3cccc(c23)N1CCCN1CCN(c2ccc(F)cc2...,5HT2A_HUMAN,HTR2A,10.4,16293,,11170639.0,,CHEMBL1134342,J. Med. Chem.,2001.0,10.4


In [124]:
chembl_data.head(10)

Unnamed: 0,Molecule ChEMBL ID,Molecule Name,Molecule Max Phase,Molecular Weight,#RO5 Violations,AlogP,Compound Key,Smiles,Standard Type,Standard Relation,...,Target Type,Document ChEMBL ID,Source ID,Source Description,Document Journal,Document Year,Cell ChEMBL ID,Properties,relation_clean,pKi_numeric
0,CHEMBL4212943,,,452.53,0.0,4.42,8b,CC(=O)c1c(OCCCCN2CCN(c3cccc(F)c3)CC2)ccc2c(C)c...,Ki,'=',...,SINGLE PROTEIN,CHEMBL4184192,1,Scientific Literature,Bioorg Med Chem,2018.0,,,=,6.4
1,CHEMBL316527,,,235.28,0.0,1.07,9,COc1c2c(c(CCN)c3c1OCC3)CCO2,Ki,'=',...,SINGLE PROTEIN,CHEMBL1130147,1,Scientific Literature,J Med Chem,1997.0,,,=,5.35
2,CHEMBL4591410,,,336.4,0.0,3.49,22; PKSN-240,COc1ccc2[nH]cc(CCNCc3ccc(-c4cn[nH]c4)o3)c2c1,Ki,'=',...,SINGLE PROTEIN,CHEMBL4312034,1,Scientific Literature,Eur J Med Chem,2020.0,CHEMBL3307715,,=,7.55
3,CHEMBL4584504,,,350.39,0.0,4.6,34; PKSN-222,Oc1ccc(-c2ccc(CNCCc3c[nH]c4cc(F)ccc34)o2)cc1,Ki,'=',...,SINGLE PROTEIN,CHEMBL4312034,1,Scientific Literature,Eur J Med Chem,2020.0,CHEMBL3307715,,=,7.02
4,CHEMBL180010,,,481.43,1.0,5.16,"6, PG01037",O=C(NC/C=C/CN1CCN(c2cccc(Cl)c2Cl)CC1)c1ccc(-c2...,Ki,'=',...,SINGLE PROTEIN,CHEMBL1148745,1,Scientific Literature,J Med Chem,2007.0,,,=,7.21
5,CHEMBL3771331,,,328.42,0.0,1.78,22,CCNC(=O)c1cccc(NC2=NC(N)=NC3(CCCCC3)N2)c1,Ki,'=',...,SINGLE PROTEIN,CHEMBL3769382,1,Scientific Literature,J Med Chem,2016.0,,,=,5.88
6,CHEMBL398619,,,483.64,0.0,4.47,12j,O=C(NC1CCc2ccc(CCN3CCN(c4nsc5ccccc45)CC3)cc21)...,Ki,'=',...,SINGLE PROTEIN,CHEMBL1140829,1,Scientific Literature,Bioorg Med Chem Lett,2008.0,CHEMBL3307716,,=,9.02
7,CHEMBL442290,,,500.64,2.0,5.22,12i,O=C(NC1CCc2ccc(CCN3CCN(c4nsc5ccccc45)CC3)cc21)...,Ki,'=',...,SINGLE PROTEIN,CHEMBL1140829,1,Scientific Literature,Bioorg Med Chem Lett,2008.0,CHEMBL3307716,,=,8.61
8,CHEMBL275451,,,382.48,0.0,3.76,12,O=C1c2ccccc2CCCN1CCN1CCC(Oc2ccc(F)cc2)CC1,Ki,'=',...,SINGLE PROTEIN,CHEMBL1136643,1,Scientific Literature,Bioorg Med Chem Lett,2003.0,,,=,7.5
9,CHEMBL316527,,,235.28,0.0,1.07,9,COc1c2c(c(CCN)c3c1OCC3)CCO2,Ki,'=',...,SINGLE PROTEIN,CHEMBL1130147,1,Scientific Literature,J Med Chem,1997.0,,,=,6.47


In [125]:
descriptors_data.head()

Unnamed: 0,smiles,ABC,ABCGG,nAcid,nBase,SpAbs_A,SpMax_A,SpDiam_A,SpAD_A,SpMAD_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,CC(=O)c1c(OCCCCN2CCN(c3cccc(F)c3)CC2)ccc2c(C)c...,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,42.06466278808458,2.4743940799154185,4.948788159830838,42.06466278808458,1.2746867511540785,...,10.410697,69.176088,452.211136,7.293728,3940,53,172.0,201.0,10.47222222222222,7.25
1,COc1c2c(c(CCN)c3c1OCC3)CCO2,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,22.566899740452094,2.5393429456080714,4.980643062207491,22.566899740452094,1.327464690614829,...,9.956033,66.014285,235.120843,6.915319,452,28,92.0,114.0,4.916666666666667,3.916667
2,COc1ccc2[nH]cc(CCNCc3ccc(-c4cn[nH]c4)o3)c2c1,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,33.40302700153903,2.42142639341503,4.721184948178399,33.40302700153903,1.3361210800615613,...,9.992277,76.384627,336.158626,7.470192,1815,33,132.0,155.0,6.027777777777779,5.583333
3,Oc1ccc(-c2ccc(CNCCc3c[nH]c4cc(F)ccc34)o2)cc1,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,34.27789102037956,2.4149865997702755,4.699237272854239,34.27789102037955,1.318380423860752,...,10.067263,76.660405,350.143056,7.780957,2065,36,138.0,161.0,6.888888888888889,5.666667
4,O=C(NC/C=C/CN1CCN(c2cccc(Cl)c2Cl)CC1)c1ccc(-c2...,module 'numpy' has no attribute 'float'.\n`np....,module 'numpy' has no attribute 'float'.\n`np....,0,1,43.73290574033282,2.375216679151583,4.750433358303166,43.73290574033282,1.325239567888873,...,10.267193,68.855801,480.148367,8.138108,4316,50,168.0,194.0,9.25,7.388889


In [126]:
zinc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2877 entries, 0 to 2876
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   zinc_id              2877 non-null   object 
 1   smiles               2877 non-null   object 
 2   ortholog_name        2877 non-null   object 
 3   gene_name            2877 non-null   object 
 4   affinity             2877 non-null   float64
 5   chembldocid          2877 non-null   int64  
 6   title                97 non-null     object 
 7   reference.pubmed_id  2701 non-null   float64
 8   reference.doi        116 non-null    object 
 9   reference.chembl_id  2877 non-null   object 
 10  reference.journal    2780 non-null   object 
 11  reference.year       2780 non-null   float64
 12  pKi_numeric          2877 non-null   float64
dtypes: float64(4), int64(1), object(8)
memory usage: 292.3+ KB


In [127]:
chembl_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5454 entries, 0 to 5453
Data columns (total 47 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Molecule ChEMBL ID          5454 non-null   object 
 1   Molecule Name               1198 non-null   object 
 2   Molecule Max Phase          1045 non-null   float64
 3   Molecular Weight            5454 non-null   float64
 4   #RO5 Violations             5423 non-null   float64
 5   AlogP                       5423 non-null   float64
 6   Compound Key                5454 non-null   object 
 7   Smiles                      5454 non-null   object 
 8   Standard Type               5454 non-null   object 
 9   Standard Relation           4467 non-null   object 
 10  Standard Value              4490 non-null   float64
 11  Standard Units              4504 non-null   object 
 12  pChEMBL Value               3949 non-null   float64
 13  Data Validity Comment       20 no

In [128]:
descriptors_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7009 entries, 0 to 7008
Columns: 1614 entries, smiles to mZagreb2
dtypes: bool(2), float64(457), int64(312), object(843)
memory usage: 86.2+ MB


In [129]:
zinc_data.columns

Index(['zinc_id', 'smiles', 'ortholog_name', 'gene_name', 'affinity',
       'chembldocid', 'title', 'reference.pubmed_id', 'reference.doi',
       'reference.chembl_id', 'reference.journal', 'reference.year',
       'pKi_numeric'],
      dtype='object')

In [130]:
chembl_data.columns

Index(['Molecule ChEMBL ID', 'Molecule Name', 'Molecule Max Phase',
       'Molecular Weight', '#RO5 Violations', 'AlogP', 'Compound Key',
       'Smiles', 'Standard Type', 'Standard Relation', 'Standard Value',
       'Standard Units', 'pChEMBL Value', 'Data Validity Comment', 'Comment',
       'Uo Units', 'Ligand Efficiency BEI', 'Ligand Efficiency LE',
       'Ligand Efficiency LLE', 'Ligand Efficiency SEI', 'Potential Duplicate',
       'Assay ChEMBL ID', 'Assay Description', 'Assay Type', 'BAO Format ID',
       'BAO Label', 'Assay Organism', 'Assay Tissue ChEMBL ID',
       'Assay Tissue Name', 'Assay Cell Type', 'Assay Subcellular Fraction',
       'Assay Parameters', 'Assay Variant Accession', 'Assay Variant Mutation',
       'Target ChEMBL ID', 'Target Name', 'Target Organism', 'Target Type',
       'Document ChEMBL ID', 'Source ID', 'Source Description',
       'Document Journal', 'Document Year', 'Cell ChEMBL ID', 'Properties',
       'relation_clean', 'pKi_numeric'],
      

In [131]:
descriptors_data.columns

Index(['smiles', 'ABC', 'ABCGG', 'nAcid', 'nBase', 'SpAbs_A', 'SpMax_A',
       'SpDiam_A', 'SpAD_A', 'SpMAD_A',
       ...
       'SRW10', 'TSRW10', 'MW', 'AMW', 'WPath', 'WPol', 'Zagreb1', 'Zagreb2',
       'mZagreb1', 'mZagreb2'],
      dtype='object', length=1614)

### Processing the ChEMBL Data

In [132]:
chembl_data = chembl_data.rename(columns={'Smiles': 'smiles'})

In [133]:
chembl_data["Standard Relation"].unique()

array(["'='", nan, "'>'", "'<'", "'>='"], dtype=object)

In [134]:
chembl_data_filtered = chembl_data[chembl_data["Standard Relation"]=="'='"]

In [135]:
chembl_data_filtered["Standard Relation"].unique()

array(["'='"], dtype=object)

In [136]:
if 'Standard Value' in chembl_data_filtered.columns:
    
    chembl_data_calc = chembl_data_filtered[chembl_data_filtered['Standard Value'] > 0].copy()

    # pKi = 9 - log10(Ki[nM])
    # Values in 'Standard Value' column are Ki in nanomoles [nM]
    chembl_data_calc['pKi_calc'] = 9 - np.log10(chembl_data_calc['Standard Value'])

    print("Calculated pKi for ChemBL data:")
    print(chembl_data_calc[['Standard Value', 'pKi_calc']].head())

else:
    print("No 'Standard Value' column in chembl_data_filtered")

Calculated pKi for ChemBL data:
   Standard Value  pKi_calc
0           400.0  6.397940
1          4443.0  5.352324
2            28.0  7.552842
3            96.0  7.017729
4            62.4  7.204815


In [137]:
chembl_data_calc.columns

Index(['Molecule ChEMBL ID', 'Molecule Name', 'Molecule Max Phase',
       'Molecular Weight', '#RO5 Violations', 'AlogP', 'Compound Key',
       'smiles', 'Standard Type', 'Standard Relation', 'Standard Value',
       'Standard Units', 'pChEMBL Value', 'Data Validity Comment', 'Comment',
       'Uo Units', 'Ligand Efficiency BEI', 'Ligand Efficiency LE',
       'Ligand Efficiency LLE', 'Ligand Efficiency SEI', 'Potential Duplicate',
       'Assay ChEMBL ID', 'Assay Description', 'Assay Type', 'BAO Format ID',
       'BAO Label', 'Assay Organism', 'Assay Tissue ChEMBL ID',
       'Assay Tissue Name', 'Assay Cell Type', 'Assay Subcellular Fraction',
       'Assay Parameters', 'Assay Variant Accession', 'Assay Variant Mutation',
       'Target ChEMBL ID', 'Target Name', 'Target Organism', 'Target Type',
       'Document ChEMBL ID', 'Source ID', 'Source Description',
       'Document Journal', 'Document Year', 'Cell ChEMBL ID', 'Properties',
       'relation_clean', 'pKi_numeric', 'pKi_ca

In [138]:
chembl_data_calc = chembl_data_calc[['smiles', 'pKi_calc']]

### Processing the ZINC Data

In [139]:
zinc_data.columns

Index(['zinc_id', 'smiles', 'ortholog_name', 'gene_name', 'affinity',
       'chembldocid', 'title', 'reference.pubmed_id', 'reference.doi',
       'reference.chembl_id', 'reference.journal', 'reference.year',
       'pKi_numeric'],
      dtype='object')

In [140]:
zinc_data = zinc_data[['smiles', 'pKi_numeric']]

### Merging ZINC & CheEMBL Datasets

In [141]:
zinc_df = zinc_data.set_index('smiles')
chembl_df = chembl_data_calc.set_index('smiles')

combined_data = zinc_df.combine_first(chembl_df)

combined_data = combined_data.reset_index()

In [142]:
combined_data.tail(15)

Unnamed: 0,smiles,pKi_calc,pKi_numeric
5574,c1ccc2c(c1)Cc1ccccc1N1O[C@@H](CN3CCCC3)C[C@H]21,,7.6
5575,c1ccc2c(c1)Cc1ccccc1N1O[C@@H](CN3CCOCC3)C[C@H]21,,6.85
5576,c1ccc2c(c1)Cc1ccccc1N1O[C@H](CN3CCCC3)C[C@H]21,,7.6
5577,c1ccc2c(c1)Cc1ccccc1N1O[C@H](CN3CCOCC3)C[C@@H]21,,6.85
5578,c1ccc2c(c1)Cc1ccccc1N1O[C@H](CN3CCOCC3)C[C@H]21,,6.85
5579,c1ccc2c(c1)Cc1ccccc1[C@H]1OC(CN3CCOCC3)C[C@H]21,6.363512,
5580,c1ccc2c(c1)Cc1ccccc1[C@H]1O[C@@H](CN3CCOCC3)C[...,,6.36
5581,c1ccc2c(c1)Cc1ccccc1[C@H]1O[C@H](CN3CCOCC3)C[C...,,6.36
5582,c1ccc2c(c1)cc1n2CCNC1,5.982967,5.98
5583,c1ccc2c(c1)cc1n2CCNC1,6.197226,5.98


In [143]:
# Filling NaNs
combined_data['pKi_numeric'] = combined_data['pKi_numeric'].fillna(combined_data['pKi_calc'])

print("\nDataFrame after filling NaNs:")
combined_data.tail(15)


DataFrame after filling NaNs:


Unnamed: 0,smiles,pKi_calc,pKi_numeric
5574,c1ccc2c(c1)Cc1ccccc1N1O[C@@H](CN3CCCC3)C[C@H]21,,7.6
5575,c1ccc2c(c1)Cc1ccccc1N1O[C@@H](CN3CCOCC3)C[C@H]21,,6.85
5576,c1ccc2c(c1)Cc1ccccc1N1O[C@H](CN3CCCC3)C[C@H]21,,7.6
5577,c1ccc2c(c1)Cc1ccccc1N1O[C@H](CN3CCOCC3)C[C@@H]21,,6.85
5578,c1ccc2c(c1)Cc1ccccc1N1O[C@H](CN3CCOCC3)C[C@H]21,,6.85
5579,c1ccc2c(c1)Cc1ccccc1[C@H]1OC(CN3CCOCC3)C[C@H]21,6.363512,6.363512
5580,c1ccc2c(c1)Cc1ccccc1[C@H]1O[C@@H](CN3CCOCC3)C[...,,6.36
5581,c1ccc2c(c1)Cc1ccccc1[C@H]1O[C@H](CN3CCOCC3)C[C...,,6.36
5582,c1ccc2c(c1)cc1n2CCNC1,5.982967,5.98
5583,c1ccc2c(c1)cc1n2CCNC1,6.197226,5.98


In [144]:
combined_data = combined_data.drop(columns=['pKi_calc'])

In [145]:
combined_data.shape

(5589, 2)

In [146]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5589 entries, 0 to 5588
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   smiles       5589 non-null   object 
 1   pKi_numeric  5589 non-null   float64
dtypes: float64(1), object(1)
memory usage: 87.5+ KB


In [147]:
combined_data.to_csv('data/processed/combined_zinc_chembl_data.csv', index=False)

### Processing Mordred Descriptors Database

The Mordred_descriptors dataset contains molecular descriptors computed using the [Mordred](https://mordred-descriptor.github.io/documentation/v0.5.0/descriptors.html) software package. 

In [None]:
numeric_cols = descriptors_data.select_dtypes(include=['number']).columns.tolist()