# Ligand-based experiments to predict binding affinities in ChEMBL

This notebook featurizes the ChEMBL dataset with `MorganFingerprintFeaturizer` and provides in-disk `npz` files for each kinase and measurement type.

Output files are written to `_output/`, as in:

* `_output/ChEMBL__O00141__pIC50Measurement.npz`
* `_output/ChEMBL__O00141__pKdMeasurement.npz`
* `_output/ChEMBL__O00141__pKiMeasurement.npz`
* `_output/ChEMBL__O00238__pIC50Measurement.npz`
* `_output/ChEMBL__O00238__pKdMeasurement.npz`

Each `npz` will contain two `np.ndarray` objects: `X` (featurized systems) and `y` (associated measurements).

In [1]:
# automated logging to weights&biases
WITH_WANDB = True
if WITH_WANDB:
    import os
    os.environ["WANDB_MODE"] = "dryrun"
    import wandb
    wandb.login()
    wandb.init(entity="kinoml-experiments", project="ligand-based", 
               name="ChEMBL/MorganFingerprint", job_type="featurization", 
               dir="_output/")

[34m[1mwandb[0m: Offline run mode, not syncing to the cloud.
[34m[1mwandb[0m: Tracking run with wandb version 0.10.2
[34m[1mwandb[0m: W&B is disabled in this directory.  Run `wandb on` to enable cloud syncing.
[34m[1mwandb[0m: Run data is saved locally in _output/wandb/offline-run-20200923_150633-2dqlw5cp





In [2]:
# Filter out some warnings thrown by openforcefield and rdkit
import warnings
warnings.simplefilter("ignore")
import logging
logging.basicConfig(level=logging.ERROR)

import numpy as np
import os
from pathlib import Path

In [3]:
HERE = Path(_dh[-1])

In [4]:
from kinoml.datasets.chembl import ChEMBLDatasetProvider
chembl = ChEMBLDatasetProvider.from_source()



HBox(children=(FloatProgress(value=0.0, max=203380.0), HTML(value='')))




In [4]:
chembl

<ChEMBLDatasetProvider with 203380 measurements (pIC50Measurement=170121, pKdMeasurement=17050, pKiMeasurement=16209), and 162584 systems (AminoAcidSequence=422, SmilesLigand=103097)>

In [5]:
df = chembl.to_dataframe()
df

Unnamed: 0,Systems,n_components,Measurement,MeasurementType
0,P00533 & Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F...,2,7.387216,pIC50Measurement
1,P35968 & Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F...,2,4.782516,pIC50Measurement
2,P00533 & Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)...,2,6.769551,pIC50Measurement
3,P06239 & Nc1ncnc2c1c(-c1cccc(Oc3ccccc3)c1)cn2C...,2,6.853872,pIC50Measurement
4,P06239 & Nc1ncnc2c1c(-c1cccc(Oc3ccccc3)c1)cn2C...,2,5.928118,pIC50Measurement
...,...,...,...,...
203375,P42345 & Nc1cc(C(F)F)c(-c2nc(N3CCOCC3)cc(N3CCO...,2,7.376751,pKiMeasurement
203376,P42345 & Nc1cc(C(F)(F)F)c(-c2cc(N3C4CCC3COC4)n...,2,7.522879,pKiMeasurement
203377,P42345 & Nc1cc(C(F)F)c(-c2cc(N3C4CCC3COC4)nc(N...,2,7.920819,pKiMeasurement
203378,P42345 & Nc1cc(C(F)(F)F)c(-c2nc(N3C4CCC3COC4)c...,2,6.361511,pKiMeasurement


This featurization pipeline consists of:

- Promoting the Smiles wrapper objects returned by ChEMBL to a full OpenForceField molecule
- Converting to RDKit molecule
- Generating the Morgan fingerprint with nbits=1024, radius=2

In [6]:
from kinoml.features.ligand import SmilesToLigandFeaturizer, MorganFingerprintFeaturizer
from kinoml.features.protein import AminoAcidCompositionFeaturizer
from kinoml.features.core import ScaleFeaturizer, Concatenated, Pipeline

morgan_featurizer = Pipeline([SmilesToLigandFeaturizer(), MorganFingerprintFeaturizer(nbits=1024, radius=2)])

In [7]:
# prefeaturize everything
chembl.featurize(morgan_featurizer, processes=6);

HBox(children=(FloatProgress(value=0.0, max=162584.0), HTML(value='')))






Remove systems that couldn't be featurized

In [8]:
from kinoml.datasets.groups import CallableGrouper
grouper = CallableGrouper(lambda measurement: 'invalid' if 'last' not in measurement.system.featurizations else 'valid')
grouper.assign(chembl, overwrite=True)
groups = chembl.split_by_groups()
len(groups.get('valid', [])), len(groups.get('invalid', []))

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




(203378, 2)

Split by kinase name, since these models are ligand-based (so one model per kinase).

In [12]:
grouper = CallableGrouper(lambda measurement: measurement.system.protein.name)
grouper.assign(groups['valid'], overwrite=True)
groups_by_kinase = groups['valid'].split_by_groups()

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




Split each kinase group by measurement type too. We need to for-loops for that:

In [None]:
type_grouper = CallableGrouper(lambda measurement: type(measurement).__name__)

output = HERE / "_output/"
output.mkdir(parents=True, exist_ok=True)

for kinase, ds in sorted(groups_by_kinase.items(), key=lambda kv: len(kv[1]), reverse=True):
    type_grouper.assign(ds, overwrite=True)
    types = ds.split_by_groups()
    for mtype, ds_ in types.items():
        X = np.asarray(ds_.featurized_systems())
        y = ds_.measurements_as_array()
        np.savez(output / f"ChEMBL__{kinase}__{mtype}.npz", X=X, y=y.astype('float32'))

Annotate observation models for `pytorch` and `xgboost` (we will need this in next notebooks)

In [14]:
observation_models_pytorch = chembl.observation_models(backend="pytorch")
loss_adapters_xgboost = chembl.loss_adapters(backend="xgboost")
display(*observation_models_pytorch)
print()
display(*loss_adapters_xgboost)

<function kinoml.core.measurements.pIC50Measurement._observation_model_pytorch(dG_over_KT, substrate_conc=1e-06, michaelis_constant=1, standard_conc=1, **kwargs)>

<function kinoml.core.measurements.pKdMeasurement._observation_model_pytorch(dG_over_KT, standard_conc=1, **kwargs)>

<function kinoml.core.measurements.pKiMeasurement._observation_model_pytorch(dG_over_KT, standard_conc=1, **kwargs)>




<function kinoml.core.measurements.pIC50Measurement._loss_adapter_xgboost__mse(dG_over_KT, dmatrix, substrate_conc=1e-06, michaelis_constant=1, standard_conc=1, **kwargs)>

<function kinoml.core.measurements.pKdMeasurement._loss_adapter_xgboost__mse(dG_over_KT, dmatrix, standard_conc=1, **kwargs)>

<function kinoml.core.measurements.pKiMeasurement._loss_adapter_xgboost__mse(dG_over_KT, dmatrix, standard_conc=1, **kwargs)>