# Ligand-based experiments to predict binding affinities in ChEMBL

This notebook featurizes the ChEMBL dataset with `MorganFingerprintFeaturizer` and provides in-disk `npz` files for each kinase and measurement type.

Output files are written to `_output/`, as in:

* `_output/ChEMBL__O00141__pIC50Measurement.npz`
* `_output/ChEMBL__O00141__pKdMeasurement.npz`
* `_output/ChEMBL__O00141__pKiMeasurement.npz`
* `_output/ChEMBL__O00238__pIC50Measurement.npz`
* `_output/ChEMBL__O00238__pKdMeasurement.npz`

Each `npz` will contain two `np.ndarray` objects: `X` (featurized systems) and `y` (associated measurements).

In [None]:
# Filter out some warnings thrown by openforcefield and rdkit
import warnings
warnings.simplefilter("ignore")
import logging
logging.basicConfig(level=logging.ERROR)

import numpy as np
import os
from pathlib import Path

from kinoml.utils import seed_everything
seed_everything()

In [None]:
HERE = Path(_dh[-1])

In [None]:
from kinoml.datasets.chembl import ChEMBLDatasetProvider
chembl = ChEMBLDatasetProvider.from_source()

In [None]:
chembl

In [None]:
df = chembl.to_dataframe()
df

This featurization pipeline consists of:

- Promoting the Smiles wrapper objects returned by ChEMBL to a full OpenForceField molecule
- Converting to RDKit molecule
- Generating the Morgan fingerprint with nbits=1024, radius=2

In [None]:
from kinoml.features.ligand import SmilesToLigandFeaturizer, MorganFingerprintFeaturizer
from kinoml.features.protein import AminoAcidCompositionFeaturizer
from kinoml.features.core import ScaleFeaturizer, Concatenated, Pipeline

morgan_featurizer = Pipeline([SmilesToLigandFeaturizer(), MorganFingerprintFeaturizer(nbits=1024, radius=2)])

In [None]:
# prefeaturize everything
chembl.featurize(morgan_featurizer, processes=6);

Remove systems that couldn't be featurized

In [None]:
from kinoml.datasets.groups import CallableGrouper
grouper = CallableGrouper(lambda measurement: 'invalid' if 'last' not in measurement.system.featurizations else 'valid')
grouper.assign(chembl, overwrite=True)
groups = chembl.split_by_groups()
len(groups.get('valid', [])), len(groups.get('invalid', []))

Split by kinase name, since these models are ligand-based (so one model per kinase).

In [None]:
grouper = CallableGrouper(lambda measurement: measurement.system.protein.name)
grouper.assign(groups['valid'], overwrite=True)
groups_by_kinase = groups['valid'].split_by_groups()

Split each kinase group by measurement type too. We need to for-loops for that:

In [None]:
type_grouper = CallableGrouper(lambda measurement: type(measurement).__name__)

output = HERE / "_output/"
output.mkdir(parents=True, exist_ok=True)

for kinase, ds in sorted(groups_by_kinase.items(), key=lambda kv: len(kv[1]), reverse=True):
    type_grouper.assign(ds, overwrite=True)
    types = ds.split_by_groups()
    for mtype, ds_ in types.items():
        X = np.asarray(ds_.featurized_systems())
        y = ds_.measurements_as_array()
        np.savez(output / f"ChEMBL__{kinase}__{mtype}.npz", X=X, y=y.astype('float32'))

Annotate observation models for `pytorch` and `xgboost` (we will need this in next notebooks)

In [None]:
observation_models_pytorch = chembl.observation_models(backend="pytorch")
loss_adapters_xgboost = chembl.loss_adapters(backend="xgboost")
display(*observation_models_pytorch)
print()
display(*loss_adapters_xgboost)

# Reproducibility logs

In [None]:
from kinoml.utils import watermark
watermark()