# Ligand-based experiments to predict binding affinities in pkis2

This notebook featurizes the pkis2 dataset with `MorganFingerprintFeaturizer` and provides in-disk `npz` files for each kinase and measurement type.

Output files are written to `_output/`, as in:

* `_output/PKIS2__YES__PercentageDisplacementMeasurement.npz`
* `_output/PKIS2__YSK1__PercentageDisplacementMeasurement.npz`
* `_output/PKIS2__YSK4__PercentageDisplacementMeasurement.npz`
* `_output/PKIS2__ZAK__PercentageDisplacementMeasurement.npz`


Each `npz` will contain two `np.ndarray` objects: `X` (featurized systems) and `y` (associated measurements), plus the train/test/validation indices.

In [None]:
# Filter out some warnings thrown by openforcefield and rdkit
import warnings
warnings.simplefilter("ignore")
import logging
logging.basicConfig(level=logging.ERROR)

import numpy as np
import os
from pathlib import Path

import pytorch_lightning as pl
pl.seed_everything(1234);

In [None]:
HERE = Path(_dh[-1])

In [None]:
from kinoml.datasets.kinomescan.pkis2 import PKIS2DatasetProvider
pkis2 = PKIS2DatasetProvider.from_source()

In [None]:
pkis2

In [None]:
df = pkis2.to_dataframe()
df

This featurization pipeline consists of:

- Converting the OFF molecule to RDKit molecule
- Generating the Morgan fingerprint with nbits=512, radius=2

In [None]:
from kinoml.features.ligand import SmilesToLigandFeaturizer, MorganFingerprintFeaturizer
from kinoml.features.protein import AminoAcidCompositionFeaturizer
from kinoml.features.core import ScaleFeaturizer, Concatenated, Pipeline

morgan_featurizer = Pipeline([SmilesToLigandFeaturizer(style="rdkit"), MorganFingerprintFeaturizer(nbits=512, radius=2)])

In [None]:
# prefeaturize everything -- use single process for this dataset to benefit from the LRU cache!
pkis2.featurize(morgan_featurizer, processes=1);

Remove systems that couldn't be featurized

In [None]:
from kinoml.datasets.groups import CallableGrouper, RandomGrouper
grouper = CallableGrouper(lambda measurement: 'invalid' if 'last' not in measurement.system.featurizations else 'valid')
grouper.assign(pkis2, overwrite=True)
groups = pkis2.split_by_groups()
len(groups.get('valid', [])), len(groups.get('invalid', []))

Split by kinase name, since these models are ligand-based (so one model per kinase).

In [None]:
grouper = CallableGrouper(lambda measurement: measurement.system.protein.name)
grouper.assign(groups['valid'], overwrite=True)
groups_by_kinase = groups['valid'].split_by_groups()

Split each kinase group by measurement type too. We need to for-loops for that:

In [None]:
type_grouper = CallableGrouper(lambda measurement: type(measurement).__name__)
random_grouper = RandomGrouper({"idx_train": 0.8, "idx_test": 0.1, "idx_val": 0.1})

output = HERE / "_output/"
output.mkdir(parents=True, exist_ok=True)

for kinase, ds in sorted(groups_by_kinase.items(), key=lambda kv: len(kv[1]), reverse=True):
    type_grouper.assign(ds, overwrite=True)
    types = ds.split_by_groups()
    for mtype, ds_ in types.items():
        indices = random_grouper.indices(ds)
        X = np.asarray(ds_.featurized_systems())
        y = ds_.measurements_as_array()
        np.savez(output / f"PKIS2__{kinase}__{mtype}.npz", X=X, y=y.astype('float32'), **indices)

Annotate observation models for `pytorch` and `xgboost` (we will need this in next notebooks)

In [None]:
observation_model_pytorch = pkis2.observation_model(backend="pytorch")
display(observation_model_pytorch)

# Reproducibility logs

In [None]:
from kinoml.utils import watermark
watermark()