# Ligand-based experiments to predict binding affinities in pkis2

This notebook featurizes the pkis2 dataset with `MorganFingerprintFeaturizer` and provides in-disk `npz` files for each kinase and measurement type.

Output files are written to `_output/`, as in:

* `_output/PKIS2__YES__PercentageDisplacementMeasurement.npz`
* `_output/PKIS2__YSK1__PercentageDisplacementMeasurement.npz`
* `_output/PKIS2__YSK4__PercentageDisplacementMeasurement.npz`
* `_output/PKIS2__ZAK__PercentageDisplacementMeasurement.npz`


Each `npz` will contain two `np.ndarray` objects: `X` (featurized systems) and `y` (associated measurements), plus the train/test/validation indices.

In [1]:
# Filter out some warnings thrown by openforcefield and rdkit
import warnings
warnings.simplefilter("ignore")
import logging
logging.basicConfig(level=logging.ERROR)

import numpy as np
import os
from pathlib import Path

import pytorch_lightning as pl
pl.seed_everything(1234);

In [2]:
HERE = Path(_dh[-1])

In [3]:
from kinoml.datasets.kinomescan.pkis2 import PKIS2DatasetProvider
pkis2 = PKIS2DatasetProvider.from_source()



In [4]:
pkis2

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems (AminoAcidSequence=403, SmilesLigand=640)>

In [5]:
df = pkis2.to_dataframe()
df

Unnamed: 0,Systems,n_components,PercentageDisplacementMeasurement
0,AAK1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,14.0
1,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,28.0
2,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,20.0
3,ABL2 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,5.0
4,ACVR1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C...,2,0.0
...,...,...,...
261865,ZAP70 & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc12)...,2,0.0
261866,p38-alpha & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0
261867,p38-beta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc...,2,0.0
261868,p38-delta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0


This featurization pipeline consists of:

- Converting the OFF molecule to RDKit molecule
- Generating the Morgan fingerprint with nbits=512, radius=2

In [6]:
from kinoml.features.ligand import SmilesToLigandFeaturizer, MorganFingerprintFeaturizer
from kinoml.features.protein import AminoAcidCompositionFeaturizer
from kinoml.features.core import ScaleFeaturizer, Concatenated, Pipeline

morgan_featurizer = Pipeline([SmilesToLigandFeaturizer(style="rdkit"), MorganFingerprintFeaturizer(nbits=512, radius=2)])

In [7]:
# prefeaturize everything -- use single process for this dataset to benefit from the LRU cache!
pkis2.featurize(morgan_featurizer, processes=1);

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=257920.0), HTML(value='')))




Remove systems that couldn't be featurized

In [13]:
from kinoml.datasets.groups import CallableGrouper, RandomGrouper
grouper = CallableGrouper(lambda measurement: 'invalid' if 'last' not in measurement.system.featurizations else 'valid')
grouper.assign(pkis2, overwrite=True)
groups = pkis2.split_by_groups()
len(groups.get('valid', [])), len(groups.get('invalid', []))

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




(261870, 0)

Split by kinase name, since these models are ligand-based (so one model per kinase).

In [14]:
grouper = CallableGrouper(lambda measurement: measurement.system.protein.name)
grouper.assign(groups['valid'], overwrite=True)
groups_by_kinase = groups['valid'].split_by_groups()

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




Split each kinase group by measurement type too. We need to for-loops for that:

In [None]:
type_grouper = CallableGrouper(lambda measurement: type(measurement).__name__)
random_grouper = RandomGrouper({"idx_train": 0.75, "idx_test": 0.25}) #, "idx_val": 0.1})

output = HERE / "_output/"
output.mkdir(parents=True, exist_ok=True)

for kinase, ds in sorted(groups_by_kinase.items(), key=lambda kv: len(kv[1]), reverse=True):
    type_grouper.assign(ds, overwrite=True)
    types = ds.split_by_groups()
    for mtype, ds_ in types.items():
        indices = random_grouper._assign(ds)
        X = np.asarray(ds_.featurized_systems())
        y = ds_.measurements_as_array()
        np.savez(output / f"PKIS2__{kinase}__{mtype}.npz", X=X, y=y.astype('float32'), **indices)

Annotate observation models for `pytorch` and `xgboost` (we will need this in next notebooks)

In [20]:
observation_model_pytorch = pkis2.observation_model(backend="pytorch")
display(observation_model_pytorch)

<function kinoml.core.measurements.PercentageDisplacementMeasurement._observation_model_pytorch(dG_over_KT, inhibitor_conc=1, standard_conc=1, **kwargs)>




# Reproducibility logs

In [23]:
from kinoml.utils import watermark
watermark()

Watermark
---------
pytorch_lightning 0.9.0
logging           0.5.1.2
numpy             1.19.1
kinoml            0+untagged.194.ga840398.dirty
last updated: 2020-10-01 21:39:02 CEST 2020-10-01T21:39:02+02:00

CPython 3.7.8
IPython 7.17.0

compiler   : GCC 7.5.0
system     : Linux
release    : 4.19.128-microsoft-standard
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
host name  : jrodriguez
Git hash   : 17e0526faa2b817814304742ac61ee3bd12d7abd
watermark 2.0.2

conda
-----
sys.version: 3.7.6 | packaged by conda-forge | (defau...
sys.prefix: /opt/miniconda
sys.executable: /opt/miniconda/bin/python
conda location: /opt/miniconda/lib/python3.7/site-packages/conda
conda-build: /opt/miniconda/bin/conda-build
conda-convert: /opt/miniconda/bin/conda-convert
conda-debug: /opt/miniconda/bin/conda-debug
conda-develop: /opt/miniconda/bin/conda-develop
conda-env: /opt/miniconda/bin/conda-env
conda-index: /opt/miniconda/bin/conda-index
conda-inspect: /opt/miniconda/bin/cond