# Contents

In this notebook, we will learn how to load the PKIS2 dataset into memory using a `DatasetProvider` object. Then we will apply some standard featurization to obtain ML-compatible representations, and use a simple model to compute some activity predictions.

1. [X] Loading the data
2. [X] Featurizing the data
3. [ ] Exporting the featurized data to PyTorch
4. [ ] Build and train the model
5. [ ] Analyze results nicely

Disable stereochemistry _warnings_ generated by `openforcefield`.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import warnings
warnings.simplefilter("ignore") 
import logging
logging.basicConfig(level=logging.ERROR)

# 1. Loading the data as a DatasetProvider

In [4]:
from kinoml.datasets.kinomescan.pkis2 import PKIS2DatasetProvider



Let's initialize the PKIS2 dataset provider. Instead of a regular `ClassName()` instantiation, we need to use `.from_source()` (with convenient default arguments).

__Why?__

This due to the design of `BaseDatasetProvider.__init__`, which expects a list of `System` objects (or subclasses of). A `System` object is a set of one more or `MolecularComponent` objects and a `BaseMeasurement`.

* The `MolecularComponent` class is the base object all proteins and ligands, regardless their representation (e.g. sequence vs 3D structure, smiles vs molecular graph). `MolecularComponent` is immediately subclassed by:
  * `BaseProtein`, the abstract model which is subclassed by more concrete classes, like `AminoAcidSequence` and `ProteinStructure`.
  * `BaseLigand`, the abstract model which is subclassed by more concrete classes, like `Ligand` (based on `openforcefield.topology.Molecule`).
* A `System` is abstract enough to not impose any restrictions on the composition, but its subclasses can be. 
  * This is the case of the `Complex` object, which requires at least one `BaseProtein` and one `BaseLigand` objects.
* A `BaseMeasurement` class is normally subclassed by more specific measurements (like `BaseMeasurement`), but the design is the same. It takes:
  * `values`: an array of numeric values (single measurements are _arrayfied_ into single-element arrays). This can be replicates of the value for statistical purposes, or under different concentrations of a reactant?
  * `conditions`: instance of `AssayConditions`. This class provides all the properties required to reproduce the experiment (say `pH`, `temperature`, `concentration`, etc). This should be paired (somehow) to the dimensionality of `values`, but I haven't though much of that yet.

Anyway, all those details are not needed to start using the provider. Right now there's no lazy behavior, so it will take _a bit_ to build all sequences and ligands. In my machine, it's about 12 seconds for all 160K datapoints.

In [5]:
%%time
provider = PKIS2DatasetProvider.from_source()

CPU times: user 10.1 s, sys: 230 ms, total: 10.3 s
Wall time: 10.3 s


You can export a convenient dataframe with this method. Take into account this is just using the default implementation in the base class, which relies on the different `__repr__` methods and `.name` attributes of the objects involved. For prettier dataframes, one can always subclass `to_dataframe` to provide a better presentation.

In [6]:
df = provider.to_dataframe()
df

Unnamed: 0,Systems,n_components,PercentageDisplacementMeasurement
0,AAK1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,14.0
1,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,28.0
2,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,20.0
3,ABL2 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,5.0
4,ACVR1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C...,2,0.0
...,...,...,...
261865,ZAP70 & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc12)...,2,0.0
261866,p38-alpha & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0
261867,p38-beta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc...,2,0.0
261868,p38-delta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0


Notice how the string representations try to be a bit informative.

In [7]:
provider

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems>

In [8]:
one_random_system = next(iter(provider.systems))
one_random_system

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=ICK>, <Ligand name=COc1cccc(c1)C1=C(Nc2ccc(Cl)c(c2)C(O)=O)C(=O)NC1=O>)>

Some areas do need improvement... this is how you get all the entries that have wild-type kinases (all of them in PKIS2). Maybe some Django-style queries?

In [9]:
wt = [ms for ms in provider.measurements if ms.system.protein._provenance["mutations"] is None]
wt_provider = PKIS2DatasetProvider(measurements=wt)
wt_provider

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems>

#  2. Featurizing the data

We will be using:

- MorganFingerprint n=2048 bits, r=2
- OneHotEncoding of protein sequence
- ... or composition of binding site

An isolated featurizer takes one system and returns the raw data:

In [10]:
from kinoml.features.ligand import MorganFingerprintFeaturizer
featurizer = MorganFingerprintFeaturizer(nbits=2048, radius=2)
fp = featurizer.featurize(next(iter(provider.systems)))
print(fp.shape, *fp[:100], "...")

(2048,) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...


In the context of a dataset provider, each system will store that raw data in an internal dictionary (`.featurizations`) for _each_ system. Without caching, this would take ~30 minutes, given the huge amount of duplication in the dataset. Thanks to LRU caching at the `featurizer` level, each `Ligand` is only featurized once!

In [11]:
%%time
provider.featurize(featurizer)

Featurizing systems...: 100%|██████████| 257920/257920 [00:06<00:00, 40124.86it/s]

CPU times: user 6.49 s, sys: 30.8 ms, total: 6.52 s
Wall time: 6.48 s





In [12]:
for system in provider.systems[:5]:
    print(system, "...\n  ", system.featurizations, "\n")

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=ICK>, <Ligand name=COc1cccc(c1)C1=C(Nc2ccc(Cl)c(c2)C(O)=O)C(=O)NC1=O>)> ...
   {'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'last': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)} 

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=IGF1R>, <Ligand name=COc1cccc(c1)C1=C(Nc2ccc(Cl)c(c2)C(O)=O)C(=O)NC1=O>)> ...
   {'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'last': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)} 

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=IKK-alpha>, <Ligand name=COc1cccc(c1)C1=C(Nc2ccc(Cl)c(c2)C(O)=O)C(=O)NC1=O>)> ...
   {'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'last': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)} 

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=IKK-beta>, <Ligand name=COc1cccc(c1)C1=C(Nc2ccc(Cl)c(c2)C(O)=O)C(=O)NC1=O>)> ...
   {'MorganFingerp

# 3. Export to PyTorch

In [13]:
dataset = provider.to_pytorch()
dataset

<kinoml.datasets.torch_datasets.TorchDataset at 0x7f5d5fd04f10>

This pytorch dataset implements the `Dataset` protocol and provides two attributes: `measurements` and (featurized) `systems`:

In [14]:
dataset.measurements

array([14., 28., 20., ...,  0.,  0., 34.])

In [15]:
print(dataset.systems[0].shape)
dataset.systems[:20]

(2048,)


[array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8),
 array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)]

# 4. Ensuring we have an adequate mapping
The underlying measurement type common to _all_ measurements contains a `mapping` method that returns a dispatched callable, configurable per backend (default=pytorch).

In [16]:
pct_displacement_mapping = provider.measurement_type.mapping(backend="pytorch")
help(provider.measurement_type.mapping)

Help on method mapping in module kinoml.core.measurements:

mapping(backend='pytorch') method of builtins.type instance
    For the percent displacement measurements available from KinomeScan, we make the assumption (see JDC's notes) that
    
    $$
    D([I]) \approx \frac{1}{1 + \frac{K_d}{[I]}}
    $$
    
    For KinomeSCAN assays, all assays are usually performed at a single concentration, $ [I] \sim 1 \mu M $.
    
    We therefore define the following function:
    
    $$
    \mathbf{F}_{KinomeScan}(\Delta g, [I]) = \frac{1}{1 + \frac{exp[-\Delta g] * 1[M]}{[I]}}.
    $$



The `mapping` function expects native objects to their dataset (or numpy arrays):

In [17]:
dataset.measurements

array([14., 28., 20., ...,  0.,  0., 34.])

In [18]:
mapped = pct_displacement_mapping(dataset.measurements)
print("% displ | mapped dG?")
print(*list(zip(dataset.measurements, mapped))[:200:4], sep="\n")

% displ | mapped dG?
(14.0, tensor(1.0000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(9.0, tensor(0.9999, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(3.0, tensor(0.9526, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(5.0, tensor(0.9933, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(26.0, tensor(1.0000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(0.0, tensor(0.5000, dtype=torch.float64))
(6.0, tensor(0.9975, dtype=torc

# 4. Building and training the model

- `DNNModel`

# 5. Analyze results