# Contents

In this notebook, we will learn how to load the PKIS2 dataset into memory using a `DatasetProvider` object. Then we will apply some standard featurization to obtain ML-compatible representations, and use a simple model to compute some activity predictions.

1. [X] Loading the data
2. [X] Featurizing the data
3. [X] Exporting the featurized data to PyTorch
4. [X] Build and train the model
5. [ ] Analyze results nicely

Disable stereochemistry _warnings_ generated by `openforcefield`.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import warnings
warnings.simplefilter("ignore") 
import logging
logging.basicConfig(level=logging.ERROR)
import numpy as np

# 1. Loading the data as a DatasetProvider

In [4]:
from kinoml.datasets.kinomescan.pkis2 import PKIS2DatasetProvider



Let's initialize the PKIS2 dataset provider. Instead of a regular `ClassName()` instantiation, we need to use `.from_source()` (with convenient default arguments).

__Why?__

This due to the design of `BaseDatasetProvider.__init__`, which expects a list of `BaseMeasurement` objects (or subclasses of). 

* A `BaseMeasurement` class is normally subclassed by more specific measurements (like `BaseMeasurement`), but the design is the same. It takes:
  * `values`: an array of numeric values (single measurements are _arrayfied_ into single-element arrays). This can be replicates of the value for statistical purposes, or under different concentrations of a reactant?
  * `conditions`: instance of `AssayConditions`. This class provides all the properties required to reproduce the experiment (say `pH`, `temperature`, `concentration`, etc). This should be paired (somehow) to the dimensionality of `values`, but I haven't though much of that yet.
  * `system`: instance of a `System` class or subclass. The subclasses can restrict which type of `MolecularComponent` objects are allowed (e.g. `ProteinLigandComplex` only takes a `Protein` and a `Ligand`).
* A `System` is abstract enough to not impose any restrictions on the composition, but its subclasses can be. 
  * This is the case of the `Complex` object, which requires at least one `BaseProtein` and one `BaseLigand` objects.
* The `MolecularComponent` class is the base object all proteins and ligands, regardless their representation (e.g. sequence vs 3D structure, smiles vs molecular graph). `MolecularComponent` is immediately subclassed by:
  * `BaseProtein`, the abstract model which is subclassed by more concrete classes, like `AminoAcidSequence` and `ProteinStructure`.
  * `BaseLigand`, the abstract model which is subclassed by more concrete classes, like `Ligand` (based on `openforcefield.topology.Molecule`).


Anyway, all those details are not needed to start using the provider. Right now there's no lazy behavior, so it will take _a bit_ to build all sequences and ligands. In my machine, it's about 12 seconds for all 160K datapoints.

In [5]:
%%time
pkis2 = PKIS2DatasetProvider.from_source()

CPU times: user 17.5 s, sys: 341 ms, total: 17.8 s
Wall time: 17.8 s


You can export a convenient dataframe with this method. Take into account this is just using the default implementation in the base class, which relies on the different `__repr__` methods and `.name` attributes of the objects involved. For prettier dataframes, one can always subclass `to_dataframe` to provide a better presentation.

In [6]:
df = pkis2.to_dataframe()
df

Unnamed: 0,Systems,n_components,PercentageDisplacementMeasurement
0,AAK1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,14.0
1,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,28.0
2,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,20.0
3,ABL2 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,5.0
4,ACVR1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C...,2,0.0
...,...,...,...
261865,ZAP70 & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc12)...,2,0.0
261866,p38-alpha & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0
261867,p38-beta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)cc...,2,0.0
261868,p38-delta & CCn1c(nc2c(nc(OC[C@H](N)c3ccccc3)c...,2,0.0


Notice how the string representations try to be a bit informative.

In [7]:
pkis2

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems>

In [8]:
one_random_system = next(iter(pkis2.systems))
one_random_system

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=CDK3>, <Ligand name=CC(=O)NC1=NC2=CN=C(NC3=C(C)C=CC(NC(=O)C4=CC=CC(=C4Cl)C4(CC4)C#N)=C3)N=C2S1>)>

Some areas do need improvement... this is how you get all the entries that have wild-type kinases (all of them in PKIS2). Maybe some Django-style queries?

In [9]:
wt = [ms for ms in pkis2.measurements if ms.system.protein.metadata["mutations"] is None]
wt_provider = PKIS2DatasetProvider(measurements=wt)
wt_provider

<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 257920 systems>

#  2. Featurizing the data

We will be using:

- MorganFingerprint n=2048 bits, r=2
- OneHotEncoding of protein sequence
- ... or composition of binding site

An isolated featurizer takes one system and returns the raw data:

In [10]:
from kinoml.features.ligand import MorganFingerprintFeaturizer
featurizer = MorganFingerprintFeaturizer(nbits=1024, radius=2)
fp = featurizer.featurize(next(iter(pkis2.systems)))
print(fp.shape, *fp[:100], "...")

(1024,) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ...


In the context of a dataset provider, each system will store that raw data in an internal dictionary (`.featurizations`) for _each_ system. Without caching, this would take ~30 minutes, given the huge amount of duplication in the dataset. Thanks to LRU caching at the `featurizer` level, each `Ligand` is only featurized once!

In [11]:
%%time
pkis2.featurize(featurizer)

Featurizing systems...: 100%|██████████| 257920/257920 [00:08<00:00, 30189.91it/s]

CPU times: user 8.59 s, sys: 50.1 ms, total: 8.64 s
Wall time: 8.6 s





In [12]:
for system in pkis2.systems[:5]:
    print(system, "...\n  ", system.featurizations, "\n")

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=CDK3>, <Ligand name=CC(=O)NC1=NC2=CN=C(NC3=C(C)C=CC(NC(=O)C4=CC=CC(=C4Cl)C4(CC4)C#N)=C3)N=C2S1>)> ...
   {'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 1, 1, 0], dtype=uint8), 'last': array([0, 0, 0, ..., 1, 1, 0], dtype=uint8)} 

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=GAK>, <Ligand name=NC(=O)Nc1sc(cc1C(N)=O)-c1ccc(F)cc1>)> ...
   {'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'last': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)} 

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=FER>, <Ligand name=CCOc1nccc(n1)-c1c(ncn1C1CCNCC1)-c1ccc(F)cc1>)> ...
   {'MorganFingerprintFeaturizer': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), 'last': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)} 

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=CDK4>, <Ligand name=CC(=O)NC1=NC2=CN=C(NC3=C(C)C=CC(NC(=O)C4=CC=CC(=C4Cl)C4(CC4)C#N)=C3)N=C2S1>)> ...


# 3. Export to PyTorch

In [13]:
dataset = pkis2.to_pytorch()
dataset

<kinoml.datasets.torch_datasets.TorchDataset at 0x7f88433ec490>

This pytorch dataset implements the `Dataset` protocol and provides two attributes: `measurements` and (featurized) `systems`:

In [14]:
dataset.measurements

tensor([14., 28., 20.,  ...,  0.,  0., 34.], dtype=torch.float64)

In [15]:
print(dataset.systems[0].shape)
dataset.systems[:20]

torch.Size([1024])


tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

**TODO**: Look into specifying datatypes per featurizer to use memory more efficiently.

# 4. Ensuring we have an adequate observation model
The underlying measurement type common to _all_ measurements contains an `observation_model` method that returns a dispatched callable, configurable per backend (default=pytorch).

In [16]:
pct_displacement_model = pkis2.measurement_type.observation_model(backend="pytorch")
pct_displacement_model??

[0;31mSignature:[0m [0mpct_displacement_model[0m[0;34m([0m[0mvalues[0m[0;34m,[0m [0minhibitor_conc[0m[0;34m=[0m[0;36m1e-06[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;34m@[0m[0mstaticmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m_observation_model_pytorch[0m[0;34m([0m[0mvalues[0m[0;34m,[0m [0minhibitor_conc[0m[0;34m=[0m[0;36m1e-6[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mimport[0m [0mtorch[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m        [0;31m# values = torch.from_numpy(values)[0m[0;34m[0m
[0;34m[0m        [0;32mreturn[0m [0;36m100[0m [0;34m/[0m [0;34m([0m[0;36m1[0m [0;34m+[0m [0mtorch[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0mvalues[0m[0;34m)[0m [0;34m/[0m [0minhibitor_conc[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m      ~/devel/p

The `observation_model` function expects native objects to their dataset (or numpy arrays):

## For multiple dataset providers

I am still not very convinced about the need for this. Looks like a wrapper around several providers, but in the end we will use the exported datasets independently in the learning loop so...?

In [17]:
from kinoml.datasets.chembl import ChEMBLDatasetProvider
chembl = ChEMBLDatasetProvider.from_source(measurement_types=("IC50",))

In [18]:
chembl

Unnamed: 0,activities.activity_id,target_dictionary.chembl_id,activities.standard_type,activities.standard_value,activities.standard_units,compound_structures.canonical_smiles,component_sequences.sequence,assays.confidence_score,docs.doc_id,docs.year,UniprotID
0,32260,CHEMBL203,IC50,41.0,nM,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,8,4959,2002.0,P00533
1,32262,CHEMBL279,IC50,16500.0,nM,Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)...,MQSKVLLAVALWLCVETRAASVGLPSVSLDLPRLSIQKDILTIKAN...,8,4959,2002.0,P35968
2,32267,CHEMBL203,IC50,170.0,nM,Cc1cc(C(=O)N2CCOCC2)[nH]c1/C=C1\C(=O)Nc2ncnc(N...,MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...,8,4959,2002.0,P00533
3,32330,CHEMBL258,IC50,140.0,nM,Nc1ncnc2c1c(-c1cccc(Oc3ccccc3)c1)cn2C1CCCC1,MGCGCSSHPEDDWMENIDVCENCHYPIVPLDGKGTLLIRNGSEVRD...,9,3891,2000.0,P06239
4,32331,CHEMBL258,IC50,1180.0,nM,Nc1ncnc2c1c(-c1cccc(Oc3ccccc3)c1)cn2C1CCCC1,MGCGCSSHPEDDWMENIDVCENCHYPIVPLDGKGTLLIRNGSEVRD...,9,3891,2000.0,P06239
...,...,...,...,...,...,...,...,...,...,...,...
203939,18813164,CHEMBL2842,Ki,42.0,nM,Nc1cc(C(F)F)c(-c2nc(N3CCOCC3)cc(N3CCOCC3)n2)cn1,MLGTGPAAATTAATTSSNVSVLQQFASGLKSRNEETRAKAAKELQH...,9,110134,2018.0,P42345
203940,18813165,CHEMBL2842,Ki,30.0,nM,Nc1cc(C(F)(F)F)c(-c2cc(N3C4CCC3COC4)nc(N3C4CCC...,MLGTGPAAATTAATTSSNVSVLQQFASGLKSRNEETRAKAAKELQH...,9,110134,2018.0,P42345
203941,18813166,CHEMBL2842,Ki,12.0,nM,Nc1cc(C(F)F)c(-c2cc(N3C4CCC3COC4)nc(N3C4CCC3CO...,MLGTGPAAATTAATTSSNVSVLQQFASGLKSRNEETRAKAAKELQH...,9,110134,2018.0,P42345
203942,18813167,CHEMBL2842,Ki,435.0,nM,Nc1cc(C(F)(F)F)c(-c2nc(N3C4CCC3COC4)cc(N3C4CCC...,MLGTGPAAATTAATTSSNVSVLQQFASGLKSRNEETRAKAAKELQH...,9,110134,2018.0,P42345


In [56]:
from kinoml.datasets.core import MultiDatasetProvider
multi = MultiDatasetProvider([chembl, pkis2])
multi

AttributeError: 'DataFrame' object has no attribute 'measurements'

In [None]:
multi.observation_models()

In [None]:
multi.to_dataframe()

In [None]:
multi.measurements_as_array()

In [None]:
len(multi.systems)

In [None]:
multi.to_pytorch()

# 4. Building and training the model

- `DNNModel`

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class NeuralNetworkRegression(nn.Module):
    """
    Builds a Pytorch model (a Dense Neural Network) and a feed-forward pass
    """
    def __init__(self, input_size=1024, hidden_size=100, output_size=1):
        super(NeuralNetworkRegression, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.fully_connected_1 = nn.Linear(self.input_size, self.hidden_size) # Fully connected layer 
        self.fully_connected_out = nn.Linear(self.hidden_size, self.output_size) # Output

    def forward(self, x):
        """
        Defines the foward pass for a given input 'x'
        """
        x = F.relu(self.fully_connected_1(x)) # Activations are ReLU
        return self.fully_connected_out(x)


## 4.1 Optimization loop

In [33]:
# Release some memory
del pkis2

In [34]:
model_inputs = dataset.systems[:20000].type(torch.FloatTensor)
targeted_measurements = dataset.measurements[:20000].type(torch.FloatTensor)
print(model_inputs.shape)
print(targeted_measurements.shape)

torch.Size([20000, 1024])
torch.Size([20000])


In [35]:
full_loss = []
nb_epoch = 100

model = NeuralNetworkRegression(input_size=1024)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_function = nn.MSELoss() # Mean squared error

# TODO: Consider stochastic mini-batches, adlala maybe
for epoch in range(nb_epoch):

    # Clear gradients
    optimizer.zero_grad()

    # Obtain model prediction given model input
    delta_g = model(model_inputs)
    
    # with observation model
    prediction = pct_displacement_model(delta_g)
    loss = loss_function(prediction, targeted_measurements)

    # Obtain loss for the predicted output
    full_loss.append(loss)

    # Gradients w.r.t to parameters
    loss.backward()

    # Optimizer
    optimizer.step()
    
    if epoch % 10 == 0:
        print(f'epoch {epoch} : loss {loss}')
print("Done!")

epoch 0 : loss 800.419921875
epoch 10 : loss 800.4111938476562
epoch 20 : loss 800.321044921875
epoch 30 : loss 798.2498779296875
epoch 40 : loss 716.9710083007812
epoch 50 : loss 662.4512939453125
epoch 60 : loss 637.2263793945312
epoch 70 : loss 625.5025024414062
epoch 80 : loss 620.917236328125
epoch 90 : loss 617.6267700195312
Done!


# 5. Analyze results