Disable stereochemistry _warnings_ generated by `openforcefield`.

In [1]:
import warnings
warnings.simplefilter("ignore") 
import logging
logging.basicConfig(level=logging.ERROR)

# Loading the data as a DatasetProvider

In [2]:
from kinoml.datasets.kinomescan.pkis2 import PKIS2DatasetProvider



Let's initialize the PKIS2 dataset provider. Instead of a regular `ClassName()` instantiation, we need to use `.from_source()` (with convenient default arguments).

__Why?__

This due to the design of `BaseDatasetProvider.__init__`, which expects a list of `System` objects (or subclasses of). A `System` object is a set of one more or `MolecularComponent` objects and a `BaseMeasurement`.

* The `MolecularComponent` class is the base object all proteins and ligands, regardless their representation (e.g. sequence vs 3D structure, smiles vs molecular graph). `MolecularComponent` is immediately subclassed by:
  * `BaseProtein`, the abstract model which is subclassed by more concrete classes, like `AminoAcidSequence` and `ProteinStructure`.
  * `BaseLigand`, the abstract model which is subclassed by more concrete classes, like `Ligand` (based on `openforcefield.topology.Molecule`).
* A `System` is abstract enough to not impose any restrictions on the composition, but its subclasses can be. 
  * This is the case of the `Complex` object, which requires at least one `BaseProtein` and one `BaseLigand` objects.
* A `BaseMeasurement` class is normally subclassed by more specific measurements (like `BaseMeasurement`), but the design is the same. It takes:
  * `values`: an array of numeric values (single measurements are _arrayfied_ into single-element arrays). This can be replicates of the value for statistical purposes, or under different concentrations of a reactant?
  * `conditions`: instance of `AssayConditions`. This class provides all the properties required to reproduce the experiment (say `pH`, `temperature`, `concentration`, etc). This should be paired (somehow) to the dimensionality of `values`, but I haven't though much of that yet.

Anyway, all those details are not needed to start using the provider. Right now there's no lazy behavior, so it will take _a bit_ to build all sequences and ligands. In my machine, it's about 80 seconds for all 160K datapoints.

In [7]:
%%time
provider = PKIS2DatasetProvider.from_source()

CPU times: user 87.5 ms, sys: 10.8 ms, total: 98.2 ms
Wall time: 96 ms


You can export a convenient dataframe with this method. Take into account this is just using the default implementation in the base class, which relies on the different `__repr__` methods and `.name` attributes of the objects involved. For prettier dataframes, one can always subclass `to_dataframe` to provide a better presentation.

In [4]:
provider.to_dataframe()

Unnamed: 0,ProteinLigandComplex,n_components,Avg PercentageDisplacementMeasurement
0,AAK1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,14.0
1,ABL1-nonphosphorylated & Clc1cccc(Cn2c(nn3c2nc...,2,28.0
2,ABL1-phosphorylated & Clc1cccc(Cn2c(nn3c2nc(cc...,2,20.0
3,ABL2 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2...,2,5.0
4,ACVR1 & Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C...,2,0.0
...,...,...,...
95,ACVR1B & COc1cc2ncn(-c3cc(OCc4ccc(cc4)S(C)(=O)...,2,0.0
96,ACVR2A & COc1cc2ncn(-c3cc(OCc4ccc(cc4)S(C)(=O)...,2,0.0
97,ACVR2B & COc1cc2ncn(-c3cc(OCc4ccc(cc4)S(C)(=O)...,2,0.0
98,ACVRL1 & COc1cc2ncn(-c3cc(OCc4ccc(cc4)S(C)(=O)...,2,4.0


Notice how the string representations try to be a bit informative.

In [8]:
provider

<PKIS2DatasetProvider with 100 systems>

In [9]:
provider.systems[0]

<ProteinLigandComplex with 2 components (<AminoAcidSequence name=AAK1>, <Ligand name=Clc1cccc(Cn2c(nn3c2nc(cc3=O)N2CCOCC2)C2CC2)c1Cl>) and <PercentageDisplacementMeasurement values=14.0 conditions=<AssayConditions pH=7.0>>>

#  Featurizing the data

_WIP_