# Featurize a dataset

Any machine learning model will expect tensorial representations of the chemical data. This notebooks provides a workflow to achieve such goal.

`kinoml.dataset.DatasetProvider` objects need to be available to deal with your collection of raw measurements for protein:ligand systems. These objects are, roughly, a list of `kinoml.core.BaseMeasurement`, each containing a set of `.values` and a some extra metadata, like the `system` objects to be featurized here.

In ligand-based models, protein information is only considered marginally, and most of the action happens at the ligand level. Usually starting with a string representation such as SMILES, or a database identifier such as a PubChem ID, these are promoted to (usually) RDKit objects and then transformed into a tensor of some form (e.g. fingerprints, molecular graph as an adjacency matrix, etc).

Available featurizers can be found under `kinoml.features`.

## How to use

Run `python run_notebook.py --help` for more information.

In [None]:
# If this is the template file (and not a copy) and you are introducing changes,
# update VERSION with the current date (YYYY.MM.DD)
VERSION = "2021.05.18" 

## ✏ Define hyper parameters

In [None]:
# TEMPLATE VALUES -- these are overriden (see below if executed) by papermill using a YAML or Python file as input
DATASET_CLS = "import.path.to.DatasetProvider"
DATASET_KWARGS = {"option": "value", "option2": "value2"}

FEATURIZE_KWARGS = {}

PIPELINES = {
    "someuniquekey": [
        ("import.path.to.SomeFeaturizer", {"option": "value", "option2": "value2"}),
        ("import.path.to.SomeOtherFeaturizer", {"option": "value", "option2": "value2"}),
    ]
}
PIPELINES_AGG = "kinoml.features.core.Concatenated"
PIPELINES_AGG_KWARGS = {}

GROUPS = [
    ("kinoml.datasets.groups.CallableGrouper", {"function": "lambda something: something.attribute"}),
    ("kinoml.datasets.groups.CallableGrouper", {"function": "lambda otherthing: otherthing.attribute2"})
]

TRAIN_TEST_VAL_KWARGS = {"idx_train": 0.8, "idx_test": 0.1, "idx_val": 0.1}

## IGNORE THIS ONE
HERE = _dh[-1]

⚠ From here on, you should _not_ need to modify anything else 🤞

---

Define key paths for data and outputs:

In [None]:
from pathlib import Path

HERE = Path(HERE)
for parent in HERE.parents:
    if next(parent.glob(".github/"), None):
        REPO = parent
        break

# Generate paths for this pipeline
featurizer_path = []
for name, branch in PIPELINES.items():
    featurizer_path.append(name)
    for clsname, kwargs in branch:
        clsname = clsname.rsplit(".", 1)[1]
        kwargs = [f"{k}={''.join(c for c in str(v) if c.isalnum())}" for k,v in kwargs.items()]
        featurizer_path.append("_".join([clsname] + kwargs))

OUT = HERE / "_output"  / "__".join(featurizer_path) / DATASET_CLS.rsplit('.', 1)[1]
OUT.mkdir(parents=True, exist_ok=True)

print(f"This notebook:           HERE = {HERE}")
print(f"This repo:               REPO = {REPO}")
print(f"Outputs in:               OUT = {OUT}")

In [None]:
# Nasty trick: save all-caps local variables (CONSTANTS working as hyperparametrs) so far in a dict to save it later
_hparams = {key: value for key, value in locals().items() if key.upper() == key and not key.startswith(("_", "OE_"))}

## Setup is finished, start working

In [None]:
from warnings import warn
import os
import sys
from pathlib import Path
from datetime import datetime

import numpy as np
import awkward as ak

from kinoml.utils import seed_everything, import_object
seed_everything();
print("Run started at", datetime.now())

## Load raw data

> This `import_object` function allows us to take a `str` containing a Python import path (e.g. `kinoml.datasets.chembl.ChEMBLDatasetProvider`) and obtain the imported object directly. That's how we can encode classes in JSON-only `papermill` inputs.
>
> See the help message `import_object?` for more info.

In [None]:
dataset = import_object(DATASET_CLS).from_source(**DATASET_KWARGS)
dataset

In [None]:
df = dataset.to_dataframe()
df

## Featurize

In [None]:
# build pipeline
from kinoml.features.core import Pipeline

pipelines = []
for key, pipeline_instructions in PIPELINES.items():
    print(f"Building featurizer `{key}` with instructions:")
    featurizers = []
    for featurizer_import_str, kwargs in pipeline_instructions:
        kwargs = kwargs or {}  # make sure empty values (None, "") turn into {} so we can do **kwargs below
        print(f"  Instantiating `{featurizer_import_str}` with options `{kwargs}`")
        featurizers.append(import_object(featurizer_import_str)(**kwargs))
    pipelines.append(Pipeline(featurizers))
print("Resulting pipelines:", *pipelines)
aggregated_pipeline = import_object(PIPELINES_AGG)(pipelines, **PIPELINES_AGG_KWARGS)
print("Aggregated pipelines:", aggregated_pipeline)

In [None]:
# prefeaturize everything
aggregated_pipeline.featurize(dataset.systems, **FEATURIZE_KWARGS);

## Filter

Remove systems that couldn't be featurized. Successful featurizations are stored in `measurement.system.featurizations['last']` so we test for that key existence.

In [None]:
from kinoml.datasets.groups import CallableGrouper, RandomGrouper
grouper = CallableGrouper(lambda measurement: 'invalid' if 'last' not in measurement.system.featurizations else 'valid')
grouper.assign(dataset, overwrite=True, progress=False)
groups = dataset.split_by_groups()
if "invalid" in groups:
    _invalid = groups.pop("invalid")
    warn(f"{len(_invalid)} entries could not be featurized!. Possible errors:")
    warn(f"{_invalid[0].system.featurizations}")

## Groups

Cumulatively apply groups.

In [None]:
groups[("valid",)] = groups.pop("valid")
if GROUPS:
    for grouper_str, grouper_kwargs in GROUPS:
        grouper_cls = import_object(grouper_str)
        ## We need this because lambda functions are not JSON-serializable
        if issubclass(grouper_cls, CallableGrouper):
            for k, v in list(grouper_kwargs.items()):
                if k == "function" and isinstance(v, str):
                    grouper_kwargs[k] = eval(v)  # sorry :)
        ## End of lambda hack
        grouper = grouper_cls(**grouper_kwargs)        
        for group_key in list(groups.keys()):
            grouper.assign(groups[group_key], overwrite=True, progress=False)
            for subkey, subgroup in groups.pop(group_key).split_by_groups().items():
                groups[group_key + (subkey,)] = subgroup
print("10 groups to show keys:", *list(groups.keys())[:10], sep="\n")

## Write tensors to disk

Output files are written to `_output/<PIPELINE>/<DATASET>/<GROUP>.parquet` files.

Each `parquet` will contain at least two array-like objects. The dimensionality of the parquet files is built as `(systems, X_or_y, ...)`. For example, the first X vector for the first system is accessed like `parquet[0, "0"]`. Notice how the 2nd index is a string! (`awkward` design).

- `"0"` (X, featurized systems). See `DatasetProvider.to_awkward` for more info.
- `"1"` (y, associated measurements)

If `X` is composed of more than one array (e.g. connectivity matrix + node features), these are flattened to `"0"`, `"1"`, `"2"`, and so on. `y` is ALWAYS the last one in that list (accessible via `data.fields`)

In [None]:
random_grouper = RandomGrouper(TRAIN_TEST_VAL_KWARGS)

parquets = []
for group, ds in sorted(groups.items(), key=lambda kv: len(kv[1]), reverse=True):
    indices = random_grouper.indices(ds)
    X, y = ds.to_awkward()
    parquet = ak.zip([*X, y], depth_limit=1)
    path = OUT / f"{'__'.join([g for g in group if g != 'valid'])}.parquet"
    parquets.append(path)
    ak.to_parquet(parquet, path)
    # TODO: Missing indices?

Preview generated Parquet files:

In [None]:
from kinoml.datasets.torch_datasets import AwkwardArrayDataset
awk = AwkwardArrayDataset.from_parquet(parquets[0])
awk

In [None]:
# X, y = awk[0]  # (multi-X) and y tensors for first system

In [None]:
# X

In [None]:
# y

In [None]:
print("Run finished at", datetime.now())

# Reproducibility logs

In [None]:
# Free some memory first
del awk, parquets, groups, dataset

In [None]:
from kinoml.utils import watermark
w = watermark()

In [None]:
%%capture cap --no-stderr
w = watermark()

In [None]:
import json

with open(OUT/ "watermark.txt", "w") as f:
    f.write(cap.stdout)

with open(OUT / "hparams.json", "w") as f:
    json.dump(_hparams, f, default=str, indent=2)