In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys; sys.path.extend(["../src", ".."])
from sensai.util import logging

logging.configureLogging(level=logging.INFO)

# Custom Models and Feature Generators

In this notebook we will demonstrate some of sensAI's main features by training a model together
with feature extractors and custom normalization rules. This will also demonstrate how easy it is to wrap one's
own model declaration into a sensAI model.

In [None]:
import sensai
import pandas as pd
import numpy as np
import sensai as sn
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler
from sensai import VectorRegressionModel
from sensai.data_transformation import DFTNormalisation
from sensai.evaluation.eval_util import createVectorModelEvaluator
from sensai.data import InputOutputData
from sensai.tracking.clearml_tracking import ClearMLExperiment
import sensai.featuregen as fgen
import matplotlib.pyplot as plt
from config import get_config

cfg = get_config(reload=True)

First, let us load a dataset.

In [None]:
housing_data = cfg.datafile_path("boston_housing.csv", stage=cfg.RAW)
housing_df = pd.read_csv(housing_data)

housing_df.head()

In [None]:
X = housing_df.copy()
y = pd.DataFrame({"nox": X.pop("nox")})

In [None]:
print("We will use this as target")
y.head()

## Creating a Custom Model

Although sensAI provides several implementations of different models across major frameworks (SKlearn, TensorFlow,
PyTorch), we put special care to make it easy for you to bring your own model. The `VectorModel` based
classes provides abstractions which can be used for most learning problems of the type "datapoint in,
row of predictions out". The row of predictions can contain a vector with class probabilities, one or multiple
regression targets and so on. For problems of the type: "datapoint in, multidimensional tensor out", see the
tutorial in TBA.

We will use VectorModel to wrap scikit-learn's implementation of a multi layer perceptron.

In [None]:
class CustomModel(VectorRegressionModel):
    def __init__(self):
        super().__init__()
        self.model = MLPRegressor()

    def _predict(self, x: pd.DataFrame) -> pd.DataFrame:
        values = self.model.predict(x)
        return pd.DataFrame({"nox": values}, index=x.index)

    def _fit(self, X: pd.DataFrame, Y: pd.DataFrame):
        self.model.fit(X, Y.values.ravel())

## Feature Generation and Normalization

Some of sensAI's core design principles include explicitness and safety of data wrangling. Special care was taken to
ensure that everything that happens to input data during feature extraction, preprocessing, training and inference was
intended by the user. Since for many projects feature engineering is decisive for model performance, it is absolutely
crucial that the developer has full control over all transformations that are going on during training and inference.


The feature generation and normalization modules helps with this, allowing fine-grained control over each step in the
processing pipeline. Since the feature generators and the normalization data frame transforms can be bound to a sensAI
model, it is guaranteed that the data pipeline at inference will work exactly as intended.
If something unexpected happens at inference time, like an unknown column, wrong order of columns etc, an error will be
raised. Errors will also be raised (unless specifically disabled) if there are columns for which no normalization rules
 have been provided for columns.
This ensures that the user has thought about how to deal with different features and that no surprises happen.

This level of control comes at the price of verbosity. sensAI classes and arguments tend to have long names,
explaining exactly what they do and what the intended use case looks like.

Below we will show an example of feature engineering.


### Defining Feature Generators

Below we will define two feature generators. One will compensate the tax for fraud, by assuming that if the declared
tax in the dataframe is above a threshold, we have to subtract some fixed value that was lied about. The threshold
is extracted from the dataframe when the feature generator is fit.

The second feature generator simply takes the columns "crim" and "age" as is and marks that they should be normalized.

In [None]:
class TaxFraudFeaturegen(fgen.FeatureGenerator):
    def __init__(self, tax_column="tax", value_lied_about=12.0):
        self.value_lied_about = value_lied_about
        self.tax_column = tax_column
        self.threshold = None
        super().__init__(
            normalisationRuleTemplate=DFTNormalisation.RuleTemplate(
                transformer=MinMaxScaler()
            )
        )

    def _fit(self, X: pd.DataFrame, Y: pd.DataFrame, ctx=None):
        self.threshold = X[self.tax_column].median()

    def compensate_for_fraud(self, tax: float):
        if tax > self.threshold:
            tax = tax - self.value_lied_about
        return tax

    def _generate(self, df: pd.DataFrame, ctx=None) -> pd.DataFrame:
        result = pd.DataFrame()
        result[self.tax_column] = df[self.tax_column].apply(self.compensate_for_fraud)
        return result


crime_age_featuregen = fgen.FeatureGeneratorTakeColumns(
    columns=["crim", "age"],
    normalisationRuleTemplate=DFTNormalisation.RuleTemplate(skip=True),
)

### The Feature Generator Registry

We could simply take the feature generators as they are and plug them into our model but instead we demonstrate
one more class in sensAI: the feature registry. Creating a registry is convenient for rapid experimentation
and for keeping track of useful features in a large project. You might not know which ones will be useful for which
model so the registry abstraction helps you checking in features into git and staying organized.

Here we create the a dedicated registry for the housing features. The registry will hold factories
of featuregens which will create singleton instances if called withing the training/inference pipeline
(this is optional).
The collector is pinned to a registry and allows to call the registered features by name (if desired).
This might not make much sense in a notebook but imagine having a central feature registry somewhere in you code. This
way you can combine the registered features with some features that you cooked up in a script, all in a few lines of code.

In [None]:
housing_feature_registry = fgen.FeatureGeneratorRegistry(useSingletons=True)

housing_feature_registry.tax = TaxFraudFeaturegen

feature_collector = fgen.FeatureCollector("tax", crime_age_featuregen, registry=housing_feature_registry)

### Normalization of Input and Target

Now we come to the issue of normalization. In each feature generator we have declared how the resulting
columns should be normalized. We can use this information by instantiating `DFTNormalisation`.
If a rule for some column is missing, the normalization object will raise an error. There is a way
to circumvent this error - set `requireAllHandled` to False. In that case, you should probably
use a defaultTransformerFactory to normalize all remaining columns. However, we recommend to explicitly pass
all normalization rules to the feature generators themselves, just to be sure that nothing is missing.

For normalizing the target we have to use an invertible transformer, we will take the MaxAbsScaler here.

In [None]:
dft_normalisation = sn.data_transformation.DFTNormalisation(
    feature_collector.getNormalizationRules(),
    requireAllHandled=True)

target_transformer = sn.data_transformation.DFTSkLearnTransformer(MaxAbsScaler())


## Combining Everything with the Model

Now we can plug all these components into our vector model and enjoy a safe and robust that will
work during training and inference. The model already has methods for saving and loading and is ready to
be deployed.

In [None]:
custom_model = CustomModel()

custom_model = custom_model \
    .withFeatureCollector(feature_collector) \
    .withInputTransformers(dft_normalisation) \
    .withTargetTransformer(target_transformer) \
    .withName("housing_predictor")

### Evaluating the Model, Tracking Results Online

We evaluate the model using an evaluation util as usual, but this time we will additionally track the results online using ClearML.

In [None]:
try: 
    clearmlExperiment = ClearMLExperiment(projectName="sensai_demo", taskName="custom_model")
except:
    # allow to run without ClearML credentials being present
    clearmlExperiment = None

evalUtil = sensai.evaluation.RegressionEvaluationUtil(InputOutputData(X, y))
evalData = evalUtil.performSimpleEvaluation(custom_model, showPlots=True, trackedExperiment=clearmlExperiment)

You will find the URL under which the results are stored online in the log.

If you missed the evaluation metrics in the log output, here they are:

In [None]:
evalData.getEvalStats().metricsDict()