# Models with Modular Data Pipelines

In [None]:
%load_ext autoreload
%autoreload 2

import sys; sys.path.append("../../src")

In [None]:
import sensai
import pandas as pd

## VectorModel

The backbone of supervised learning implementations is the `VectorModel` abstraction.
It is so named, because, in computer science, a *vector* corresponds to an array of data,
and vector models map such vectors to the desired outputs, i.e. regression targets or 
classes.

It is important to note that this does *not* limit vector models to tabular data, because the data within
a vector can take arbitrary forms (in contrast to vectors as they are defined in mathematics).
Every element of an input vector could itself be arbitrary
complex, and could, in the most general sense, be any kind of object.

### The VectorModel Class Hierarchy

`VectorModel` is an abstract base class.
From it, abstract base classes for classification (`VectorClassificationModel`) and regression (`VectorRegressionModel`) are derived. And we furthermore provide base classes for rule-based models, facilitating the implementation of models that do not require learning (`RuleBasedVectorClassificationModel`, `RuleBasedVectorRegressionModel`).

These base classes are, in turn, specialised in order to provide direct access to model implementations based on widely used machine learning libraries such as scikit-learn, XGBoost, PyTorch, etc.
Use your IDE's hierarchy view to inspect them.

<!-- TODO: hierarchical bullet item list with hierarchy (or maybe auto-generate?) -->

### DataFrame-Based Interfaces

Vector models use pandas DataFrames as the fundmental input and output data structures.
Every row in a data frame corresponds to a vector of data, and an entire data frame can thus be viewed as a dataset or batch of data. Data frames are a good base representation for input data because
  * they provide rudimentary meta-data in the form of column names, avoiding ambiguity.
  * they can contain arbitrarily complex data, yet in the simplest of cases, they can directly be mapped to a data matrix (2D array) of features that simple models can directly process.

The `fit` and `predict` methods of `VectorModel` take data frames as input, and the latter furthermore returns its predictions as a data frame.
It is important to note that the DataFrame-based interface does not limit the scope of the models that can be applied, as one of the key principles of vector models is that they may define arbitrary model-specific transformations of the data originally contained in a data frame (e.g. a conversion from complex objects in data frames to one or more tensors for neural networks), as we shall see below.

Here's the particularly simple Iris dataset for flower species classification, where the features are measurements of petals and sepals:

In [None]:
dataset = sensai.data.dataset.DataSetClassificationIris()
io_data = dataset.load_io_data()
io_data.to_df().sample(8)

Here, `io_data` is an instance of `InputOutputData`, which contains two data frames, `inputs` and `outputs`.
The `to_df` method merges the two data frames into one for easier visualisation.

Let's split the dataset and apply a model to it:

In [None]:
# load and split a dataset
splitter = sensai.data.DataSplitterFractional(0.8)
train_io_data, test_io_data = splitter.split(io_data)

# train a model
model = sensai.sklearn.classification.SkLearnRandomForestVectorClassificationModel(
    n_estimators=15)
model.fit_input_output_data(train_io_data)

# make predictions
predictions = model.predict(test_io_data.inputs)

The `fit_input_output_data` method is just a convenience method to pass an `InputOutputData` instance instead of two data frames. It is equivalent to

```python
model.fit(train_io_data.inputs, train_io_data.outputs)
```

where the two data frames containing inputs and outputs are passed separately.

Now let's compare the ground truth to some of the predictions:

In [None]:
pd.concat((test_io_data.outputs, predictions), axis=1).sample(8)

### Implementing Custom Models

It is straightforward to implement your own model. Simply subclass the appropriate base class depending on the type of model you want to implement.

For example, let us implement a simple classifier where we always return the a priori probability of each class in the training data, ignoring the input data for predictions. For this case, we inherit from `VectorClassificationModel` and implement the two abstract methods it defines.

In [None]:
class PriorProbabilityVectorClassificationModel(sensai.VectorClassificationModel):
    def _fit_classifier(self, x: pd.DataFrame, y: pd.DataFrame):
        self._prior_probabilities = y.iloc[:, 0].value_counts(normalize=True).to_dict()

    def _predict_class_probabilities(self, x: pd.DataFrame) -> pd.DataFrame:
        values = [self._prior_probabilities[cls] for cls in self.get_class_labels()]
        return pd.DataFrame([values] * len(x), columns=self.get_class_labels(), index=x.index)

Adapting a model implementation from another machine learning library is typically just a few lines. For models that adhere to the scikit-learn interfaces for learning and prediction, there are abstract base classes that make the adaptation particularly straightforward.

### Configuration

Apart from the parameters passed at construction, which are specific to the type of model in question, all vector models can be flexibly configured via methods that can be called post-construction.
These methods all have the `with_` prefix, indicating that they return the instance itself (akin to the builder pattern), allowing calls to be chained in a single statement.

The most relevant such methods are:

* `with_name` to name the model (for reporting purposes)
* `with_raw_input_transformer` for adding an initial input transformation
* `with_feature_generator` and `with_feature_collector` for specifying how to generate features from the input data
* `with_feature_transformers` for specifying how the generated features shall be transformed

The latter three points are essential for defining modular input pipelines and will be addressed in detail below.

All configured options are fully reflected in the model's string representation, which can be pretty-printed with the `pprint` method.

In [None]:
str(model.with_name("RandomForest"))

In [None]:
model.pprint()

## Modular Pipelines

A key principle of sensAI's vector models is that data pipelines 
* can be **strongly associated with a model**. This is critically important of several heterogeneous models shall be applied to the same use case. Typically, every model has different requirements regarding the data it can process and the representation it requires to process it optimally.
* are to be **modular**, meaning that a pipeline can be composed from reusable and user-definable components.

An input pipeline typically serves the purpose of answering the following questions:

* **How shall the data be pre-processed?**

  It might be necessary to process the data before we can use it and extract data from it.
  We may need to filter or clean the data;
  we may need to establish a usable representation from raw data (e.g. convert a string-based representation of a date into a proper data structure);
  or we may need to infer/impute missing data.

  The relevant abstraction for this task is `DataFrameTransformer`, which, as the name suggests, can arbitrarily transform a data frame.
  All non-abstract class implementations have the prefix `DFT` in sensAI and thus are easily discovered through auto-completion.

  A `VectorModel` can be configured to apply a pre-processing transformation via method `with_raw_input_transformers`.

* **What is the data used by the model?**

  The relevant abstraction is `FeatureGenerator`. Via `FeatureGenerator` instances, a model can define which set of features is to be used. Moreover, these instances can hold meta-data on the respective features, which can be leveraged for downstream representation. 
  In sensAI, the class names of all feature generator implementations use the prefix `FeatureGenerator`.

  A `VectorModel` can be configured to answer this question via method `with_feature_generator` (or `with_feature_collector`).

* **How does that data need to be represented?**

  Different models can require different representations of the same data. For example, some models might require all features to be numeric, thus requiring categorical features to be encoded, while others might work better with the original representation.
  Furthermore, some models might work better with numerical features normalised or scaled in a certain way while it makes no difference to others.
  We can address these requirements by adding model-specific transformations.

  The relevant abstraction is, once again, `DataFrameTransformer`.

  A `VectorModel` can be configured to apply a transformation to its features via method `with_feature_transformers`.

The three pipeline stages are applied in the order presented above, and all components are optional, i.e. if a model does not define any raw input transformers, then the original data remains unmodified. If a model defines no feature generator, then the set of features is given by the full input data frame, etc.

