# The baselining pipeline

Global non-linear models – such as Random Forest Trees, Gradient Boosting Trees or Neural Nets – can be very effective in terms of predicting building energy consumption. This is evident when one looks at the winning solutions of the [ASHRAE’s Great Energy Predictor III competition](https://www.kaggle.com/c/ashrae-energy-prediction) hosted by Kaggle; the top five (5) approaches utilized combinations of Gradient Boosting Trees (LightGBM, CatBoost, XGBoost) and Neural Nets (feed-forward networks and convolutional networks). 

However, interpretation and auditing of a global non-linear model is not trivial. Here, model interpretation is related to the transparency of an algorithm’s decisions, and the ability to identify what the algorithm has learned and what subset of the observations was most influential on what the algorithm learned.   

An alternative to global non-linear models is an ensemble of local linear models. The general recipe for developing an ensemble of such models comprises the following steps:

 1.	Select the linear model to be used as the local estimator (i.e. the building block of the ensemble);
 2.	Define a way to quantify the notion of locality, i.e. when two (2) observations are close enough to be handled by the same local model;
 3.	Define a strategy for selecting the observations to train each local model and for combining the results for all individual models into one prediction.   

The way `eensight` implements these steps is explained here.


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [15]:
import functools

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from joblib import Parallel
from sklearn.base import clone
from datetime import time, datetime
from sklearn.pipeline import Pipeline
from sklearn.utils.fixes import delayed
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

pd.plotting.register_matplotlib_converters()

In [240]:
from eensight.utils.jupyter import load_catalog

from eensight.features.generate import DatetimeFeatures, MMCFeatures
from eensight.features.encode import CategoricalEncoder, SplineEncoder, ICatSplineEncoder
from eensight.pipelines.model_selection import cvrmse, nmbe
from eensight.features.cluster import ClusterFeatures
from eensight.models import (
    LinearPredictor,
    GroupedPredictor,
    CompositePredictor,
    EnsemblePredictor
)

In [4]:
catalog = load_catalog('demo')

data_train = catalog.load('train.model_input_data')

# We train the predictive models without outliers:
X_train = data_train.loc[~data_train['consumption_outlier'], ['temperature']]
y_train = data_train.loc[X_train.index, ['consumption']]

## The linear model used as the local estimator

The simplest model to use is a model that includes only the hour of the week as a feature. The hour of the week is a categorical feature and it can be encoded in [one-hot form](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features):

In [46]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X_train)[['hourofweek']]
dmatrix = CategoricalEncoder(feature='hourofweek', encode_as='onehot').fit_transform(features)

We can fit a linear model:

In [50]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y_train.values)

... and evaluate it in-sample:

In [51]:
pred = model.predict(dmatrix)

In [None]:
y_true = y_train.values

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")

The degrees of freedom of the model are:

In [None]:
np.linalg.matrix_rank(dmatrix)

The impact of the hour of the week on energy consumption is then:

In [None]:
pred = pd.DataFrame(data=pred, index=y_train.index, columns=['hourofweek_impact'])

date_enc = DatetimeFeatures(remainder='passthrough', subset='hourofweek')
to_plot = date_enc.fit_transform(pred).groupby('hourofweek').mean()


colors = ['#8c510a', '#d8b365', '#f6e8c3', '#f5f5f5', '#c7eae5', '#5ab4ac', '#01665e']

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    intervals = np.split(to_plot.index, 7)
    for i, item in enumerate(intervals):
        ax.axvspan(item[0], item[-1], alpha=0.3, color=colors[i])
    
    to_plot.plot(ax=ax)
    ax.set_xlabel('Hour of week')
    ax.legend(['Average contribution of hour-of-week feature'], fancybox=True, frameon=True)

`eensight` includes functionality for reducing the number of categories in a categorical feature while retaining as much as possible the feature's predictive capability:

In [58]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X_train)[['hourofweek']]

enc = CategoricalEncoder(feature='hourofweek', encode_as='onehot', max_n_categories=60)
dmatrix = enc.fit_transform(features, y_train)

In [None]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y_train.values)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")

This is practically the same performance with one third of degrees of freedom:

In [None]:
np.linalg.matrix_rank(dmatrix)

The local model uses this functionality so that to control the degrees of freedom of the ensemble model. 

Another component to include in the model is an interaction between the hour of the week and the temperature. The [TOWT model](https://ieeexplore.ieee.org/document/5772947/) estimates the temperature effect separately for periods of the day with high and with low energy consumption in order to distinguish between occupied and unoccupied building periods. 

To this end, a flexible curve is fitted on the `consumption~temperature` relationship, and if more than the 65% of the data points that correspond to a specific time-of-week are above the fitted curve, the corresponding hour is flagged as “Occupied”, otherwise it is flagged as “Unoccupied.” 

We can apply this approach using `eensight` functionality:

In [157]:
enc = SplineEncoder(feature='temperature', degree=1, strategy='uniform').fit(X_train)
dmatrix = enc.transform(X_train)

In [158]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y_train.values)
pred = model.predict(dmatrix)
pred = pd.DataFrame(data=pred, index=y_train.index, columns=y_train.columns)

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    ax.scatter(X_train['temperature'], y_train['consumption'], s=1, alpha=0.2)
    
    X_train_ = X_train.sort_values(by='temperature')
    ax.plot(X_train_, pred.loc[X_train_.index, 'consumption'], c='#cc4c02')

In [160]:
resid = y_train - pred
mask = resid > 0
mask = DatetimeFeatures(subset='hourofweek').fit_transform(mask)
occupied = mask.groupby('hourofweek')['consumption'].mean() > 0.65
occupied = occupied.to_dict()

In [161]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X_train)
features['occupied'] = features['hourofweek'].map(lambda x: occupied[x])

In [162]:
enc_temp = SplineEncoder(feature='temperature', degree=1)
enc_occ = CategoricalEncoder(feature='occupied', encode_as='onehot')
enc_occ = enc_occ.fit(features)

enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_temp)
dmatrix = enc.fit_transform(features)

In [None]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y_train.values)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")

In [None]:
np.linalg.matrix_rank(dmatrix)

Alternatively, we can rely on `eensight`s functionality to categorize the hours of the week into the two most dissimilar categories in terms of energy consumption given temperature information: 

In [165]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X_train)

enc_temp = SplineEncoder(feature='temperature', degree=1, strategy='uniform')
enc_occ = CategoricalEncoder(feature='hourofweek', max_n_categories=2,
                             stratify_by='temperature', min_samples_leaf=15)
enc_occ = enc_occ.fit(features, y_train)

enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_temp)
dmatrix = enc.fit_transform(features)

In [None]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y_train.values)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")

The prediction results are better while the number of the degrees of freedom is the same:

In [None]:
np.linalg.matrix_rank(dmatrix)

Then, the `consumption~temperature` curves per category of hour of week are:

In [None]:
date_enc = DatetimeFeatures(remainder='passthrough', subset='hourofweek')

intervals = pd.concat(
    ( pd.cut(X_train['temperature'], 15, precision=0), 
      pd.DataFrame(data=pred, index=X_train.index, columns=['temperature_impact'])
    ), 
    axis=1
)

enc_cat = enc_occ.feature_pipeline_['reduce_dimension']
intervals = date_enc.fit_transform(intervals)
intervals['hourofweek'] = intervals['hourofweek'].map(lambda x: enc_cat.mapping_[x])

to_plot = (
    intervals.groupby(['hourofweek', 'temperature'])['temperature_impact']
             .mean()
             .unstack()
)

colors = ['#8c510a', '#df65b0']

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    for i, (idx, values) in enumerate(to_plot.iterrows()):
        values.plot(ax=ax, lw=2, alpha=0.6, label=f'category {idx}', color=colors[i])

    ax.xaxis.set_major_locator(plt.MaxNLocator(10))
    ax.set_xlabel('Temperature intervals')
    ax.legend(fancybox=True, frameon=True)

The only two additions to the above components are:

1. A categorical feature for the different months of the dataset
2. A linear term for the temperature as a main effect. The interaction term between temperature and the hour of the week "corrects" the predictions of the temperature's linear term in the main effects.

The default local linear model used in `eensight` is defined in *eensight/conf/base/models/towt.yaml* as:

```yaml
add_features:
  time:
    type: datetime
    subset: month, hourofweek
  
regressors:
  month:
    feature: month
    type: categorical
    encode_as: onehot

  tow:
    feature: hourofweek
    type: categorical
    max_n_categories: 60  
    encode_as: onehot
  
  lin_temperature:
    feature: temperature
    type: linear
  
  flex_temperature:
    feature: temperature
    type: spline
    n_knots: 5
    degree: 1
    strategy: uniform 
    extrapolation: constant
    interaction_only: true

interactions:
  tow, flex_temperature:
    tow:
      max_n_categories: 2 
      stratify_by: temperature 
      min_samples_leaf: 15 
```


We can apply the local model on the whole demo dataset. This functionality is provided by `eensight.models.LinearPredictor`:

### `eensight.models.LinearPredictor` 

    Parameters
    ----------
    model_structure : dict
        The model configuration
    alpha : float (default=0.1)
        Regularization strength of the underlying ridge regression; must be a positive 
        float. Regularization improves the conditioning of the problem and reduces the 
        variance of the estimates. Larger values specify stronger regularization.
    fit_intercept : bool (default=False)
        Whether to fit the intercept for this model. If set to false, no intercept will 
        be used in calculations.

In [184]:
catalog = load_catalog('demo', model='towt')
model_structure = catalog.load('model_structure')

Create the model:

In [194]:
local_reg = LinearPredictor(model_structure=model_structure, alpha=0.01)

Fit with training data:

In [None]:
%%time
local_reg = local_reg.fit(X_train, y_train)

Evaluate the model in-sample:

In [None]:
%%time
pred = local_reg.predict(X_train)

In [None]:
y_true = y_train

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true[['consumption']], pred[['consumption']])*100}")
print(f"In-sample NMBE (%): {nmbe(y_true[['consumption']], pred[['consumption']])*100}")

We can plot the standardized residuals:

In [271]:
resid = y_true[['consumption']] - pred[['consumption']]
resid = StandardScaler().fit_transform(resid).squeeze()
resid = pd.Series(data=resid, index=y_train.index)

In [None]:
with plt.style.context('seaborn-whitegrid'):
    grid = sns.jointplot(x=pred['consumption'], y=resid)
    grid.fig.set_figwidth(12)
    grid.fig.set_figheight(5)
    grid.set_axis_labels(xlabel='Predicted Value', ylabel='Standardized Residuals')

The effective number of parameters (i.e. the degrees of freedom) is:

In [None]:
local_reg.n_parameters

This is how the design matrix of the regression corresponds to each regressor:

In [None]:
local_reg.composer_.component_matrix

This makes it easy to decompose the prediction into components (the regularization term `alpha=0.01` in the `LinearPredictor` was used primarily so that the iindividual components have reasonable values):

In [None]:
%%time
pred = local_reg.predict(X_train, include_components=True)
pred.head()

In [200]:
assert np.allclose(pred['consumption'],
            pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)

## A notion of locality


The local model is meant to be applied on different clusters of the dataset. Since each of the local models in the ensemble predicts on a different subset of the input data (an observation cannot belong to more than one clusters), the final prediction is generated by vertically concatenating all the individual models’ predictions.
 

<img src="images/grouped.png?modified=12345678" alt="grouped" width="550"/>

This functionality is provided by the combination of the: 

- `eensight.features.ClusterFeatures`
- `eensight.models.GroupedPredictor`


### `eensight.features.ClusterFeatures` 

A composite transformer model that uses [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/) (Hierarchical Density-Based 
Spatial Clustering of Applications with Noise) to cluster the input data and 
a [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to predict the clusters for unseen inputs.

    Parameters
    ----------
    min_cluster_size : int, optional (default=5)
        The minimum size of clusters; single linkage splits that contain
        fewer points than this will be considered points "falling out" of a
        cluster rather than a cluster splitting into two new clusters.
    min_samples : int, optional (default=5)
        The number of samples in a neighbourhood for a point to be
        considered a core point. This parameter controls what the clusterer
        identifies as noise.
    metric : string, or callable, optional (default='euclidean')
        The metric to use when calculating distance between instances in a
        feature array. If metric is a string or callable, it must be one of
        the options allowed by metrics.pairwise.pairwise_distances for its
        metric parameter. If metric is "precomputed", X is assumed to be a
        distance matrix and must be square.
    transformer : An object that implements a `fit_transform` method (default=None)
        The `fit_transform` method is used for transforming the input into a form
        that is understood by the distance metric.
    memory : Instance of joblib.Memory or string (optional)
        Used to cache the output of the computation of the tree.
        If a string is given, it is the path to the caching directory.
    allow_single_cluster : bool, optional (default=True)
        By default HDBSCAN will not produce a single cluster, setting this
        to True will override this and allow single cluster results in
        the case that you feel this is a valid result for your dataset.
    cluster_selection_method : string, optional (default='eom')
        The method used to select clusters from the condensed tree. The
        standard approach for HDBSCAN* is to use an Excess of Mass algorithm
        to find the most persistent clusters. Alternatively you can instead
        select the clusters at the leaves of the tree -- this provides the
        most fine grained and homogeneous clusters. Options are:
            * ``eom``
            * ``leaf``
    n_neighbors : int, default=1
        Number of neighbors to use by default for :meth:`kneighbors` queries.
    weights : {'uniform', 'distance'} or callable, default='uniform'
        weight function used in prediction. Possible values:
        - 'uniform' : uniform weights.  All points in each neighborhood
        are weighted equally.
        - 'distance' : weight points by the inverse of their distance.
        In this case, closer neighbors of a query point will have a
        greater influence than neighbors which are further away.
        - [callable] : a user-defined function which accepts an
        array of distances, and returns an array of the same shape
        containing the weights.
    output_name : str, default='cluster'
        The name of the output dataframe's column that includes the cluster
        information.



### `eensight.models.GroupedPredictor` 

Constructs one estimator per data group. Splits data by values of a
single column and fits one estimator per such column.

    Parameters
    ----------
    model_structure : dict
        A configuration dictionary that includes information about the base model's
        structure.
    group_feature : str
        The name of the column of the input dataframe to use as the grouping set.
    estimator_params : dict or tuple of tuples, default=tuple()
        The parameters to use when instantiating a new base estimator. If none are given,
        default parameters are used.
    fallback : bool (default=False)
        Whether or not to fall back to a global model in case a group parameter is not
        found during `.predict()`.
    
  

We already have the `distance_metrics` in the catalog:

In [202]:
distance_metrics = catalog.load('train.distance_metrics')

In [None]:
print(distance_metrics.keys())

Select a time interval:

In [207]:
start_time = time(8, 0)

In [208]:
end_time = distance_metrics[start_time]['end_time']

metric_components = distance_metrics[start_time]['metric_components']
metric = functools.partial(metric_function, metric_components)

In [210]:
mcc_features = Pipeline([
        ('dates', DatetimeFeatures(subset=['month', 'dayofweek'])),
        ('features', MMCFeatures()),
])

In [211]:
clusterer = ClusterFeatures(min_cluster_size=20, 
                            transformer=mcc_features, 
                            output_name='cluster', 
                            metric=metric)

In [212]:
X_train_int = X_train.loc[start_time:end_time]
y_train_int = y_train.loc[start_time:end_time]

clusters = clusterer.fit_transform(X_train_int)

X_train_int = pd.concat((X_train_int, clusters), axis=1)

In [213]:
reg_grouped = GroupedPredictor(
                    model_structure=model_structure, 
                    group_feature='cluster',
                    estimator_params=(('alpha', 0.01), ('fit_intercept', False)),
)

In [None]:
%%time
reg_grouped = reg_grouped.fit(X_train_int, y_train_int)

The `GroupedPredictor` applies the feature generation transformers directly on the dataset before it is split per cluster:

In [None]:
reg_grouped.transformers_

In addition, it fits all categorical encoders in ordinal form, and passes the encoded data (but not the actual encoders) to each cluster estimator:

In [None]:
for name, encoder in reg_grouped.encoders_['main_effects'].items():
    print('--->', name)
    print(encoder)

In [None]:
for pair_name, encoder in reg_grouped.encoders_['interactions'].items():
    print('--->', pair_name)
    print(encoder)

In [None]:
%%time
pred = reg_grouped.predict(X_train_int)

In [None]:
y_true = y_train_int

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true[['consumption']], pred[['consumption']])*100}")
print(f"In-sample NMBE (%): {nmbe(y_true[['consumption']], pred[['consumption']])*100}")

The number of parameters was:

In [None]:
reg_grouped.n_parameters

... and the model was fitted using observation data size:

In [None]:
X_train_int.shape[0]

The process of combining distance metrics per time interval, clusters and base models is encapsulated into the `CompositePredictor`:


### `eensight.models.CompositePredictor` 

Linear regression model that combines a clusterer (an estimator that answers 
the question "*To which cluster should I allocate a given observation's target?*")
and a grouped regressor (regressor for predicting the target given information 
about the clusters)

    Parameters
    ----------
    distance_metrics : dict
        Dictionary containing time interval information of the form:
        key: interval start time, values: interval end time, and components of 
        the corresponding distance metric.
    base_clusterer : eensight.features.cluster.ClusterFeatures
        An estimator that answers the question "To which cluster should I allocate
        a given observation's target?".
    base_regressor : eensight.models.grouped.GroupedPredictor
        A regressor for predicting the target given information about the clusters.
    group_feature : str, default='cluster'
        The name of the feature to use as the grouping set.

In [226]:
model = CompositePredictor(distance_metrics=distance_metrics, 
                           base_clusterer=clusterer,
                           base_regressor=reg_grouped
)

In [None]:
%%time
model = model.fit(X_train, y_train)

In [None]:
%%time
pred = model.predict(X_train)

In [None]:
y_true = y_train

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true[['consumption']], pred[['consumption']])*100}")
print(f"In-sample NMBE (%): {nmbe(y_true[['consumption']], pred[['consumption']])*100}")

The number of parameters used was:

In [None]:
model.n_parameters

We can ask for the cluster information to be included in the prediction (cluster information is in the form `# of interval: # of cluster`.

In [None]:
%%time
pred = model.predict(X_train, include_clusters=True)
pred.head()

... and/or the components:

In [None]:
%%time
pred = model.predict(X_train, include_components=True)
pred.head()

In [233]:
assert np.allclose(pred['consumption'],
            pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)

The parameter `min_cluster_size` of the `ClusterFeatures` can (and generally should) be optimized for out of sample performance. 

The relevant functionality is provided by `eensight.pipelines.model_selection.optimize`:

### `eensight.pipelines.model_selection.optimize` 

    Parameters
    ----------
    estimator : Any regressor with scikit-learn API (i.e. with fit and
        predict methods) 
        The object to use to fit the data.
    X : pandas dataframe of shape (n_samples, n_features)
        The input data to optimize on.
    y : pandas dataframe of shape of shape (n_samples, 1)
        The training target data to optimize on.
    n_repeats : int, default=2
        Number of times to repeat the train/test data split process process.
    test_size : float, default=0.25
        The proportion of the dataset to include in the test split. Should be between
        0.0 and 1.0.
    target_name : str, default='consumption'
        It is expected that both y and the predictions of the `estimator` are
        dataframes with a single column, the name of which is the one provided
        for `target_name`.
    budget : int, default=20
        The number of trials. If this argument is set to `None`, there is no
        limitation on the number of trials. If `timeout` is also set to `None`,
        the study continues to create trials until it receives a termination
        signal such as Ctrl+C or SIGTERM.
    timeout : int, default=None
        Stop study after the given number of second(s). If this argument is set to
        `None`, the study is executed without time limitation.
    scorers : dict, default=None
        dict mapping scorer name to a callable. The callable object
        should have signature ``scorer(y_true, y_pred)``.
        The default value is:
        `OrderedDict(
            {
                "CVRMSE": lambda y_true, y_pred:
                    eensight.pipelines.model_selection.cvrmse(
                        y_true[target_name], y_pred[target_name]
                    ),
                "ExVAR": lambda y_true, y_pred: sklearn.metrics.explained_variance_score(
                    y_true[target_name], y_pred[target_name]
                ),
            }
        )`
    directions : list, default=None
        A sequence of directions during multi-objective optimization. Set
        ``minimize`` for minimization and ``maximize`` for maximization.
        The default value is ['minimize', 'maximize'].
    optimization_space : callable, default=None
        A function that takes an `optuna.trial.Trial` as input and returns
        a parameter combination to try. If it is None, the `estimator` should
        have an `optimization_space` function.
    multivariate: bool, default=False
        If `True`, the multivariate TPE (Tree-structured Parzen Estimator)
        is used when suggesting parameters.
    out_of_sample: bool, default=True
        Whether the optimization should be based on out-of-sample (if `True`) or
        in-sample (if `False`) performance.
    verbose : bool, default=False
        Flag to show progress bars or not.
    tags: str or list of str, default=None
        Tags are returned by the function as-is and are useful as a way to distinguish
        the results when running the function many times in parallel.
    opt_space_kwards : dict
        Additional keyworded parameters to pass to the `optimization_space`
        function.

The optimization can be carried out once with `cluster_selection_method='eom'` and once with `cluster_selection_method='leaf'`:

In [234]:
def optimize_for_tag(model, X_train, y_train, budget, tag):
    model.set_params(**{"base_clusterer__assign_clusters__cluster_selection_method": tag})
    return optimize(model, X_train, y_train, budget=budget, 
                    tags=tag, multivariate=False)


In [235]:
clusterer = ClusterFeatures(min_cluster_size=20, 
                            transformer=mcc_features, 
                            output_name='cluster')

reg_grouped = GroupedPredictor(model_structure=model_structure, 
                               group_feature='cluster',
                               estimator_params=(('alpha', 0.01), ('fit_intercept', False)),
)

model = CompositePredictor(distance_metrics=distance_metrics, 
                           base_clusterer=clusterer,
                           base_regressor=reg_grouped,
                           group_feature='cluster'
)

In [None]:
%%time
budget = 15

parallel = Parallel(n_jobs=2)
results = parallel(
        delayed(optimize_for_tag)(
                clone(model), 
                X_train, 
                y_train,
                budget, 
                tag
        ) for tag in ['eom', 'leaf']
)

In [None]:
scores = None
params= None

for res in results:
    scores = pd.concat((scores, res['scores']), ignore_index=True)
    params = pd.concat(
    ( params,
      res['params'].assign(**{'assign_clusters__cluster_selection_method': res['tags']})
    ), ignore_index=True)

print(scores)

We can consider all parameter combinations (they constitute a Pareto front) and create an ensemble out of them:  

In [238]:
to_include = scores.sort_values(by='CVRMSE').iloc[:5]
params_ = params.loc[to_include.index]

In [241]:
en_model = EnsemblePredictor(base_estimator=model,
                             ensemble_parameters=params_.to_dict('records')
)

In [None]:
%%time
en_model = en_model.fit(X_train, y_train)

In [None]:
%%time
pred = en_model.predict(X_train)

In [None]:
y_true = y_train

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true[['consumption']], pred[['consumption']])*100}")
print(f"In-sample NMBE (%): {nmbe(y_true[['consumption']], pred[['consumption']])*100}")

The number of parameters is:

In [None]:
en_model.n_parameters

The individual components propagate from the local models:

In [None]:
pred = en_model.predict(X_train, include_components=True)
pred.head()

In [247]:
assert np.allclose(pred['consumption'],
    pred[[col for col in pred.columns if col not in ('consumption', 'cluster')]].sum(axis=1)
)

We can plot the standardized residuals, adding this time information about the matrix profile scores: 

In [275]:
resid = y_true[['consumption']] - pred[['consumption']]
resid = StandardScaler().fit_transform(resid).squeeze()
resid = pd.Series(data=resid, index=y_train.index)

In [276]:
mp_scores = catalog.load('train.matrix_profile_scores')

discords = mp_scores[mp_scores['nnd'] >= mp_scores['nnd'].quantile(0.99)]
discords = data_train.loc[np.isin(data_train.index.date, discords.index.date)]

In [None]:
discord_resids = resid.loc[discords.index]

with plt.style.context('seaborn-whitegrid'):
    grid = sns.jointplot(x=pred['consumption'], y=resid)
    sns.scatterplot(x=pred.loc[discord_resids.index, 'consumption'], y=discord_resids, 
                    color='#feb24c', ax=grid.ax_joint)
    grid.fig.set_figwidth(12)
    grid.fig.set_figheight(5)
    grid.set_axis_labels(xlabel='Predicted Value', ylabel='Standardized Residuals')

This is the main value from identifying discords: discords that are not handled well by the model are data subsets for which getting additional information will be most beneficial. For the dataset at hand, getting this information is easy; all discords correspond to holidays: 



In [None]:
discords['holiday'].value_counts()

`eensight` includes a model specification for datasets with holiday information at *eensight/conf/base/models/towt_holidays.yaml*:

```yaml
add_features:
  time:
    type: datetime
    subset: month, hourofweek, hour
  
regressors:
  month:
    feature: month
    type: categorical
    encode_as: onehot

  tow:
    feature: hourofweek
    type: categorical
    max_n_categories: 60  
    encode_as: onehot

  hour:
    feature: hour
    type: categorical
    max_n_categories: 12  
    encode_as: onehot
    interaction_only: true

  holidays:
    feature: holiday
    type: categorical
    max_n_categories: 3
    excluded_categories: _novalue_ # default value for imputing missing categorical data
    stratify_by: hour 
    min_samples_leaf: 1
    interaction_only: true
  
  lin_temperature:
    feature: temperature
    type: linear
  
  flex_temperature:
    feature: temperature
    type: spline
    n_knots: 5
    degree: 1
    strategy: uniform 
    extrapolation: constant
    interaction_only: true

interactions:
  hour, holidays: ~
  tow, flex_temperature:
    tow:
      max_n_categories: 2 
      stratify_by: temperature 
      min_samples_leaf: 15 
```

This models add to the `towt` model an intercation term between the holidays feature and the hour of the day, so that to correct the contribution of the hour of the week based on whether the day is a holiday or not. 

In [296]:
catalog = load_catalog('demo', model='towt_holidays')
model_structure = catalog.load('model_structure')

In [297]:
X_train = data_train.loc[~data_train['consumption_outlier'], ['temperature', 'holiday']]
y_train = data_train.loc[X_train.index, ['consumption']]

In [298]:
clusterer = ClusterFeatures(min_cluster_size=20, 
                            transformer=mcc_features, 
                            output_name='cluster')

reg_grouped = GroupedPredictor(model_structure=model_structure, 
                               group_feature='cluster',
                               estimator_params=(('alpha', 0.01), ('fit_intercept', False)),
)

model = CompositePredictor(distance_metrics=distance_metrics, 
                           base_clusterer=clusterer,
                           base_regressor=reg_grouped,
                           group_feature='cluster'
)

In [None]:
%%time
budget = 15

parallel = Parallel(n_jobs=2)
results = parallel(
        delayed(optimize_for_tag)(
                clone(model), 
                X_train, 
                y_train,
                budget, 
                tag
        ) for tag in ['eom', 'leaf']
)

In [None]:
scores = None
params= None

for res in results:
    scores = pd.concat((scores, res['scores']), ignore_index=True)
    params = pd.concat(
    ( params,
      res['params'].assign(**{'assign_clusters__cluster_selection_method': res['tags']})
    ), ignore_index=True)

print(scores)

In [301]:
to_include = scores.sort_values(by='CVRMSE').iloc[:5]
params_ = params.loc[to_include.index]

In [302]:
en_model = EnsemblePredictor(base_estimator=model,
                             ensemble_parameters=params_.to_dict('records')
)

In [None]:
%%time
en_model = en_model.fit(X_train, y_train)

In [None]:
%%time
pred = en_model.predict(X_train)

In [None]:
y_true = y_train

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true[['consumption']], pred[['consumption']])*100}")
print(f"In-sample NMBE (%): {nmbe(y_true[['consumption']], pred[['consumption']])*100}")

In [None]:
en_model.n_parameters

In [None]:
resid = y_true[['consumption']] - pred[['consumption']]
resid = StandardScaler().fit_transform(resid).squeeze()
resid = pd.Series(data=resid, index=y_train.index)

In [None]:
discord_resids = resid.loc[discords.index]

with plt.style.context('seaborn-whitegrid'):
    grid = sns.jointplot(x=pred['consumption'], y=resid)
    sns.scatterplot(x=pred.loc[discord_resids.index, 'consumption'], y=discord_resids, 
                    color='#feb24c', ax=grid.ax_joint)
    grid.fig.set_figwidth(12)
    grid.fig.set_figheight(5)
    grid.set_axis_labels(xlabel='Predicted Value', ylabel='Standardized Residuals')

-----------------