# The baselining pipeline with a distance-based predictive model

Global non-linear models – such as Random Forest Trees, Gradient Boosting Trees or Neural Nets – can be very effective in terms of predicting building energy consumption. This is evident when one looks at the winning solutions of the [ASHRAE’s Great Energy Predictor III competition](https://www.kaggle.com/c/ashrae-energy-prediction) hosted by Kaggle; the top five (5) approaches utilized combinations of Gradient Boosting Trees (LightGBM, CatBoost, XGBoost) and Neural Nets (feed-forward networks and convolutional networks). 

However, interpretation and auditing of a global non-linear model is not trivial. Here, model interpretation is related to the transparency of an algorithm’s decisions, and the ability to identify what the algorithm has learned and what subset of the observations was most influential on what the algorithm learned.   

An alternative to global non-linear models is an ensemble of local linear models. The general recipe for developing an ensemble of such models comprises the following steps:

 1.	Select the linear model to be used as the local estimator (i.e. the building block of the ensemble);
 2.	Define a way to quantify the notion of locality, i.e. when two (2) observations are close enough to be handled by the same local model;
 3.	Define a strategy for selecting the observations to train each local model and for combining the results for all individual models into one prediction. 
 
Given a distance metric then, a general template for creating a predictive model ensemble is:

1. Find a small set of anchors that adequately summarize the dataset 
2. Define a way to quantify the distance between each anchor and all remaining observations in the dataset 
3. Convert the distances of the previous step into weights
4. Use the weights to randomly sample from the available dataset, so that to create one sample per anchor. Data observations that are close to an anchor are more likely to be selected than those far away. 
5. Fit one model per anchor using the relevant sample from step 4. 
6. Predict using the weighted sum of the models' predictions.



## The linear model used as the local estimator

The default local linear model used in `eensight` is defined in *eensight/conf/base/models/towt.yaml* as:

```yaml
add_features:
  time:
    feature: null 
    type: datetime
    remainder: passthrough
    subset: month, hourofweek

regressors:
  month:
    feature: month
    type: categorical
    max_n_categories: null
    encode_as: onehot 
  
  tow:
    feature: hourofweek
    type: categorical
    max_n_categories: null 
    stratify_by: null 
    excluded_categories: null 
    encode_as: onehot 
  
  lin_temperature:
    feature: temperature
    type: linear
    include_bias: false 

  spl_temperature:
    feature: temperature
    type: spline
    n_knots: 4
    degree: 2
    strategy: quantile
    extrapolation: constant
    interaction_only: True

interactions:
  tow, spl_temperature:
    tow:
      max_n_categories: 5
      stratify_by: temperature 
      min_samples_leaf: 20
```

We can apply the local model on the whole demo dataset. This functionality is provided by `eensight.models.LinearPredictor`:

### `eensight.models.LinearPredictor` 

    Parameters
    ----------
    model_structure : dict
        The model configuration
    alpha : float (default=0.1)
        Regularization strength of the underlying ridge regression; must be a positive 
        float. Regularization improves the conditioning of the problem and reduces the 
        variance of the estimates. Larger values specify stronger regularization.
    fit_intercept : bool (default=False)
        Whether to fit the intercept for this model. If set to false, no intercept will 
        be used in calculations.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import pairwise_distances
from sklearn.preprocessing import StandardScaler

In [3]:
from eensight.utils.jupyter import load_catalog
from eensight.models import LinearPredictor, CalendarEnsemble, ParetoEnsemble
from eensight.pipelines.model_selection import cvrmse, nmbe, optimize


### Load the demo catalog

In [24]:
catalog = load_catalog('demo', model='towt')

InterpolationKeyError: Interpolation key 'site_name' not found

### Load the training data from the catalog

In [5]:
data_train = catalog.load('train.preprocessed_data')
data_train = data_train.dropna()

X_train = data_train.loc[~data_train['consumption_outlier'], ['temperature']]
y_train = data_train.loc[X_train.index, ['consumption']]

In [19]:
model_structure = catalog.load('model_structure')

local_reg = LinearPredictor(model_structure=model_structure, alpha=0.1)

Fit the training data

In [None]:
%%time
local_reg = local_reg.fit(X_train, y_train)

Evaluate the model in-sample:

In [None]:
%%time
pred = local_reg.predict(data_train[['temperature']])

In [9]:
resid = data_train[['consumption']] - pred[['consumption']]
resid = StandardScaler().fit_transform(resid).squeeze()
resid = pd.Series(data=resid, index=data_train.index)

In [None]:
outliers = resid[data_train['consumption_outlier']]

with plt.style.context('seaborn-whitegrid'):
    grid = sns.jointplot(x=pred['consumption'], y=resid)
    sns.scatterplot(x=pred.loc[outliers.index, 'consumption'], y=outliers, 
                    color='red', ax=grid.ax_joint)
    grid.fig.set_figwidth(12)
    grid.fig.set_figheight(5)
    grid.set_axis_labels(xlabel='Predicted Value', ylabel='Standardized Residuals')

In [None]:
y_true = data_train[['consumption']]

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")

The number of parameters is:

In [None]:
local_reg.n_parameters

This is our benchmark. Additional complexity should improve both metrics. 

## A notion of locality

In the simpest case, anchors can be chosen to be days that cover uniformly the span of the calendar time, and the distance metric as the difference in calendar days between two observations.

The relevant functionality is provided by `eensight.models.CalendarEnsemble`:

### `eensight.models.CalendarEnsemble` 

A TOWT-like model using ensemble prediction.

    Parameters
    ----------
    model_structure : dict
        The base model's configuration
    n_estimators : int (default=1)
        The number of estimators.
    sigma : float (default=0.5)
        It controls the kernel width that generates weights for sampling the dataset for
        each estimator. Generally, only values between 0.1 and 2 make practical sense.
    weight_method : str, default='softmin'
        Defines the way individual predictions from the estimators are weighted so that to
        generate the final prediction. It can be 'softmin' or 'argmin'.
    cache_location : str, pathlib.Path or None
        The path to use as a data store for the metric calculations. If None is given, no
        caching of the calculations is done.
    alpha : float (default=0.1)
        Regularization strength of the underlying ridge regression; must be a positive 
        float. Regularization improves the conditioning of the problem and reduces the 
        variance of the estimates. Larger values specify stronger regularization.
    fit_intercept : bool (default=False)
        Whether to fit the intercept for this model. If set to false, no intercept will be 
        used in calculations.

In [51]:
model = CalendarEnsemble(model_structure=model_structure,
                         n_estimators=6,
                         sigma=1,
                         weight_method="softmin",
                         cache_location=None,
                         alpha=0.1,
                         fit_intercept=False
)

In [None]:
%%time
model = model.fit(X_train, y_train)

We can visualize the weights that were used by each of the ensemble's estimators to sample the dataset:

In [53]:
X_for_metric = np.array(model.metric_transformer.fit_transform(X_train))

distances = pairwise_distances(
    X_for_metric, model.anchors_, metric=model.metric_
)

weights = np.exp(-(distances ** 2) / (2*(model.sigma * distances.std())**2))

We plot the weights of the first four estimators in the ensemble: 

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 7), dpi=96)
    
    layout = (min(6, 2+weights.shape[1]), 1)
        
    axes = []

    ax0 = plt.subplot2grid(layout, (0, 0), rowspan=2)
    ax0.plot(y_train.index, y_train.values, alpha=0.5)

    ax1 = plt.subplot2grid(layout, (2, 0))
    ax1.plot(y_train.index, weights[:, 0])
    axes.append(ax1)

    if weights.shape[1] > 1:
        ax2 = plt.subplot2grid(layout, (3, 0))
        ax2.plot(y_train.index, weights[:, 1])
        axes.append(ax2)

    if weights.shape[1] > 2:
        ax3 = plt.subplot2grid(layout, (4, 0))
        ax3.plot(y_train.index, weights[:, 2])
        axes.append(ax3)

    if weights.shape[1] > 3:
        ax4 = plt.subplot2grid(layout, (5, 0))
        ax4.plot(y_train.index, weights[:, 3])
        axes.append(ax4)

    for ax in axes:
        ax.legend(['weights'], frameon=True, shadow=True)
    
fig.tight_layout()

Evaluate the model in-sample:

In [None]:
%%time
pred = model.predict(data_train[['temperature']])

The number of parameters is:

In [None]:
model.n_parameters

In [None]:
y_true = data_train['consumption']

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred['consumption'])*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred['consumption'])*100}")

We can access the individual components of the predictions too:

In [None]:
%%time
pred = model.predict(data_train[['temperature']], include_components=True)
pred.head()

In [23]:
assert np.allclose(pred['consumption'],
            pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)

The parameters `n_estimators` and `sigma` can and should be optimized. The relevant functionality is provided by `eensight.pipelines.model_selection.optimize`:

### `eensight.pipelines.model_selection.optimize` 

    Parameters
    ----------
    estimator : Any regressor with scikit-learn API (i.e. with fit and
        predict methods) 
        The object to use to fit the data.
    X : pandas dataframe of shape (n_samples, n_features)
        The input data to optimize on.
    y : pandas dataframe of shape of shape (n_samples, 1)
        The training target data to optimize on.
    n_repeats : int, default=2
        Number of times to repeat the train/test data split process process.
    test_size : float, default=0.2
        The proportion of the dataset to include in the test split. Should be between
        0.0 and 1.0.
    target_name : str, default='consumption'
        It is expected that both y and the predictions of the `estimator` are
        dataframes with a single column, the name of which is the one provided
        for `target_name`.
    budget : int, default=20
        The number of trials. If this argument is set to `None`, there is no
        limitation on the number of trials. If `timeout` is also set to `None`,
        the study continues to create trials until it receives a termination
        signal such as Ctrl+C or SIGTERM.
    timeout : int, default=None
        Stop study after the given number of second(s). If this argument is set to
        `None`, the study is executed without time limitation.
    scorers : dict, default=None
        dict mapping scorer name to a callable. The callable object
        should have signature ``scorer(y_true, y_pred)``.
        The default value is:
        `OrderedDict(
            {
                "CVRMSE": lambda y_true, y_pred:
                    eensight.pipelines.model_selection.cvrmse(
                        y_true[target_name], y_pred[target_name]
                    ),
                "AbsNMBE": lambda y_true, y_pred: np.abs(
                    eensight.pipelines.model_selection.nmbe(
                        y_true[target_name], y_pred[target_name]
                    )
                ),
            }
        )`
    directions : list, default=None
        A sequence of directions during multi-objective optimization. Set
        ``minimize`` for minimization and ``maximize`` for maximization.
        The default value is ['minimize', 'minimize'].
    optimization_space : callable, default=None
        A function that takes an `optuna.trial.Trial` as input and returns
        a parameter combination to try. If it is None, the `estimator` should
        have an `optimization_space` function.
    multivariate: bool, default=False
        If `True`, the multivariate TPE (Tree-structured Parzen Estimator)
        is used when suggesting parameters.
    out_of_sample: bool, default=True
        Whether the optimization should be based on out-of-sample (if `True`) or
        in-sample (if `False`) performance.
    verbose : bool, default=False
        Flag to show progress bars or not.
    tags: str or list of str, default=None
        Tags are returned by the function as-is and are useful as a way to distinguish
        the results when running the function many times in parallel.
    opt_space_kwards : dict
        Additional keyworded parameters to pass to the `optimization_space`
        function.

The default is to optimize for out-of-sample performance. If the goal is to have the best fit on the available data, `out_of_sample` should be set to `False`.

The `CalendarEnsemble` has a static method `optimization_space`:

```python
def optimization_space(trial, **kwargs):
        n_estimators_lower = kwargs.get("n_estimators_lower") or 2
        n_estimators_upper = kwargs.get("n_estimators_upper") or 12
        sigma_lower = kwargs.get("sigma_lower") or 0.1
        sigma_upper = kwargs.get("sigma_upper") or 2.0

        param_space = {
            "n_estimators": trial.suggest_int(
                "n_estimators", n_estimators_lower, n_estimators_upper
            ),
            "sigma": trial.suggest_float("sigma", sigma_lower, sigma_upper),
        }
        return param_space
```

In [None]:
%%time
res = optimize(model, X_train, y_train, budget=20, test_size=0.2, n_repeats=2,
               out_of_sample=True, multivariate=True)

The optimization results constitute a Pareto front of different values for the CVRMSE and NMBE metrics:

In [None]:
pd.concat((res.scores, res.params), axis=1)

We can consider all parameter combinations and create an ensemble out of them:

In [26]:
to_include = res.scores.sort_values(by='CVRMSE').iloc[:5]
params_ = res.params.loc[to_include.index]

In [29]:
model_ens = ParetoEnsemble(
                base_estimator=model,
                ensemble_parameters=params_.to_dict('records')
)

In [None]:
%%time
model_ens = model_ens.fit(X_train, y_train)

In [None]:
%%time
pred = model_ens.predict(data_train[['temperature']])

In [None]:
y_true = data_train[['consumption']]

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")

In [None]:
model_ens.n_parameters

The individual components propagate from the ensemble's individual models:

In [None]:
%%time
pred = model_ens.predict(data_train[['temperature']], include_components=True)
pred.head()

In [36]:
assert np.allclose(pred['consumption'],
            pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)

Finally, we can visualize the residuals:

In [37]:
resid = data_train[['consumption']] - pred[['consumption']]
resid = StandardScaler().fit_transform(resid).squeeze()
resid = pd.Series(data=resid, index=data_train.index)

In [None]:
outliers = resid[data_train['consumption_outlier']]

with plt.style.context('seaborn-whitegrid'):
    grid = sns.jointplot(x=pred['consumption'], y=resid)
    sns.scatterplot(x=pred.loc[outliers.index, 'consumption'], y=outliers, 
                    color='red', ax=grid.ax_joint)
    grid.fig.set_figwidth(12)
    grid.fig.set_figheight(5)
    grid.set_axis_labels(xlabel='Predicted Value', ylabel='Standardized Residuals')

----------