# Design principles

In [1]:
import os
import json
import pprint
import feature_encoders.settings

from omegaconf import OmegaConf

In [2]:
import eensight.settings

from eensight.config import OmegaConfigLoader
from eensight.utils.jupyter import load_catalog

## Data Pipelines

All the functionality in `eensight` is organized around data pipelines. Each pipeline consumes data and other artifacts (such as models) produced by a previous pipeline, and produces new data and artifacts for its successor pipelines. Up to this point, there are six (6) pipelines:

| Pipeline name 	| Pipeline description                                                                                                                                                                                                                                                                              	|
|:---------------	|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| *preprocess*    	| Merges (if necessary) and validates input data, identifies potential data drift, identifies potential outliers, evaluates data adequacy, and imputes missing values.                                                                                                                                              	|
| *daytype*       	| Finds consumption profile prototypes and estimates a distance metric that translates calendar information (month of year and day of week) to daily or sub-daily consumption profile similarity. Prototypes are a small set of daily or sub-daily profiles that adequately summarize the available data.              	|
| *baseline*      	| Optimizes a predictive model for out-of-sample performance, fits the optimized model on the available training data, and evaluates its performance in-sample. 	|
| *validate*      	| Cross-validates the optimized predictive model and builds a conformal predictor to construct uncertainty intervals. 	|
| *predict*       	|     Uses the optimized predictive model and the conformal model of the previous stage to generate predictions on pre- and post-retrofit data, adding uncertainty intervals with user-provided confidence levels.                                                                                                               	|
| *compare*       	| Estimates cumulative savings given the optimized predictive model and post-retrofit data, while adding uncertainty intervals with user-provided confidence levels.                                                                                                                                 	|


Each pipeline has up to two (2) different versions, each corresponding to one of the two (2) **namespaces** in `eensight`:

* **train**: Any dataset or pipeline that exists in the `train` namespace will be used for model training and evaluation. 


* **apply**: Any dataset or pipeline that exists in the `apply` namespace will be used for counterfactual prediction and savings estimation. 

The associations between pipelines and namespaces are summarized below:

|            	| train    	| apply   	| 
|------------	|----------	|----------	|
| preprocess 	| &#10004; 	| &#10004; 	| 
| daytype    	| &#10004; 	|          	|          	
| baseline   	| &#10004; 	|  	|  
| validate    	| &#10004; 	| 	|
| predict    	| &#10004; 	| &#10004;	| 
| compare    	|          	| &#10004;  | 


`eensight` is a [Kedro](https://github.com/quantumblacklabs/kedro)-based application, so all pipelines are [Kedro pipelines](https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipeline_introduction.html).

## Counterfactual prediction

When the forecast counterfactual prediction approach is selected, a predictive model is trained on the data before the energy efficiency intervention, and applied on the data after the intervention:

<img src="images/forward.png" alt="catalogs" width="550"/>

For the `eensight` user, the application of the forecast approach requires only the assignment of the available datasets to the correct namespaces. The forecast approach requires that the `train` namespace corresponds to data before the energy efficiency intervention and the `apply` namespace corresponds to data after the energy efficiency intervention.

## Data catalogs

All pipelines get and store their datasets, parameters and models by interacting with a [data catalog](https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html). However, `eensight` extends and customizes the relevant functionality of Kedro so that multiple catalogs can coexist. While Kedro assumes a unique catalog for any Kedro project, `eensight` allows users to define a new catalog for each building site. The goal is to support the very common use case where M&V practitioners have access to a set of CSV or Excel files for each building they want to analyze. As a result, all `eensight` commands need the name of the catalog with which they will interact.

<img src="images/catalogs.png" alt="catalogs" width="550"/>

A data catalog is a YAML file that includes: 

- **The name of the building/site**

    Example:
    ```yaml
    site_name : demo
    ```

    `eensight` assumes that the data is located at a relative path that is composed by the site name, the name of the folder according to the Kedro's [data engineering convention](https://kedro.readthedocs.io/en/stable/12_faq/01_faq.html#what-is-data-engineering-convention), and the corresponding namespace. The root of the relative path is defined in `eensight.settings.DATA_ROOT`.
    
    The root of the relative path is defined in: `eensight.settings.DATA_ROOT`.


- **Whether or not datasets and models should be versioned**. In order to enable [versioning](https://kedro.readthedocs.io/en/stable/05_data/02_kedro_io.html#versioning), you need to update the catalog file and set the versioned attribute to true.   

    Example:
    ```yaml
    versioned : true
    ```


- **The location of the building**. This information is used for automatically generating holiday information, if the selected model requires a "holiday" feature.

    Example:
    ```yaml
    location:
      country   : Greece  
      province  : null 
      state     : null 
    ```

- **A map of feature names**. This is a mapping between specific feature names that the `eensight` functionality expects (consumption, temperature, holiday, timestamp) and the actual names used in the available datasets

    Example:
    ```yaml
    rebind_names:
      consumption : eload
      temperature : temp 
      holiday     : null 
      timestamp   : dates
    ```

- The different data sources consumed and produced by `eensight`. The information that is necessary to adequately describe a data source can be found at Kedro's documentation: https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html    

  Adding the appropriate namespace at the beginning of a dataset's name allows us to automate the process of selecting and running pipelines:
  
  ```yaml
    train.root_input:
      type: PartitionedDataSet
      path: demo/01_raw/train
      dataset:
        type: pandas.CSVDataSet
        load_args:
          sep: ','
          index_col: 0
      filename_suffix: '.csv'

    test.root_input:
      type: PartitionedDataSet
      path: demo/01_raw/test
      dataset:
        type: pandas.CSVDataSet
        load_args:
          sep: ','
          index_col: 0
      filename_suffix: '.csv'
```

One of the changes that `eensight` has introduced to the Kedro's standard approach is a custom `ConfigLoader` (`eensight.config.OmegaConfigLoader`) that utilizes [OmegaConf](https://github.com/omry/omegaconf) as backend. `OmegaConf` makes it easy to use [variable interpolation](https://omegaconf.readthedocs.io/en/latest/usage.html#variable-interpolation) when writting the configuration files. As an example, the `globals.yaml` file contains values that can be reused in other files:  

*globals.yaml:*
```yaml
types:
  csv      : pandas.CSVDataSet
  json     : json.JSONDataSet
  multiple : PartitionedDataSet
  pickle   : pickle.PickleDataSet

folders:
  raw          : 01_raw
  intermediate : 02_intermediate
  primary      : 03_primary
  feature      : 04_feature
  model_input  : 05_model_input
  model        : 06_models
  model_output : 07_model_output
  report       : 08_reporting
```

... and then in *some_catalog.yaml:*
```yaml
site_name : demo

train.root_input:
    type: ${globals:types.multiple}
    path: ${site_name}/${globals:folders.raw}/train
    dataset:
      type: ${globals:types.csv}
      load_args:
        sep: ','
        index_col: 0
      filename_suffix: '.csv'
```

It is not necessary to use variable interpolation in `eensight`'s configuration files, but it simplifies the process of writting them. One could have also defined *some_catalog.yaml* as: 

```yaml
site_name : demo

train.root_input:
    type: PartitionedDataSet
    path: demo/01_raw/train
    dataset:
      type: pandas.CSVDataSet
      load_args:
        sep: ','
        index_col: 0
    filename_suffix: '.csv'
```

Variable interpolation makes it also easy to define **partial catalogs**. Since the largest part of a catalog declares intermediate and final artifacts, users may write only the part of the catalog that corresponds to the raw input data:

```yaml
site_name : demo
versioned : false   

train.root_input:
  type: PartitionedDataSet
  path: demo/01_raw/train
  dataset:
    type: pandas.CSVDataSet
    load_args:
      sep: ','
      index_col: 0
  filename_suffix: '.csv'

test.root_input:
  type: PartitionedDataSet
  path: demo/01_raw/test
  dataset:
    type: pandas.CSVDataSet
    load_args:
      sep: ','
      index_col: 0
  filename_suffix: '.csv'
```

> **Note**: The `site_name` and `versioned` entries are **required**, whereas `location` and `rebind_names` not.

All the commands or methods that have the name of the catalog as an input, have also the option to indicate whether it is partial or complete. If the catalog is partial, it is merged with the template from `eensight.settings.CONF_ROOT/base/templates/_base.yaml` and interpolated.

In [3]:
base_path = os.path.join(eensight.settings.CONF_ROOT, "base")

if not os.path.isabs(base_path):
    base_path = os.path.join(eensight.settings.PROJECT_PATH, base_path)

The demo catalog is a partial catalog:

In [4]:
catalog_conf = OmegaConf.load(os.path.join(base_path, "catalogs", "demo", "catalog.yaml"))
print(json.dumps(OmegaConf.to_container(catalog_conf), indent=4))

{
    "site_name": "demo",
    "versioned": false,
    "location": {
        "country": "Italy",
        "province": null,
        "state": null
    },
    "rebind_names": {
        "consumption": null,
        "temperature": null,
        "holiday": null,
        "timestamp": null
    },
    "train.root_input": {
        "type": "PartitionedDataSet",
        "path": "demo/01_raw/train",
        "dataset": {
            "type": "pandas.CSVDataSet",
            "load_args": {
                "sep": ",",
                "index_col": 0
            }
        },
        "filename_suffix": ".csv"
    },
    "apply.root_input": {
        "type": "PartitionedDataSet",
        "path": "${site_name}/01_raw/apply",
        "dataset": {
            "type": "pandas.CSVDataSet",
            "load_args": {
                "sep": ",",
                "index_col": 0
            }
        },
        "filename_suffix": ".csv"
    }
}


The `eensight.framework.context.CustomContext` completes a partial catalog using an approach that is similar to the following:

In [5]:
config_loader = OmegaConfigLoader(conf_paths=[base_path], globals_pattern="globals*")

selected_catalog = "demo"
catalog_is_partial = True

catalog_search = [
    f"catalogs/{selected_catalog}.*",
    f"catalogs/{selected_catalog}/**",
    f"catalogs/**/{selected_catalog}.*",
]

if catalog_is_partial:
    catalog_search.append("templates/_base.*")

conf_catalog = config_loader.get(*catalog_search)
print(json.dumps(
            {
                key: val for key, val in conf_catalog.items() 
                if "model_prediction" in key
            },
         indent=4
      )
)

{
    "train.model_prediction": {
        "type": "pandas.CSVDataSet",
        "filepath": "demo/07_model_output/train/model_prediction.csv",
        "load_args": {
            "sep": ",",
            "index_col": 0,
            "parse_dates": [
                0
            ]
        },
        "save_args": {
            "index": true
        },
        "versioned": false
    },
    "apply.model_prediction": {
        "type": "pandas.CSVDataSet",
        "filepath": "demo/07_model_output/apply/model_prediction.csv",
        "load_args": {
            "sep": ",",
            "index_col": 0,
            "parse_dates": [
                0
            ]
        },
        "save_args": {
            "index": true
        },
        "versioned": false
    }
}


Another way to create a data catalog is by using the command: `python -m eensight catalog create`. This command will guide the user through a series of questions. This is a simple approach that is meant to work for input data in the form of csv files. 


## Energy consumption baseline modelling approach


Global non-linear models – such as Random Forest Trees, Gradient Boosting Trees or Neural Nets – can be very effective in terms of predicting building energy consumption. This is evident when one looks at the winning solutions of the [ASHRAE’s Great Energy Predictor III competition](https://www.kaggle.com/c/ashrae-energy-prediction) hosted by Kaggle; the top five (5) approaches utilized combinations of Gradient Boosting Trees (LightGBM, CatBoost, XGBoost) and Neural Nets (feed-forward networks and convolutional networks). 

However, interpretation and auditing of a global non-linear model is not trivial. Here, model interpretation is related to the transparency of an algorithm’s decisions, and the ability to identify what the algorithm has learned and what subset of the observations was most influential on what the algorithm learned.   

An alternative to global non-linear models is an ensemble of local linear models. The general recipe that `eensight` uses for developing an ensemble of such models comprises the following steps:

 1.	Select the linear base model to be used as the local estimator (i.e. the building block of the ensemble);
 2.	Define a way to quantify the notion of locality, i.e. when two (2) observations are close enough to be handled by the same local model;
 3. Find a rule to assign all observations into neighborhoods in feature space;
 4. Fit one model per neighborhood;
 5. Predict using the average of all the models' predictions. If a test data point is outside the neighborhood of a model, that model does not contribute to the prediction. 

## Base model configuration

The structure of the linear base model can be defined using YAML files. These files have three sections: (a) added features, (b) regressors and (c) interactions.

### 1. Added features

The information in this section is passed to one of the feature generators. Most of the feature generators are provided by the [`feature-encoders` library](https://github.com/hebes-io/feature-encoders). There is extensive [documentation](https://feature-encoders.readthedocs.io/en/latest/Tutorial_Feature_Composition.html) on how feature generators are defined in `feature-encoders`, and how additional feature generators can be developed.  

As an example, to include a [`feature_encoders.generate.DatetimeFeatures`](https://feature-encoders.readthedocs.io/en/latest/feature_encoders.generate.html#feature_encoders.generate.DatetimeFeatures) generator in a model, one would have to add in the model YAML specification:

```yaml
add_features:
  time:
    type: datetime
    subset: month, hourofweek
```

`eensight` gets the information on how to map feature generator types to the classes for input validation and creation of the corresponding generators using an approach similar to the one below: 

In [6]:
feature_path = feature_encoders.settings.CONF_ROOT

if not os.path.isabs(feature_path):
    feature_path = os.path.join(
        feature_encoders.settings.PROJECT_PATH, feature_path
    )

In [7]:
config_loader = OmegaConfigLoader(conf_paths=[feature_path, base_path], 
                                  globals_pattern="globals*"
)

conf_features = config_loader.get(
    "features*", "features/**", "**/features*"
)

`TrendFeatures`, `DatetimeFeatures` and `CyclicalFeatures` generators are provided by `feature-encoders` (no need for a fully qualified class name), while `HolidayFeatures` and `OccupancyFeatures` are provided by `eensight.features` (fully qualified class names must be provided). 

In [8]:
print(json.dumps(conf_features, indent=4))

{
    "trend": {
        "validate": "validate.TrendSchema",
        "generate": "generate.TrendFeatures"
    },
    "datetime": {
        "validate": "validate.DatetimeSchema",
        "generate": "generate.DatetimeFeatures"
    },
    "cyclical": {
        "validate": "validate.CyclicalSchema",
        "generate": "generate.CyclicalFeatures"
    },
    "holidays": {
        "validate": "eensight.features.HolidaySchema",
        "generate": "eensight.features.HolidayFeatures"
    },
    "occupancy": {
        "validate": "eensight.features.OccupancySchema",
        "generate": "eensight.features.OccupancyFeatures"
    }
}


From all the parameters under the name of a feature generator in a catalog configuration, 

```yaml
add_features:
  time:
    type: datetime
    subset: month, hourofweek
```

the `type` is used to identify the correct classes for validation and creation. The remaining are passed to the validation class and then to the `__init__` method of the creation class.  

### 2. Regressors

The information for each regressor includes its name, the name of the feature to encode so that to create this regressor, the type of the encoder (linear, spline or categorical), and the parameters to pass to the corresponding encoder class from `feature-encoders`:

- `IdentityEncoder` for type linear 
- `SplineEncoder` for type spline
- `CategoricalEncoder` for type categorical

 
Example:

```yaml
regressors:
  month:              # name of the regressor
    feature: month    # name of the feature 
    type: categorical # type of the encoder to apply on `feature` and create `regressor`
    encode_as: onehot # parameter for encoder's __init__
    
  tow:                     # name of the regressor
    feature: hourofweek    # name of the feature 
    type: categorical
    max_n_categories: 60   # lump together 168 categories into 60 by target similarity
    encode_as: onehot 
    
  flex_temperature:
    feature: temperature
    type: spline
    n_knots: 5
    degree: 1
    strategy: uniform 
    extrapolation: constant
    interaction_only: true # do not include it in main effects

```
 

### 3. Interactions

Interactions can introduce new regressors, reuse regressors already defined in the regressors section, as well as alter the parameters of regressors that are already defined in the regressors section. Interactions are always pairwise and always between encoders (and not features).

This functionality too is provided by [`feature-encoders`](https://feature-encoders.readthedocs.io/en/latest/Tutorial_Interactions.html).

```yaml
interactions:
  tow, flex_temperature:
    tow: # update the parameters of the tow encoder
      max_n_categories: 2 
      stratify_by: temperature 
      min_samples_leaf: 15 
```


`eensight` gets the information on model configurations using an approach similar to the one below: 

In [9]:
config_loader = OmegaConfigLoader(conf_paths=[feature_path, base_path], 
                                  globals_pattern="globals*"
)

selected_model = "towt"

conf_model = config_loader.get(
    f"base_models/{selected_model}.*",
    f"base_models/{selected_model}/**",
    f"**/base_models/{selected_model}.*",
)

print(json.dumps(conf_model, indent=4))

{
    "add_features": {
        "time": {
            "type": "datetime",
            "subset": "month, hourofweek"
        }
    },
    "regressors": {
        "month": {
            "feature": "month",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "tow": {
            "feature": "hourofweek",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "lin_temperature": {
            "feature": "temperature",
            "type": "linear"
        },
        "flex_temperature": {
            "feature": "temperature",
            "type": "spline",
            "n_knots": 5,
            "degree": 1,
            "strategy": "uniform",
            "extrapolation": "constant",
            "interaction_only": true
        }
    },
    "interactions": {
        "tow, flex_temperature": {
            "tow": {
                "max_n_categories": 2,
                "stratify_by": "temperature",
                "min_samples_leaf

## Parameter values

`eensight` pipelines get their parameter settings from YAML files in the *conf/base/parameters* directory.

```
conf
│   README.md 
│
└───base
│   │   globals.yaml
│   │   logging.yaml
│   │
│   └── parameters
│       ├── preprocess.yaml
│       ├── ...
   
```

Parameters are accessed and treated exactly as prescribed by the Kedro documentation: https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#parameters

`eensight` gets the information for parameters using an approach similar to the one below:

In [10]:
config_loader = OmegaConfigLoader(conf_paths=[feature_path, base_path], 
                                  globals_pattern="globals*"
)

# we merge all YAML files under conf/base/parameters
params = config_loader.get(
    "parameters*", "parameters*/**", "**/parameters*"
)

pp = pprint.PrettyPrinter(indent=2)
pp.pprint(list(params.keys()))

[ 'of_optimize_model',
  'of_apply_compare',
  'of_prepare_data',
  'of_find_prototypes',
  'of_distance_learning',
  'daytyping_window',
  'of_apply_predict',
  'of_drift_detection',
  'find_outliers_for',
  'of_global_filter',
  'of_seasonal_decompose',
  'of_global_outlier',
  'of_local_outlier',
  'of_linear_impute',
  'max_missing_pct',
  'max_outlier_pct',
  'of_cross_validate',
  'of_conformal_predictor']


## Putting it all together

`load_catalog` is a utility function that loads a data catalog in the same way that `eensight.framework.context.CustomContext` does it:

In [11]:
catalog = load_catalog("demo", partial_catalog=True, base_model="towt")

The catalog includes data:

In [12]:
pp.pprint([item for item in catalog.list() if "model_prediction" in item])

['train.model_prediction', 'apply.model_prediction']


... information about the feature generators:

In [13]:
print(json.dumps(catalog.load("feature_map"), indent=4))

2021-11-21 16:23:46,077 - kedro.io.data_catalog - INFO - Loading data from `feature_map` (MemoryDataSet)...
{
    "trend": {
        "validate": "validate.TrendSchema",
        "generate": "generate.TrendFeatures"
    },
    "datetime": {
        "validate": "validate.DatetimeSchema",
        "generate": "generate.DatetimeFeatures"
    },
    "cyclical": {
        "validate": "validate.CyclicalSchema",
        "generate": "generate.CyclicalFeatures"
    },
    "holidays": {
        "validate": "eensight.features.HolidaySchema",
        "generate": "eensight.features.HolidayFeatures"
    },
    "occupancy": {
        "validate": "eensight.features.OccupancySchema",
        "generate": "eensight.features.OccupancyFeatures"
    }
}


... the selected model configuration:

In [14]:
print(json.dumps(catalog.load("model_config"), indent=4))

2021-11-21 16:23:48,959 - kedro.io.data_catalog - INFO - Loading data from `model_config` (MemoryDataSet)...
{
    "add_features": {
        "time": {
            "type": "datetime",
            "subset": "month, hourofweek"
        }
    },
    "regressors": {
        "month": {
            "feature": "month",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "tow": {
            "feature": "hourofweek",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "lin_temperature": {
            "feature": "temperature",
            "type": "linear"
        },
        "flex_temperature": {
            "feature": "temperature",
            "type": "spline",
            "n_knots": 5,
            "degree": 1,
            "strategy": "uniform",
            "extrapolation": "constant",
            "interaction_only": true
        }
    },
    "interactions": {
        "tow, flex_temperature": {
            "tow": {
         

... and all the parameters:

In [15]:
pp.pprint(list(catalog.load("parameters").keys()))

2021-11-21 16:23:54,022 - kedro.io.data_catalog - INFO - Loading data from `parameters` (MemoryDataSet)...
[ 'of_optimize_model',
  'of_apply_compare',
  'of_prepare_data',
  'of_find_prototypes',
  'of_distance_learning',
  'daytyping_window',
  'of_apply_predict',
  'of_drift_detection',
  'find_outliers_for',
  'of_global_filter',
  'of_seasonal_decompose',
  'of_global_outlier',
  'of_local_outlier',
  'of_linear_impute',
  'max_missing_pct',
  'max_outlier_pct',
  'of_cross_validate',
  'of_conformal_predictor']


## Run command arguments

The primary way of using `eensight` is through the command line: <code>python -m eensight run </code>

It is [Click](https://click.palletsprojects.com/)-based, so -h and --help commands can provide information about the expected usage.  

| Option 	| Description 	|
|:---	|:---	|
| catalog 	| The name of the catalog configuration file to load.  	|
| partial-catalog 	| Whether the catalog includes information only about the raw input data |
| base-model 	| The name of the base model configuration file to load 	|
| pipeline 	| The name of the modular pipeline to run 	|
| runner 	| The runner that you want to run the pipeline with 	|
| parallel 	| Flag to run the pipeline using the `ParallelRunner` 	|
| async 	| If load and save node inputs and outputs should be done asynchronously with threads 	|
| env 	| Kedro configuration environment name 	|
| from-inputs 	| A list of dataset names which should be used as a starting point 	|
| to-outputs 	| A list of dataset names which should be used as an end point 	|
| from-nodes 	| A list of node (pipeline step) names which should be used as a starting point 	|
| to-nodes 	| A list of node names which should be used as an end point 	|
| nodes 	| A list with node names. The `run` function will run only these nodes 	|
| tags 	| List of tags. Construct a pipeline from nodes having any of these tags 	|
| load-versions 	| A mapping between dataset names and versions to load 	|
| params 	| These values will override the values in the `parameters` configuration files |
| run-config| A YAML configuration file to load the missing (if any) `run` command arguments from. |



-----------------