# The preprocessing pipeline

## The goal of the preprocessing pipeline

The data preprocessing pipeline assumes that energy consumption and outdoor air temperature is the only data that is available for M&V. We consider an energy consumption and outdoor air temperature dataset valid if:
    
    
1. **There are no duplicate values in the dataset’s timestamps**. Duplicate timestamps are treated separately for energy consumption and for temperature data. In both cases, if the range of the energy consumption or temperature values that share a timestamp is short – according to a user-defined threshold– they are replaced by their average. Otherwise, they are treated as missing values.


2. **There are no missing values in the dataset’s timestamps**. If there are missing timestamps, they are added and the respective data is treated as missing values.


3. **Potential outliers are identified and marked**. Outlier detection is carried out separately for energy consumption and for temperature data. 


4. **There is enough data available for the energy consumption of the building under study**. Baseline energy consumption data must cover at least one full year before any energy efficiency intervention. In addition, and adopting the data requirements of the [CalTRACK](https://www.caltrack.org/) set of methods, data must be available for over 90% of hours in each calendar month – ***after excluding the potential outliers***.


5. **There are no missing values in the outdoor air temperature data**. If temperature data is missing, the missing values are imputed. The outdoor air temperature changes smoothly from one hour to the next, so interpolating over a 6-hour window around a missing observation is a sensible approach for imputation. This is in line with CalTRACK's requirement that temperature data may not be missing for more than six (6) consecutive hours.

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

from datetime import datetime, timedelta

pd.plotting.register_matplotlib_converters()

In [3]:
from eensight.utils.jupyter import load_catalog
from eensight.pipelines.preprocessing import validate_input_data
from eensight.pipelines.preprocessing import (global_filter, global_outlier_detect, 
                                              local_outlier_detect)

from eensight.pipelines.preprocessing import decompose_consumption, decompose_temperature
from eensight.pipelines.preprocessing.validation import check_column_values_not_null
from eensight.pipelines.preprocessing import linear_impute

## Load the data catalog for the demo building

In [4]:
catalog = load_catalog('demo')

## Preprocess the training data

In [5]:
train_input = catalog.load('train.root_input')
merged_data = validate_input_data(train_input, rebind_names=catalog.load('rebind_names'), 
                                          location=catalog.load('location'))

### Select the consumption data

In [6]:
consumption =  merged_data['consumption']

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    consumption.loc[consumption.notna()].plot(ax=ax, alpha=0.5)
    ax.set_xlabel('Hours')

### Outlier identification

The proposed approach for outlier identification is outlined next:

#### Step 1: Global filter

The first step screens for non-physically plausible values as well as unlikely values in the data. 

For power consumption data, negative and zero values are filtered out. 

For both consumption and temperature data, values that are at least 10 times larger than the median value are also removed. The threshold of ten times the median value aims at removing the most extreme outliers. 

Furthermore, long streaks of constant values are filtered out as well (here *long* is defined in hours by `no_change_window`).


In [107]:
consumption = global_filter(consumption, 
                            no_change_window=3,
                            allow_zero=False, 
                            allow_negative=False
)

#### Step 2: Seasonal filter

The second step captures the seasonal cycle of the data through a trend and seasonality decomposition approach that utilizes a Fourier series expansion of the form:

$$y(t)=\alpha+bt+\sum_{n=1}^{N} (\alpha_n\cos(\frac{2πnt}{P}) + b_n\sin(\frac{2πnt}{P}))$$

where:

$\alpha$    is the offset of the linear trend

$b$    is the slope of the linear trend

$t$    is the day since a pre-specified epoch. For hourly data, $t$ will take decimal number values.

$N$    is a parameter that controls the flexibility of the expansion. Suggested values are N=4 for daily seasonality, N=10 for yearly seasonality (see [Taylor S. J. and Letham B. (2018) "Forecasting at scale," The American Statistician 72(1), pp. 37-45](https://peerj.com/preprints/3190/))

$P$    is the length of the seasonality: P=1 for daily seasonality, P=365.25 for yearly seasonality. For energy consumption data, we fit a different daily seasonality component for each day of the week.  

$\alpha_n, b_n$	Regression coefficients for the Fourier series expansion terms.


The reason for applying seasonal decomposition before outlier identification can be seen in the figure below: 

In [109]:
def fit_pdf(x, data, distribution=stats.norm):
    # fit dist to data
    params = distribution.fit(data)

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Calculate fitted PDF and error with fit in distribution
    pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
    return params, pdf

In [None]:
consumption_ = consumption.dropna().values
x_d = np.linspace(consumption_.min(), consumption_.max(), 2000)
params, pdf = fit_pdf(x_d, consumption_)

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    consumption.plot(kind='hist', bins=100, density=True, alpha=0.3, ax=ax)
    pd.Series(pdf, x_d).plot(ax=ax)
    
    ax.legend(['Fitted Normal distribution', 'Actual distribution of power consumption'], 
              frameon=True, shadow=True, fontsize=12)

Since seasonality leads to multimodal distributions, methods that rely on the assumption that the data follows a Normal distribution – such as simple three-sigma rules, the Grubbs test or the Extreme Studentized Deviate (ESD) test  – should generally be used only ***after*** a seasonal filter has been applied to the data.

In [113]:
results = decompose_consumption(consumption.dropna().to_frame("consumption"),
                                return_model=True)

In [None]:
results.model.composer_.component_names_

In [115]:
pred = results.transformed['yhat']
resid = results.transformed['resid']

In [None]:
print(f'CV(RMSE): {np.sqrt(np.mean(resid**2)) / np.mean(consumption)}')

The next plot shows the actual and the predicted power consumption for the first and the last month of 2016:

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 6), dpi=96)
    layout = (2, 1)
    ax1 = plt.subplot2grid(layout, (0, 0))
    ax2 = plt.subplot2grid(layout, (1, 0))

    start = datetime(2016, 1, 1, 0)
    end = datetime(2016, 2, 1, 0)
    consumption.loc[start:end].plot(ax=ax1, alpha=0.6)
    pred.loc[start:end].plot(ax=ax1, alpha=0.4)
    ax1.set_xlabel('Hours')
    ax1.legend(['Power consumption', 'Seasonal prediction'], frameon=True, shadow=True)
    
    start = datetime(2016, 12, 1, 0)
    end = datetime(2017, 1, 1, 0)
    consumption.loc[start:end].plot(ax=ax2, alpha=0.6)
    pred.loc[start:end].plot(ax=ax2, alpha=0.4)
    ax2.set_xlabel('Hours')
    ax2.legend(['Power consumption', 'Seasonal prediction'], frameon=True, shadow=True)

fig.tight_layout()

The next plot shows the distribution of the residuals when subtracting the actual from the predicted power consumption. The distribution of the residuals resembles a Student’s t distribution and, hence, it is easier to work with for detecting outliers.

In [None]:
residuals_ = resid.dropna()
x_d = np.linspace(residuals_.min(), residuals_.max(), 2000)

_, pdf_t = fit_pdf(x_d, residuals_, distribution=stats.t)


with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    resid.plot(kind='hist', bins=100, density=True, alpha=0.3, ax=ax)
    pd.Series(pdf_t, x_d).plot(ax=ax)
    
    ax.legend(['Fitted Student\'s t distribution', 'Distribution of residuals'], 
              frameon=True, shadow=True, fontsize=12)

#### Step 3: Global outlier detection

The third step of the outlier detection process identifies observations in the available dataset as potential outliers if the value of their corresponding residuals lies outside the range defined by:

$$[median^{all} - c\times mad^{all}, median^{all} + c\times mad^{all}]$$

where:

$median^{all}$ is the median of all the residual values

$mad^{all}$ is the median absolute deviation of all the residual values

$c$ is a user defined parameter (suggested value is 5).


In [122]:
outliers_global = global_outlier_detect(resid, c=5)

The next plot shows the potential outliers in power consumption identified using the global outlier detection for January, August, first 5 days of September and December 2016:

In [None]:
subset = consumption.loc[consumption.index.isin(outliers_global[outliers_global].index)]

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 8), dpi=96)
    layout = (3, 1)
    ax1 = plt.subplot2grid(layout, (0, 0))
    ax2 = plt.subplot2grid(layout, (1, 0))
    ax3 = plt.subplot2grid(layout, (2, 0))
    
    start = datetime(2016, 1, 1, 0)
    end = datetime(2016, 2, 1, 0)
    consumption.loc[start:end].plot(ax=ax1, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax1, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass 
    
    start = datetime(2016, 8, 1, 0)
    end = datetime(2016, 9, 6, 0)
    consumption.loc[start:end].plot(ax=ax2, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax2, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    
    start = datetime(2016, 12, 1, 0)
    end = datetime(2017, 1, 1, 0)
    consumption.loc[start:end].plot(ax=ax3, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax3, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    ax3.set_xlabel('Hours')
    
fig.tight_layout()

#### Step 4: Local outlier detection

The final step of the outlier detection process retains from the outliers identified in the previous step only those that can be characterised as outliers when we also compare their values with the observations in the same day of the year. 

The rationale for this approach can be explained by looking at the next plot, which shows the actual and the predicted power consumption during the first two (2) weeks of 2016 in the dataset. An important observation from the plot is that the distance from the seasonal model’s predictions is not by itself enough for detecting outliers when the whole day is misrepresented by the model (here a holiday is treated as a normal day).

In [None]:
start = datetime(2016, 1, 1, 0)
end = datetime(2016, 1, 1, 0) + timedelta(days=14)

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    consumption.loc[start:end].plot(ax=ax, alpha=0.8)
    pred.loc[start:end].plot(ax=ax, alpha=0.4)
    
    ax.set_ylim(top=7000)
    ax.annotate(' First day of year ', xy=(datetime(2016, 1, 1, 12), 2200),  xycoords='data',
             xytext=(40, 140), textcoords='offset points',
             size=13, ha='center', va="center",
             bbox=dict(boxstyle="round", alpha=0.3),
             arrowprops=dict(arrowstyle="wedge,tail_width=0.5", alpha=0.3))
    
    ax.set_xlabel('Hours')
    ax.legend(['Power consumption', 'Seasonal prediction'], frameon=True, shadow=True)

Accordingly, the observations in the available dataset are marked as potential outliers if the value of their corresponding residuals lies outside the range defined by:

$$[median^{day} - c\times mad^{day}, median^{day} + c\times mad^{day}]$$

where:

$median^{day}$ is the median of all the residual values in the corresponding day

$mad^{day}$ is the median absolute deviation of all the residual values in the corresponding day

$c$ is a user defined parameter (suggested value is 5).


This step is parameterised by the minimum percentage of observations `min_samples` that must be available for any given day so that to take the daily statistics into account. If the number of the available observations is lower than this threshold, only the global outlier detection results are considered.

In [125]:
outliers_local = local_outlier_detect(resid, min_samples=0.6, c=5)

The next plot shows the potential outliers in power consumption identified using the local outlier detection for January, August, first 5 days of September and December 2016:

In [None]:
subset = consumption.loc[consumption.index.isin(outliers_local[outliers_local].index)]

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 8), dpi=96)
    layout = (3, 1)
    ax1 = plt.subplot2grid(layout, (0, 0))
    ax2 = plt.subplot2grid(layout, (1, 0))
    ax3 = plt.subplot2grid(layout, (2, 0))
    
    start = datetime(2016, 1, 1, 0)
    end = datetime(2016, 2, 1, 0)
    consumption.loc[start:end].plot(ax=ax1, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax1, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    
    start = datetime(2016, 8, 1, 0)
    end = datetime(2016, 9, 6, 0)
    consumption.loc[start:end].plot(ax=ax2, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax2, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    
    start = datetime(2016, 12, 1, 0)
    end = datetime(2017, 1, 1, 0)
    consumption.loc[start:end].plot(ax=ax3, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax3, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    ax3.set_xlabel('Hours')
    
fig.tight_layout()

For an observation to be marked as an outlier, both global and local results must agree. 

In [128]:
outliers = np.logical_and(outliers_global, outliers_local)

In [None]:
subset = consumption.loc[consumption.index.isin(outliers[outliers].index)]

with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 8), dpi=96)
    layout = (3, 1)
    ax1 = plt.subplot2grid(layout, (0, 0))
    ax2 = plt.subplot2grid(layout, (1, 0))
    ax3 = plt.subplot2grid(layout, (2, 0))
    
    start = datetime(2016, 1, 1, 0)
    end = datetime(2016, 2, 1, 0)
    consumption.loc[start:end].plot(ax=ax1, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax1, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    
    start = datetime(2016, 8, 1, 0)
    end = datetime(2016, 9, 6, 0)
    consumption.loc[start:end].plot(ax=ax2, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax2, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    
    start = datetime(2016, 12, 1, 0)
    end = datetime(2017, 1, 1, 0)
    consumption.loc[start:end].plot(ax=ax3, alpha=0.6)
    try:
        subset.loc[start:end].plot(ax=ax3, style='o', ms=4, c='red', alpha=0.4)
    except IndexError:
        pass
    ax3.set_xlabel('Hours')
    
fig.tight_layout()

The next plot shows the potential outliers identified in the whole consumption dataset:

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    consumption.plot(ax=ax, alpha=0.4, style='.', ms=2)
    
    subset = consumption.loc[consumption.index.isin(outliers[outliers].index)]
    subset.plot(ax=ax, style='o', ms=3, c='red')
    
    ax.set_xlabel('Hours')
    ax.legend(['Power consumption', 'Potential outliers'], frameon=True, shadow=True)

In [134]:
merged_data['consumption'] = consumption
merged_data['consumption_outlier'] = outliers
merged_data['consumption_outlier'] = merged_data['consumption_outlier'].fillna(value=False)

### Repeat the process for temperature data

In [8]:
temperature = merged_data['temperature']

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    temperature.loc[temperature.notna()].plot(ax=ax, alpha=0.5)
    ax.set_xlabel('Hours')

In [137]:
temperature = global_filter(temperature, 
                            no_change_window=3,
                            allow_zero=True, 
                            allow_negative=True)

We apply seasonal decomposition on the temperature:

In [139]:
results = decompose_temperature(temperature.dropna().to_frame("temperature"),
                                return_model=True)

In [None]:
results.model.composer_.component_names_

In [143]:
resid = results.transformed['resid']

The distribution of the residuals resembles a Student’s t distribution:

In [None]:
residuals_ = resid.dropna()
x_d = np.linspace(residuals_.min(), residuals_.max(), 2000)

_, pdf_t = fit_pdf(x_d, residuals_, distribution=stats.t)


with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    resid.plot(kind='hist', bins=100, density=True, alpha=0.3, ax=ax)
    pd.Series(pdf_t, x_d).plot(ax=ax)
    
    ax.legend(['Fitted Student\'s t distribution', 'Distribution of residuals'], 
              frameon=True, shadow=True, fontsize=12)

Outliers found in the dataset:

In [None]:
outliers_global = global_outlier_detect(resid, c=5)
outliers_local = local_outlier_detect(resid, min_samples=0.6, c=5)
outliers = np.logical_and(outliers_global, outliers_local)

print(f'Number of outliers found: {outliers.sum()}')

In [146]:
merged_data['temperature'] = temperature
merged_data['temperature_outlier'] = outliers
merged_data['temperature_outlier'] = merged_data['temperature_outlier'].fillna(value=False)

All outliers - except for consumption - are replaced by NaN. 

In [147]:
columns = merged_data.filter(like='outlier', axis=1).columns
to_drop = []

for col in columns:
    feature, _ = col.split('_')
    if feature != 'consumption':
        merged_data[feature] = merged_data[feature].mask(merged_data[col], np.nan)
        to_drop.append(col)
        
merged_data = merged_data.drop(to_drop, axis=1)

### Impute missing values in the temperature data

In [None]:
print('Number of missing temperature values before: {}'
          .format(merged_data['temperature'].isna().sum())
)

In [151]:
merged_data['temperature'] = linear_impute(merged_data['temperature'], window=6)

In [None]:
print('Number of missing temperature values after: {}'
          .format(merged_data['temperature'].isna().sum())
)

### Ensure that enough training data is available

In [153]:
missing_condition = (  merged_data['consumption_outlier'] 
                     | merged_data['consumption'].isna() 
                     | merged_data['temperature'].isna()
)

In [154]:
missing = merged_data[['consumption']].mask(missing_condition, np.nan)

In [156]:
avail_data = dict()

for month_year, group in missing.groupby([lambda x: x.year, lambda x: x.month]):
    check = check_column_values_not_null(data=group, column='consumption', mostly=0.9)
    avail_data[month_year] = check.result['unexpected_percent']

avail_data = {f'{key[0]}M{key[1]}' :val for key, val in avail_data.items()}
avail_data = pd.DataFrame.from_dict(avail_data, orient='index', columns=['values'])

In [None]:
print('Months with not enough data are:')
print(avail_data[avail_data['values'] > 0.1])

In [None]:
with plt.style.context('seaborn-whitegrid'):    
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))
    
    subset = avail_data.mask(avail_data['values'] <= 0.1, 0) 
    subset.plot.bar(rot=25, ax=ax, color='#C71585', legend=False)
    
    subset = avail_data.mask(avail_data['values'] > 0.1, 0)
    subset.plot.bar(rot=25, ax=ax, color='#4682B4', legend=False)

### Save to the catalog

In [159]:
catalog.save('train.preprocessed_data', merged_data)

--------------------------