# Power Laws: Detecting Anomalies in Usage
Energy consumption of buildings has steadily increased. There is an increasing realization that many buildings do not perform as intended by their designers. Typical buildings consume 20% more energy than necessary due to faulty construction, malfunctioning equipment, incorrectly configured control systems and inappropriate operating procedures.

The building systems may fail to meet the performance expectations due to various faults. Poorly maintained, degraded, and improperly controlled equipment wastes an estimated 15% to 30% of energy used in commercial buildings.

Therefore, it is of great potential to develop automatic, quick-responding, accurate and reliable fault detection and to provide diagnosis schemes to ensure the optimal operations of systems to save energy.

Schneider Electric already has relevant offers, but would like to determine if alternative techniques can add new detections / functionalities, bring gain in precision, or operate with less data.

https://search.library.northwestern.edu/primo-explore/fulldisplay?docid=01NWU_HATHI_TRUSTMIU01-100664356&context=L&vid=NULVNEW&search_scope=NWU&tab=default_tab&lang=en_US

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import power_utils

%matplotlib inline

In [34]:
train, test = power_utils.import_data()

  mask |= (ar1 == a)


In [35]:
train = train[train.meter_description.isin(test.meter_description.unique().tolist())]
train = train[train.activity.isin(test.activity.unique().tolist())]

train.sort_values(by=['meter_id', 'timestamp'], inplace=True)

# Convert Wh to kWh
train['values'] = train.apply(lambda r: (r['values'] / 1000) if r['units'] == 'Wh' else r['values'], axis=1)
train['units'] = train.apply(lambda r: 'kWh' if r['units'] == 'Wh' else r['units'], axis=1)

In [20]:
train = train.set_index('timestamp')

In [31]:
train = train.drop_duplicates(['meter_id'])

In [36]:
train.head()

Unnamed: 0,meter_id,timestamp,values,date,site_id,meter_description,units,surface,activity,holiday,temperature,distance
0,2,2015-06-11 00:00:00,2.035,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,20.033333,16.317674
1,2,2015-06-11 00:15:00,2.074,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,
2,2,2015-06-11 00:30:00,2.062,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,
3,2,2015-06-11 00:45:00,2.025,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,
4,2,2015-06-11 01:00:00,2.034,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,


In [51]:
lagging = train.set_index('timestamp')

In [52]:
preproc.head()

Unnamed: 0_level_0,meter_id,values,date,site_id,meter_description,units,surface,activity,holiday,temperature,distance
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-06-11 00:00:00,2,2.035,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,20.033333,16.317674
2015-06-11 00:15:00,2,2.074,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,
2015-06-11 00:30:00,2,2.062,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,
2015-06-11 00:45:00,2,2.025,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,
2015-06-11 01:00:00,2,2.034,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,


In [53]:
s = preproc.groupby('meter_id')['values'].shift(1, freq='B').reset_index()

In [58]:
s = s.rename({'values': 'values_bus_day_lag_1'}, axis='columns')

In [59]:
s.head()

Unnamed: 0,meter_id,timestamp,values_bus_day_lag_1
0,2,2015-06-12 00:00:00,2.035
1,2,2015-06-12 00:15:00,2.074
2,2,2015-06-12 00:30:00,2.062
3,2,2015-06-12 00:45:00,2.025
4,2,2015-06-12 01:00:00,2.034


In [60]:
train = pd.merge(train, s, how='left', on=['meter_id', 'timestamp'])

In [64]:
train['obs_id'] = train.index

In [67]:
train.head()

Unnamed: 0,meter_id,timestamp,values,date,site_id,meter_description,units,surface,activity,holiday,temperature,distance,obs_id
0,2,2015-06-11 00:00:00,2.035,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,20.033333,16.317674,0
1,2,2015-06-11 00:15:00,2.074,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,,1
2,2,2015-06-11 00:30:00,2.062,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,,2
3,2,2015-06-11 00:45:00,2.025,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,,3
4,2,2015-06-11 01:00:00,2.034,2015-06-11,334_61,main meter,kWh,2000.0,office,0.0,,,4


In [68]:
# Lags
lagging = train.set_index('timestamp')
s = lagging.groupby('meter_id')['values'].shift(1, freq='B').reset_index()
s = s.rename({'values': 'values_business_day_lag_1'}, axis='columns')
train = pd.merge(train, s, how='left', on=['meter_id', 'timestamp'])

s = lagging.groupby('meter_id')['values'].shift(1, freq='D').reset_index()
s = s.rename({'values': 'values_day_lag_1'}, axis='columns')
train = pd.merge(train, s, how='left', on=['meter_id', 'timestamp'])

s = lagging.groupby('meter_id')['values'].shift(7, freq='D').reset_index()
s = s.rename({'values': 'values_day_lag_7'}, axis='columns')
train = pd.merge(train, s, how='left', on=['meter_id', 'timestamp'])

# Differencing
train['values_business_day_diff_1'] = train['values'] - train['values_business_day_lag_1']
train['values_day_diff_1'] = train['values'] - train['values_day_lag_1']
train['values_day_diff_7'] = train['values'] - train['values_day_lag_7']

# Day of the week
train['date'] = pd.to_datetime(train['date'])
train['dow'] = train['date'].dt.dayofweek
train['wom'] = (train['date'].dt.day - 1) // 7 + 1
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month
train['day'] = train['date'].dt.month
train['hour'] = train['date'].dt.hour
train['minute'] = train['date'].dt.minute
train.drop('date', inplace=True, axis=1)

In [71]:
train.head()

Unnamed: 0,meter_id,timestamp,values,surface,holiday,temperature,distance,obs_id,values_business_day_lag_1,values_day_lag_1,...,minute,site_id_038,site_id_234_203,site_id_334_61,meter_description_main_meter,meter_description_virtual_main,meter_description_virtual_meter,units_kWh,activity_general,activity_office
0,2,2015-06-11 00:00:00,2.035,2000.0,0.0,20.033333,16.317674,0,,,...,0,0,0,1,1,0,0,1,0,1
1,2,2015-06-11 00:15:00,2.074,2000.0,0.0,,,1,,,...,0,0,0,1,1,0,0,1,0,1
2,2,2015-06-11 00:30:00,2.062,2000.0,0.0,,,2,,,...,0,0,0,1,1,0,0,1,0,1
3,2,2015-06-11 00:45:00,2.025,2000.0,0.0,,,3,,,...,0,0,0,1,1,0,0,1,0,1
4,2,2015-06-11 01:00:00,2.034,2000.0,0.0,,,4,,,...,0,0,0,1,1,0,0,1,0,1


In [70]:
from sklearn.preprocessing import MinMaxScaler

# Dummy coding
categorical_vars = ['site_id', 'meter_description', 'units', 'activity']
for var in categorical_vars:
    s = pd.get_dummies(train[var], prefix=var)
    s.columns = s.columns.str.replace(" ", "_")
    train.drop(var, inplace=True, axis=1)
    train = pd.concat([train, s], axis=1)

# Center and scale variables
numerical_vars = ['values', 'values_business_day_lag_1', 'values_day_lag_1', 'values_day_lag_7', 
                  'values_business_day_diff_1', 'values_day_diff_1', 'values_day_diff_7', 
                  'dow', 'wom', 'year', 'month', 'day', 'hour', 'minute', 'surface', 'temperature', 'distance']


NameError: name 'MinMaxScaler' is not defined

In [72]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
s = pd.DataFrame(min_max_scaler.fit_transform(train[numerical_vars].fillna(-1)), columns=numerical_vars)
s.index = train.index
train.drop(numerical_vars, inplace=True, axis=1)
train = pd.concat([train, s], axis=1)

# Split the data set
train = train
train.to_csv('./tmp/train_prepared.csv', index=False)