# 4.1 Prediction of the aggregate demand with feature models

In this section, we will compare the performance of different algorithms in predicting the aggregate demand, separating standard and ToU users. First, we will use feature models and then exponential smoothing, which is a time-series dedicated algorithm.

In general we will train our model with 2011-2013 data and validate it with 2014 data.

We will use the aggregate statistics we computed in Section 2 with Spark.

Feature models are an alternative for time series forecasting. They consist in applying conventional machine learning models to variables constructed from the data (e.g. mean consumption in the previous day). These are popular in the literature for electrical demand prediction [2], [4].

In this section we will use different algorithms to fit our feature model. Later, we will predict the aggregate demand with exponential smoothing models in R using the "forecast" package.

### Preparing the dataset

In [64]:
import pandas as pd
pd.options.display.max_columns = 999

In [67]:
df = pd.read_csv('outputs/agg_stats.csv', index_col=0)

In [68]:
df.head()

Unnamed: 0,DateTime,Tariff,ToU_User,count,sum,min,mean,max,std_dev
0,2012-10-15 21:00:00,Std,0,4210,1270.587,0.0,0.301802,5.335,0.31913
1,2012-10-21 21:00:00,Std,0,4287,1239.813999,0.0,0.289203,6.095,0.320852
2,2012-10-27 13:00:00,Std,0,4402,1159.695,0.0,0.263447,3.554,0.336398
3,2012-10-28 17:30:00,Std,0,4402,1665.008001,0.0,0.378239,8.04,0.447315
4,2012-11-04 00:30:00,Std,0,4403,994.769,0.0,0.22593,6.072,0.424697


We keep only the columns we need: DateTime, Tariff and mean consumption. We are interested in predicting the aggregate demand (i.e. the sum) but as the data count in each Timestamp is different, we will use the mean.

In [69]:
df = df[['DateTime','Tariff','mean']]

In [70]:
df['DateTime'] = pd.to_datetime(df['DateTime'])

In [71]:
type(df['DateTime'][2])

pandas.tslib.Timestamp

In [72]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean
41397,2013-08-12 16:00:00,ToU,0.170519
37965,2013-04-25 16:00:00,Std,0.185175
60759,2013-05-06 12:30:00,ToU,0.179697
65336,2013-11-30 10:00:00,Std,0.262507
30346,2013-06-03 05:30:00,Std,0.129813


### Feature engineering

We will derive the following features:

-Mean aggregate demand per hour in the previous day (1 variable).

-Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In total we have 13 derived features.

Furthermore, we will include weather and time variables in the model:

-Time (30-min resolution)

-Day of week (7 levels)

-Month (12 levels)

-Holiday (binary)

-Temperature (ºC, continuous)

-Relative Humidity (%, continuous)

-Cloud cover (%, continuous)

-Atmospheric Pressure (mbar, continuous)

These variables have been calculated in Section 1 and stored in 'weather_no_na.csv'.

And the last variable to be included in the model:
-Tariff (p/kWh)

Thus, in total we will have 22 variables in our models.


First, we will build the time variables.

In [73]:
df['DoW'] = df['DateTime'].apply(lambda x: x.weekday())

In [74]:
df['Time'] = df['DateTime'].apply(lambda x: x.time())

In [75]:
df['Date'] = df['DateTime'].apply(lambda x: x.date())

In [76]:
df['Date'] = pd.to_datetime(df['Date'])

In [77]:
df['Month']= df['DateTime'].apply(lambda x: x.month)

In [78]:
df['Year']= df['DateTime'].apply(lambda x: x.year)

In [79]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean,DoW,Time,Date,Month,Year
23836,2013-08-06 14:00:00,ToU,0.137946,1,14:00:00,2013-08-06,8,2013
56017,2013-02-02 04:30:00,ToU,0.113117,5,04:30:00,2013-02-02,2,2013
68097,2013-06-22 23:30:00,Std,0.155765,5,23:30:00,2013-06-22,6,2013
35334,2012-04-02 07:00:00,Std,0.241241,0,07:00:00,2012-04-02,4,2012
13270,2013-10-31 05:30:00,Std,0.117387,3,05:30:00,2013-10-31,10,2013


Now it is easier to derive the consumption features.

###### Mean aggregate demand per hour in the previous day (1 variable).

First, we will calculate the daily mean aggregating by 'Date'. Then we will add one day to the 'Date' variable of our new DataFrame and join it with the main DataFrame df.

In [80]:
import datetime

In [81]:
daily_mean = df.groupby(['Date', 'Tariff']).agg({'mean': 'mean'}).reset_index()

In [82]:
daily_mean['Date'] = pd.to_datetime(daily_mean['Date'])

In [83]:
type(daily_mean['Date'][2])

pandas.tslib.Timestamp

In [20]:
daily_mean.sample(5)

Unnamed: 0,Date,Tariff,mean
1221,2013-07-25,ToU,0.156114
297,2012-04-19,ToU,0.216483
995,2013-04-03,ToU,0.226597
599,2012-09-17,ToU,0.169745
421,2012-06-20,ToU,0.158177


In [84]:
daily_mean['Date'] = daily_mean['Date'] + datetime.timedelta(days=1)

In [85]:
df = df.merge(daily_mean, how = 'left', on = ['Date','Tariff'])

In [86]:
df = df.rename(columns = {'mean_x' : 'mean_cons',
          'mean_y' : 'mean_prev_day'})

In [87]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean_cons,DoW,Time,Date,Month,Year,mean_prev_day
41825,2014-02-09 14:30:00,ToU,0.268367,6,14:30:00,2014-02-09,2,2014,0.223289
72993,2013-07-13 12:30:00,ToU,0.167876,5,12:30:00,2013-07-13,7,2013,0.157449
20011,2014-02-21 10:30:00,Std,0.222306,4,10:30:00,2014-02-21,2,2014,0.224884
70734,2013-03-15 03:00:00,Std,0.14699,4,03:00:00,2013-03-15,3,2013,0.249186
23288,2012-02-29 16:00:00,Std,0.207436,2,16:00:00,2012-02-29,2,2012,0.244607


###### Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

In [88]:
df.head()

Unnamed: 0,DateTime,Tariff,mean_cons,DoW,Time,Date,Month,Year,mean_prev_day
0,2012-10-15 21:00:00,Std,0.301802,0,21:00:00,2012-10-15,10,2012,0.227159
1,2012-10-21 21:00:00,Std,0.289203,6,21:00:00,2012-10-21,10,2012,0.214069
2,2012-10-27 13:00:00,Std,0.263447,5,13:00:00,2012-10-27,10,2012,0.225192
3,2012-10-28 17:30:00,Std,0.378239,6,17:30:00,2012-10-28,10,2012,0.23787
4,2012-11-04 00:30:00,Std,0.22593,6,00:30:00,2012-11-04,11,2012,0.24095


In [89]:
def prev_ts(df_in, initial, final, step):
    #Both initial and final are included in the loop
    df_out = df_in[['DateTime', 'Tariff']].copy()
    for i in range(initial, final + 1, step):
        aux = df_in[['DateTime', 'Tariff', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = i)
        df_out = df_out.\
            merge(aux, how = 'left', on = ['DateTime','Tariff'], suffixes = ('','_-%d' %(i)))
    return(df_out)

In [93]:
tmp = prev_ts(df, 22*60+30, 24*60, 30)
df = df.merge(tmp[['DateTime','Tariff','mean_cons_-1380', 'mean_cons_-1410', 'mean_cons_-1440']],
         on = ['DateTime','Tariff'], suffixes = ('',''))

In [94]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean_cons,DoW,Time,Date,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440
12786,2013-05-10 03:30:00,Std,0.10288,4,03:30:00,2013-05-10,5,2013,0.184996,0.11212,0.103446,0.10084
1590,2012-03-21 20:00:00,ToU,0.345253,2,20:00:00,2012-03-21,3,2012,0.222125,0.35635,0.374243,0.36891
26408,2013-09-21 14:00:00,Std,0.198668,5,14:00:00,2013-09-21,9,2013,0.187236,0.173513,0.167166,0.167441
10029,2011-12-04 13:30:00,ToU,0.2842,6,13:30:00,2011-12-04,12,2011,0.183636,0.1536,0.1268,0.184267
30062,2011-12-31 15:00:00,Std,0.30154,5,15:00:00,2011-12-31,12,2011,0.261003,0.314867,0.292593,0.284088


###### Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

In [150]:
aux2 = df[['DateTime', 'Tariff']].copy()
means_3d = pd.DataFrame(index=df.index)
for h in range(0,4):
    for d in range(1,4):
        aux = df[['DateTime', 'Tariff', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = d*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','Tariff'])
    means_3d = means_3d.join(aux2.mean(axis=1).rename('mean_last3d_-%d' %(h*30)))

In [151]:
means_3d.head()

Unnamed: 0,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
0,0.286278,0.276157,0.264642,0.251189
1,0.276418,0.269546,0.258608,0.246133
2,0.209606,0.206111,0.204395,0.204098
3,0.348101,0.352302,0.352996,0.350607
4,0.203514,0.191968,0.181703,0.17278


In [152]:
df = df.join(means_3d)

In [157]:
df.sample(3)

Unnamed: 0,DateTime,Tariff,mean_cons,DoW,Time,Date,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
2891,2011-11-28 18:30:00,ToU,0.359167,0,18:30:00,2011-11-28,11,2011,0.162781,0.180833,0.238833,0.303167,0.287056,0.288167,0.263352,0.244431
65844,2012-01-22 22:00:00,Std,0.351664,6,22:00:00,2012-01-22,1,2012,0.269654,0.26367,0.30433,0.321241,0.32671,0.318403,0.304173,0.29077
59869,2012-07-25 04:30:00,Std,0.105824,2,04:30:00,2012-07-25,7,2012,0.167339,0.125067,0.110885,0.104079,0.100616,0.103466,0.108712,0.115253


###### Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In [203]:
aux2 = df[['DateTime', 'Tariff']].copy()
means_3w = pd.DataFrame(index=df.index)
for h in range(0,4):
    for w in range(1,4):
        aux = df[['DateTime', 'Tariff', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = w*7*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','Tariff'])
    means_3w = means_3w.join(aux2.mean(axis=1).rename('mean_last3w_-%d' %(h*30)))

In [198]:
df = df.join(means_3w)

In [199]:
df.sample(3)

Unnamed: 0,DateTime,Tariff,mean_cons,DoW,Time,Date,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90,mean_last3w_-0,mean_last3w_-30,mean_last3w_-60,mean_last3w_-90
18271,2013-09-19 06:00:00,ToU,0.157808,3,06:00:00,2013-09-19,9,2013,0.172362,0.188479,0.17475,0.162415,0.158145,0.168493,0.174458,0.176029,0.138159,0.147067,0.153653,0.157703
78681,2013-06-13 08:30:00,Std,0.191496,3,08:30:00,2013-06-13,6,2013,0.182262,0.175158,0.182603,0.183667,0.18319,0.183547,0.182362,0.18141,0.192418,0.192987,0.192568,0.19151
48971,2013-08-27 14:30:00,ToU,0.149615,1,14:30:00,2013-08-27,8,2013,0.156622,0.168038,0.159808,0.163097,0.173369,0.171835,0.173441,0.174855,0.156015,0.156083,0.156904,0.159982


Note that pandas.mean skips NaN values by default. Therefore, in the first days/weeks the means are calculated with the available data (1 or 2 values).

###### Weather variables

Now that we have built the derived features, we just have to add the weather variables. We downloaded and treated them in Section 1.

In [212]:
weather = pd.read_csv('data/weather_no_na.csv')

In [213]:
weather.shape

(20446, 13)

We will use the following variables:

-Temperature (ºC)

-Relative Humidity (%)

-Cloud cover (%)

-Atmospheric Pressure (mbar)

For a discussion on the reasons why these variables have been choosen and how NA values have been filled refer to Section 1.

In [214]:
weather.columns

Index(['time', 'apparentTemperature', 'cloudCover', 'dewPoint', 'humidity',
       'icon', 'precipType', 'pressure', 'summary', 'temperature',
       'visibility', 'windBearing', 'windSpeed'],
      dtype='object')

In [215]:
weather = weather[['time','temperature','humidity','cloudCover','pressure']]\
    .rename(columns = {'time' : 'DateTime'})

In [216]:
weather['DateTime'] = pd.to_datetime(weather['DateTime'])

In [217]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
2,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97
3,2011-11-01 03:00:00,14.18,0.88,0.43,1006.4
4,2011-11-01 04:00:00,14.2,0.9,0.38,1006.05


The frequency of the weather data is of 1 hour while that of the consumption data is 30 minutes, thus we need to resample it. We will take the last valid observation (e.g. 2012-01-01 10:30 will have the same data as 2012-01-01 10:00) => method = 'ffill'.

In [218]:
weather = weather.set_index('DateTime').resample('30min').fillna(method = 'ffill')\
    .reset_index()

In [219]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 00:30:00,13.54,0.87,0.27,1008.01
2,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
3,2011-11-01 01:30:00,12.74,0.93,0.32,1007.76
4,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97


In [221]:
df = df.merge(weather, how = 'left', on = 'DateTime')

###### Bank Holidays

This is the last variable we will include in our model. Bank Holidays will be treated separately from Sundays as a binary variable. 

Bank Holidays in England from 2011 to 2014 were downloaded in Section 1 using the Python holidays library (and cross-checked with the official information in www.gov.co.uk) and saved to a csv file.

In [241]:
holidays = pd.Series.from_csv('data/bank_holidays.csv')

In [245]:
holidays = pd.to_datetime(holidays)

In [246]:
holidays.head()

0   2011-01-01
1   2011-01-03
2   2011-04-22
3   2011-04-25
4   2011-04-29
dtype: datetime64[ns]

In [255]:
df['Holiday'] = 0
df.loc[df['Date'].isin(holidays),'Holiday'] = 1

In [256]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean_cons,DoW,Time,Date,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90,mean_last3w_-0,mean_last3w_-30,mean_last3w_-60,mean_last3w_-90,temperature,humidity,cloudCover,pressure,Holiday
57606,2012-10-22 04:00:00,ToU,0.098568,0,04:00:00,2012-10-22,10,2012,0.219287,0.101445,0.098709,0.096792,0.096584,0.098743,0.101519,0.106834,0.095958,0.099982,0.104822,0.112867,12.62,1.0,0.98,1017.85,0
27650,2012-02-06 08:00:00,Std,0.314526,0,08:00:00,2012-02-06,2,2012,0.336885,0.283092,0.251566,0.222558,0.259396,0.269528,0.275886,0.280448,0.288593,0.282882,0.274363,0.266951,1.72,0.96,0.88,1032.46,0
73119,2012-10-21 14:00:00,Std,0.273623,6,14:00:00,2012-10-21,10,2012,0.214069,0.23685,0.221746,0.217536,0.203899,0.204081,0.208642,0.213785,0.232186,0.233525,0.234909,0.237676,12.42,0.91,0.85,1017.5,0
19257,2013-12-02 21:00:00,Std,0.354043,0,21:00:00,2013-12-02,12,2013,0.260058,0.319056,0.333102,0.348323,0.335521,0.329491,0.322749,0.31461,0.353596,0.345071,0.335667,0.324362,6.64,0.77,0.07,1032.28,0
1590,2012-03-21 20:00:00,ToU,0.345253,2,20:00:00,2012-03-21,3,2012,0.222125,0.35635,0.374243,0.36891,0.374848,0.37013,0.363942,0.356219,0.39619,0.388477,0.382342,0.376333,9.68,0.71,0.27,1032.78,0


Now that we have created all features in our model, we can save it to a file for easier access. Then we can start the prediction.

In [257]:
df.to_csv('outputs/feature_model.csv')

### Prediction