# 4 Prediction of the aggregate demand 

In this section, we will compare the performance of different algorithms in predicting the aggregate demand, separating standard and ToU users. First, we will use feature models and then exponential smoothing, which is a time-series dedicated algorithm.

After adding the external variables (holidays, weather) we will use different algorithms to fit our feature model. Later, we will predict the aggregate demand with exponential smoothing models in R using the "forecast" package.

In general we will train our model with Nov 2011- Nov 2013 data and validate it with Nov 2013 - Feb 2014 data.

We will use the aggregate statistics we computed in Section 2 with Spark.

## 4.1 Preparing the dataset

First, we will prepare our data so it can be use in the different algorithms proposed. We will start including time and weather variables as these are needed for all models and then we will add the engineered features for the feature models.

In [15]:
import pandas as pd
pd.options.display.max_columns = 999
import datetime

In [16]:
df = pd.read_csv('outputs/agg_stats.csv', index_col=0)

In [17]:
df.head()

Unnamed: 0,DateTime,Tariff,ToU_User,count,sum,min,mean,max,std_dev
0,2012-10-15 21:00:00,Std,0,4210,1270.587,0.0,0.301802,5.335,0.31913
1,2012-10-21 21:00:00,Std,0,4287,1239.813999,0.0,0.289203,6.095,0.320852
2,2012-10-27 13:00:00,Std,0,4402,1159.695,0.0,0.263447,3.554,0.336398
3,2012-10-28 17:30:00,Std,0,4402,1665.008001,0.0,0.378239,8.04,0.447315
4,2012-11-04 00:30:00,Std,0,4403,994.769,0.0,0.22593,6.072,0.424697


We keep only the columns we need: DateTime, Tariff and mean consumption. We are interested in predicting the aggregate demand (i.e. the sum) but as the data count in each Timestamp is different, we will use the mean.

In [18]:
df = df[['DateTime','Tariff','mean']]

In [19]:
df['DateTime'] = pd.to_datetime(df['DateTime'])

In [20]:
type(df['DateTime'][2])

pandas.tslib.Timestamp

In [21]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean
69664,2014-02-09 10:30:00,Std,0.274834
32885,2014-02-20 02:30:00,Std,0.132424
75045,2012-04-09 05:00:00,ToU,0.100082
29813,2012-01-01 20:00:00,ToU,0.359456
28597,2013-11-20 06:00:00,ToU,0.130753


### 4.1.1. Bank Holidays

Bank Holidays probably have an impact in household electricity consumption as it was discussed in Section 1. Bank Holidays will be treated separately from Sundays as a binary variable. 

Bank Holidays in England from 2011 to 2014 were downloaded in Section 1 using the Python holidays library (and cross-checked with the official information in www.gov.co.uk) and saved to a csv file.

In [22]:
holidays = pd.Series.from_csv('data/bank_holidays.csv')

In [23]:
holidays = pd.to_datetime(holidays)

In [24]:
holidays.head()

0   2011-01-01
1   2011-01-03
2   2011-04-22
3   2011-04-25
4   2011-04-29
dtype: datetime64[ns]

In [28]:
df['Date'] = df['DateTime'].apply(lambda x: x.date())

In [29]:
df['Date'] = pd.to_datetime(df['Date'])

In [31]:
df['Holiday'] = 0
df.loc[df['Date'].isin(holidays),'Holiday'] = 1

In [32]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean,Holiday,Date
34367,2011-12-13 22:00:00,ToU,0.364237,0,2011-12-13
60551,2013-05-31 00:30:00,Std,0.137393,0,2013-05-31
21706,2012-06-16 00:30:00,Std,0.156315,0,2012-06-16
5813,2012-02-24 19:30:00,Std,0.375746,0,2012-02-24
18028,2013-09-03 13:00:00,Std,0.162294,0,2013-09-03


In [33]:
df[df['Holiday'] == 1].sample(5)

Unnamed: 0,DateTime,Tariff,mean,Holiday,Date
44888,2013-08-26 04:30:00,ToU,0.088565,1,2013-08-26
31348,2012-01-02 03:00:00,Std,0.163754,1,2012-01-02
52370,2013-08-26 11:30:00,Std,0.185649,1,2013-08-26
69442,2013-05-06 22:30:00,ToU,0.178713,1,2013-05-06
22814,2013-05-06 17:00:00,Std,0.22442,1,2013-05-06


### 4.1.3 Weather variables

Now we have to add the weather variables, which will be used in any model. Weather variables were discussed, downloaded and treated in Section 1.

In [34]:
weather = pd.read_csv('data/weather_no_na.csv')

In [35]:
weather.shape

(20446, 13)

We will use the following variables:

-Temperature (ºC)

-Relative Humidity (%)

-Cloud cover (%)

-Atmospheric Pressure (mbar)

For a discussion on the reasons why these variables have been choosen and how NA values have been filled refer to Section 1.

In [36]:
weather.columns

Index(['time', 'apparentTemperature', 'cloudCover', 'dewPoint', 'humidity',
       'icon', 'precipType', 'pressure', 'summary', 'temperature',
       'visibility', 'windBearing', 'windSpeed'],
      dtype='object')

In [37]:
weather = weather[['time','temperature','humidity','cloudCover','pressure']]\
    .rename(columns = {'time' : 'DateTime'})

In [38]:
weather['DateTime'] = pd.to_datetime(weather['DateTime'])

In [39]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
2,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97
3,2011-11-01 03:00:00,14.18,0.88,0.43,1006.4
4,2011-11-01 04:00:00,14.2,0.9,0.38,1006.05


The frequency of the weather data is of 1 hour while that of the consumption data is 30 minutes, thus we need to resample it. We will take the last valid observation (e.g. 2012-01-01 10:30 will have the same data as 2012-01-01 10:00) => method = 'ffill'.

In [40]:
weather = weather.set_index('DateTime').resample('30min').fillna(method = 'ffill')\
    .reset_index()

In [41]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 00:30:00,13.54,0.87,0.27,1008.01
2,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
3,2011-11-01 01:30:00,12.74,0.93,0.32,1007.76
4,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97


In [42]:
df = df.merge(weather, how = 'left', on = 'DateTime')

In [43]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean,Holiday,Date,temperature,humidity,cloudCover,pressure
1687,2013-08-03 08:00:00,Std,0.163616,0,2013-08-03,17.35,0.79,0.64,1014.18
40225,2012-03-29 06:00:00,ToU,0.173429,0,2012-03-29,7.02,0.83,0.2,1028.81
19891,2012-03-13 01:30:00,ToU,0.127986,0,2012-03-13,7.48,0.94,0.89,1035.0
58240,2013-10-30 07:30:00,Std,0.196863,0,2013-10-30,4.46,0.92,0.22,1020.98
49474,2012-05-10 23:30:00,ToU,0.135004,0,2012-05-10,14.22,0.88,0.75,1013.28


Now that we have added all external variables our models need, we can save it to a csv file. 

In [44]:
df.to_csv('outputs/tseries_model.csv')

We will use this model as is in time series modelling. For feature models we need to add the features.

## 4.1.4. Feature engineering

Feature models are an alternative for time series forecasting. They consist in applying conventional machine learning models to variables constructed from the data (e.g. mean consumption in the previous day). These are popular in the literature for electrical demand prediction [2], [4].

We will derive the following features:

-Mean aggregate demand per hour in the previous day (1 variable).

-Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In total we have 13 derived features.

Furthermore, we will include weather and time variables in the model:

-Time (30-min resolution)

-Day of week (7 levels)

-Month (12 levels)

-Holiday (binary)

-Temperature (ºC, continuous)

-Relative Humidity (%, continuous)

-Cloud cover (%, continuous)

-Atmospheric Pressure (mbar, continuous)

These variables have been calculated in Section 1 and stored in 'weather_no_na.csv'.

And the last variable to be included in the model:
-Tariff (p/kWh)

Thus, in total we will have 22 variables in our models.


First, we will build the time variables.

Now it is easier to derive the consumption features.

#### Adding time variables

In [45]:
df['DoW'] = df['DateTime'].apply(lambda x: x.weekday())

In [46]:
df['Time'] = df['DateTime'].apply(lambda x: x.time())

In [47]:
df['Month']= df['DateTime'].apply(lambda x: x.month)

In [48]:
df['Year']= df['DateTime'].apply(lambda x: x.year)

In [49]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean,Holiday,Date,temperature,humidity,cloudCover,pressure,DoW,Time,Month,Year
46081,2012-09-25 21:00:00,ToU,0.271262,0,2012-09-25,11.76,0.84,0.33,988.92,1,21:00:00,9,2012
9795,2012-07-13 03:00:00,Std,0.099904,0,2012-07-13,14.34,0.95,0.85,1003.32,4,03:00:00,7,2012
60584,2013-11-19 01:30:00,Std,0.151778,0,2013-11-19,6.29,0.81,0.0,1009.46,1,01:30:00,11,2013
23199,2013-04-19 20:00:00,Std,0.303767,0,2013-04-19,8.03,0.76,0.37,1031.83,4,20:00:00,4,2013
73239,2014-02-14 20:00:00,Std,0.368049,0,2014-02-14,11.75,0.68,0.75,978.16,4,20:00:00,2,2014


###### Mean aggregate demand per hour in the previous day (1 variable).

First, we will calculate the daily mean aggregating by 'Date'. Then we will add one day to the 'Date' variable of our new DataFrame and join it with the main DataFrame df.

In [50]:
daily_mean = df.groupby(['Date', 'Tariff']).agg({'mean': 'mean'}).reset_index()

In [51]:
daily_mean['Date'] = pd.to_datetime(daily_mean['Date'])

In [52]:
type(daily_mean['Date'][2])

pandas.tslib.Timestamp

In [53]:
daily_mean.sample(5)

Unnamed: 0,Date,Tariff,mean
538,2012-08-18,Std,0.169564
191,2012-02-26,ToU,0.239881
956,2013-03-15,Std,0.245509
1556,2014-01-09,Std,0.235771
590,2012-09-13,Std,0.174227


In [54]:
daily_mean['Date'] = daily_mean['Date'] + datetime.timedelta(days=1)

In [55]:
df = df.merge(daily_mean, how = 'left', on = ['Date','Tariff'])

In [56]:
df = df.rename(columns = {'mean_x' : 'mean_cons',
          'mean_y' : 'mean_prev_day'})

In [57]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean_cons,Holiday,Date,temperature,humidity,cloudCover,pressure,DoW,Time,Month,Year,mean_prev_day
68505,2014-01-18 09:00:00,Std,0.241734,0,2014-01-18,8.12,0.82,1.0,991.83,5,09:00:00,1,2014,0.236838
56254,2013-09-04 03:30:00,Std,0.097991,0,2013-09-04,14.56,0.95,0.0,1024.28,2,03:30:00,9,2013,0.169922
34701,2013-03-08 14:30:00,ToU,0.19777,0,2013-03-08,9.64,0.95,0.86,998.0,4,14:30:00,3,2013,0.219322
46003,2012-08-22 22:30:00,Std,0.184515,0,2012-08-22,16.11,0.74,0.28,1017.89,2,22:30:00,8,2012,0.167222
21347,2012-01-05 00:00:00,Std,0.227402,0,2012-01-05,9.83,0.8,0.67,1002.92,3,00:00:00,1,2012,0.265686


###### Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

In [58]:
df.head()

Unnamed: 0,DateTime,Tariff,mean_cons,Holiday,Date,temperature,humidity,cloudCover,pressure,DoW,Time,Month,Year,mean_prev_day
0,2012-10-15 21:00:00,Std,0.301802,0,2012-10-15,11.8,0.82,0.38,1000.87,0,21:00:00,10,2012,0.227159
1,2012-10-21 21:00:00,Std,0.289203,0,2012-10-21,12.0,0.96,0.87,1017.99,6,21:00:00,10,2012,0.214069
2,2012-10-27 13:00:00,Std,0.263447,0,2012-10-27,7.46,0.73,0.57,1016.85,5,13:00:00,10,2012,0.225192
3,2012-10-28 17:30:00,Std,0.378239,0,2012-10-28,9.09,0.86,0.43,1012.33,6,17:30:00,10,2012,0.23787
4,2012-11-04 00:30:00,Std,0.22593,0,2012-11-04,3.08,0.91,0.13,999.88,6,00:30:00,11,2012,0.24095


In [59]:
def prev_ts(df_in, initial, final, step):
    #Both initial and final are included in the loop
    df_out = df_in[['DateTime', 'Tariff']].copy()
    for i in range(initial, final + 1, step):
        aux = df_in[['DateTime', 'Tariff', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = i)
        df_out = df_out.\
            merge(aux, how = 'left', on = ['DateTime','Tariff'], suffixes = ('','_-%d' %(i)))
    return(df_out)

In [60]:
tmp = prev_ts(df, 22*60+30, 24*60, 30)
df = df.merge(tmp[['DateTime','Tariff','mean_cons_-1380', 'mean_cons_-1410', 'mean_cons_-1440']],
         on = ['DateTime','Tariff'], suffixes = ('',''))

In [61]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean_cons,Holiday,Date,temperature,humidity,cloudCover,pressure,DoW,Time,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440
30618,2013-08-15 22:30:00,ToU,0.173125,0,2013-08-15,18.0,0.8,0.33,1017.02,3,22:30:00,8,2013,0.152033,0.125588,0.143171,0.16584
17245,2014-01-01 13:30:00,Std,0.3257,1,2014-01-01,8.74,0.9,0.07,990.23,2,13:30:00,1,2014,0.252249,0.287366,0.288366,0.290529
31051,2013-08-25 08:30:00,ToU,0.183269,0,2013-08-25,15.42,0.96,1.0,1011.28,6,08:30:00,8,2013,0.164833,0.172518,0.177876,0.170593
65833,2011-12-17 22:00:00,Std,0.346586,0,2011-12-17,3.64,0.8,0.48,1012.41,5,22:00:00,12,2011,0.285098,0.341941,0.366468,0.376191
2356,2013-04-17 18:30:00,ToU,0.240286,0,2013-04-17,17.46,0.59,0.39,1009.15,2,18:30:00,4,2013,0.184304,0.280125,0.263716,0.25955


###### Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

In [62]:
aux2 = df[['DateTime', 'Tariff']].copy()
means_3d = pd.DataFrame(index=df.index)
for h in range(0,4):
    for d in range(1,4):
        aux = df[['DateTime', 'Tariff', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = d*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','Tariff'])
    means_3d = means_3d.join(aux2.mean(axis=1).rename('mean_last3d_-%d' %(h*30)))

In [63]:
means_3d.head()

Unnamed: 0,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
0,0.286278,0.276157,0.264642,0.251189
1,0.276418,0.269546,0.258608,0.246133
2,0.209606,0.206111,0.204395,0.204098
3,0.348101,0.352302,0.352996,0.350607
4,0.203514,0.191968,0.181703,0.17278


In [64]:
df = df.join(means_3d)

In [65]:
df.sample(3)

Unnamed: 0,DateTime,Tariff,mean_cons,Holiday,Date,temperature,humidity,cloudCover,pressure,DoW,Time,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
49044,2012-03-12 07:00:00,ToU,0.194819,0,2012-03-12,5.93,0.96,0.47,1035.29,0,07:00:00,3,2012,0.23902,0.214299,0.184868,0.118326,0.159968,0.18519,0.205788,0.218541
43381,2011-12-09 13:00:00,ToU,0.2604,0,2011-12-09,7.68,0.62,0.31,1010.92,4,13:00:00,12,2011,0.257634,0.273115,0.292962,0.309731,0.266072,0.252653,0.247568,0.245618
71943,2013-01-05 05:30:00,Std,0.128027,0,2013-01-05,8.17,0.94,0.86,1035.38,5,05:30:00,1,2013,0.240086,0.155742,0.140181,0.128496,0.133038,0.138189,0.145585,0.151697


###### Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In [66]:
aux2 = df[['DateTime', 'Tariff']].copy()
means_3w = pd.DataFrame(index=df.index)
for h in range(0,4):
    for w in range(1,4):
        aux = df[['DateTime', 'Tariff', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = w*7*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','Tariff'])
    means_3w = means_3w.join(aux2.mean(axis=1).rename('mean_last3w_-%d' %(h*30)))

In [67]:
df = df.join(means_3w)

In [68]:
df.sample(3)

Unnamed: 0,DateTime,Tariff,mean_cons,Holiday,Date,temperature,humidity,cloudCover,pressure,DoW,Time,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90,mean_last3w_-0,mean_last3w_-30,mean_last3w_-60,mean_last3w_-90
26278,2014-02-18 00:30:00,ToU,0.153197,0,2014-02-18,8.13,0.87,0.07,1009.44,1,00:30:00,2,2014,0.215931,0.121131,0.136476,0.154438,0.165929,0.156137,0.147634,0.141069,0.159337,0.148464,0.139922,0.133518
73166,2013-05-01 20:30:00,Std,0.300039,0,2013-05-01,10.58,0.48,0.31,1022.83,2,20:30:00,5,2013,0.188458,0.2665,0.290514,0.303346,0.30572,0.298633,0.288665,0.275618,0.305824,0.297761,0.289023,0.276853
8180,2014-01-28 05:00:00,Std,0.129436,0,2014-01-28,5.81,0.87,0.07,985.22,1,05:00:00,1,2014,0.249824,0.154049,0.142309,0.133349,0.129292,0.131501,0.134825,0.140999,0.127401,0.131363,0.136468,0.14563


Note that pandas.mean skips NaN values by default. Therefore, in the first days/weeks the means are calculated with the available data (1 or 2 values).

Now that we have built all features we can save this model to a csv file for later use.

In [69]:
df.to_csv('outputs/features_model.csv')