# 4 Prediction of the aggregate demand 

In this section, we will compare the performance of different algorithms in predicting the aggregate demand, separating standard and ToU users. First, we will use feature models and then exponential smoothing, which is a time-series dedicated algorithm.

After adding the external variables (holidays, weather) we will use different algorithms to fit our feature model. Later, we will predict the aggregate demand with exponential smoothing models in R using the "forecast" package.

In general we will train our model with Nov 2011- Nov 2013 data and validate it with Nov 2013 - Feb 2014 data.

We will use the aggregate statistics we computed in Section 2 with Spark.

## 4.1 Preparing the dataset

First, we will prepare our data so it can be use in the different algorithms proposed. We will start including time and weather variables as these are needed for all models and then we will add the engineered features for the feature models.

In [1]:
import pandas as pd
pd.options.display.max_columns = 999
import datetime

In [2]:
df = pd.read_csv('outputs/agg_stats.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,DateTime,Tariff,ToU_User,count,sum,min,mean,max,std_dev
0,2012-10-15 21:00:00,Std,0,4210,1270.587,0.0,0.301802,5.335,0.31913
1,2012-10-21 21:00:00,Std,0,4287,1239.813999,0.0,0.289203,6.095,0.320852
2,2012-10-27 13:00:00,Std,0,4402,1159.695,0.0,0.263447,3.554,0.336398
3,2012-10-28 17:30:00,Std,0,4402,1665.008001,0.0,0.378239,8.04,0.447315
4,2012-11-04 00:30:00,Std,0,4403,994.769,0.0,0.22593,6.072,0.424697


We keep only the columns we need: DateTime, Tariff and mean consumption. We are interested in predicting the aggregate demand (i.e. the sum) but as the data count in each Timestamp is different, we will use the mean.

In [4]:
df = df[['DateTime','Tariff','mean']]

In [5]:
df['DateTime'] = pd.to_datetime(df['DateTime'])

In [6]:
type(df['DateTime'][2])

pandas.tslib.Timestamp

In [7]:
df.sample(5)

Unnamed: 0,DateTime,Tariff,mean
67700,2013-10-26 02:00:00,Std,0.109328
75539,2014-01-06 18:00:00,Std,0.376058
70913,2012-07-04 22:00:00,ToU,0.214773
56670,2013-09-12 10:00:00,Std,0.179702
5260,2012-03-02 08:30:00,ToU,0.254258


Variable Tariff does not actually contain the Tariff the users are subject to in that timestamp, but the user group (whether they were included in dToU in 2013 or not). Therefore, we will change the variable name to User_group for clarity.

In [8]:
df = df.rename(columns = {'Tariff' : 'User_group'})

### 4.1.1 Tariff

On of the main variables to include in our dataset is the price of electricity. We have grouped standard flat rate and dToU users in two different groups but we need the actual values of the tariff. 

For standard users, the price is always equal to 14.228 p/kWh.

For ToU users, the tariff level for each 30-min interval is given in an Excel sheet we downloaded in Section 3.2. The corresponding values for each level are:

-High = 67.20p/kWh

-Normal = 11.76p/kWh

-Low = 3.99p/kWh

In [9]:
tariffs = pd.read_excel('data/Tariffs.xlsx')

In [10]:
tariffs.loc[tariffs['Tariff'] == 'Normal', 'Tariff_value'] = 11.76
tariffs.loc[tariffs['Tariff'] == 'High', 'Tariff_value'] = 67.20
tariffs.loc[tariffs['Tariff'] == 'Low', 'Tariff_value'] = 3.99

In [11]:
tariffs = tariffs.rename(columns = {'TariffDateTime' : 'DateTime'})

In [12]:
tariffs.sample(5)

Unnamed: 0,DateTime,Tariff,Tariff_value
11221,2013-08-22 18:30:00,Normal,11.76
1724,2013-02-05 22:00:00,Low,3.99
13428,2013-10-07 18:00:00,Normal,11.76
3596,2013-03-16 22:00:00,High,67.2
3605,2013-03-17 02:30:00,Low,3.99


In [13]:
tariffs = tariffs.drop('Tariff', axis=1)

In [14]:
df = df.merge(tariffs, how = 'left', on = 'DateTime')

In [15]:
df['Tariff_value'] = df['Tariff_value'].fillna(14.228) #This is the standard tariff

In [16]:
df[['DateTime','User_group','Tariff_value']].sample(5)

Unnamed: 0,DateTime,User_group,Tariff_value
43033,2012-11-13 20:30:00,Std,14.228
13753,2012-06-05 17:00:00,ToU,14.228
43766,2012-01-10 16:30:00,ToU,14.228
72554,2012-09-04 23:00:00,ToU,14.228
22106,2012-02-21 09:00:00,Std,14.228


### 4.1.2. Bank Holidays

Bank Holidays probably have an impact in household electricity consumption as it was discussed in Section 1. Bank Holidays will be treated separately from Sundays as a binary variable. 

Bank Holidays in England from 2011 to 2014 were downloaded in Section 1 using the Python holidays library (and cross-checked with the official information in www.gov.co.uk) and saved to a csv file.

In [17]:
holidays = pd.Series.from_csv('data/bank_holidays.csv')

In [18]:
holidays = pd.to_datetime(holidays)

In [19]:
holidays.head()

0   2011-01-01
1   2011-01-03
2   2011-04-22
3   2011-04-25
4   2011-04-29
dtype: datetime64[ns]

In [20]:
df['Date'] = df['DateTime'].apply(lambda x: x.date())

In [21]:
df['Date'] = pd.to_datetime(df['Date'])

In [22]:
df['Holiday'] = 0
df.loc[df['Date'].isin(holidays),'Holiday'] = 1

In [23]:
df.sample(5)

Unnamed: 0,DateTime,User_group,mean,Tariff_value,Date,Holiday
61991,2014-01-16 23:30:00,ToU,0.211941,14.228,2014-01-16,0
67446,2012-07-20 16:30:00,ToU,0.199608,14.228,2012-07-20,0
63808,2012-03-07 18:30:00,Std,0.36382,14.228,2012-03-07,0
39054,2012-01-25 00:30:00,ToU,0.161405,14.228,2012-01-25,0
30817,2013-09-30 21:30:00,Std,0.248219,11.76,2013-09-30,0


In [24]:
df[df['Holiday'] == 1].sample(5)

Unnamed: 0,DateTime,User_group,mean,Tariff_value,Date,Holiday
19542,2012-12-26 23:00:00,Std,0.265504,14.228,2012-12-26,1
78647,2013-01-01 20:30:00,Std,0.359184,11.76,2013-01-01,1
60759,2013-05-06 12:30:00,ToU,0.179697,11.76,2013-05-06,1
74076,2011-12-25 06:00:00,Std,0.142732,14.228,2011-12-25,1
46890,2012-06-04 14:00:00,ToU,0.202942,14.228,2012-06-04,1


### 4.1.3. Weather variables

Now we have to add the weather variables, which will be used in any model. Weather variables were discussed, downloaded and treated in Section 1.

In [25]:
weather = pd.read_csv('data/weather_no_na.csv')

In [26]:
weather.shape

(20446, 13)

We will use the following variables:

-Temperature (ºC)

-Relative Humidity (%)

-Cloud cover (%)

-Atmospheric Pressure (mbar)

For a discussion on the reasons why these variables have been choosen and how NA values have been filled refer to Section 1.

In [27]:
weather.columns

Index(['time', 'apparentTemperature', 'cloudCover', 'dewPoint', 'humidity',
       'icon', 'precipType', 'pressure', 'summary', 'temperature',
       'visibility', 'windBearing', 'windSpeed'],
      dtype='object')

In [28]:
weather = weather[['time','temperature','humidity','cloudCover','pressure']]\
    .rename(columns = {'time' : 'DateTime'})

In [29]:
weather['DateTime'] = pd.to_datetime(weather['DateTime'])

In [30]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
2,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97
3,2011-11-01 03:00:00,14.18,0.88,0.43,1006.4
4,2011-11-01 04:00:00,14.2,0.9,0.38,1006.05


The frequency of the weather data is of 1 hour while that of the consumption data is 30 minutes, thus we need to resample it. We will take the last valid observation (e.g. 2012-01-01 10:30 will have the same data as 2012-01-01 10:00) => method = 'ffill'.

In [31]:
weather = weather.set_index('DateTime').resample('30min').fillna(method = 'ffill')\
    .reset_index()

In [32]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 00:30:00,13.54,0.87,0.27,1008.01
2,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
3,2011-11-01 01:30:00,12.74,0.93,0.32,1007.76
4,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97


In [33]:
df = df.merge(weather, how = 'left', on = 'DateTime')

In [34]:
df.sample(5)

Unnamed: 0,DateTime,User_group,mean,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure
57972,2012-10-11 08:00:00,ToU,0.205669,14.228,2012-10-11,0,10.84,0.85,0.77,1005.95
13860,2013-08-22 21:00:00,ToU,0.228711,11.76,2013-08-22,0,18.03,0.91,0.12,1018.95
21548,2012-05-08 07:00:00,ToU,0.194039,14.228,2012-05-08,0,11.52,0.93,0.6,1007.43
49844,2012-04-14 18:30:00,ToU,0.335103,14.228,2012-04-14,0,9.98,0.56,0.32,1009.28
26817,2013-10-08 17:30:00,Std,0.273431,11.76,2013-10-08,0,18.08,0.79,0.75,1023.96


Now that we have added all external variables our models need, we can save it to a csv file. 

In [35]:
df.to_csv('outputs/tseries_model.csv')

We will use this model as is in time series modelling. For feature models we need to add the features.

## 4.1.4. Feature engineering

Feature models are an alternative for time series forecasting. They consist in applying conventional machine learning models to variables constructed from the data (e.g. mean consumption in the previous day). These are popular in the literature for electrical demand prediction [2], [4].

We will derive the following features:

-Mean aggregate demand per hour in the previous day (1 variable).

-Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In total we have 13 derived features.

Furthermore, we will include weather and time variables in the model:

-Time (30-min resolution)

-Day of week (7 levels)

-Month (12 levels)

-Holiday (binary)

-Temperature (ºC, continuous)

-Relative Humidity (%, continuous)

-Cloud cover (%, continuous)

-Atmospheric Pressure (mbar, continuous)

These variables have been calculated in Section 1 and stored in 'weather_no_na.csv'.

And the last variable to be included in the model:
-Tariff (p/kWh)

Thus, in total we will have 22 variables in our models.


First, we will build the time variables.

Now it is easier to derive the consumption features.

#### Adding time variables

In [36]:
df['DoW'] = df['DateTime'].apply(lambda x: x.weekday())

In [37]:
df['Time'] = df['DateTime'].apply(lambda x: x.time())

In [38]:
df['Day'] = df['DateTime'].apply(lambda x: x.day)

In [39]:
df['Month']= df['DateTime'].apply(lambda x: x.month)

In [40]:
df['Year']= df['DateTime'].apply(lambda x: x.year)

In [41]:
df.sample(5)

Unnamed: 0,DateTime,User_group,mean,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure,DoW,Time,Day,Month,Year
71394,2013-06-12 15:00:00,ToU,0.202382,3.99,2013-06-12,0,16.44,0.86,1.0,1011.31,2,15:00:00,12,6,2013
36270,2013-07-10 15:00:00,ToU,0.144002,11.76,2013-07-10,0,22.62,0.54,0.31,1024.67,2,15:00:00,10,7,2013
42660,2013-04-04 02:00:00,Std,0.157437,11.76,2013-04-04,0,1.87,0.76,0.2,1014.82,3,02:00:00,4,4,2013
50504,2013-04-19 22:00:00,ToU,0.226941,11.76,2013-04-19,0,6.26,0.83,0.0,1032.81,4,22:00:00,19,4,2013
9380,2012-05-10 03:30:00,Std,0.109917,14.228,2012-05-10,0,15.35,0.93,0.85,1006.6,3,03:30:00,10,5,2012


###### Mean aggregate demand per hour in the previous day (1 variable).

First, we will calculate the daily mean aggregating by 'Date'. Then we will add one day to the 'Date' variable of our new DataFrame and join it with the main DataFrame df.

In [42]:
daily_mean = df.groupby(['Date', 'User_group']).agg({'mean': 'mean'}).reset_index()

In [43]:
daily_mean['Date'] = pd.to_datetime(daily_mean['Date'])

In [44]:
type(daily_mean['Date'][2])

pandas.tslib.Timestamp

In [45]:
daily_mean.sample(5)

Unnamed: 0,Date,User_group,mean
565,2012-08-31,ToU,0.161208
668,2012-10-22,Std,0.215742
427,2012-06-23,ToU,0.168983
495,2012-07-27,ToU,0.162748
1641,2014-02-20,ToU,0.204214


In [46]:
daily_mean['Date'] = daily_mean['Date'] + datetime.timedelta(days=1)

In [47]:
df = df.merge(daily_mean, how = 'left', on = ['Date','User_group'])

In [48]:
df = df.rename(columns = {'mean_x' : 'mean_cons',
          'mean_y' : 'mean_prev_day'})

In [49]:
df.sample(5)

Unnamed: 0,DateTime,User_group,mean_cons,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure,DoW,Time,Day,Month,Year,mean_prev_day
15312,2012-05-13 15:00:00,Std,0.21938,14.228,2012-05-13,0,16.21,0.41,0.42,1027.68,6,15:00:00,13,5,2012,0.193864
37134,2012-05-21 15:30:00,ToU,0.189677,14.228,2012-05-21,0,15.49,0.72,0.43,1009.22,0,15:30:00,21,5,2012,0.195976
41180,2013-09-19 07:30:00,Std,0.20807,11.76,2013-09-19,0,9.6,0.82,0.75,1012.26,3,07:30:00,19,9,2013,0.188706
11533,2013-12-11 16:00:00,ToU,0.242378,11.76,2013-12-11,0,4.27,0.92,0.0,1028.16,2,16:00:00,11,12,2013,0.218056
37925,2012-10-28 05:30:00,Std,0.131853,14.228,2012-10-28,0,1.91,0.92,0.13,1018.52,6,05:30:00,28,10,2012,0.23787


###### Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

In [50]:
df.head()

Unnamed: 0,DateTime,User_group,mean_cons,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure,DoW,Time,Day,Month,Year,mean_prev_day
0,2012-10-15 21:00:00,Std,0.301802,14.228,2012-10-15,0,11.8,0.82,0.38,1000.87,0,21:00:00,15,10,2012,0.227159
1,2012-10-21 21:00:00,Std,0.289203,14.228,2012-10-21,0,12.0,0.96,0.87,1017.99,6,21:00:00,21,10,2012,0.214069
2,2012-10-27 13:00:00,Std,0.263447,14.228,2012-10-27,0,7.46,0.73,0.57,1016.85,5,13:00:00,27,10,2012,0.225192
3,2012-10-28 17:30:00,Std,0.378239,14.228,2012-10-28,0,9.09,0.86,0.43,1012.33,6,17:30:00,28,10,2012,0.23787
4,2012-11-04 00:30:00,Std,0.22593,14.228,2012-11-04,0,3.08,0.91,0.13,999.88,6,00:30:00,4,11,2012,0.24095


In [51]:
def prev_ts(df_in, initial, final, step):
    #Both initial and final are included in the loop
    df_out = df_in[['DateTime', 'User_group']].copy()
    for i in range(initial, final + 1, step):
        aux = df_in[['DateTime', 'User_group', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = i)
        df_out = df_out.\
            merge(aux, how = 'left', on = ['DateTime','User_group'], suffixes = ('','_-%d' %(i)))
    return(df_out)

In [52]:
tmp = prev_ts(df, 22*60+30, 24*60, 30)
df = df.merge(tmp[['DateTime','User_group','mean_cons_-1380', 'mean_cons_-1410', 'mean_cons_-1440']],
         on = ['DateTime','User_group'], suffixes = ('',''))

In [53]:
df.sample(5)

Unnamed: 0,DateTime,User_group,mean_cons,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure,DoW,Time,Day,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440
77739,2013-04-01 19:00:00,ToU,0.335453,11.76,2013-04-01,1,2.98,0.55,0.24,1011.3,0,19:00:00,1,4,2013,0.232052,0.327668,0.331465,0.31294
30016,2012-05-18 23:00:00,Std,0.170107,14.228,2012-05-18,0,12.46,0.87,0.8,1006.01,4,23:00:00,18,5,2012,0.187861,0.133632,0.147103,0.1681
74989,2013-11-06 18:00:00,ToU,0.324134,11.76,2013-11-06,0,14.75,0.89,1.0,998.79,2,18:00:00,6,11,2013,0.215806,0.343887,0.345528,0.31733
51400,2011-12-17 17:00:00,ToU,0.356791,14.228,2011-12-17,0,5.01,0.78,0.66,1009.74,5,17:00:00,17,12,2011,0.266792,0.434357,0.358881,0.306643
35304,2013-12-28 12:30:00,Std,0.26213,11.76,2013-12-28,0,7.68,0.67,0.31,997.99,5,12:30:00,28,12,2013,0.246842,0.286465,0.286564,0.283092


###### Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

In [54]:
aux2 = df[['DateTime', 'User_group']].copy()
means_3d = pd.DataFrame(index=df.index)
for h in range(0,4):
    for d in range(1,4):
        aux = df[['DateTime', 'User_group', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = d*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','User_group'])
    means_3d = means_3d.join(aux2.mean(axis=1).rename('mean_last3d_-%d' %(h*30)))

In [55]:
means_3d.head()

Unnamed: 0,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
0,0.286278,0.276157,0.264642,0.251189
1,0.276418,0.269546,0.258608,0.246133
2,0.209606,0.206111,0.204395,0.204098
3,0.348101,0.352302,0.352996,0.350607
4,0.203514,0.191968,0.181703,0.17278


In [56]:
df = df.join(means_3d)

In [57]:
df.sample(3)

Unnamed: 0,DateTime,User_group,mean_cons,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure,DoW,Time,Day,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
7917,2013-01-17 10:00:00,ToU,0.245599,11.76,2013-01-17,0,-1.48,0.94,0.29,1019.05,3,10:00:00,17,1,2013,0.255242,0.241431,0.24138,0.245743,0.247607,0.249267,0.247726,0.24428
29390,2013-09-08 21:30:00,ToU,0.242798,3.99,2013-09-08,0,11.31,0.88,0.12,1018.01,6,21:30:00,8,9,2013,0.161068,0.171402,0.189144,0.208065,0.214351,0.204704,0.194844,0.184387
16137,2012-01-23 09:00:00,Std,0.246718,14.228,2012-01-23,0,5.44,0.81,0.67,1019.03,0,09:00:00,23,1,2012,0.286122,0.277311,0.270628,0.254351,0.251637,0.256738,0.257626,0.261012


###### Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In [58]:
aux2 = df[['DateTime', 'User_group']].copy()
means_3w = pd.DataFrame(index=df.index)
for h in range(0,4):
    for w in range(1,4):
        aux = df[['DateTime', 'User_group', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = w*7*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','User_group'])
    means_3w = means_3w.join(aux2.mean(axis=1).rename('mean_last3w_-%d' %(h*30)))

In [59]:
df = df.join(means_3w)

In [60]:
df.sample(3)

Unnamed: 0,DateTime,User_group,mean_cons,Tariff_value,Date,Holiday,temperature,humidity,cloudCover,pressure,DoW,Time,Day,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90,mean_last3w_-0,mean_last3w_-30,mean_last3w_-60,mean_last3w_-90
39328,2013-03-19 05:00:00,ToU,0.113938,11.76,2013-03-19,0,3.41,0.92,0.85,996.03,1,05:00:00,19,3,2013,0.230172,0.132957,0.125223,0.122639,0.116693,0.118908,0.123211,0.130759,0.124516,0.129489,0.135068,0.144665
11265,2013-03-14 16:00:00,Std,0.235265,11.76,2013-03-14,0,6.64,0.42,0.42,1015.35,3,16:00:00,14,3,2013,0.256019,0.276247,0.256562,0.244199,0.264173,0.272195,0.281919,0.294824,0.247574,0.255133,0.265479,0.278642
18036,2013-09-26 04:30:00,Std,0.105459,11.76,2013-09-26,0,14.67,0.91,0.0,1014.79,3,04:30:00,26,9,2013,0.180931,0.131999,0.114884,0.107164,0.106555,0.110063,0.117714,0.128857,0.108695,0.11272,0.1204,0.130938


Note that pandas.mean skips NaN values by default. Therefore, in the first days/weeks the means are calculated with the available data (1 or 2 values).

Now that we have built all features we can save this model to a csv file for later use.

In [61]:
df.to_csv('outputs/features_model.csv')