# 4 Prediction of the aggregate demand 

In this section, we will compare the performance of different algorithms in predicting the aggregate demand, separating standard and ToU users. First, we will use feature models and then exponential smoothing, which is a time-series dedicated algorithm.

After adding the external variables (holidays, weather) we will use different algorithms to fit our feature model. Later, we will predict the aggregate demand with exponential smoothing models in R using the "forecast" package.

In general we will train our model with Nov 2011- Nov 2013 data and validate it with Nov 2013 - Feb 2014 data.

We will use the aggregate statistics we computed in Section 2 with Spark.

## 4.1 Preparing the dataset

First, we will prepare our data so it can be use in the different algorithms proposed. We will start including time and weather variables as these are needed for all models and then we will add the engineered features for the feature models.

In [1]:
import pandas as pd
pd.options.display.max_columns = 999
import datetime

In [2]:
df = pd.read_csv('outputs/agg_stats.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,DateTime,Tariff,ToU_User,count,sum,min,mean,max,std_dev
0,2012-10-15 21:00:00,Std,0,4210,1270.587,0.0,0.301802,5.335,0.31913
1,2012-10-21 21:00:00,Std,0,4287,1239.813999,0.0,0.289203,6.095,0.320852
2,2012-10-27 13:00:00,Std,0,4402,1159.695,0.0,0.263447,3.554,0.336398
3,2012-10-28 17:30:00,Std,0,4402,1665.008001,0.0,0.378239,8.04,0.447315
4,2012-11-04 00:30:00,Std,0,4403,994.769,0.0,0.22593,6.072,0.424697


We keep only the columns we need: DateTime, Tariff and mean consumption. We are interested in predicting the aggregate demand (i.e. the sum) but as the data count in each Timestamp is different, we will use the mean. However, we cannot take the mean column directly as the value of Tariff is the user group rather than the tariff applied. In 2011, 2012 and 2014 we are interested in the aggregate demand of both groups combined so we will group and calculate the mean with sum and count.

In [4]:
df = df[['DateTime','Tariff','sum','count']]

In [5]:
df['DateTime'] = pd.to_datetime(df['DateTime'])

In [6]:
type(df['DateTime'][2])

pandas.tslib.Timestamp

Column Tariff actually represents the user group, not the tariff applied in that timestamp so we will rename it for clarity.

In [7]:
df = df.rename(columns={'Tariff':'User_group'})

In [8]:
df.sample(5)

Unnamed: 0,DateTime,User_group,sum,count
66886,2013-06-02 03:30:00,Std,425.513,4261
73156,2013-04-01 12:00:00,Std,1312.155,4303
66217,2012-08-14 11:30:00,Std,642.326,3739
56372,2012-06-02 23:30:00,ToU,106.372,739
31743,2012-01-08 20:00:00,Std,138.97,365


### 4.1.1 Tariff

On of the main variables to include in our dataset is the price of electricity. We have grouped standard flat rate and dToU users in two different groups but we need the actual values of the tariff. 

For standard users, the price is always equal to 14.228 p/kWh.

For ToU users, the tariff level for each 30-min interval is given in an Excel sheet we downloaded in Section 3.2. The corresponding values for each level are:

-High = 67.20p/kWh

-Normal = 11.76p/kWh

-Low = 3.99p/kWh

In [9]:
tariffs = pd.read_excel('data/Tariffs.xlsx')

In [10]:
tariffs.loc[tariffs['Tariff'] == 'Normal', 'Tariff_value'] = 11.76
tariffs.loc[tariffs['Tariff'] == 'High', 'Tariff_value'] = 67.20
tariffs.loc[tariffs['Tariff'] == 'Low', 'Tariff_value'] = 3.99

In [11]:
tariffs = tariffs.rename(columns = {'TariffDateTime' : 'DateTime'})

In [12]:
tariffs.sample(5)

Unnamed: 0,DateTime,Tariff,Tariff_value
14895,2013-11-07 07:30:00,Normal,11.76
11845,2013-09-04 18:30:00,Normal,11.76
15087,2013-11-11 07:30:00,Normal,11.76
1335,2013-01-28 19:30:00,Low,3.99
2725,2013-02-26 18:30:00,Normal,11.76


In [13]:
tariffs = tariffs.drop('Tariff', axis=1)

In [14]:
df = df.merge(tariffs, how = 'left', on = 'DateTime')

In [15]:
df['Tariff_value'] = df['Tariff_value'].fillna(14.228) #This is the standard tariff

In [16]:
df['Tariff_value'] = df['Tariff_value'].fillna(14.228) #This is the standard tariff
df.loc[df['User_group'] == 'Std','Tariff_value'] = 14.228

In [17]:
df.sample(5)

Unnamed: 0,DateTime,User_group,sum,count,Tariff_value
61954,2013-09-05 07:00:00,ToU,181.446,1061,11.76
71883,2012-02-03 01:00:00,ToU,18.62,92,14.228
77058,2013-02-14 07:30:00,Std,1139.262,4367,14.228
61667,2013-02-01 23:00:00,Std,1204.893999,4374,14.228
2580,2013-12-10 13:00:00,Std,870.163,4066,14.228


In [18]:
df = df.groupby(['DateTime', 'Tariff_value']).agg({'sum': 'sum',
                                            'count': 'sum'}).reset_index()

Now we can calculate the mean for each data point.

In [19]:
df['mean'] = df['sum']/df['count']

We also add a new variable "ToU" which indicates whether a consumption data point is for a user in which ToU was applied. As 14.228 p/kWh is the flat rate tariff (and this is not a possible value of ToU tariff), we know that if Tariff_value = 14.228, this is a standard point, and otherwise a ToU point.

In [20]:
df.loc[df['Tariff_value'] == 14.228,'ToU'] = 0
df.loc[df['Tariff_value'] != 14.228,'ToU'] = 1

In [21]:
df.sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean,ToU
19797,2013-01-04 22:00:00,14.228,1374.838999,4397,0.312677,0.0
2004,2012-01-04 03:00:00,14.228,65.535,418,0.156782,0.0
16056,2012-10-22 22:00:00,14.228,1295.035,5411,0.239334,0.0
37194,2013-07-05 04:30:00,11.76,120.076,1084,0.110771,1.0
16866,2012-11-08 19:00:00,14.228,2043.925001,5503,0.37142,0.0


### 4.1.2. Bank Holidays

Bank Holidays probably have an impact in household electricity consumption as it was discussed in Section 1. Bank Holidays will be treated separately from Sundays as a binary variable. 

Bank Holidays in England from 2011 to 2014 were downloaded in Section 1 using the Python holidays library (and cross-checked with the official information in www.gov.co.uk) and saved to a csv file.

In [22]:
holidays = pd.Series.from_csv('data/bank_holidays.csv')

In [23]:
holidays = pd.to_datetime(holidays)

In [24]:
holidays.head()

0   2011-01-01
1   2011-01-03
2   2011-04-22
3   2011-04-25
4   2011-04-29
dtype: datetime64[ns]

In [25]:
df['Date'] = df['DateTime'].apply(lambda x: x.date())

In [26]:
df['Date'] = pd.to_datetime(df['Date'])

In [27]:
df['Holiday'] = 0
df.loc[df['Date'].isin(holidays),'Holiday'] = 1

In [28]:
df.sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean,ToU,Date,Holiday
2392,2012-01-12 05:00:00,14.228,60.286,429,0.140527,0.0,2012-01-12,0
18130,2012-12-05 03:00:00,14.228,802.139,5525,0.145184,0.0,2012-12-05,0
6078,2012-03-29 01:00:00,14.228,237.63,1226,0.193825,0.0,2012-03-29,0
17660,2012-11-25 08:00:00,14.228,1047.084,5529,0.18938,0.0,2012-11-25,0
53923,2013-12-26 10:30:00,14.228,1051.565,4058,0.259134,0.0,2013-12-26,1


In [29]:
df[df['Holiday'] == 1].sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean,ToU,Date,Holiday
33438,2013-05-27 01:30:00,11.76,105.686,1091,0.096871,1.0,2013-05-27,1
28074,2013-04-01 04:30:00,11.76,128.1,1099,0.116561,1.0,2013-04-01,1
19476,2013-01-01 14:00:00,11.76,272.567,1111,0.245335,1.0,2013-01-01,1
53841,2013-12-25 14:00:00,14.228,1494.554,4057,0.368389,0.0,2013-12-25,1
53904,2013-12-26 06:00:00,11.76,119.834,1041,0.115114,1.0,2013-12-26,1


### 4.1.3. Weather variables

Now we have to add the weather variables, which will be used in any model. Weather variables were discussed, downloaded and treated in Section 1.

In [30]:
weather = pd.read_csv('data/weather_no_na.csv')

In [31]:
weather.shape

(20446, 13)

We will use the following variables:

-Temperature (ºC)

-Relative Humidity (%)

-Cloud cover (%)

-Atmospheric Pressure (mbar)

For a discussion on the reasons why these variables have been choosen and how NA values have been filled refer to Section 1.

In [32]:
weather.columns

Index(['time', 'apparentTemperature', 'cloudCover', 'dewPoint', 'humidity',
       'icon', 'precipType', 'pressure', 'summary', 'temperature',
       'visibility', 'windBearing', 'windSpeed'],
      dtype='object')

In [33]:
weather = weather[['time','temperature','humidity','cloudCover','pressure']]\
    .rename(columns = {'time' : 'DateTime'})

In [34]:
weather['DateTime'] = pd.to_datetime(weather['DateTime'])

In [35]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
2,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97
3,2011-11-01 03:00:00,14.18,0.88,0.43,1006.4
4,2011-11-01 04:00:00,14.2,0.9,0.38,1006.05


The frequency of the weather data is of 1 hour while that of the consumption data is 30 minutes, thus we need to resample it. We will take the last valid observation (e.g. 2012-01-01 10:30 will have the same data as 2012-01-01 10:00) => method = 'ffill'.

In [36]:
weather = weather.set_index('DateTime').resample('30min').fillna(method = 'ffill')\
    .reset_index()

In [37]:
weather.head()

Unnamed: 0,DateTime,temperature,humidity,cloudCover,pressure
0,2011-11-01 00:00:00,13.54,0.87,0.27,1008.01
1,2011-11-01 00:30:00,13.54,0.87,0.27,1008.01
2,2011-11-01 01:00:00,12.74,0.93,0.32,1007.76
3,2011-11-01 01:30:00,12.74,0.93,0.32,1007.76
4,2011-11-01 02:00:00,13.68,0.91,0.25,1006.97


In [38]:
df = df.merge(weather, how = 'left', on = 'DateTime')

In [39]:
df.sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure
38687,2013-07-20 17:30:00,14.228,874.772,4225,0.207047,0.0,2013-07-20,0,22.19,0.62,0.64,1021.76
50334,2013-11-19 01:30:00,11.76,120.305,1047,0.114904,1.0,2013-11-19,0,6.29,0.81,0.0,1009.46
552,2011-12-04 21:00:00,14.228,33.436,110,0.303964,0.0,2011-12-04,0,5.34,0.85,0.31,999.53
23674,2013-02-14 07:30:00,11.76,265.491,1105,0.240263,1.0,2013-02-14,0,7.63,0.94,0.63,1007.47
1880,2012-01-01 13:00:00,14.228,112.841,411,0.274552,0.0,2012-01-01,1,12.46,0.81,0.63,1003.86


Now that we have added all external variables our models need, we can save it to a csv file. 

We will split the dataset into train and test samples: from the dataset start to 2013-09-31 will be included in the train set and from 2013-10-01 until the end of the time series in the test set. This means we will have 9 months of ToU in train and 3 in test. The limitation is not all months will be included in the test set and we will not have one whole year of ToU data in the train set.

In [40]:
df.loc[df['DateTime'] < '2013-10-01 00:00:00','Set'] = 'Train'
df.loc[df['DateTime'] >= '2013-10-01 00:00:00','Set'] = 'Test'

In [41]:
df.to_csv('outputs/tseries_model.csv', index = False)

We will use this model as is in time series modelling. For feature models we need to add the features.

## 4.1.4. Feature engineering

Feature models are an alternative for time series forecasting. They consist in applying conventional machine learning models to variables constructed from the data (e.g. mean consumption in the previous day). These are popular in the literature for electrical demand prediction [2], [4].

We will derive the following features:

-Mean aggregate demand per hour in the previous day (1 variable).

-Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

-Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In total we have 13 derived features.

Furthermore, we will include weather and time variables in the model:

-Day of week (7 levels)

-Holiday (binary)

-Temperature (ºC, continuous)

-Relative Humidity (%, continuous)

-Cloud cover (%, continuous)

-Atmospheric Pressure (mbar, continuous)

These variables have been calculated in Section 1 and stored in 'weather_no_na.csv'.

And the last variable to be included in the model:

-Tariff (p/kWh)

Thus, in total we will have 22 variables in our models.


First, we will build the time variables.

Now it is easier to derive the consumption features.

#### Adding time variables

Let us add the time variables to be used in lieu of DateTime in our prediction model: time, day of week, month, year.

We will not use the day of the month as it would introduce a categorical variable with 31 levels and it does not seem relevant in the exploratory data analysis in Tableau.

In [42]:
df['DoW'] = df['DateTime'].apply(lambda x: x.strftime("%A"))

In [43]:
df['Time'] = df['DateTime'].apply(lambda x: x.time())

In [44]:
df['Month']= df['DateTime'].apply(lambda x: x.strftime("%B"))

In [45]:
df['Year']= df['DateTime'].apply(lambda x: x.year)

In [46]:
df.sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure,Set,DoW,Time,Month,Year
27152,2013-03-22 13:00:00,11.76,257.836,1100,0.234396,1.0,2013-03-22,0,5.04,0.73,0.37,1008.29,Train,Friday,13:00:00,March,2013
12690,2012-08-13 19:00:00,14.228,1139.832001,4761,0.23941,0.0,2012-08-13,0,18.88,0.84,0.5,1010.62,Train,Monday,19:00:00,August,2012
34813,2013-06-10 09:00:00,14.228,806.840999,4251,0.1898,0.0,2013-06-10,0,11.47,0.7,1.0,1016.17,Train,Monday,09:00:00,June,2013
1738,2011-12-29 14:00:00,14.228,110.374,398,0.277322,0.0,2011-12-29,0,9.49,0.71,0.77,1018.43,Train,Thursday,14:00:00,December,2011
41687,2013-08-20 23:30:00,14.228,576.522,4193,0.137496,0.0,2013-08-20,0,16.09,0.73,0.0,1026.73,Train,Tuesday,23:30:00,August,2013


###### Mean aggregate demand per hour in the previous day (1 variable).

First, we will calculate the daily mean aggregating by 'Date'. Then we will add one day to the 'Date' variable of our new DataFrame and join it with the main DataFrame df.

In [47]:
daily_mean = df.groupby(['Date', 'ToU']).agg({'mean': 'mean'}).reset_index()

In [48]:
daily_mean['Date'] = pd.to_datetime(daily_mean['Date'])

In [49]:
type(daily_mean['Date'][2])

pandas.tslib.Timestamp

In [50]:
daily_mean.sample(5)

Unnamed: 0,Date,ToU,mean
980,2013-10-15,1.0,0.182846
430,2013-01-13,1.0,0.260546
1174,2014-02-09,0.0,0.254605
495,2013-02-15,0.0,0.244258
371,2012-11-28,0.0,0.249383


In [51]:
daily_mean['Date'] = daily_mean['Date'] + datetime.timedelta(days=1)

In [52]:
df = df.merge(daily_mean, how = 'left', on = ['Date','ToU'])

In [53]:
df = df.rename(columns = {'mean_x' : 'mean_cons',
          'mean_y' : 'mean_prev_day'})

In [54]:
df.sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean_cons,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure,Set,DoW,Time,Month,Year,mean_prev_day
6958,2012-04-16 09:00:00,14.228,313.72,1516,0.206939,0.0,2012-04-16,0,7.5,0.48,0.31,1026.99,Train,Monday,09:00:00,April,2012,0.231841
5479,2012-03-16 12:30:00,14.228,230.458,1026,0.224618,0.0,2012-03-16,0,9.26,0.76,0.89,1019.24,Train,Friday,12:30:00,March,2012,0.230902
1001,2011-12-14 05:30:00,14.228,30.982,223,0.138933,0.0,2011-12-14,0,5.06,0.78,0.2,990.84,Train,Wednesday,05:30:00,December,2011,0.253518
6375,2012-04-04 05:30:00,14.228,208.212,1290,0.161405,0.0,2012-04-04,0,4.13,0.92,0.35,1002.59,Train,Wednesday,05:30:00,April,2012,0.213136
28781,2013-04-08 13:00:00,67.2,195.924,1099,0.178275,1.0,2013-04-08,0,10.08,0.47,0.55,1004.06,Train,Monday,13:00:00,April,2013,0.213706


###### Aggregate demand in the previous day at the same time and at the previous 3 time steps (4 variables).

In [55]:
df.head()

Unnamed: 0,DateTime,Tariff_value,sum,count,mean_cons,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure,Set,DoW,Time,Month,Year,mean_prev_day
0,2011-11-23 09:00:00,14.228,0.569,2,0.2845,0.0,2011-11-23,0,4.84,0.99,0.32,1027.34,Train,Wednesday,09:00:00,November,2011,
1,2011-11-23 09:30:00,14.228,0.561,2,0.2805,0.0,2011-11-23,0,4.84,0.99,0.32,1027.34,Train,Wednesday,09:30:00,November,2011,
2,2011-11-23 10:00:00,14.228,0.92,6,0.153333,0.0,2011-11-23,0,5.69,0.98,0.56,1027.72,Train,Wednesday,10:00:00,November,2011,
3,2011-11-23 10:30:00,14.228,0.588,6,0.098,0.0,2011-11-23,0,5.69,0.98,0.56,1027.72,Train,Wednesday,10:30:00,November,2011,
4,2011-11-23 11:00:00,14.228,0.772,7,0.110286,0.0,2011-11-23,0,7.66,0.88,0.32,1027.59,Train,Wednesday,11:00:00,November,2011,


In [56]:
def prev_ts(df_in, initial, final, step):
    #Both initial and final are included in the loop
    df_out = df_in[['DateTime', 'ToU']].copy()
    for i in range(initial, final + 1, step):
        aux = df_in[['DateTime', 'ToU', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = i)
        df_out = df_out.\
            merge(aux, how = 'left', on = ['DateTime','ToU'], suffixes = ('','_-%d' %(i)))
    return(df_out)

In [57]:
tmp = prev_ts(df, 22*60+30, 24*60, 30)
df = df.merge(tmp[['DateTime','ToU','mean_cons_-1380', 'mean_cons_-1410', 'mean_cons_-1440']],
         on = ['DateTime','ToU'], suffixes = ('',''))

In [58]:
df.sample(5)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean_cons,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure,Set,DoW,Time,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440
19181,2012-12-27 00:30:00,14.228,1165.175,5505,0.211658,0.0,2012-12-27,0,7.73,0.82,0.27,1003.06,Train,Thursday,00:30:00,December,2012,0.243411,0.168923,0.191019,0.218676
9603,2012-06-10 11:30:00,14.228,863.077,3903,0.221132,0.0,2012-06-10,0,16.83,0.49,0.41,1008.6,Train,Sunday,11:30:00,June,2012,0.178811,0.205025,0.211188,0.210592
11484,2012-07-19 16:00:00,14.228,899.899001,4732,0.190173,0.0,2012-07-19,0,20.27,0.5,0.42,1011.91,Train,Thursday,16:00:00,July,2012,0.172349,0.217391,0.20956,0.196699
47930,2013-10-25 00:30:00,11.76,109.611,1050,0.104391,1.0,2013-10-25,0,13.18,0.93,0.0,1009.45,Test,Friday,00:30:00,October,2013,0.180342,0.097941,0.099941,0.107938
22407,2013-02-01 02:30:00,14.228,615.995,4374,0.140831,0.0,2013-02-01,0,8.03,0.8,0.26,1009.66,Train,Friday,02:30:00,February,2013,0.239651,0.130583,0.135504,0.144816


###### Mean value of the aggregate demand of the previous 3 days at the same time and at the previous 3 time steps (4 variables).

In [59]:
aux2 = df[['DateTime', 'ToU']].copy()
means_3d = pd.DataFrame(index=df.index)
for h in range(0,4):
    for d in range(1,4):
        aux = df[['DateTime', 'ToU', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = d*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','ToU'])
    means_3d = means_3d.join(aux2.mean(axis=1).rename('mean_last3d_-%d' %(h*30)))

In [60]:
df = df.join(means_3d)

In [61]:
df.sample(3)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean_cons,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure,Set,DoW,Time,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90
26740,2013-03-18 06:00:00,11.76,146.386,1101,0.132957,1.0,2013-03-18,0,0.26,0.99,0.39,989.94,Train,Monday,06:00:00,March,2013,0.248919,0.15928,0.140001,0.124374,0.349582,0.265628,0.239471,0.232349
52774,2013-12-14 11:30:00,11.76,236.137,1042,0.226619,1.0,2013-12-14,0,9.57,0.77,0.75,1025.91,Test,Saturday,11:30:00,December,2013,0.224444,0.218281,0.206332,0.207092,0.40639,0.323074,0.291731,0.273723
48894,2013-11-04 01:30:00,11.76,110.779,1047,0.105806,1.0,2013-11-04,0,8.84,0.96,0.07,983.07,Test,Monday,01:30:00,November,2013,0.220749,0.103288,0.10894,0.11629,0.336013,0.238084,0.197292,0.17442


###### Mean value of the aggregate demand on the same day of week of the previous 3 weeks at the same time and at the previous 3 time steps (4 variables).

In [62]:
aux2 = df[['DateTime', 'ToU']].copy()
means_3w = pd.DataFrame(index=df.index)
for h in range(0,4):
    for w in range(1,4):
        aux = df[['DateTime', 'ToU', 'mean_cons']].copy()
        aux['DateTime'] = aux['DateTime'] + datetime.timedelta(minutes = w*7*24*60-h*30)
        aux2 = aux2.\
            merge(aux, how = 'left', on = ['DateTime','ToU'])
    means_3w = means_3w.join(aux2.mean(axis=1).rename('mean_last3w_-%d' %(h*30)))

In [63]:
df = df.join(means_3w)

In [64]:
df.sample(3)

Unnamed: 0,DateTime,Tariff_value,sum,count,mean_cons,ToU,Date,Holiday,temperature,humidity,cloudCover,pressure,Set,DoW,Time,Month,Year,mean_prev_day,mean_cons_-1380,mean_cons_-1410,mean_cons_-1440,mean_last3d_-0,mean_last3d_-30,mean_last3d_-60,mean_last3d_-90,mean_last3w_-0,mean_last3w_-30,mean_last3w_-60,mean_last3w_-90
17549,2012-11-23 00:30:00,14.228,1060.851999,5517,0.192288,0.0,2012-11-23,0,8.89,0.9,0.7,1006.03,Train,Friday,00:30:00,November,2012,0.229276,0.157621,0.17325,0.194312,0.142922,0.154024,0.153316,0.149413,0.144864,0.155881,0.1546,0.150727
51691,2013-12-03 04:30:00,14.228,496.815,4078,0.121828,0.0,2013-12-03,0,6.92,0.73,0.07,1030.53,Test,Tuesday,04:30:00,December,2013,0.239061,0.130004,0.121593,0.119379,0.094642,0.108695,0.115891,0.121118,0.09092,0.104996,0.112908,0.119902
12565,2012-08-11 04:30:00,14.228,476.461,4756,0.100181,0.0,2012-08-11,0,15.71,0.83,0.09,1023.48,Train,Saturday,04:30:00,August,2012,0.166081,0.120835,0.106896,0.10035,0.075134,0.088692,0.097879,0.105719,0.074422,0.086245,0.093618,0.099686


Note that pandas.dataframe.mean skips NaN values by default. Therefore, if some values are missing, the mean is still calculated with the remaining ones. Nonetheless, in the first 3 weeks, none of the values needed to calculate some features will exist so there will be missing values. These values will be dropped from the model. This is not needed in dedicated time series algorithms.

In [65]:
df = df.dropna().reset_index(drop=True)

Time can be encoded as a categorical variable or as a continuous numerical variable.

Looking at the exploratory data analysis in Tableau, aggregating data by time, we can distinguish a plateau between 8:00 and 15:59 during weekdays, an increasing trend up to 19:00, then decreasing to 2:00 and a plateau between 2:00 and 5:59 before increasing to 8:00. During weekends, the midday plateau is not so constant and seems to start at 9:00.

Considering time with 30-min frequency would result in a categorical variable with 48 levels. This resolution is probably to large for our purpose taking into account that we will derive features based on consumption in previous time steps and at the same time on previous comparable days, as we will see later. 

Considering time as a continuous numerical variable is probably a better approach, but from the exploratory data analysis we now its relationship with the target will be strongly non-linear.

Therefore, we propose to use it as a numerical variable in the interval [0,24) in models which can grasp non-linearities and to drop it in linear models at is has been used to derived the features.

We will also remove the year as we only have 2 complete years (2012 and 2013) and part of our test data is in 2014, which is absent from the train set.

In [66]:
def time2float(time):
    return time.hour + time.minute / 60.0

In [67]:
df['Time'] = df['Time'].apply(lambda t: time2float(t))

Now that we have built all features we can save this model to a csv file for use in R.

In [68]:
df.to_csv('outputs/features_model.csv', index = False)