In [12]:
import scipy as stats
import numpy as np
import matplotlib as mpl
import pandas as pd
import statsmodels as sm
import sklearn as sk

#import theano as t
import tensorflow as tf
import keras as k

All original code from: https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/

## Motivation:

I am getting familiar with Energy Demand forecasting and LSTM's are cutting edge architecture for Time Series Forecasting.

# Defining the Problem

Given the rise of smart electricity meters and the wide adoption of electricity generation technology like solar panels, there is a wealth of electricity usage data available.

This data represents a multivariate time series of power-related variables, that in turn could be used to model and even **forecast future electricity consumption.**

In this model build, will be using a household power consumption dataset for multi-step time series forecasting .

# The Data

The Household Power Consumption dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years.

The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

global_active_power: The total active power consumed by the household (kilowatts).
global_reactive_power: The total reactive power consumed by the household (kilowatts).
voltage: Average voltage (volts).
global_intensity: Average current intensity (amps).
sub_metering_1: Active energy for kitchen (watt-hours of active energy).
sub_metering_2: Active energy for laundry (watt-hours of active energy).
sub_metering_3: Active energy for climate control systems (watt-hours of active energy).
Active and reactive energy refer to the technical details of alternative current.

In general terms, the active energy is the real power consumed by the household, whereas the reactive energy is the unused power in the lines.

We can see that the dataset provides the active power as well as some division of the active power by main circuit in the house, specifically the kitchen, laundry, and climate control. These are not all the circuits in the household.

The remaining watt-hours can be calculated from the active energy by first converting the active energy to watt-hours then subtracting the other sub-metered active energy in watt-hours, as follows:

### sub_metering_remainder = (global_active_power * 1000 / 60) - (sub_metering_1 + sub_metering_2 + sub_metering_3)

In [19]:
# load all data
dataset = pd.read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime'])
# summarize
dataset.shape
dataset.head()

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


Take care of '?'. Convert them to NaN's so all data is one array of floating point values.

In [24]:
dataset.replace('?', np.NaN, inplace=True)

Create new column: sub_metering_remainder. 

In [25]:
# add a column for for the remainder of sub metering
values = dataset.values.astype('float32')
dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6])

In [26]:
#Save transformed dataset as .csv since this format is easiest to work with when loading as Pandas DF
dataset.to_csv('household_power_consumption.csv')

In [29]:
#Check that it was converted properly by reloading dataset as csv.
dataset = pd.read_csv('household_power_consumption.csv', header=0)
dataset.head()

Unnamed: 0,datetime,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,sub_metering_4
0,2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0,52.26667
1,2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0,72.333336
2,2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0,70.566666
3,2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0,71.8
4,2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0,43.1


Great! Headers are right and sub_metering_4 is added into DF.

Also, good thing to check is that NaN's were created in replacement of '?''s.

In [30]:
dataset.isnull().sum()

datetime                     0
Global_active_power      25979
Global_reactive_power    25979
Voltage                  25979
Global_intensity         25979
Sub_metering_1           25979
Sub_metering_2           25979
Sub_metering_3           25979
sub_metering_4           25979
dtype: int64

In [31]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   datetime               object 
 1   Global_active_power    float64
 2   Global_reactive_power  float64
 3   Voltage                float64
 4   Global_intensity       float64
 5   Sub_metering_1         float64
 6   Sub_metering_2         float64
 7   Sub_metering_3         float64
 8   sub_metering_4         float64
dtypes: float64(8), object(1)
memory usage: 142.5+ MB


NaN's are about 1/8 of the dataset. This is something to keep in mind when assessming model performance.

Now, let's explore the data to get an a look at what we are trying to model.

# EDA

## Patterns in Observations Over Time

Best way to understand Time Series is to make line plots.

Start with making line plots for each variable.

In [None]:
from matplotlib import pyplot

pyplot.figure()
for i in range(len(dataset.columns)):
    pyplot.subplot(len(dataset.columns), 1, i+1)
    name = dataset.columns[i]
    pyplot.plot(dataset[name])
    pyplot.title(name, y=0)
pyplot.show()

Some intitial obs:

1. Something interesting with sub_metering_3 (environmental control) that may not directly map to hot/cold years. Maybe new systems were installed.

2. sub_metering_4 is decreasing with time, downward trend that may be correlated with sub_metering_3's solid trend's increase towards the end.

3. These observations reminds us that need to take into consideration subsequences' temporal ordering when fitting and evaluating any model.

4. Might be able to see wave of seasonal effect in 'Global_active_power' and some other variates with subsequent EDA.

5. Some spiky activity may be due to weekend household activity. 

Now, zooming in on 'Global_active_power'.

In [None]:
# plot active power for each year
years = ['2007', '2008', '2009', '2010']
pyplot.figure()
for i in range(len(years)):
	# prepare subplot
	ax = pyplot.subplot(len(years), 1, i+1)
	# determine the year to plot
	year = years[i]
	# get all observations for the year
	result = dataset[str(year)]
	# plot the active power for the year
	pyplot.plot(result['Global_active_power'])
	# add a title to the subplot
	pyplot.title(str(year), y=0, loc='left')
pyplot.show()

Each line plot represents a year from 'Global_active_power'.

We can see:
1. Some common gross patters: between Feb and March and between August and September, there is significant decrease in usage.

2. Less consumption in middle of the year (summer months) and more consumption in beginning and end of the year (winter months). This may show annual seasonal pattern in consumption.

3. Some patches of missing data.

Can continue to zoom in on data and look at 2007's Global_active_power. This would see any seasonality at monthly, weekly and daily level.

In [None]:
# plot active power for each 2007 month
months = [x for x in range(1, 13)]
pyplot.figure()
for i in range(len(months)):
	# prepare subplot
	ax = pyplot.subplot(len(months), 1, i+1)
	# determine the month to plot
	month = '2007-' + str(months[i])
	# get all observations for the month
	result = dataset[month]
	# plot the active power for the month
	pyplot.plot(result['Global_active_power'])
	# add a title to the subplot
	pyplot.title(month, y=0, loc='left')
pyplot.show()

From this granular graph, able to detect daily sine wave for power consumption. This is good since able to detect daily pattern in power consumption. 

Also, notice there are stretches of days without consumption, such as the case with Feb, April, August and October. This could represent vacation days when home was not occupied.

Finally, can now look at an even more granular level and look at consumption on a daily level.
What is expected is pattern of consumption within each day and differences in consumption from day-to-day.

In [None]:
# plot active power for first 20 days in Jan 2007
days = [x for x in range(1, 20)]
pyplot.figure()
for i in range(len(days)):
	# prepare subplot
	ax = pyplot.subplot(len(days), 1, i+1)
	# determine the day to plot
	day = '2007-01-' + str(days[i])
	# get all observations for the day
	result = dataset[day]
	# plot the active power for the day
	pyplot.plot(result['Global_active_power'])
	# add a title to the subplot
	pyplot.title(day, y=0, loc='left')
pyplot.show()

Each plot is a day from first 20 days in Jan 2007.

What we notice is:
1. Most of the consumption starts in the early morning (around 6 and 7 AM). 

2. Some days show a drop in the middle of the day since most people are out of the house during this time.

3. Some overnight consumption may be due to heat being used over night.

Therefore, to model data well, good to consider seasons and weather it brings which affects consumption.

Another important view of the data involves distributions. So let's start looking at Time Series data distributions.

# Time Series Data Distributions.

It would be nice to know if var distributions are Gaussian or not. 

We can look in to this by creating histograms.

Let's start with creating histogram for each var in time series dataset.

In [None]:
# histogram plot for each variable
pyplot.figure()
for i in range(len(dataset.columns)):
	pyplot.subplot(len(dataset.columns), 1, i+1)
	name = dataset.columns[i]
	dataset[name].hist(bins=100)
	pyplot.title(name, y=0)
pyplot.show()

We can see that from all variables except for voltage, there is a skewed distribution towards small watt-hour/kilowatt values

Global active power does seem to be bi-modal (has 2 mean groups of observations). This will be further looked into by separating its data into 4 distributions corresponding to yearly data (2007-2010).

And for Voltage variable, it is strongly Gaussian.

In [None]:
# plot active power for each year
years = ['2007', '2008', '2009', '2010']
pyplot.figure()
for i in range(len(years)):
	# prepare subplot
	ax = pyplot.subplot(len(years), 1, i+1)
	# determine the year to plot
	year = years[i]
	# get all observations for the year
	result = dataset[str(year)]
	# plot the active power for the year
	result['Global_active_power'].hist(bins=100)
	# zoom in on the distribution
	ax.set_xlim(0, 5)
	# add a title to the subplot
	pyplot.title(str(year), y=0, loc='right')
pyplot.show()

Bi-modal distributions are evident for each year from Global_active_power var.

There is one peak around .3 KW and maybe around 1.3 kW.

There is also a long tail stretching into higher kW usage. 

Separating out first peak, second peak and long tail into discretized groups for day/hr usage may be helpful in developing a predictive model.

It is possible that groups may vary over year's seasons.

Can look into this by separating out distribution for each month in a year (2007).

In [None]:
months = [x for x in range(1, 13)]
pyplot.figure()
for i in range(len(months)):
	# prepare subplot
	ax = pyplot.subplot(len(months), 1, i+1)
	# determine the month to plot
	month = '2007-' + str(months[i])
	# get all observations for the month
	result = dataset[month]
	# plot the active power for the month
	result['Global_active_power'].hist(bins=100)
	# zoom in on the distribution
	ax.set_xlim(0, 5)
	# add a title to the subplot
	pyplot.title(month, y=0, loc='right')
pyplot.show()

We can see that bimodal is strong in all months after March. 
For Jan-March, looks like there is a 3rd peak.

Peaks (looking at second one) are higher in colder months and lower in warmer months.

Thicker tails are evident for colder months as well.

Also, given the different y-axis scaling for July and August, looks like there is relatively higher consumption.