## Recurrenty Neural Networks to predict electricity usage

This notebooks takes raw observation data on etrical usuage, and creats a daily prediction for the next seven days of usage.  The dataset is the household_power_consumption dataset hosted by UCI Mahine Learning Repo.  Target value to predict is 'Global_active_power'

#### Steps
##### Data Loading and Cleaning
+ Download data 
+ unzip data, create pandas data frame from the 'household_power_consumption.txt' in the zip file
+ combine date and time to a datatime index for the data frame
+ aggregate the data to daily from observationallay level
+ reshape data so that an entier week of data is used to predictions for the next week 
    + RNN Data input needs to be 3d , (observations, timesteps, n_cols)
+ split the data into training and test sets

##### Modeling

+ Design a nerual network architecture
+ compile 
+ fit the model

##### Deployment
+ write a function that combines cleaning steps to create a predictions


In [55]:
import requests, zipfile
import pandas as pd
from io import BytesIO
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip'

# get a zip file from UCI machine learning respository 
request = requests.get(url)
file = zipfile.ZipFile(BytesIO(request.content))

# reads txt file from zip, as byte code 
with file.open('household_power_consumption.txt') as f:
    txt = f.readlines()

    # decodes bytes to string, splaces return and newline characters with ''
txt_decoded = [row.decode("utf-8").replace('\r\n', '') for row in txt]

# extracts column names 
cols = txt_decoded[0].split(';')

# create a data frame
df = pd.DataFrame(columns=cols, data = [row.split(';') for row in txt_decoded[1:100000]])

# combines date and time col to a date time col
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

# sets date time as index
df.set_index('Datetime', inplace=True)

# drops the seperate Data and Time Columns
df.drop(['Date', 'Time'], axis=1, inplace=True)

# coerrce all the string columns to float
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')
    

min date 2006-12-16 00:00:00
max date 2007-12-02 00:00:00


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [56]:
# Basic information about the data
dates = daily_data.index
min_date = min(dates)
max_date = max(dates)
horrizon = min_date
print('min date', min_date)
print('max date', max_date)
print('data shape', df.shape)
df.head()

min date 2006-12-16 00:00:00
max date 2007-12-02 00:00:00
data shape (99999, 7)


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


##### About the data
As is clear from the index, the data is one minute time stamp observations of power usage
+    global_active_power: The total active power consumed by the household (kilowatts). (This is the target)
+    global_reactive_power: The total reactive power consumed by the household (kilowatts).
+    voltage: Average voltage (volts).
+    global_intensity: Average current intensity (amps).
+    sub_metering_1: Active energy for kitchen (watt-hours of active energy).
+    sub_metering_2: Active energy for laundry (watt-hours of active energy).
+    sub_metering_3: Active energy for climate control systems (watt-hours of active energy).


In [57]:
# groups by daily
daily_groups = df.resample('D') 
# aggregates by some
daily_data = daily_groups.sum()
daily_data.head()

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16,1209.176,34.922,93552.53,5180.8,0.0,546.0,4926.0
2006-12-17,3390.46,226.006,345725.32,14398.6,2033.0,4187.0,13341.0
2006-12-18,2203.826,161.792,347373.64,9247.2,1063.0,2621.0,14018.0
2006-12-19,1666.194,150.942,348479.01,7094.0,839.0,7602.0,6197.0
2006-12-20,2225.748,160.998,348923.61,9313.0,0.0,2648.0,14063.0


In [7]:

# yeilds date range for x,and y ides of the equation
def get_data_ranges(dates, d=7):
    min_date = min(dates)
    max_date = max(dates)
    horrizon = min_date
    while horrizon + pd.Timedelta(days=d*2) <= max_date:
        x_date_range = pd.date_range(horrizon, periods=d)
        y_date_range = pd.date_range(horrizon + pd.Timedelta(days=d) , periods=d)
        horrizon =  horrizon +  pd.Timedelta(days=d)
        yield x_date_range, y_date_range

date_ranges = list(get_data_ranges(daily_data.index, d=7))
date_ranges[0]

2006-12-16 00:00:00 2007-12-02 00:00:00


(DatetimeIndex(['2006-12-16', '2006-12-17', '2006-12-18', '2006-12-19',
                '2006-12-20', '2006-12-21', '2006-12-22'],
               dtype='datetime64[ns]', freq='D'),
 DatetimeIndex(['2006-12-23', '2006-12-24', '2006-12-25', '2006-12-26',
                '2006-12-27', '2006-12-28', '2006-12-29'],
               dtype='datetime64[ns]', freq='D'))

In [None]:
# Add Fill witn Previous !! 

In [78]:
import numpy as np
# function that yields 2d arrays (timestep, x_cols) for training and testing
def array_gen(df, d=7, targetCol='Global_active_power'):
    # generatres the correct date ranges
    gen = get_data_ranges(df.index, d=d)
    try:
        while True: # runs until stop interation is met (no more date ranges to use)
            data_list = [next(gen) for _ in range(d)]
            x_ranges = [v[0] for v in data_list]
            y_ranges = [v[1] for v in data_list]
            try: # there is a case where date ranges are not in df, that causes key error
                # uses index loc to get date ranges from df
                x = np.reshape([df.loc[r].values for r in x_ranges], (d,-1))
                y = np.reshape([df.loc[r, targetCol].values for r in y_ranges],  (d, -1))
                yield x, y
                
            except KeyError:
                pass
    except StopIteration:
        print('array gen completed')
        
# creates the data geneator
days = 7
x_cols = daily_data.shape[1]
g = array_gen(daily_data, d=days)

# creates list of (x,y ) tuples 
data_list = list(g)

# reshapes list of (x, y) tuples into a 3d x, y arrays for training
x = np.reshape([v[0] for v in data_list], (-1, days, x_cols)) 
y = np.reshape([v[1] for v in data_list], (-1, days)) 

print('x_shape', x.shape)
print('y_shape', y.shape)

array gen completed
x_shape (49, 7, 7)
y_shape (49, 7)


In [82]:
# data splitting

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

#### Data Shape for Machine Recurrent Neural Networks
Input into LSTM RNN or GRU needs to be three dimenisional,  with shape

(observations, timesteps, n_cols)


In [83]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Flatten
import numpy as np

batch_size = 5

# Expected input batch shape: (batch_size, timesteps, data_dim)
# Note that we have to provide the full batch_input_shape since the network is stateful.
# the sample of index i in batch k is the follow-up for the sample i in batch k-1.

model = Sequential()
model.add(LSTM(4, return_sequences=True, stateful=False,
               batch_input_shape=(None, days, x_cols)))
model.add(LSTM(4, stateful=False, return_sequences=False))
model.add(Dense(7, activation='linear'))

model.compile(loss='mean_squared_logarithmic_error',
              optimizer='adam',
              metrics=['mae'])
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_16 (LSTM)               (None, 7, 4)              192       
_________________________________________________________________
lstm_17 (LSTM)               (None, 4)                 144       
_________________________________________________________________
dense_7 (Dense)              (None, 7)                 35        
Total params: 371
Trainable params: 371
Non-trainable params: 0
_________________________________________________________________


In [84]:
model.fit(X_train, y_train, batch_size=10, epochs=10)
score = model.evaluate(X_test, y_test, batch_size=16)


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f693daa9978>