### Time Series Forecasting

- A time series is data collected periodically, over time.
- Time series forecasting is the task of predicting future data points, given some historical data.
- It is commonly used in a variety of tasks from weather forecasting, retail and sales forecasting, stock market prediction, and in behavior prediction (such as predicting the flow of car traffic over a day).
- There is a lot of time series data out there, and recognizing patterns in that data is an active area of machine learning research!

* In this notebook, we'll focus on one method for finding time-based patterns: using SageMaker's supervised learning model, DeepAR.

#### DeepAR

- DeepAR utilizes a recurrent neural network(RNN), which is designed to accept some sequence of data points as historical input and produce a predicted sequence of points. So, how does this model learn?

- During training, you'll provide a training dataset (made of several time series) to a DeepAR estimator. The estimator looks at all the training time series and tries to identify similarities across them.

- It trains by randomly sampling training examples from the training time series.

- Each training example consists of a pair of adjacent context and prediction windows of fixed, predefined lengths.
    - The context_length parameter controls how far in the past the model can see.
    - The prediction_length parameter controls how far in the future predictions can be made.

- In any forecasting task, you should choose the context window to provide enough, relevant information to a model so that it can produce accurate predictions.

-  In general, data closest to the prediction time frame will contain the information that is most influential in defining that prediction.

- In many forecasting applications, like forecasting sales month-to-month, the context and prediction windows will be the same size, but sometimes it will be useful to have a larger context window to notice longer-term patterns in data.

#### Energy Consumption Data

- The data we'll be working with in this notebook is data about household electric power consumption, over the globe. The dataset is originally taken from [Kaggle](https://www.kaggle.com/datasets/uciml/electric-power-consumption-data-set), and represents power consumption collected over several years from 2006 to 2010

#### Machine Learning Workflow

This notebook approaches time series forecasting in a number of steps:

- Loading and exploring the data
- Creating training and test sets of time series
- Formatting data as JSON files and uploading to S3
- Instantiating and training a DeepAR estimator
- Deploying a model and creating a predictor
- Evaluating the predictor

#### Import packages

In [None]:
import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Load the data

In [None]:
n_lines = 10

with open('household_power_consumption.txt') as file:
    head = [next(file) for line in range(n_lines)]
    
display(head)

#### Pre-Process the Data

- The 'household_power_consumption.txt' file has the following attributes:

    - The 'household_power_consumption.txt' file has the following attributes:
    - The various data features are separated by semicolons (;)
    - Some values are 'nan' or '?', and we'll treat these both as NaN values

##### Managing NaN values

- This DataFrame does include some data points that have missing values.
- So far, we've mainly been dropping these values, but there are other ways to handle NaN values, as well.
- One technique is to just fill the missing column values with the mean value from that column; this way the added value is likely to be realistic.

- The preprocessing_methods.py module will help to load in the original text file as a DataFrame and fill in any NaN values, per column, with the mean feature value.
- This technique will be fine for long-term forecasting; if I wanted to do an hourly analysis and prediction, I'd consider dropping the NaN values or taking an average over a small, sliding window rather than an entire column of data.

In [None]:
os.getcwd()

In [None]:
sys.path.insert(0, os.getcwd())

In [None]:
from preprocessing_methods import DataFrameOperations

In [None]:
df_opt = DataFrameOperations('household_power_consumption.txt')

In [None]:
# fill NaN column values with *average* column value
df = df_opt.fill_nan_with_mean()

In [None]:
df.shape

In [None]:
df.head()

In [None]:
n_lines = 10

with open('new_household_power_consumption.txt') as file:
    head = [next(file) for line in range(n_lines)]
    
display(head)

In [None]:
df = pd.read_csv("new_household_power_consumption.txt")

In [None]:
lst = ["Date-Time", "Global_active_power", "Global_reactive_power", "Voltage",
       "Global_intensity", "Sub_metering_1", "Sub_metering_2", "Sub_metering_3"]

##### Global Active Power

- In this example, we'll want to predict the global active power, which is the household minute-averaged active power (kilowatt), measured across the globe. So, below, I am getting just that column of data and displaying the resultant plot.

In [None]:
power_df = df['Global_active_power'].copy()
power_df.shape

In [None]:
# display the data 
plt.figure(figsize=(12,6))
# all data points
power_df.plot(title='Global active power', color='blue') 
plt.show()

- Since the data is recorded each minute, the above plot contains a lot of values. So, I'm also showing just a slice of data, below.

In [None]:
# can plot a slice of hourly data
end_mins = 1440 # 1440 mins = 1 day

plt.figure(figsize=(12,6))
power_df[0:end_mins].plot(title='Global active power, over one day', color='blue') 
plt.show()

##### Hourly vs Daily

There is a lot of data, collected every minute, and so I could go one of two ways with my analysis:
1. Create many, short time series, say a week or so long, in which I record energy consumption every hour, and try to predict the energy consumption over the following hours or days.
2. Create fewer, long time series with data recorded daily that I could use to predict usage in the following weeks or months.

- Both tasks are interesting! It depends on whether you want to predict time patterns over a day/week or over a longer time period, like a month.
- With the amount of data I have, I think it would be interesting to see longer, recurring trends that happen over several months or over a year.
- So, I will resample the 'Global active power' values, recording daily data points as averages over 24-hr periods.

In [None]:
power_df

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df['Date-Time'] = pd.to_datetime(df['Date-Time'])

In [None]:
df.info()

In [None]:
datatime_df = df['Date-Time'].copy()
datatime_df

In [None]:
mean_datatime_df = datatime_df.resample("D").mean()

In [None]:
# resample over day (D)
freq = '24h'
# calculate the mean active power for a day
mean_power_df = power_df.resample(freq)

In [None]:
# display the mean values
plt.figure(figsize=(15, 8))
mean_power_df.plot(title='Global active power, mean per day', color='blue')
plt.tight_layout()
plt.show()