# Time Series Forecasting in Python:

## Objective:

* Combine, clean, and prepare the energy and temperature datasets for exploration and modeling. We will use the combined and cleaned datasets to make the exploration and modeling an easier task in the upcoming sections.

## Data Description:
* hr_temp_20170201-20200131_subset.csv – This is a dataset containing hourly (variable DATE) temperature data (variable HourlyDryBulbTemperature) at a weather station near the area you are forecasting energy for.

* hrl_load_metered - 20170201-20200131.csv – This is a dataset containing hourly (variable datetime_beginning_ept) megawatt usage data (variable mw) for the area in Pennsylvania centered around Duquesne. We are using only three years of data because we want to make sure that we look at recent energy patterns that are still applicable to our current customers.

In [913]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt

In [914]:
import os
for dirname, _, filenames in os.walk('/kaggle/input/milestone1'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 1.0 Load Data:
* First, let's make sure to import date related values with 'datetime64' data type for easy manipulation.

In [915]:
meter = pd.read_csv('/kaggle/input/milestone1/hrl_load_metered - 20170201-20200131.csv')#, parse_dates=['datetime_beginning_utc', 'datetime_beginning_ept'])
weather = pd.read_csv('/kaggle/input/milestone1/hr_temp_20170201-20200131_subset.csv', parse_dates=['DATE'])
weather1 = pd.read_csv('/kaggle/input/milestone1/hr_temp_20200201-20200229_subset.csv', parse_dates=['DATE'])

In [916]:
for column in ['datetime_beginning_utc', 'datetime_beginning_ept']:
    meter[column] = pd.to_datetime(meter[column])

weather = weather.sort_values('DATE')
meter = meter.sort_values('datetime_beginning_ept')
print(f'weather: {weather.shape}')
print(f'meter: {meter.shape}')

In [917]:
weather.dtypes

In [918]:
meter.dtypes

## 2.0 Preprocessing:

### 2.1 Drop/Rename columns:
* Let's drop unnecessary columns and rename some columns for simplicity.

In [919]:
#weather
weather = weather.drop(columns=['STATION','REPORT_TYPE','SOURCE'])

In [920]:
#meter
meter = meter.rename(columns={'datetime_beginning_ept':'DATE'})
meter = meter.drop(columns=['datetime_beginning_utc','nerc_region','mkt_region','zone','load_area','is_verified'])
meter.head()

### 2.2 Index DateTime:
* Let's create extract information (year, month, hour, day, day_of_week) from 'DATE'column.
* Then, we will create new columns for it.
* Fianlly, we will make these columns as index for each dataframe.
* This makes combining our two dataframes (energy & weather) easier using common index.

In [921]:
meter['day_of_week'] = meter.DATE.dt.dayofweek
meter['hour'] = meter.DATE.dt.hour
meter['day'] = meter.DATE.dt.day
meter['month'] = meter.DATE.dt.month
meter['year'] = meter.DATE.dt.year
meter.head(3)

In [922]:
weather['day_of_week'] = weather.DATE.dt.dayofweek
weather['hour'] = weather.DATE.dt.hour
weather['day'] = weather.DATE.dt.day
weather['month'] = weather.DATE.dt.month
weather['year'] = weather.DATE.dt.year
weather.head(3)

In [923]:
weather0 = weather.set_index(['year','month','day','hour','day_of_week'])
meter0 = meter.set_index(['year','month','day','hour','day_of_week'])

weather0 = weather0.drop(columns=['DATE'])
meter0 = meter0.drop(columns=['DATE'])
df = weather0.join(meter0, how='outer')
df = df.rename(columns={'HourlyDryBulbTemperature':'temp'})
df

### 2.3 Identify Missing Data:
* Not all of the temperature data is recorded, as the stations would occasionally not report.
* So, we will fill in these missing values using linear interpolation.
* Let's look at what data we are mssing.

In [924]:
df.isna().sum()

In [925]:
#missing mw values (Mar 12 2017, Mar 11 2018, Mar 10 2019)
# Sunday of 3rd week of March may be the yearly maintenance downtime for energy meter
missing_mw = df[df['mw'].isna()==True]
missing_mw

In [926]:
# missing temp values (Oct 25 2018, Dec 25 2018, May 26 2019)
# weather temp sensor doesn't exibit any particular yearly maintenance downtime.
# we will fill this temp values with interpolation
missing_temp = df[df['temp'].isna()==True]
missing_temp

### 2.4 Treat Missing Data:
* we are missing:
    * 37 temperature data [temp (unit: degree F)]
    * 3 energy consumption data [mw (unit: MWh)]
* Let's use linear interpolation to fill in the gaps with forward direction.

In [937]:
df['temp'] = df['temp'].interpolate(method='linear', limit_direction = 'forward')
df['mw'] = df['mw'].interpolate(method='linear', limit_direction = 'forward')
df.isnull().sum()

### 2.5 Re-index with Date:
* we will desolve multi-index (year, month, day, hour)
* Then, we will create a new index, 'date', which will combine all these values.

In [928]:
df = df.reset_index()
df.head(3)

In [929]:
df['date'] = pd.to_datetime(dict(year=df.year, month=df.month, day=df.day, hour=df.hour))
print(df.dtypes)
df.head(3)

In [930]:
df = df.set_index('date')
df = df.drop(columns='day')
df = df.rename(columns={'day_of_week':'weekday'})
df.head(3)

### 2.6 Train/Test Split:
* train = from 2017-02-01 to 2019-12-31
* test = from 2020-01-01 to 2020-01-31

In [931]:
train = df[df.index.to_series().between('2017-01-01 00:00:00','2019-12-31 23:59:59') == True]
test = df[df.index.to_series().between('2020-01-01 00:00:00','2020-12-31 23:39:59') == True]

In [932]:
train

In [933]:
test

In [934]:
# let's confirm that we didn't miss any rows when splitting dataset to train/test
print(f'df: {df.shape[0]}')
print(f'train: {train.shape[0]}')
print(f'test: {test.shape[0]}')
print(f'train+test = df: {train.shape[0]+test.shape[0]} = {df.shape[0]}')

## Remark:
* Train (25539)/Test(744) split is successfully completed with no data loss.
* Now, we are ready to submit this assignment.

### 