# Forecasting Energy Demand - Milestone 1

This Jupyter notebook serves as a partial solution to Milestone 1 of the liveProject on forecasting energy demand in Python. It is the full solution with all of the code removed.

## Importing Necessary Libraries and Functions

The first thing we need to do is import the necessary functions and libraries that we will be working with throughout the topic. We should also go ahead and upload all the of the necessary data sets here instead of loading them as we go. We will be using energy production data from PJM Interconnection. They are a regional transmission organization that coordinates the movement of wholesale electricity in parts of the United States. Specifically, we will be focused on a region of Pennsylvania. We will also be using temperature data collected from the National Oceanic and Atmospheric Assocation (NOAA).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Preparing the Energy and Temperature Data

In [3]:
temp = pd.read_csv('hr_temp_20170201-20200131_subset.csv')
temp.head()

Unnamed: 0,STATION,DATE,REPORT_TYPE,SOURCE,HourlyDryBulbTemperature
0,72520514762,2017-02-01T00:53:00,FM-15,7,37.0
1,72520514762,2017-02-01T01:53:00,FM-15,7,37.0
2,72520514762,2017-02-01T02:53:00,FM-15,7,36.0
3,72520514762,2017-02-01T03:53:00,FM-15,7,36.0
4,72520514762,2017-02-01T04:53:00,FM-15,7,36.0


In [4]:
load = pd.read_csv('hrl_load_metered - 20170201-20200131.csv')
load.head()

Unnamed: 0,datetime_beginning_utc,datetime_beginning_ept,nerc_region,mkt_region,zone,load_area,mw,is_verified
0,2/1/2017 5:00,2/1/2017 0:00,RFC,WEST,DUQ,DUQ,1419.881,True
1,2/1/2017 6:00,2/1/2017 1:00,RFC,WEST,DUQ,DUQ,1379.505,True
2,2/1/2017 7:00,2/1/2017 2:00,RFC,WEST,DUQ,DUQ,1366.106,True
3,2/1/2017 8:00,2/1/2017 3:00,RFC,WEST,DUQ,DUQ,1364.453,True
4,2/1/2017 9:00,2/1/2017 4:00,RFC,WEST,DUQ,DUQ,1391.265,True


In [5]:
data = pd.DataFrame({
    'date': temp['DATE'],
    'temp': temp['HourlyDryBulbTemperature'],
    'mw': load['mw']
})
data.head()

Unnamed: 0,date,temp,mw
0,2017-02-01T00:53:00,37.0,1419.881
1,2017-02-01T01:53:00,37.0,1379.505
2,2017-02-01T02:53:00,36.0,1366.106
3,2017-02-01T03:53:00,36.0,1364.453
4,2017-02-01T04:53:00,36.0,1391.265


One of the problems when loading a data set you want to run time series analysis on is the type of object Python sees for the "date" variable. Let's look at the pandas data frame data types for each of our variables.

In [6]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26280 entries, 0 to 26279
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    26280 non-null  object 
 1   temp    26243 non-null  float64
 2   mw      26280 non-null  float64
dtypes: float64(2), object(1)
memory usage: 616.1+ KB


Here we can see that the Date variable is a general object and not a "date" according to Python. We can change that with the pandas function ```to_datetime``` as we have below.

In [7]:
data['date'] = pd.to_datetime(data['date'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26280 entries, 0 to 26279
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    26280 non-null  datetime64[ns]
 1   temp    26243 non-null  float64       
 2   mw      26280 non-null  float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 616.1 KB


Good! Now that we have a ```datetime64``` object in our data set we can easily create other forms of date variables. The hour of day, day of week, month of year, and possibly even the year itself might all impact the energy usage. Let's extract these variables from our date object so that we can use them in our analysis. Pandas has some wonderful functionality to do this with the ```hour```, ```day```, ```dayofweek```, ```month```, and ```year``` functions. Then let's inspect the first few observations to make sure things look correct.

In [8]:
data['hour'] = data['date'].dt.hour
data['weekday'] = data['date'].dt.weekday
data['month'] = data['date'].dt.month
data['year'] = data['date'].dt.year
data.head()

Unnamed: 0,date,temp,mw,hour,weekday,month,year
0,2017-02-01 00:53:00,37.0,1419.881,0,2,2,2017
1,2017-02-01 01:53:00,37.0,1379.505,1,2,2,2017
2,2017-02-01 02:53:00,36.0,1366.106,2,2,2,2017
3,2017-02-01 03:53:00,36.0,1364.453,3,2,2,2017
4,2017-02-01 04:53:00,36.0,1391.265,4,2,2,2017


Everything looks good in the first few observations above. If you still aren't convinced you could pull different pieces of the data frame to make sure that other observations are structured correctly.

Now we should set this Python date object as the index of our data set. This will make it easier for plotting as well as forecasting later. We can use the ```set_index``` function for this.

In [9]:
data.set_index('date', inplace=True)

Good! Now that we have our data structured as we would like, we can start the cleaning of the data. First, let's check if there are any missing values in the temperature column. The ```is.null``` function will help us here.

In [10]:
data.isnull().sum()

temp       37
mw          0
hour        0
weekday     0
month       0
year        0
dtype: int64

Looks like there are 37 missing values in our temperature data. We shoudl impute those. However, we don't just want to put the average temperature in these spots as the overall average across three years probably isn't a good guess for any one hour. The temperature of the hours on either side of the missing observation would be more helpful. Let's do a linear interpolation across missing values to help with this. This will essentially draw a straight line between the two known points to fill in the missing values. We can use the ```interpolate(method='linear')``` function for this.

In [11]:
data['temp'].interpolate(method='linear', inplace=True)

Now let's see if we have any more missing temperature values.

In [12]:
data['temp'].isnull().sum()

0

No more! Time to check if the energy data has any missing values.

No missing values there either! Perfect.

Now it is time to split the data into two pieces - training and testing. The training data set is the data set we will be building our model on, while the testing data set is what we will be reporting results on since the model wouldn't have seen it ahead of time. Using the date index we can easily do this in our data frame.

In [13]:
train = data[:'2019-12-31'].copy()
test = data['2020-01':].copy()

Now let's look at the first few observations for our training data set.

In [14]:
train.head()

Unnamed: 0_level_0,temp,mw,hour,weekday,month,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-02-01 00:53:00,37.0,1419.881,0,2,2,2017
2017-02-01 01:53:00,37.0,1379.505,1,2,2,2017
2017-02-01 02:53:00,36.0,1366.106,2,2,2,2017
2017-02-01 03:53:00,36.0,1364.453,3,2,2,2017
2017-02-01 04:53:00,36.0,1391.265,4,2,2,2017


Everything looks good there!

Now let's do the same for our testing data set.

In [15]:
test.head()

Unnamed: 0_level_0,temp,mw,hour,weekday,month,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01 00:53:00,31.0,1363.428,0,2,1,2020
2020-01-01 01:53:00,29.0,1335.975,1,2,1,2020
2020-01-01 02:53:00,30.0,1296.817,2,2,1,2020
2020-01-01 03:53:00,30.0,1288.403,3,2,1,2020
2020-01-01 04:53:00,31.0,1292.263,4,2,1,2020


Excellent! We now have our data cleaned and split. By combining and cleaning the data sets, we will make the exploration of these data sets as well as the modeling of these data sets much easier for the upcoming sections!