# Analysis of Hourly Energy Consumption Data

## Introduction

This notebook details my attempt at time series forecasting, using historical hourly energy consumption data from different regions of the United States. The dataset I am analyzing comes from the following source: https://www.kaggle.com/robikscube/hourly-energy-consumption/home.

First, I will perform some preliminary analysis of the dataset, to determine how clean it is, or if there are any missing values. I will visualize the dataset and obtain some basic statistics to understand the dataset prior to model building. 

Second, I will apply some neural network models to the dataset, to see if the final year of energy usage can be predicted.

Let's import some libraries.

In [2]:
import pandas as pd
import pandas_profiling

In [4]:
#Load the dataset
dataset = pd.read_csv('./hourly-energy-consumption/DAYTON_hourly.csv', delimiter=',')

#Determine the shape of the dataset
print('The shape of the dataset is :',dataset.shape)
num_hours=dataset.shape[0]
print('Including',num_hours,'hours,', num_hours//24,'days, and',num_hours//(24*365),'years')

The shape of the dataset is : (121275, 2)
Including 121275 hours, 5053 days, and 13 years


## Analysis and Visualization

Let's get some basic statistics for the dataset, such as the mean, standard deviation, median, min, and max energy consumption. 

In [5]:
dataset.describe()

Unnamed: 0,DAYTON_MW
count,121275.0
mean,2037.85114
std,393.403153
min,982.0
25%,1749.0
50%,2009.0
75%,2279.0
max,3746.0


The median is about the same as the mean, but the distance between the max and the mean is much larger, than the min and the mean. This possibly means that the distribution is right skewed. 

The package "pandas_profiling" can give us some more information about the dataset, such as the number of missing values, and a histogram of the energy consumption.

In [6]:
pandas_profiling.ProfileReport(dataset)

0,1
Number of variables,2
Number of observations,121275
Total Missing (%),0.0%
Total size in memory,1.9 MiB
Average record size in memory,16.0 B

0,1
Numeric,1
Categorical,1
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2361
Unique (%),1.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2037.9
Minimum,982
Maximum,3746
Zeros (%),0.0%

0,1
Minimum,982
5-th percentile,1454
Q1,1749
Median,2009
Q3,2279
95-th percentile,2742
Maximum,3746
Range,2764
Interquartile range,530

0,1
Standard deviation,393.4
Coef of variation,0.19305
Kurtosis,0.25418
Mean,2037.9
MAD,311.67
Skewness,0.5238
Sum,247140000
Variance,154770
Memory size,947.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1990.0,161,0.1%,
1962.0,161,0.1%,
1984.0,160,0.1%,
1978.0,160,0.1%,
1921.0,156,0.1%,
1976.0,154,0.1%,
1996.0,154,0.1%,
1943.0,154,0.1%,
1955.0,153,0.1%,
1919.0,153,0.1%,

Value,Count,Frequency (%),Unnamed: 3
982.0,1,0.0%,
1000.0,1,0.0%,
1011.0,1,0.0%,
1015.0,1,0.0%,
1021.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3712.0,1,0.0%,
3722.0,1,0.0%,
3724.0,2,0.0%,
3741.0,1,0.0%,
3746.0,1,0.0%,

0,1
Distinct count,121271
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0

0,1
2014-11-02 02:00:00,2
2016-11-06 02:00:00,2
2017-11-05 02:00:00,2
Other values (121268),121269

Value,Count,Frequency (%),Unnamed: 3
2014-11-02 02:00:00,2,0.0%,
2016-11-06 02:00:00,2,0.0%,
2017-11-05 02:00:00,2,0.0%,
2015-11-01 02:00:00,2,0.0%,
2008-10-28 22:00:00,1,0.0%,
2011-09-30 19:00:00,1,0.0%,
2007-04-15 07:00:00,1,0.0%,
2016-11-16 02:00:00,1,0.0%,
2006-06-07 02:00:00,1,0.0%,
2008-02-26 09:00:00,1,0.0%,

Unnamed: 0,Datetime,DAYTON_MW
0,2004-12-31 01:00:00,1596.0
1,2004-12-31 02:00:00,1517.0
2,2004-12-31 03:00:00,1486.0
3,2004-12-31 04:00:00,1469.0
4,2004-12-31 05:00:00,1472.0


As expected, the data is slightly right skewed.