<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/05-Time_Series/B-ERCOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing electricity usage from Electric Reliability Council of Texas (ERCOT)

We will be analyzing the electricity usage data from ERCOT.

ERCOT releases the [electricity usage of their grid](http://www.ercot.com/gridinfo/load) and they [provide archival copies of their data](http://www.ercot.com/gridinfo/load/load_hist).
(ERCOT also publishes their own load forecasts, this is a good baseline for any model.)

In [None]:
#@title Setup

!pip install -U -q statsmodels PyMySQL sqlalchemy

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose

from sqlalchemy import create_engine
from sqlalchemy import text

In [None]:
#@title Plotting Setup

%config InlineBackend.figure_format = 'retina'

# Change the graph defaults
plt.rcParams['figure.figsize'] = (8, 3)  # Default figure size of 8x3 inches
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.color'] = 'lightgray'
plt.rcParams['font.size'] = 10  # Default font size of 12 points
plt.rcParams['lines.linewidth'] = 1  # Default line width of 1 points
plt.rcParams['lines.markersize'] = 2  # Default marker size of 2 points
plt.rcParams['legend.fontsize'] = 10  # Default legend font size of 10 points

# Load and plot the data


In [None]:
conn_string = 'mysql+pymysql://{user}:{password}@{host}/{db}?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'student',
    password = 'dwdstudent2015',
    db = 'ercot',
    encoding = 'utf8mb4')

engine = create_engine(conn_string)

# This query loads the dataset from the DB into the dataframe
with engine.connect() as con:
  sql = "SELECT * FROM ercot.electricity"
  df = pd.read_sql(text(sql), con=con)
  df = df.set_index('DATE_TIME')

In [None]:
df

In [None]:
df.plot(
    logy=True,
    title='ERCOT Consumption Data',
    ylabel="Consumption"
)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left') # move the legend out of the chart

In [None]:
df.plot(
    y = 'ERCOT',
    title='ERCOT Consumption Data',
    ylabel="Consumption"
)

# Potential questions:

* We are trying to perform capacity planning. How will demand evolve over time? Can we make projections for the next 5 years?
* We care about the maximum capacity of our system as we need to avoid blackouts. Make projections for the total capacity necessary to avoid blackouts. Ideally, provide confidence intervals showing how much maximum capacity we need.
* Perform the analysis on a regional basis, and identify capacity planning for the regions (COAST, WEST, etc)

## Extracting Time Series Components: Trend, Seasonal, Residual

In [None]:
df['ERCOT'].autocorr()

In [None]:
pd.plotting.lag_plot(df.ERCOT, lag=1, s=1, alpha=0.1 )

### Extracting Daily Seasonal Component

In [None]:
Y = df.ERCOT

decompose = seasonal_decompose(Y,
                                model='multiplicative',
                                period=24,
                                extrapolate_trend=24)

T_d, S_d, R_d = decompose.trend, decompose.seasonal, decompose.resid

In [None]:
# plot the trend, after we remove daily seasonality
T_d.plot()

In [None]:
(
    S_d # the daily seasonal trend
    .head(24 * 5) # plot the first five days
    .plot()
)

We also plot the residuals to see the quality of the removal of the seasonal component. Honestly, not a great outcome: our techniques are pretty naive, and do not account for the fluctuating magnitude of the changes (aka "_clustered volatility_"). Dealing with clustered volatility requires more advanced models than the ones we are currently using.

In [None]:
(
    R_d # the residual factors after removing the daily seasonal
    .head(24 * 30) # plot the first 30 days
    .plot()
)

### Analyzing further the $T_d$ Component: Identifying Weekly Patterns

We have extracted three time series (trend, seasonal, residual) from the main time series, after extracting the daily component. Now we will extract the weekly component, which has a duration of `period = 24 * 7` hours.

In [None]:
decompose = seasonal_decompose(T_d,
                                model='multiplicative',
                                period=24 * 7,
                                extrapolate_trend=24 * 7)

T_w, S_w, R_w = decompose.trend, decompose.seasonal, decompose.resid

In [None]:
# This is the remaining trend component after removing daily and weekly fluctuations
T_w.plot()

In [None]:
# This shows the weekly seasonality
S_w.head(24*7).plot()

In [None]:
# This shows the residual, after removing the daily and weekly
R_w.head(24*365).plot()

### Analyzing further the $T_w$ Component: Identifying yearly Patterns

In [None]:
decompose = seasonal_decompose(T_w,
                                model='multiplicative',
                                period=24 * 365,
                                extrapolate_trend=24 * 365)

T_y, S_y, R_y = decompose.trend, decompose.seasonal, decompose.resid

In [None]:
# This shows the overall trend, after removing daily, weekly, and yearly seasonality
T_y.plot()

In [None]:
# This shows the yearly seasonality
S_y.head(24*365).plot()

## Summary

In [None]:
# This shows the overall trend, after removing daily, weekly, and yearly seasonality
T_y.plot()

In [None]:
# This shows the yearly seasonality. We show the
# first 365 days * 24 hours as the pattern repeats in subsequent periods
S_y.head(24*365).plot()

In [None]:
# This shows the weekly seasonality. We show the first
# 7 days multiplied with 24 hours as the pattern repeats in subsequent periods
S_w.head(24*7).plot()

In [None]:
# This shows the daily seasonality. We show the first
# 24 hours, as the pattern repeats in subsequent periods
S_d.head(24).plot()

In [None]:
# This is the time series with the overall trend, plus seasonality
(T_y * S_y * S_w * S_d).plot()

In [None]:
# This is the residual, that is not captured by the trend or seasonality
# When we are modeling, we really talk about forecasting the trend and
# potentially modeling this time series, which has been de-trended
# and de-seasonalized.
( R_y * R_w * R_d ).plot(figsize=(16,4), linewidth=0.5)

In [None]:
# A high autocorrelation means that consumption is
# still clustered in time periods.
(R_y * R_w * R_d).autocorr()

In [None]:
ax = pd.plotting.lag_plot((R_y * R_w * R_d), lag=1, s=1)

In [None]:
# Here is the histogram of the residuals.
# Since these are multiplicative factors, it is a good idea
# to also take the log and plot them again.
( R_y * R_w * R_d ).hist(bins=1000)

In [None]:
np.log2( R_y * R_w * R_d ).hist(bins=1000)

In [None]:
# Examine the quantiles of the residual distribution
# These are the values with which we need to multiply our
# trend and seasonality projections to estimate maximum capacity
#
# 99% = For 87.6 hours in a year, consumption is above this level
# 99.9% = For 8.76 hours in a year, consumption is above this level
# 99.99% = For 52 mins in a year, consumption is above this level
# 99.999% = For 5.2 mins in a year, consumption is above this level

q=[
    0.00001, 0.0001,0.001,0.01,0.1,0.25,
    0.5,
    0.75,0.9,0.99,0.999,0.9999,0.99999
]
(R_y * R_w * R_d).quantile(q)

In [None]:
# Plots the histogram of the log of the time series after removing the trend
# The division Y / T_y removes the long term trend from the series and returns
# back the multiplicative factors S_y * S_w * S_d * R_y * R_w * R_d
np.log(Y / T_y).hist(bins=1000, alpha=0.75)

# Now let's remove the seasonal components as well and see the difference
# Plots the histogram of the log of the residuals after removing trend and seasonality
np.log(Y / (T_y * S_y * S_w * S_d)).hist(bins=1000, alpha=0.75)