<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/05-Time_Series/C-ERCOT_using_Prophet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing electricity usage from Electric Reliability Council of Texas (ERCOT)

ERCOT releases the electricity usage of their grid at http://www.ercot.com/gridinfo/load

The archives are at: http://www.ercot.com/gridinfo/load/load_hist

ERCOT also publishes their own load forecasts, this is a good baseline for any model.

In [None]:
#@title Setup

!pip install -U -q PyMySQL sqlalchemy
# prophet

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

from sqlalchemy import create_engine
from sqlalchemy import text

import numpy as np

import seaborn as sns

from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly, plot_cross_validation_metric
from prophet.diagnostics import cross_validation, performance_metrics



In [None]:
#@title Plotting Setup

%config InlineBackend.figure_format = 'retina'

# Change the graph defaults
plt.rcParams['figure.figsize'] = (8, 3)  # Default figure size of 8x3 inches
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.color'] = 'lightgray'
plt.rcParams['font.size'] = 10  # Default font size of 12 points
plt.rcParams['lines.linewidth'] = 1  # Default line width of 1 points
plt.rcParams['lines.markersize'] = 2  # Default marker size of 2 points
plt.rcParams['legend.fontsize'] = 10  # Default legend font size of 10 points

# Load and plot the data


In [None]:
conn_string = 'mysql+pymysql://{user}:{password}@{host}/{db}?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'student',
    password = 'dwdstudent2015',
    db = 'ercot',
    encoding = 'utf8mb4')

engine = create_engine(conn_string)

# This query loads the dataset from the DB into the dataframe
with engine.connect() as con:
  sql = "SELECT * FROM ercot.electricity WHERE DATE_TIME < '2023-06-01'"
  df = pd.read_sql(text(sql), con=con)
  df = df.set_index('DATE_TIME')

In [None]:
df

In [None]:
df.plot(
    logy=True,
    title='ERCOT Consumption Data',
    ylabel="Consumption"
)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left') # move the legend out of the chart

In [None]:
df.plot(
    y = 'ERCOT',
    title='ERCOT Consumption Data',
    ylabel="Consumption"
)

## Analyzing the time series using Prophet

For full and current documentation [please check the webpage of the project](https://facebook.github.io/prophet/).

In [None]:
# We can remove the resampling part and use hourly data
# but it takes ~10 mins on Colab to process the time series
# with hourly data. The tradeoff of working with daily data
# is that we do not extract the seasonality component within
# the day.

edf = (
    df
    .resample('1D').sum() # we will work with daily data
    .reset_index() # make the datetime index a regular column
    .filter( items = ['DATE_TIME', 'ERCOT']) # keep only datetime and ERCOT
    .rename( # prophet requires specific names for time ("ds") and for the time series ("y")
        {
          'DATE_TIME': 'ds',
          'ERCOT': 'y'
        },
        axis="columns" )
)


# This dataframe is ready for Prophet
edf


In [None]:
# Plot the daily usage:
edf.plot(y='y', x='ds')

In [None]:
m = Prophet(seasonality_mode='multiplicative')

# We ask to also add the US holidays as regressors
m.add_country_holidays(country_name='US')

# Take as input the time series and extract the components
m.fit(edf)

In [None]:
# Setup for hourly forecasts, one year in the future
# future = m.make_future_dataframe(periods=365 * 24, freq='H')

# Setup for daily forecasts, one year in the future
future = m.make_future_dataframe(periods=365)
future.tail()

In [None]:
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()


In [None]:
fig1 = m.plot(forecast)

In [None]:
fig2 = m.plot_components(forecast)

In [None]:
plot_plotly(m, forecast)

In [None]:
plot_components_plotly(m, forecast)

## Performance Evaluation

Prophet includes functionality for time series cross validation to measure forecast error using historical data. This is done by selecting cutoff points in the history, and for each of them fitting the model using data only up to that cutoff point. We can then compare the forecasted values to the actual values.

### Training, validation, forecast

* **Training**: Data in the *initial* or *estimation* or *training* period are used to help select the model and to estimate its parameters. Forecasts made in this period are not really "honest" because data on both sides of each observation are used to help determine the forecast.

* **Validation**: Data in the *validation* or *horizon* period are not given to the algorithm while creating the model and are *held out* during model training. Instead, we use the data in this period to evaluate the quality of the forecasting. Often the results of the evaluation are called ***backtests***.

* **Forecast**: This is the time period for which we make our actual forecasts. Please note that even if we do have data for the forecasting period, these should not be used to guide the selection of our algorithm or other settings of our algorithm.

<img src="https://facebook.github.io/prophet/static/diagnostics_files/diagnostics_4_0.png">

In [None]:
# Method 1: Cross-validation with moving cutoffs
df_cv = cross_validation(m,
      initial='730 days', # We will take the first two years of the data
      horizon = '365 days', # and we will make predictions for one year
      period='180 days', # then we will move the cutoff 180 days forward and continue
                         # until we reach the end of the time series
                         # (for our series with 21 years of data, minus two
                         # years for starting the training, this results
                         # in 38 different cutoff dates used for evaluation)
      parallel="processes" # speedup using parallelism
      )


In [None]:
df_cv # This shows our predictions (y_hat) against the actual value (y)
      # and also shows lower and upper estimates for y_hat.

In [None]:
# Method 2: We explicitly specify the cutoffs

cutoffs = pd.to_datetime(['2019-01-01', '2020-01-01', '2021-01-01'])

# Use the three cutoffs above and make predictions for 365 days after the cutoff
df_cv2 = cross_validation(m, cutoffs=cutoffs, horizon='365 days')
df_cv2

### Evaluation metrics

There are many metrics that can be used to evaluate the quality of the forecasts

* **MSE (Mean Squared Error)**: Measures the average squared difference between the predicted and actual values in a time series.
* **RMSE (Root Mean Squared Error)**: The square root of the MSE, providing a measure of the average magnitude of the prediction errors.
* **MAE (Mean Absolute Error)**: Calculates the average absolute difference between the predicted and actual values, ignoring the direction of errors.
* **MAPE (Mean Absolute Percentage Error)**: Computes the average percentage difference between the predicted and actual values, relative to the actual values.
* **MDAPE (Median Absolute Percentage Error)**: Similar to MAPE, but uses the median instead of the mean, making it more robust to outliers.
* **SMAPE (Symmetric Mean Absolute Percentage Error)**: A symmetric variant of MAPE that avoids division by zero and handles overestimations and underestimations equally.
* **COVERAGE**: Represents the proportion of observed values that fall within a certain prediction interval or confidence interval, indicating the reliability of the forecasts.


The popularity and usage of specific metrics for time series evaluation can vary depending on the context and the specific requirements of the problem at hand. However, some commonly used metrics are:

* MSE (Mean Squared Error): It is a widely used metric for measuring the overall accuracy of predictions and is often used in regression tasks.
* RMSE (Root Mean Squared Error): RMSE is frequently used as it provides an easily interpretable measure of the average prediction error in the same units as the target variable.
* MAE (Mean Absolute Error): MAE is popular due to its simplicity and ease of interpretation, as it gives an average of the absolute differences between predicted and actual values.
* MAPE (Mean Absolute Percentage Error): MAPE is commonly used when it is important to understand the percentage error in predictions relative to the actual values.
* SMAPE (Symmetric Mean Absolute Percentage Error): SMAPE is utilized when it is necessary to account for both overestimations and underestimations in the predictions.

### Manual calculation of performance metrics

In [None]:
df_cv

In [None]:
# Calculate MAPE which is defined a "MAPE = abs(actual - prediction) / abs(prediction)"
df_cv['mape'] = np.abs(df_cv['y'] - df_cv['yhat']) / np.abs(df_cv['y'])

In [None]:
df_cv['horizon'] = df_cv['ds'] - df_cv['cutoff']

In [None]:
df_cv.pivot_table(
    index = 'horizon',
    values = 'mape',
    aggfunc = 'mean'
).plot(
    figsize = (12,2)
)

In [None]:
# Now, let's visualize the MAPE metric as a function of the horizon length
# using a violin plot, to understand not only the mean but also the distribution


# Create categories/bins for grouping the x-axis values
# Grouping into 7 bins
bins = [0, 50, 100, 150, 200, 250, 300, 365]
labels = ['0-50', '50-100', '100-150', '150-200', '200-250', '250-300', '300-365']
# Convert horizon to a number (instead of a "x days")
df_cv['horizon_length'] = df_cv['horizon'].dt.total_seconds() / (24 * 60 * 60)
# and group the numbers into predefined ranges
df_cv['horizon_length'] = pd.cut(df_cv['horizon_length'], bins=bins, labels=labels)

plt.figure(figsize=(12, 3))
ax = sns.violinplot(
    data = df_cv,
    x = 'horizon_length',
    y = 'mape',
)

### Automatic calculation of performance metrics

Now let's see how we can calculate these metrics using the Prophet package

The `performance_metrics` utility can be used to compute some useful statistics of the prediction performance (yhat, yhat_lower, and yhat_upper compared to y), as a function of the distance from the cutoff (how far into the future the prediction was). The statistics computed are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percent error (MAPE), median absolute percent error (MDAPE) and coverage of the yhat_lower and yhat_upper estimates. These are computed on a rolling window of the predictions in df_cv after sorting by horizon (ds minus cutoff). By default 10% of the predictions will be included in each window, but this can be changed with the rolling_window argument.

In [None]:
df_p = performance_metrics(df_cv)

In [None]:
df_p

In [None]:
df_p.plot(
    x = 'horizon',
    y = 'mape',
    figsize = (16,2)
)

In [None]:
# A Prophet-provided visualization that shows the average value of the metric
# (blue line) against the various measurements in different points in time

fig = plot_cross_validation_metric(df_cv, metric='mape')
# Get the Axes object from the Figure
ax = fig.get_axes()[0]
ax.set_ylim(0, 0.2)