In [None]:
from numpy.testing import assert_equal, assert_almost_equal

# HW1: Forecasting Electricity Demand

In this homework, you will be forecasting the daily electricity demand for the entire Luzon grid.

Specifically, your goal is to build a <u>7-day ahead forecaster</u> using an ARIMA model.

<div class="alert alert-info">

**Important Note**
    
Make sure that you are running `statsmodels 0.12.2` when answering this homework.
</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import itertools

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error

# Set figure size
plt.rcParams["figure.figsize"] = (15,5)

df = pd.read_csv('elecdaily_luzon.csv', index_col=0)
df.index = pd.to_datetime(df.index)

# My personal preference is to use the pandas.Series, but you can use a pandas.DataFrame as well.
ts = df['GW']
ts

## Q1.

Plot the series together with its ACF plot. 

Comment on the seasonality of the time series and its other interesting characteristics.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Q2.

Verify that the series is NOT stationary using an ADF test.

Fill in the `adf_stat` variable.

In [None]:
adf_stat = None

# YOUR CODE HERE
raise NotImplementedError()

adf_stat

In [None]:
# Sanity check
assert adf_stat is not None, 'Put the ADF statistic in this variable!'

# Hidden tests

## Q3. 

Apply the appropriate seasonal differencing to make the series stationary. Verify using an ADF test.

Fill in the `m` and `adf_stat` variables.

In [None]:
m = None
adf_stat = None

# YOUR CODE HERE
raise NotImplementedError()
adf_stat

In [None]:
# Sanity check
assert m is not None, 'Put the differencing parameter in this variable!'

# Hidden tests

In [None]:
# Sanity check
assert adf_stat is not None, 'Put the ADF statistic in this variable!'

# Hidden tests

## Interlude

Recall that your goal is to build a <u>7-day ahead forecaster</u>.

First, we'll hold out the last 84 observations (approx. 3 months) to use as a test set.

In [None]:
h = 7
test_size = 84

ts_train = ts[:-test_size]
ts_test = ts[-test_size:]

## Q4.

Create a grid for the $(p,d,q)$ parameters.

Let,

- $p=0,1,2$


- $d=0,1$


- $q=0,1,2$

Fill in the `pdq_grid` variable.

In [None]:
pdq_grid = None

# YOUR CODE HERE
raise NotImplementedError()

pdq_grid

In [None]:
# Sanity check
assert pdq_grid is not None, 'Put the (p,d,q) grid in this variable!'
assert type(pdq_grid) == list, 'pdq_grid should be a list of tuples!'
assert all(isinstance(_, tuple) for _ in pdq_grid), 'pdq_grid should be a list of tuples!'

# Hidden tests

## Q5.

Using the training set, use grid search with a 4-fold time series split (validation size of 28) to select the best $(p,d,q)$ by minimizing the average RMSE,

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_i^n (y_i - \hat{y}_i)^2}$$

For each $(p,d,q)$, place its average RMSE score in `df_results`. Make sure to follow the pre-set structure of `df_results`.

In addition, fill in the `tskfold` variable.

Finally, use the following settings when fitting the ARIMA model:

1. `ARIMA(..., enforce_stationarity=False, enforce_invertibility=False)`


2. `.fit(method_kwargs={'maxiter': 200})`


The first condition will supress warnings about parameter combinations that result in non-stationary/non-invertible models. You can read about the technical details [`here`](https://otexts.com/fpp3/AR.html) and [`here`](https://otexts.com/fpp3/MA.html) .

The second condition will fix the MLE convergence warnings.

In [None]:
df_results = pd.DataFrame({'(p,d,q)': pdq_grid, 'Avg. RMSE': np.zeros(len(pdq_grid))})
df_results

In [None]:
tskfold = None

# YOUR CODE HERE
raise NotImplementedError()

df_results

In [None]:
# Sanity check
assert type(tskfold) == TimeSeriesSplit, 'tskfold should be a TimeSeriesSplit object!'
assert df_results['(p,d,q)'].tolist() == pdq_grid, 'df_results appears to be out of order. Do not sort it.'
assert all(isinstance(_, float) for _ in df_results['Avg. RMSE']), 'df_results should contain floats!'

# Hidden tests (Checks Avg. RMSE up to 3 decimal places)

## Q6.

Using the best $(p,d,q)$, evaluate its performance on the test set using cross-validation. 

This time, use a 12-fold time series split and calculate the average RMSE.

*Note: 12-folds * 7-steps = 84 observations which is the test size!*

Fill in the `p`, `d`, `q`, and `test_error` variables.

In [None]:
p = None
d = None
q = None

test_error = None

# YOUR CODE HERE
raise NotImplementedError()

print('Test Avg. RMSE =', test_error)

In [None]:
# Sanity check
assert p is not None, 'Put the best p in this variable!'
assert d is not None, 'Put the best d in this variable!'
assert q is not None, 'Put the best q in this variable!'
assert test_error is not None, 'Put the average RMSE in this variable!'

# Hidden tests (Checks Avg. RMSE up to 3 decimal places)

## Q7.

Evaluate the performance of a naive and seasonal naive baseline on the test set, following the same strategy as above.

Fill in the `test_error_naive` and `test_error_snaive` variables.

In [None]:
test_error_naive = None
test_error_snaive = None

# YOUR CODE HERE
raise NotImplementedError()

print('  Naive Avg. RMSE =', test_error_naive)
print('S.Naive Avg. RMSE =', test_error_snaive)

In [None]:
# Sanity check
assert test_error_naive is not None, 'Put the average RMSE for the Naive method in this variable!'

# Hidden tests (Checks Avg. RMSE up to 3 decimal places)

In [None]:
# Sanity check
assert test_error_snaive is not None, 'Put the average RMSE for the S.Naive method in this variable!'

# Hidden tests (Checks Avg. RMSE up to 3 decimal places)