### Forecasting with Time Series

**OBJECTIVES**

- Build Holt-Winters models on time series data
- Test for stationarity using Augmented Dickey Fuller Test and KSS
- Build SARIMA models on time series data


In [97]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import requests

### Examining Retail Sales Data

Using the alpha vantage API, extract and structure data as a `DataFrame` that has a `datetime` index and column `value` containing numeric sales data.

- https://www.alphavantage.co/documentation/#retail-sales

#### Problem 1: Getting and formatting data

To begin, extract retail sales data for the years 2015 - present using alpha vantage and the retail sales endpoint.  Create a datetime index and be sure the values are numeric datatypes.  Plot the resulting data.

### Problem 2: Train and Test split

Split the data into train and test sets at February of 2020.  

### Problem 3: Holt Winters Model

Below, fit a Holt Winters model on your training data and use it to make predictions on your test data using all the default settings.  Make predictions and discuss the quality of these predictions in terms of both **ROOT MEAN SQUARED ERROR** and **AIC** score.

In [173]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

### Problem 4: Grid Searching Parameters

Now, try to fit different versions of the Holt Winters model -- using different values for the `trend`, `seasonal` and `seasonal_period` arguments.  Assemble your results in a DataFrame with columns for both **RMSE** and **AIC** of the models, as well as the parameters.  The code below is meant to get you started.

In [186]:
from sklearn.metrics import mean_squared_error

In [194]:
import warnings
warnings.filterwarnings('ignore')

In [195]:
aics = []
mses = []
params = []
for trend in ['add', 'mul', None]:
    for season in ['add', 'mul', None]:
        for p in [12, 15, 6]:
            #create and fit the model
            
            #get aic
            
            #get rmse
            
            

In [197]:
result_df = pd.DataFrame({'aic': aics, 'rmse': mses, 'parameters': params})

In [199]:
result_df.sort_values('aic').head(2)

Unnamed: 0,aic,rmse,parameters
18,1336.437862,31927.299236,"[None, add, 12]"
9,1340.411896,37165.58645,"[mul, add, 12]"


### Problem 4: Optimial Model

Based on your grid search, fit and predict using the model with parameters based on the best **AIC** model.  Make predictions and plot these against the test data.

### Problem 5: Why are predictions what they are?

Can you explain the underperformance of the retail sales predictions?  Is this because of the model we are using or something else?

### Problem 6: Stationarity

As discussed in class, a different kind of model can be found in the `SARIMA` models that work like our regression models with seasonal components.  Before building the model, tests for stationarity of the data should be conducted.  Below, use the `adfuller` and `kpss` tests to determine the stationarity of the time series.  Is the data stationary?  Why or why not.

In [227]:
from statsmodels.tsa.stattools import adfuller, kpss

### Problem 7: SARIMA Models

To build and identify "good" parameters for a SARIMA model, you are to use `pmdarima` and the `auto_arima` function given below.  Fit the data and determine what were identified as optimal parameters.  How does your model compare to Holt-Winters in terms of `aic` and `RMSE`?

In [234]:
import pmdarima as pm

In [243]:
model = pm.auto_arima(train, 
                      start_p = 1, 
                      max_p = 3, 
                      start_d = 0, 
                      max_d = 3, 
                      start_q = 0, 
                      max_q = 3, 
                      start_P = 1, 
                      start_D = 1, 
                      start_Q = 1, 
                      max_P=2,
                      max_D=1,
                      max_Q=2, 
                      m = 12)

### BONUS: Extending the Model

Maybe you believe that COVID played a part in the retail sales data.  There seems to be a pre and post covid trend to the sales.  Build two different datasets for pre and post covid, conduct a train test split, and see if these models perform better than the model that uses both pre and post covid data together.  What do you think this means for retail sales moving forward?