# Semi-automated ARIMA model selection 

In this notebook, we demonstrate model selection over the `order` parameter of the Arima class of recursive forecast models. The order here refers to the triple (p,d,q) specifying the auto-regressive, differencing, and moving average orders. While we don't currently support an entirely automatic order selection (like, e.g. auto.arima in R), we do allow selection from a given set of orders using the `BestOfForecaster` estimator. 

In [13]:
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np

from sklearn.metrics import mean_absolute_error

from ftk import TimeSeriesDataFrame, ForecastDataFrame
from ftk.models import Arima, BestOfForecaster
from ftk.model_selection import RollingOriginValidator

from ftk.data import load_dow_jones_dataset
print('imports done.')

imports done.


Here, we loading some sample data from the Dow Jones revenue data set.

In [14]:
train_df, test_df = load_dow_jones_dataset()
train_df = train_df[train_df.grain_index.isin(['AAPL'])]
test_df = test_df[test_df.grain_index.isin(['AAPL'])]
train_df.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,revenue
quarter_start,company_ticker,Unnamed: 2_level_1
2013-04-01,AAPL,35323.0
2013-07-01,AAPL,37472.0
2013-10-01,AAPL,57594.0
2014-01-01,AAPL,45646.0
2014-04-01,AAPL,37432.0


We'll start by fitting a single, default Arima model. We only specify the time series frequency, which is quarterly with the time points anchored to the start of the quarter.

In [15]:
arima_model = Arima(freq='QS')

In [17]:
arima_model

Arima(freq=<QuarterBegin: startingMonth=1>, order=[1, 0, 0],
   origin_time_colname='origin',
   pred_dist_colname='DistributionForecastArima',
   pred_point_colname='PointForecastArima', seasonality=None)

In [18]:
arima_model.fit(train_df)
validate_df = arima_model.predict(test_df)
validate_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,revenue,DistributionForecastArima,PointForecastArima
quarter_start,company_ticker,origin,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-07-01,AAPL,2014-04-01,42123.0,<scipy.stats._distn_infrastructure.rv_frozen o...,36222.98
2014-10-01,AAPL,2014-04-01,74599.0,<scipy.stats._distn_infrastructure.rv_frozen o...,35082.25
2015-01-01,AAPL,2014-04-01,58010.0,<scipy.stats._distn_infrastructure.rv_frozen o...,34005.95
2015-04-01,AAPL,2014-04-01,49605.0,<scipy.stats._distn_infrastructure.rv_frozen o...,32990.43


Now, we create a list of order parameters to select over. We'll create a list of Arima forecasters each with a different order setting. The `BestOfForecaster.fit` method will use rolling origin cross-validation to evaluate the different models. Calling its `predict` method will select the model with the best cv-fitting performance and generate predictions from it. Here, model evaluation is with respect to the mean absolute error. The eval metric can be changed through the `metric_fun` fit parameter.

In [19]:
# Make a list of Arima forecasters to evaluate
forecaster_list = list()
order_list = [[1, 0, 0], [1, 1, 0], [1, 1, 1], [2, 1, 1],  [2, 2,0]]
for order in order_list:
    # Make name for each Arima model from its order setting 
    order_str = ''.join([str(el) for el in order])
    mod_name = 'arima' + order_str
    forecaster_list.append((mod_name, Arima(freq='QS', order=order)))

# Use a rolling origin validator for making temporal cross-validation folds
validator = RollingOriginValidator(n_splits=5)

# Use BestOfForecaster to select the best Arima model (based on out-of-sample errors)
best_forecaster = BestOfForecaster(forecaster_list)
best_forecaster.fit(train_df, validator=validator, metric_fun=mean_absolute_error)
validate_df = best_forecaster.predict(test_df)

Looking at the predictions, it appears that the (2,1,1) Arima model had the best performance.

In [20]:
validate_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,PointForecast,DistributionForecast,revenue
quarter_start,company_ticker,ForecastOriginTime,ModelName,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-07-01,AAPL,2014-04-01,arima211,47416.08,<scipy.stats._distn_infrastructure.rv_frozen o...,42123.0
2014-10-01,AAPL,2014-04-01,arima211,51486.73,<scipy.stats._distn_infrastructure.rv_frozen o...,74599.0
2015-01-01,AAPL,2014-04-01,arima211,46640.86,<scipy.stats._distn_infrastructure.rv_frozen o...,58010.0
2015-04-01,AAPL,2014-04-01,arima211,46629.44,<scipy.stats._distn_infrastructure.rv_frozen o...,49605.0


In [21]:
validate_df.calc_error(err_name='MAPE')

17.28621890952115