## Auto ARIMA Pros & Cons

#### Pros
- Saves time
- Removes ambiguity
- Reduce risk of human error

#### Cons
- Blindly putting our faith in one criterion
- Never really see how well the other models perform
- Topic expertise

## Packages

In [1]:
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from statsmodels.tsa.arima_model import ARIMA
from arch import arch_model
import yfinance
import warnings
warnings.filterwarnings('ignore')
sns.set()

In [2]:
raw_data = yfinance.download(
    tickers = '^GSPC ^FTSE ^N225 ^GDAXI',
    start = '1994-01-07', end = '2018-01-29',
    interval = '1d', group_by = 'ticker',
    auto_adjust = True, treads = True
)

[*********************100%***********************]  4 of 4 completed


In [3]:
df_comp = raw_data.copy()

In [4]:
df_comp['spx'] = df_comp['^GSPC'].Close[:]
df_comp['dax'] = df_comp['^GDAXI'].Close[:]
df_comp['ftse'] = df_comp['^FTSE'].Close[:]
df_comp['nikkei'] = df_comp['^N225'].Close[:]

In [5]:
df_comp = df_comp.iloc[1:]
del df_comp['^GSPC']
del df_comp['^GDAXI']
del df_comp['^FTSE']
del df_comp['^N225']
df_comp = df_comp.asfreq('b')
df_comp = df_comp.fillna(method = 'ffill')

## Creating Returns

In [6]:
df_comp['ret_spx'] = df_comp.spx.pct_change(1) * 100
df_comp['ret_ftse'] = df_comp.ftse.pct_change(1) * 100
df_comp['ret_dax'] = df_comp.dax.pct_change(1) * 100
df_comp['ret_nikkei'] = df_comp.nikkei.pct_change(1) * 100

## Splitting the Data

In [7]:
size = int(len(df_comp) * 0.8)
df, df_test = df_comp.iloc[:size], df_comp.iloc[size:]

## Fitting a Model

In [8]:
from pmdarima.arima import auto_arima

In [9]:
model_auto = auto_arima(df.ret_ftse[1:])

In [10]:
model_auto

ARIMA(callback=None, disp=0, maxiter=None, method=None, order=(4, 0, 5),
      out_of_sample_size=0, scoring='mse', scoring_args={},
      seasonal_order=(0, 0, 0, 1), solver='lbfgs', start_params=None,
      with_intercept=True)

In [11]:
model_auto.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,5019.0
Model:,"SARIMAX(4, 0, 5)",Log Likelihood,-7882.658
Date:,"Wed, 17 Feb 2021",AIC,15787.316
Time:,19:41:53,BIC,15859.047
Sample:,0,HQIC,15812.452
,- 5019,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,0.0309,0.025,1.246,0.213,-0.018,0.080
ar.L1,0.0135,0.082,0.165,0.869,-0.147,0.174
ar.L2,-0.6690,0.077,-8.645,0.000,-0.821,-0.517
ar.L3,-0.1616,0.072,-2.260,0.024,-0.302,-0.021
ar.L4,0.1898,0.074,2.553,0.011,0.044,0.335
ma.L1,-0.0384,0.081,-0.471,0.637,-0.198,0.121
ma.L2,0.6205,0.078,7.933,0.000,0.467,0.774
ma.L3,0.0592,0.069,0.858,0.391,-0.076,0.194
ma.L4,-0.1836,0.073,-2.510,0.012,-0.327,-0.040

0,1,2,3
Ljung-Box (Q):,67.77,Jarque-Bera (JB):,6360.08
Prob(Q):,0.0,Prob(JB):,0.0
Heteroskedasticity (H):,2.0,Skew:,-0.19
Prob(H) (two-sided):,0.0,Kurtosis:,8.5


In [12]:
## Few quick comments
# The rules of model selection are rather 'rules of thumb' than 'fixed'
# Auto ARIMA only considers a single feature - the AIC
# We could have easily overfitted while going through the models in previous exercises
# The default arguments of the method restrict the number of AR and MA components

## Important Arguments

Now, let us loosen some of the restrictions and show how to use the auto_arima method to the fullest

In [None]:
# exogenous => outside factors (e.g other time series)
# m => seasonal cycle length
# max_order => maximum amount of variables to be used in the regression (p + q)
# max_p => maximum AR components
# max_q => maximum MA components
# max_d => maximum integrations
# maxiter => maximum iterations we're giving the model to converge the coefficients (becomes harder as the order increases)
# return_valid_fits => whether or not the method should validate the results
# alpha => level of significance, default is 5%, which we should be using most of the time
# n_jobs => how many models to fit at a time (-1 indicates "as many as possible")
# trend => "ct" usually - ct => constant
# information_criterion => 'aic', 'aicc', 'bic', 'hqic', 'oob'
#                          (Akaike Information Criterion, Corrected Akaike Information Criterion,
#                          Baysian Information Criterion, Hannan-Quinn Information Criterion, or
#                          "out of bag" -- for validation scoring -- respectively)
# out_of_sample_size => validates the model selection (pass the entire dataset, and set 20% to be out_of_sample_size) 

model_auto = auto_arima(
    df.ret_ftse[1:],
    exogenous = df[['ret_spx', 'ret_dax', 'ret_nikkei']][1:],
    # 5 business days
    m = 5,
    max_order = None,
    max_p = 7,
    max_q = 7,
    max_d = 2,
    # [start] Seasonal orders
    max_P = 4,
    max_Q = 4,
    max_D = 2,
    # [end] Seasonal orders
    maxiter = 50,
    alpha = 0.05,
    n_jobs = -1,
    trend = 'ct'
)