## Lecture 22-23 BIC
* **Define a function to compute BIC of given p and select the "optimal" p for AR(p) model**

Define a function named bicAR
* `lagmat2ds(x,maxlag0)`(statsmodels.tsa.tsatools.lagmat2ds): Generate lagmatrix for 2d array, columns arranged by variables(each column is a series of data starting from a certain lag). x is the original data. maxlag0 is the maximum lags to generate (including lag 0). See Example 1.
* `statsmodels.api.OLS()`: 
    * One of the arguments here is "missing".Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none.’
    * statsmodels.api.OLS() also displays BIC, which can be extracted by ".bic" from results or simply read the summary table. See Example 2.
    * Here we need to include the constant term (estimation of $\mu$) in the regression by genrating a vector of 1s.

In [1]:
def bicAR(Y, pmax):
    """
    Calculate the Bayes Information Criterion for a univariate AR(p) model.
    
    Inputs:
        Y   : Times series data. 1 by n. A list or an array.
        pmax: User-specified upper bound of number of lags
        
    Outputs:
        BIC           : Bayes Information Criterion
        p_optimal     : The optimal p that generates BIC
        BIC_statsmodel: BIC given by statsmodel
        p_statsmodel  : The optimal p given by comparing BIC_statsmodel
    """
    import numpy as np
    import statsmodels.api as sm
    from statsmodels.tsa.tsatools import lagmat2ds
    
    Y          = np.array(Y)    # in case the input is of other datatypes
    bic_aux    = np.zeros(pmax)
    T          = len(Y)    
    mu_aux     = np.transpose(np.matrix(np.ones(T)))
    bic_sm_aux = []
    
    for i_p in range(pmax): # 0,1,2..., pmax-1, so actual lag is i_p+1
        Ylag   = lagmat2ds(x=Y, maxlag0=i_p+1) # The first column is with lag 0   
        exogen = np.array(np.concatenate((mu_aux, Ylag[:,1:]), axis=1))

        for i in range(i_p+2):
            for j in range(T):
                if exogen[j,i] == 0:
                    exogen[j,i] = None 
                    
        reg1         = sm.OLS(endog=Y, exog=exogen, missing='drop')
        results      = reg1.fit()
        bic_sm_aux.append(results.bic)
        
        OLS_residual = results.resid     # a T-i_p-1 by 1 array
        aux_resid    = OLS_residual[i_p+1:]       
        part_SSR     = np.sum(aux_resid**2)
        bic_aux[i_p] = np.log(part_SSR/T) + (i_p+2)*np.log(T)/T
    
    BIC_statsmodel = min(bic_sm_aux)
    p_statsmodel   = bic_sm_aux.index(BIC_statsmodel)+1
    BIC            = np.nanmin(bic_aux)
    p_optimal      = np.where(bic_aux==BIC)[0][0]+1
    
    
    return BIC, p_optimal, BIC_statsmodel, p_statsmodel


**Example 1**
* Be careful: For time lags where there lack data, python replace them with 0, not None! So before we run the OLS regrssion, we must change those 0 values into None and then drop those datapoints in the regression.

In [2]:
import numpy as np
from statsmodels.tsa.tsatools import lagmat2ds
import pandas as pd

max_lag = 4

X = np.array([1,2,3,4,5,6,7])
Xlag = lagmat2ds(x=X, maxlag0=max_lag)
pd.DataFrame(Xlag, index = ['t=1','t=2','t=3','t=4','t=5','t=6','t=7'])

Unnamed: 0,0,1,2,3,4
t=1,1.0,0.0,0.0,0.0,0.0
t=2,2.0,1.0,0.0,0.0,0.0
t=3,3.0,2.0,1.0,0.0,0.0
t=4,4.0,3.0,2.0,1.0,0.0
t=5,5.0,4.0,3.0,2.0,1.0
t=6,6.0,5.0,4.0,3.0,2.0
t=7,7.0,6.0,5.0,4.0,3.0


* Convert 0 values into None

In [3]:
for i in range(max_lag+1):
    for j in range(len(X)):
        if Xlag[j,i] == 0:
            Xlag[j,i] = None
pd.DataFrame(Xlag, index = ['t=1','t=2','t=3','t=4','t=5','t=6','t=7'])

Unnamed: 0,0,1,2,3,4
t=1,1.0,,,,
t=2,2.0,1.0,,,
t=3,3.0,2.0,1.0,,
t=4,4.0,3.0,2.0,1.0,
t=5,5.0,4.0,3.0,2.0,1.0
t=6,6.0,5.0,4.0,3.0,2.0
t=7,7.0,6.0,5.0,4.0,3.0


**Invoke the bicAR function. Taking AR(3) as an example.**

* Set up parameters

In [4]:
class AR3_param:
    def __init__(self):
        self.phi1 = 0.5
        self.phi2 = 0.3
        self.phi3 = 0.1
        self.sigma2 = 2

AR3 = AR3_param()

In [5]:
print(AR3.phi1)
print(AR3.phi2)
print(AR3.phi3)
print(AR3.sigma2)

0.5
0.3
0.1
2


* Simulate data

In [6]:
import numpy as np
import statsmodels.api as sm

T = 10000
ar_param = np.array([1, -AR3.phi1, -AR3.phi2, -AR3.phi3])
ma_param = np.array([1])

np.random.seed(1) 
Y = sm.tsa.arma_generate_sample(ar=ar_param, ma=ma_param, nsample=T, sigma = AR3.sigma2**0.5)

* **Example 2**: results of OLS regression (checking lag = 2)

In [7]:
p = 2
T = len(Y)
mu_aux = np.transpose(np.matrix(np.ones(T)))
Ylag   = lagmat2ds(x=Y, maxlag0=p+1)
exogen = np.array(np.concatenate((mu_aux, Ylag[:,1:]), axis=1))

for i in range(p+2):
    for j in range(T):
        if exogen[j,i] == 0:
            exogen[j,i] = None 

reg0         = sm.OLS(endog=Y, exog=exogen, missing='drop')
results      = reg0.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.731
Model:                            OLS   Adj. R-squared:                  0.731
Method:                 Least Squares   F-statistic:                     9074.
Date:                Tue, 09 Apr 2019   Prob (F-statistic):               0.00
Time:                        01:44:38   Log-Likelihood:                -17637.
No. Observations:                9997   AIC:                         3.528e+04
Df Residuals:                    9993   BIC:                         3.531e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0135      0.014      0.954      0.3

* Select p using bicAR

In [8]:
from bic import bicAR
(BIC, p_optimal, BIC_statsmodel, p_statsmodel) = bicAR(Y,6)

p   = [p_optimal,p_statsmodel]
bic = [BIC, BIC_statsmodel]
import pandas as pd
pd.DataFrame({'selected p': p, 'BIC': bic}, index=['bicAR','statsmodel'])

Unnamed: 0,selected p,BIC
bicAR,3,0.693306
statsmodel,3,35311.476681
