## Autoregressive Models Tutorial

The purpose of this notebook is to demonstrate manual coding of autoregressive models. The `AutoReg` function from `statsmodels` provides a relatively simple interface for fitting models. However, it is not straightforward to add weights or make some other custom loss function. The point is to use the linear regression model from `sklearn`, so weights and custom loss can easily be added. 

For a few different AR models we will reproduce the results with linear regression.

In [1]:
import sys
import numpy as np
from numpy import random
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import os
import os.path as osp
import pandas as pd
from statsmodels.tsa.ar_model import AutoReg

# Custom modules
sys.path.append(osp.join(os.getcwd(),"src")) # Add src subdirectory to python path
from data_funcs import synthetic_data

In [2]:
# Generate Data
# Sim data, no rain for simplicity
random.seed(456)

hours = 400 # Total number of time steps
dat = synthetic_data(max_rain = 0, data_noise = .5)  # Sim data from FMDA project code
fm = dat['fm'][0:hours]
h=np.arange(0, hours)

# Manually edit sim data to illustrate point about ROS
fm = fm + 20 - .07*np.arange(0, hours) # Shift up by 20, add decreasing trend

# Split to training and test
# Model 1 fit with OLS on FM
h = h.reshape(-1, 1)
h2 = 300
fmtr=fm[0:h2]
fmte=fm[h2:len(fm)]

## Lag 1 AR model with constant trend

In [3]:
## Autoreg Model, lag 1 and default of constant trend 
ar1 = AutoReg(fmtr, lags=1).fit() # ROS is modeled response
fit1 = ar1.predict(start=0, end=h2-1, dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated
preds1 = ar1.predict(start=len(fmtr), end=len(fmtr)+len(fmte)-1, dynamic=False)

## Reproduce with LinearRegression, with default constant mean (same as const trend)

X = pd.DataFrame({'rs': fmtr})
X['lag1'] = X['rs'].shift(1)
X = X.drop(['rs'], axis=1)
X = X.dropna().to_numpy()

mod = LinearRegression().fit(X, np.delete(fmtr, 0))
fits = mod.predict(X)

Xte = pd.DataFrame({'rs': fmte})
Xte['lag1'] = Xte['rs'].shift(1)
Xte = Xte.drop(['rs'], axis=1)
Xte = Xte.dropna().to_numpy()

preds = mod.predict(Xte)

We then compare the results up to rounding error. We expect a value close to machine epsilon, or in the $10^{-18}$ range.|

In [9]:
## Compare Results up to rounding error
def max_err(x, y):
    return np.max(np.abs(x-y))

print(f'Training Max Difference: {max_err(fits, fit1)}')
print(f'Training Max Difference: {max_err(preds, preds1)}')

Training Max Difference: 2.842170943040401e-14


ValueError: operands could not be broadcast together with shapes (99,) (100,) 