# Models Tutorial

The purpose of this notebook is to demonstrate manual coding of AR models used in this project. The typical AR modeling packages in python do not allow for easy implementation of custom loss functions and other weighting schemes.

In [1]:
import sys
import numpy as np
from numpy import random
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import os
import os.path as osp
import pandas as pd
from statsmodels.tsa.ar_model import AutoReg

# Custom modules
sys.path.append(osp.join(os.getcwd(),"src")) # Add src subdirectory to python path
from data_funcs import synthetic_data

In [2]:
# Generate Data
# Sim data, no rain for simplicity
random.seed(456)

hours = 400 # Total number of time steps
dat = synthetic_data(max_rain = 0, data_noise = .5)  # Sim data from FMDA project code
fm = dat['fm'][0:hours]
h=np.arange(0, hours)

# Manually edit sim data to illustrate point about ROS
fm = fm + 20 - .07*np.arange(0, hours) # Shift up by 20, add decreasing trend

# Split to training and test
# Model 1 fit with OLS on FM
# h = h.reshape(-1, 1)
h2 = 300
fmtr=fm[0:h2]
fmte=fm[h2:len(fm)]

## Autoregressive Models

The `AutoReg` function from `statsmodels` provides a relatively simple interface for fitting models. However, it is not straightforward to add weights or make some other custom loss function. The point is to use the linear regression model from `sklearn`, so weights and custom loss can easily be added. 

For a few different AR models we will reproduce the results with linear regression. The mathematical form of an AR model with time lags $k=1, 2, ..., K$:

$$
y_t = \beta_0 + \sum_{k=1}^K \beta_k y_{t-k} +\epsilon_t
$$

## Lag 1 AR model with constant trend

The mathematical specification is:

$$
y_t = \beta_0 + \beta_1 y_{t-1} +\epsilon_t
$$

In [3]:
## Autoreg Model, lag 1 and default of constant trend 
ar1 = AutoReg(fmtr, lags=1).fit()
fit1 = ar1.predict(start=0, end=h2-1, dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated

## Reproduce with LinearRegression, with default constant mean (same as const trend)

X = pd.DataFrame({'rs': fmtr})
X['lag1'] = X['rs'].shift(1)
X = X.drop(['rs'], axis=1)
X = X.dropna().to_numpy()

mod = LinearRegression().fit(X, np.delete(fmtr, 0))
fits = mod.predict(X)

We then compare the results up to rounding error. We expect a value close to machine epsilon, or in the $10^{-18}$ range. Additionally, the model parameters should be the same up to rounding error.

In [4]:
## Compare Results up to rounding error
def max_err(x, y):
    return np.max(np.abs(x-y))

print(f'Training Max Difference: {max_err(fits, fit1)}')

Training Max Difference: 2.842170943040401e-14


In [5]:
# Params same to 8 decimal places
print(np.round([mod.intercept_, mod.coef_[0]], 8))
print(ar1.params)

[0.17953995 0.99114216]
[0.17953995 0.99114216]


For prediction with the LM model, we have to iteratively call the predict function starting with the last observation, then moving forward one forecasted value at a time.

In [6]:
# Predict with built-in AR function
preds1 = ar1.model.predict(ar1.params, start=len(fmtr), end=len(fm)-1, dynamic=False)
# preds11 = ar1.forecast(hours-h2) # Note: equivalent

Below we write a function that forecasts the AR model into the future. This must be done iteratively: for each time step that is forecasted, that value is fed back into the model.

In [7]:
def predict_ts(m, f, ts):
    "m: model object"
    "f: observed"
    "ts: number of time steps to forecast"

    preds = np.zeros(ts) # initialize array of forecasts for return value
    
    Xtemp = f[-1].reshape(1, 1) # model matrix with last fitted value
    
    preds[0]=m.predict(Xtemp)

    # Loop through remaining time steps and predict using last value
    for i in range(1, ts):
        Xtemp = preds[i-1].reshape(1, 1)
        preds[i]=m.predict(Xtemp)
    
    return preds

In [8]:
preds=predict_ts(mod, fmtr, len(fm)-h2)

In [9]:
np.max(np.abs(preds - preds1)) # Expect small value

1.5205614545266144e-12

This is clearly a bad model for the given data, which has a clear time trend and seasonal effect, but this is just to illustrate the inner workings.

## Lag K AR model with constant trend

$$
y_t = \beta_0 + \sum_{k=1}^K \beta_k y_{t-k} +\epsilon_t
$$

In [10]:
# Custom functions to recreate AR

def build_lags(v, lags):
    "v: data vector to lag"
    "lags: list of integers"
    
    X = pd.DataFrame({'x': v})
    for l in lags:
        X[f"lag{l}"] = X['x'].shift(l)
    X = X.drop(['x'], axis=1)
    X = X.dropna().to_numpy()
    return X

In [11]:
## Autoreg Model, lag 5 and default of constant trend 
ar1 = AutoReg(fmtr, lags=5).fit() 
fit1 = ar1.predict(start=0, end=h2-1, dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated
preds1 = ar1.model.predict(ar1.params, start=len(fmtr), end=len(fm)-1, dynamic=False)

In [12]:
## Recreate with LM
X = build_lags(fmtr, lags = np.arange(1, 6))
mod = LinearRegression().fit(X, fmtr[5:h2])
fits = mod.predict(X)

In [13]:
# Params same to 8 decimal places
print(np.round([mod.intercept_, *mod.coef_], 8))
print(ar1.params)
print(np.max(np.abs(fits - fit1[4:h2]))) # Expect small value

[ 0.71105127  1.07287614  0.39574979 -0.20867182 -0.42885378  0.1382032 ]
[ 0.71105127  1.07287614  0.39574979 -0.20867182 -0.42885378  0.1382032 ]
7.815970093361102e-14


## AR 1 with Time trend

The mathematical specification is:

$$
y_t = \beta_0 + \beta_1 t + \beta_2 y_{t-1} +\epsilon_t
$$

In [14]:
## Autoreg Model, lag 1 and time dependent trend and overall mean 
ar1 = AutoReg(fmtr, lags=1, trend="ct").fit()
fit1 = ar1.predict(start=0, end=h2-1, dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated
preds1 = ar1.model.predict(ar1.params, start=len(fmtr), end=len(fm)-1, dynamic=False)

In [15]:
## Reproduce with LinearRegression, with constant mean and time trend mean

X = pd.DataFrame({'rs': fmtr.tolist(), 't': h[0:h2].tolist()})
X['lag1'] = X['rs'].shift(1)
X = X.drop(['rs'], axis=1)
X = X.dropna().to_numpy()

mod = LinearRegression().fit(X, np.delete(fmtr, 0))
fits = mod.predict(X)

In [16]:
# Params same to 8 decimal places
print(np.round([mod.intercept_, *mod.coef_], 8))
print(ar1.params)
print(np.max(np.abs(fits - fit1))) # Expect small value

[ 2.03533262 -0.00452552  0.94083945]
[ 2.03985814 -0.00452552  0.94083945]
2.842170943040401e-14


We modify the predict function from before to add a time trend term.

In [17]:
def predict_ar1t(m, f, ts):
    "m: model object"
    "f: observed"
    "ts: number of time steps to forecast"

    t = np.arange(len(f), len(f)+ts)
    
    preds = np.zeros(ts) # initialize array of forecasts for return value
    
    Xtemp = np.column_stack((t[0], f[-1])) # model matrix with last fitted value

    preds[0]=m.predict(Xtemp)

    # Loop through remaining time steps and predict using last value
    for i in range(1, ts):
        Xtemp = np.column_stack((t[i], preds[i-1]))
        # Xtemp = preds[i-1].reshape(1, 1) # join with time index
        preds[i]=m.predict(Xtemp)
    
    return preds

In [18]:
preds=predict_ar1t(mod, fmtr, len(fm)-h2)

In [19]:
np.max(np.abs(preds - preds1)) # Expect small value

3.907985046680551e-14

## AR 1 with Time trend and Covariates

The mathematical specification for $P$ predictors is:

$$
y_t = \beta_0 + \beta_1 t + \beta_2 y_{t-1} + \sum_{i=1}^P\alpha_j x_{j, t} +\epsilon_t
$$

The covariates I'll include here for illustration will be hour of the day (1-24) and a randomly generated value, call it $z$, which won't be part of the data generating process but is just included to code multiple covariates.

In [20]:
hour = np.resize(range(1, 24), hours) # repeat 1-24 for each response value (times here aren't real)
z = random.normal(10, 10, size=hours)
XX = np.column_stack((hour, z))

In [21]:
## Autoreg Model, lag 1, time dependent trend and overall mean, two covariates 
ar1 = AutoReg(fmtr, lags=1, trend="ct", exog = XX[0:h2]).fit()
fit1 = ar1.predict(start=0, end=h2-1, exog = XX[0:h2], dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated
preds1 = ar1.model.predict(ar1.params, start=len(fmtr), end=len(fm)-1, exog_oos=XX[h2:hours], dynamic=False)

In [22]:
## Reproduce with LinearRegression, with constant mean and time trend mean

X = pd.DataFrame({'rs': fmtr.tolist(), 't': h[0:h2].tolist(), 'hour': hour[0:h2].tolist(), 'z': z[0:h2].tolist()})
X['lag1'] = X['rs'].shift(1)
X = X.drop(['rs'], axis=1)
X = X.dropna().to_numpy()

mod = LinearRegression().fit(X, np.delete(fmtr, 0))
fits = mod.predict(X)

In [23]:
# Params same to 8 decimal places, different order
print(np.round([mod.intercept_, *mod.coef_], 8))
print(ar1.params)
print(np.max(np.abs(fits - fit1))) # Expect small value

[ 1.11494846 -0.00403312  0.04223127  0.00770899  0.95175982]
[ 1.11898158 -0.00403312  0.95175982  0.04223127  0.00770899]
5.684341886080802e-14


Again, we must slightly modify the predict function.

In [24]:
def predict_ar1t(m, f, XX, ts):
    "m: model object"
    "f: observed"
    "XX: covariate matrix"
    "ts: number of time steps to forecast"
    
    preds = np.zeros(ts) # initialize array of forecasts for return value
    
    Xtemp = np.column_stack((XX.loc[0:0], f[-1])) # model matrix with last fitted value

    preds[0]=m.predict(Xtemp)

    # Loop through remaining time steps and predict using last value
    for i in range(1, ts):
        Xtemp = np.column_stack((XX.loc[i:i], preds[i-1]))
        # Xtemp = preds[i-1].reshape(1, 1) # join with time index
        preds[i]=m.predict(Xtemp)
    
    return preds

In [25]:
X2 = pd.DataFrame({'t': h[h2:hours].tolist(), 'hour': hour[h2:hours].tolist(), 'z': z[h2:hours].tolist()})

preds=predict_ar1t(mod, fmtr, X2, len(fm)-h2)

In [26]:
np.max(np.abs(preds - preds1)) # Expect small value

1.297184581972033e-12

## Lag K AR model with time trend and covariates

For a model with $K$ time lags and $P$ other covariates

$$
y_t = \beta_0 + \beta_1 t + \sum_{i=1}^P\beta_j x_{j, t}+ \sum_{k=1}^K \alpha_k y_{t-k} +\epsilon_t
$$

In [27]:
## Autoreg Model, lag 5 and time trend
ar1 = AutoReg(fmtr, lags=5, trend="ct", exog = XX[0:h2]).fit() 
fit1 = ar1.predict(start=0, end=h2-1, dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated
preds1 = ar1.model.predict(ar1.params, start=len(fmtr), end=len(fm)-1, exog_oos=XX[h2:hours], dynamic=False)

In [28]:
## Reproduce with LinearRegression
lags=5
X = build_lags(fmtr, lags = np.arange(1, lags+1))
X = pd.DataFrame(X)
X['t'] = h[lags:h2].tolist()
X['hour'] = hour[lags:h2].tolist()
X['z'] = z[lags:h2].tolist()
X = X.to_numpy()

In [29]:
mod = LinearRegression().fit(X, fmtr[lags:h2])
fits = mod.predict(X)

In [30]:
# Params same to 8 decimal places, different order
print(np.round([mod.intercept_, *mod.coef_], 8))
print(ar1.params) # Params (9 excluding error var): intercept, time trend, 5x lags, 2x other
print(np.max(np.abs(fits - fit1[(lags-1):h2]))) # Expect small value

[ 4.62876349e+00  8.27666360e-01  4.64901500e-01 -1.56592400e-02
 -3.48889520e-01 -7.26099200e-02 -9.07512000e-03  1.10529800e-02
  1.22160000e-04]
[ 4.63783861e+00 -9.07512346e-03  8.27666358e-01  4.64901504e-01
 -1.56592385e-02 -3.48889518e-01 -7.26099162e-02  1.10529791e-02
  1.22161569e-04]
1.1013412404281553e-13


In [31]:
def predict_ar(m, K, f, XX, ts):
    "m: model object"
    "K: time lag terms in m"
    "f: observed"
    "XX: covariate matrix"
    "ts: number of time steps to forecast"
    
    preds = np.zeros(ts) # initialize array of forecasts for return value
    
    Xtemp = np.column_stack((np.flip(f[-K:]).reshape(1, K), XX.loc[0:0])) # model matrix with last fitted value

    preds[0]=m.predict(Xtemp)

    # Loop through remaining time steps and predict using last value
    for i in range(1, ts):
        if i < K: # build lags using training data if necessary
            x = np.concatenate((f[-(K-i):], preds[0:i]))
        else: 
            x = preds[(i-K):i]
        x = np.flip(x)
        Xtemp = np.column_stack((x.reshape(1, K), XX.loc[i:i]))
        # Xtemp = preds[i-1].reshape(1, 1) # join with time index
        preds[i]=m.predict(Xtemp)
    
    return preds

In [32]:
X2 = pd.DataFrame({'t': h[h2:hours].tolist(), 'hour': hour[h2:hours].tolist(), 'z': z[h2:hours].tolist()})

preds=predict_ar(mod, lags, fmtr, X2, len(fm)-h2)

In [33]:
np.max(np.abs(preds - preds1)) # Expect small value

8.348877145181177e-13

Check Another one...

In [34]:
## Autoreg Model, lag 9 and time trend and another random predictor
np.random.seed(123)
hour = np.resize(range(1, 24), hours) # repeat 1-24 for each response value (times here aren't real)
z = random.normal(10, 10, size=hours)
y = random.normal(10, 10, size=hours)
XX = np.column_stack((hour, z, y))

ar1 = AutoReg(fmtr, lags=9, trend="ct", exog = XX[0:h2]).fit() 
fit1 = ar1.predict(start=0, end=h2-1, dynamic=False) # ignore NAN values at beginning when lags can't be calculated
fit1 = fit1[1:h2]  # ignore NAN values at beginning when lags can't be calculated
preds1 = ar1.model.predict(ar1.params, start=len(fmtr), end=len(fm)-1, exog_oos=XX[h2:hours], dynamic=False)

In [35]:
## Reproduce with LinearRegression
lags=9
X = build_lags(fmtr, lags = np.arange(1, lags+1))
X = pd.DataFrame(X)
X['t'] = h[lags:h2].tolist()
X['hour'] = hour[lags:h2].tolist()
X['z'] = z[lags:h2].tolist()
X['y'] = y[lags:h2].tolist()
X = X.to_numpy()

In [36]:
mod = LinearRegression().fit(X, fmtr[lags:h2])
fits = mod.predict(X)

In [37]:
# Check params the same
print(np.round([mod.intercept_, *mod.coef_], 8))
print(ar1.params) 
print(np.max(np.abs(fits - fit1[(lags-1):h2]))) # Expect small value

[ 6.80131277e+00  7.38909280e-01  4.47106490e-01  9.04955900e-02
 -3.00207340e-01 -1.22535420e-01 -8.13481200e-02  7.53617000e-02
 -1.13412170e-01  6.25961000e-02 -1.33262700e-02  1.04575000e-03
 -1.46969000e-03  1.70286000e-03]
[ 6.81463904e+00 -1.33262691e-02  7.38909276e-01  4.47106492e-01
  9.04955853e-02 -3.00207341e-01 -1.22535422e-01 -8.13481153e-02
  7.53616994e-02 -1.13412169e-01  6.25961022e-02  1.04574624e-03
 -1.46968512e-03  1.70286136e-03]
1.1723955140041653e-13


In [38]:
X2 = pd.DataFrame({'t': h[h2:hours].tolist(), 'hour': hour[h2:hours].tolist(), 'z': z[h2:hours].tolist(), 'y': y[h2:hours].tolist()})

preds=predict_ar(mod, lags, fmtr, X2, len(fm)-h2)

In [39]:
np.max(np.abs(preds - preds1)) # Expect small value

8.43769498715119e-13

## AR Model Loss Functions

The mathematical form of an AR model at time $t$ with time lags $k=1, 2, ..., K$:

$$
y_t = \beta_0 + \sum_{k=1}^K \beta_k y_{t-k} +\epsilon_t
$$

To fit with model with residual sum of squares (RSS), weights would be find by minimizing the following loss function: For time $t=1, ..., T$, observed response $y_t$ and modeled response $\hat y_t$

$$
RSS = \sum_{t=1}^{T}(y_t - \hat y_t)^2
$$

A weighted sum of squares $WSS$ procedure would be: for weights $w_t = 1, ..., T$

$$
WSS = \sum_{t=1}^{T}w_t(y_t - \hat y_t)^2
$$