# Autoregression

In trying to model an outcome vector based on a feature matrix, autoregression uses one or more outcome values as features to predict future outcomes.

Observed outcome vector:
$$
y = [y_0, y_1, ..., y_T], \text{ for }t =0, ..., T
$$

Lag 1 Model:
$$
y_t = \beta_0 + \beta_1 y_{t-1} + \epsilon_t
$$

$$
\hat y_t = \beta_0 + \beta_1 y_{t-1}
$$

Usually assuming iid $\epsilon_t \sim N(0, \sigma^2)$

Lag K model:

$$
\hat y_t = \beta_0 + \sum_{k=1}^K\beta_k y_{t-k} + \epsilon_t
$$

Arbitrary Lag model:

$K = \{1, 24\}$

$$
\hat y_t = \beta_0 + \sum_{k \in K}\beta_k y_{t-k} + \epsilon_t
$$

## Multivariate Linear Autoregression

Suppose outcome $y_t$ is a vector of length $m$:

$$
y_t = \begin{pmatrix}y_{1t} \\ y_{2t}\\ \vdots \\ y_{mt} \end{pmatrix}
$$

$$
$$

$$
\beta_0 = \begin{pmatrix}\beta^{(0)}_{1} \\ \vdots \\ \beta^{(0)}_{m} \end{pmatrix}
$$

$$
\beta_i = \begin{pmatrix} \beta^{(i)}_{11} & ... & \beta^{(i)}_{1m} \\ 
\beta^{(i)}_{21} & ... & \beta^{(i)}_{2m} \\
... & ... & ... \\
\beta^{(i)}_{m1} & ... & \beta^{(i)}_{mm}
\end{pmatrix}
$$

## Multivariate Nonlinear Autoregression

$$
\hat y_t = f(y_{t-1}, \ldots, y_{t-K}, \pmb\beta)
$$

## Multivariate Nonlinear Autoregression with Additional Inputs

Suppose $x_t$ is a vector of feature inputs at time $t$.

$$
\hat y_t = f(x_t, y_{t-1}, \ldots, y_{t-K}, \pmb\beta) + \epsilon_t
$$

EXERCISE: 

Write above in the form below, with $y$ transformed to stacked vector $z$ using: 

$$z_t=\begin{pmatrix} y_t \\ y_{t-1}\\\vdots \\ y_{t-K+1} \end{pmatrix}$$

$$
z_t = g(x_t, z_{t-1}, \pmb\beta) + \epsilon_t
$$

Prove they are equivalent.

*NOTE: * similar motivation to expressing higher-order ODE as system of first-order ODE's.

We have:

$$z_{t-1}=\begin{pmatrix} y_{t-1} \\ y_{t-2}\\\vdots \\ y_{t-K} \end{pmatrix}$$


### Linear Case

$$
y_t = \beta_0 + \sum_{k=1}^K\beta_k y_{t-k} + \epsilon_t
$$

Where $\epsilon_t \sim MVNorm(\pmb 0, \Sigma_t)$, where $\Sigma_t$ is the covariance matrix at time $t$. NOTE: although $\Sigma_t$ is $m\times m$, at each time step we get a $m\times 1$ realization of the random distribution for $\epsilon_t$.

$$
\begin{pmatrix}y_{t1}\\ \vdots \\ y_{tm}\end{pmatrix} = \begin{pmatrix}\beta^{(0)}_{1} \\ \vdots \\ \beta^{(0)}_{m} \end{pmatrix} + \sum_{k=1}^K \begin{pmatrix} \beta^{(k)}_{11} & \ldots & \beta^{(k)}_{1m}  \\
\vdots & \ddots & \vdots \\
\beta^{(k)}_{m1} & \ldots & \beta^{(i)}_{mm}
\end{pmatrix} \begin{pmatrix}y_{(t-k)1}\\ \vdots \\ y_{(t-k)m} \end{pmatrix} + \begin{pmatrix} \epsilon_{t1}\\ \vdots \\ \epsilon_{tm} \end{pmatrix}
$$

$$
=\begin{pmatrix}\beta^{(0)}_{1} \\ \vdots \\ \beta^{(0)}_{m} \end{pmatrix} + \begin{pmatrix} (\beta_1) & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & (\beta_k) \end{pmatrix} \begin{pmatrix} y_{t-1}\\ \vdots \\ y_{t-k} \end{pmatrix} + \begin{pmatrix} \epsilon_{t1}\\ \vdots \\ \epsilon_{tm} \end{pmatrix}
$$

Let $z_{t-1} = \begin{pmatrix} y_{t-1}\\ \vdots \\ y_{t-k} \end{pmatrix}$, so

$$
y_t = \beta_0 + \begin{pmatrix} (\beta_1) & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & (\beta_k) \end{pmatrix}z_{t-1} + \epsilon_t
$$

## Linear Case w External Inputs

Ignoring $\epsilon$ for now...

$$
y_t = \beta_0 + \beta_1 y_{t-1} + \beta_2 x_t
$$

And,

$$
y_{t-1} = \beta_0 + \beta_1 y_{t-2} + \beta_2 x_{t-1}
$$

So,

$$
y_t = \beta_0 + \beta_1 [\beta_0 + \beta_1 y_{t-2} + \beta_2 x_{t-1}] + \beta_2 x_t
$$

Thus, with $x_t$ feature input at time $t$ and $y_{t}$ the recurrent state at time $t$ and $d_t$ is observed data at time $t$:

$y_t = f(x_t, y_{t-1}, \beta), \; L(\beta)=\|y_t-d_t\|^2_2$

The function $f$ is understood to be a function of $x$, $y$, and $\beta$

Suppose, for clarity, that $y_t\in\mathbb R$

$$
\begin{align*}
    \frac{d}{d\beta} L(\beta)&=\frac{d}{d\beta}(y_t-d_t)^2\\
    &= 2(y_t-d_t)\cdot \frac{d}{d\beta} y_t \\
    &= 2(y_t-d_t)\cdot \frac{d}{d\beta} f(x_t, y_{t-1}, \beta)\\
    &= 2(y_t-d_t)\cdot \left(\frac{d}{dy}f(x_t, y_{t-1}, \beta)\cdot \frac{d}{d\beta} y_{t-1}+\frac{d}{d\beta}f(x_t, y_{t-1}, \beta)\frac{d}{d\beta}\beta\right)
\end{align*}
$$

This requires a recursive relationship:

$$
\frac{d}{d\beta}y_{t} = \left(\frac{d}{dy}f(x_t, y_{t-1}, \beta)\cdot \frac{d}{d\beta} y_{t-1}+\frac{d}{d\beta}f(x_t, y_{t-1}, \beta)\frac{d}{d\beta}\beta\right)
$$

To update parameters in a machine learning context, we need to calculate the gradient of $L(\beta)$ with respect to $\beta$:

$$
\nabla_\beta L(\beta)  = 2 e_t^T
$$
$$
\nabla_\beta f(x_t, y_{t-1}, \beta)
$$


$y_t = f(x_t, f(x_{t-1}, y_{t-2}, \beta), \beta)$

$y_t = f(x_t, f(x_{t-1}, f(x_{t-2}, y_{t-3}, \beta), \beta), \beta)$








Continuing this, 
$y_t=f_2(x_t, x_{t-1}, y_{t-2}, \beta)$
$y_t=f_3(x_t, x_{t-1}, x_{t-2}, y_{t-3}, \beta)$