# Tutorial #2 :: The Extended Kalman Filter
The Kalman filter (and the Extended Kalman Filter) is an incredibly powerful estimation technique. We will begin by describing the linear Kalman filter (this is what is meant when we say "Kalman filter" as opposed to "Extended" or "Unscented" Kalman filters), and then we'll show how this can be extended to nonlinear cases via the extended Kalman filter. 

The Kalman filter, though it does act as a filter, is most useful as a state estimation technique. In fact, given a linear system where the state variables are zero-mean and Gaussian distributed, the Kalman filter is an optimal estimator. We'll provide some rough math below to give some intuition into the code that will follow, but for a more complete understanding of Kalman filters and linear estimation in general, see *Adaptive Filters*, by Ali Sayed. 

Let's begin! 

## The Normal Equation for Optimal Linear Estimation
Linear estimation deals with estimating a random, possibly vector-valued variable $x$ given observations $y$. By default, we assume a vector-valued quantity is a column vector. Traditionally, these estimators are designed to be optimal in the *least squares sense*. In other words, they aim to minimize the error

<center>$\tilde{x} = x - \hat{x}$</center>

where $\hat{x}$ is the estimate of vector $x$. The *mean-squared-error* or MSE of $\tilde{x}$ is given as $\mathbb{E}(\tilde{x}^2)$. One intuitive way of reasoning about $\hat{x}$ is that it must be constructed in such a way that $\tilde{x}$ is uncorrelated with observations $y$ or any function thereof, say $f(y)$. If $\tilde{x}$ *was* correlated with $y$, then additional information could still be gleaned from $y$, indicating that $\hat{x}$ is suboptimal. More formally, the following condition must hold:

<center>
$\tilde{x} \perp f(y) = \hat{x}$
</center>

using the simple observation that $\hat{x}$ is a function of $y$. This is commonly stated as the *orthognality principle*, because it implies that $\mathbb{E}\tilde{x}f(y)^* = 0$ for complex valued $x$ and $y$, or simply $\mathbb{E}\tilde{x}f(y)^T = 0$ for real values. Finally, it can be derived from probability that the optimal estimate of $\hat{x} = f(y)$ given observations $y$ is the conditional expectation

<center> $\hat{x} = \mathbb{E}(x~|~y)$ </center>

Linear estimation limits $f(y)$ to be a liniear function, namely

<center> $f(y) = Ky + b$ </center>

for matrix $K$ and vector $b$. More specifically, we often assume zero-mean random variables $x$ and $y$ such that $\hat{x} = f(y) = Ky$. Given the orthogonality principle, the following must hold for an optimal estimator, denoted $K_o$:

<center>
$\begin{align}
\mathbb{E}(\tilde{x}y^T) &= 0\\
\mathbb{E}((x - K_oy)y^T) &= 0\\
\mathbb{E}(xy^T) - K_o\mathbb{E}(yy^T) &= 0\\
\end{align}$
</center>

or, more compactly,

<center>
$K_oR_y = R_{xy}$
</center>

where $R_{xy}$ represents the covariance matrix corresponding to random variables $x$ and $y$. The above equation is often referred to as the normal equation for optimal linear estimation. In summary, an optimal linear estimator $K_o$ is any such matrix that satisfies $K_o = R_y^{-1}R_{xy}$.

## Introducing the Linear Kalman Filter
We know from the previous section that we're seeking an optimal estimator $K_o$ such that $\hat{x} = K_oy$, where $K_oRy = R_{xy}$.

Now we're going to consider a transformation on $y$ into a new variable $e$:

<center>
$e = Ay$
</center>

for a lower triangular and invertible matrix $A$. The key is that we want each subsequent entry in the vector $e$ to be uncorrelated with the previous entry. In other words, we want to extract from $y$ the key information from all observations, ignoring any cross-correlation. These are called *innovations* of $y$. In other words, we want $R_e$ to be diagonal. 

We are now concerned with estimating $x$ from $e$: $\hat{x}_{|e} = R_{xe}R_e^{-1}e$. In fact, we can show algebraically that the estimators are equivalent--that is, estimating $\hat{x}$ from $e$ is the same as estimating $\hat{x}$ from $y$, but now $R_e$ is block diagonal, meaning it can nicely be split into a sum of individual estimators! In fact, we can write the following iterative formula:

<center>
$\begin{align}
\hat{x}_{|N} &= \hat{x}_{|N-1} + \hat{x}_{e_N}\\
&= \hat{x}_{|N-1} + (\mathbb{E}xe^T_N)R_{e,N}^{-1}e_N
\end{align}$
</center>

where $\hat{x}_{|N}$ denotes the estimate of $x$ using only up to $N$ observations $y$. At this point we have an iterative, causal way of estimating $\hat{x}$, but we haven't yet explained how to get the innovations $e$ from $y$--we've just assumed they exist. In fact, we can construct each innovation $e_i$ by *whitening* the observations $y$: $e_i = y_i - \hat{y}_{i|i-1}$. We'll not dig into the details here, referring the reader again to *Adaptive Filters* by Sayed. 

## State-Space Models
In order to understand Kalman filters, we must first understand state-space representations. State-space models describe the iterative progression of the state, $x$, as well as the observations $y$ as a function of the state. Specifcally, we often write

<center>
$\begin{align}
x_{k+1} &= A_kx_k + B_ku_k \\
y_{k} &= C_kx_k + v_k
\end{align}$
</center>

where $k$ is the discrete time index, $u$ is the input to the system, and $A$, $B$, and $C$ describe the evolution of the state and the observations--these depend on the physics of the system and the sensors used. Note that the matrices themselves *can* vary with time as well, although they certainly don't have to. Finally, $u_k$ and $v_k$ *must* be zero-mean and white. 

## The Kalman Filter Equations

