# T4 - Filtering & time series
Before we look at the full (multivariate) Kalman filter,
let's get more familiar with time-dependent (temporal/sequential) problems.
$
% START OF MACRO DEF
% DO NOT EDIT IN INDIVIDUAL NOTEBOOKS, BUT IN macros.py
%
\newcommand{\Reals}{\mathbb{R}}
\newcommand{\Expect}[0]{\mathbb{E}}
\newcommand{\NormDist}{\mathcal{N}}
%
\newcommand{\DynMod}[0]{\mathscr{M}}
\newcommand{\ObsMod}[0]{\mathscr{H}}
%
\newcommand{\mat}[1]{{\mathbf{{#1}}}}
%\newcommand{\mat}[1]{{\pmb{\mathsf{#1}}}}
\newcommand{\bvec}[1]{{\mathbf{#1}}}
%
\newcommand{\trsign}{{\mathsf{T}}}
\newcommand{\tr}{^{\trsign}}
\newcommand{\tn}[1]{#1}
\newcommand{\ceq}[0]{\mathrel{≔}}
%
\newcommand{\I}[0]{\mat{I}}
\newcommand{\K}[0]{\mat{K}}
\newcommand{\bP}[0]{\mat{P}}
\newcommand{\bH}[0]{\mat{H}}
\newcommand{\bF}[0]{\mat{F}}
\newcommand{\R}[0]{\mat{R}}
\newcommand{\Q}[0]{\mat{Q}}
\newcommand{\B}[0]{\mat{B}}
\newcommand{\C}[0]{\mat{C}}
\newcommand{\Ri}[0]{\R^{-1}}
\newcommand{\Bi}[0]{\B^{-1}}
\newcommand{\X}[0]{\mat{X}}
\newcommand{\A}[0]{\mat{A}}
\newcommand{\Y}[0]{\mat{Y}}
\newcommand{\E}[0]{\mat{E}}
\newcommand{\U}[0]{\mat{U}}
\newcommand{\V}[0]{\mat{V}}
%
\newcommand{\x}[0]{\bvec{x}}
\newcommand{\y}[0]{\bvec{y}}
\newcommand{\z}[0]{\bvec{z}}
\newcommand{\q}[0]{\bvec{q}}
\newcommand{\br}[0]{\bvec{r}}
\newcommand{\bb}[0]{\bvec{b}}
%
\newcommand{\bx}[0]{\bvec{\bar{x}}}
\newcommand{\by}[0]{\bvec{\bar{y}}}
\newcommand{\barB}[0]{\mat{\bar{B}}}
\newcommand{\barP}[0]{\mat{\bar{P}}}
\newcommand{\barC}[0]{\mat{\bar{C}}}
\newcommand{\barK}[0]{\mat{\bar{K}}}
%
\newcommand{\D}[0]{\mat{D}}
\newcommand{\Dobs}[0]{\mat{D}_{\text{obs}}}
\newcommand{\Dmod}[0]{\mat{D}_{\text{obs}}}
%
\newcommand{\ones}[0]{\bvec{1}}
\newcommand{\AN}[0]{\big( \I_N - \ones \ones\tr / N \big)}
%
% END OF MACRO DEF
$

In [None]:
import resources.workspace as ws
%matplotlib inline
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
plt.ion();

## Hidden Markov Models (HMM)

It is generally reasonable to assume that
tomorrow depends on today, and only today.
For example, if you know the state of the atmosphere today ($k$),
then you don't need to know anything about yesterday ($k-1$)
to make (initialize) your forecast for tomorrow ($k+1$).
In the presence of uncertainty, this is stated formally/symbolically by the **Markov** assumption:
$$p(\x_{k+1} | \x_k, ..., \x_0) = p(\x_{k+1} | \x_k) \, \tag{HMM1}$$

This Markovian/dynamic *transition* density (kernel) is assumed known.
However, we say that the states are ***hidden*** because they are not directly observed.
Instead, we only gain information on them through the observations.
It is generally reasonable to assume that the measurement at time $k$
only depends on the state at time $k$, i.e.
$$p(\y_k | \x_k, ..., \x_0, \y_k, \ldots, \y_1) = p(\y_k | \x_k) \,. \tag{HMM2}$$
which is today's observation likelihood, or *emission* pdf.

These two assumptions form what we call a "Hidden Markov Model", and are illustrated below, for time $k=0, \ldots, K$.
The arrows indicate causality.
   
<img width="80%" src="./resources/HMM.svg" alt='Hidden Markov Models'/>

*PS: If nature is insufficiently modelled, then these assumptions become tenuous.*

*PS: You may have seen a different diagram illustrating HMM's,
   with arrows pointing in all kinds of directions, even forming loops.
   They are the same thing, but focus only on a single one of the time steps shown above,
   for which they show all of the (necessarily discrete) transition probablities.*

While the above assumptions are very abstract,
they still imbue the problem of estimating $\x_{0:K}$ (shorthand for $\x_0, \ldots, \x_K$) or parts of it,
with a structure that we should be able to exploit,
even on this most abstract of levels.
Indeed, the above HMM assumptions produce the following factorisation
$$ \begin{align*}p(\x_{0:K} | \y_{1:K})
%&\propto p(\x_{0:K}) \, p(\y_{1:K} | \x_{0:K})\\\
&\propto p(\x_0) \prod_{k=1}^K p(\x_k | \x_{k-1})  \, p(\y_k | \x_k) \tag{HMM3}
\,,\end{align*}$$
This decomposition means that $p(\x_{0:K} | \y_{1:K})$ is not actually a function on the space of dimension $\dim(\x_{0:K})$ -- imagine discretising and representing it numerically! -- but rather $2K + 1$ functions on a space of dimensions $\dim(\x_k)$.

In addition, the decomposition has a particular sequential structure, which we can further exploit.
Suppose we wish to forecast tomorrow's weather (or atmospheric state), $\x_{k+1}$,
based on all previous observations, $\y_{1:k}$, i.e. $ p(\x_{k+1}|\y_{1:k})$.
Then we need to to consider (i.e. integrate over)
all possible $\x_k$ producing a given value of $\x_{k+1}$,
weighted by the probability of that $\x_k$ and the transition kernel:
$$ p(\x_{k+1}|\y_{1:k}) = \int  p(\x_{k+1}| \x_k ) \, p(\x_k|\y_{1:k})\, d \x_k \,, \tag{HMM4}$$
This **forecast** equation is also known as the Chapman-Kolmogorov equation, and is equivalent to the PDE called the Fokker-Planck equations. Either way, the forecast must be "initialized" by the density $p(\x_k|\y_{1:k})$, which is called today's **analysis**.
It can, in turn, be written using Bayes' rule for today's observations:
$$ p(\x_k| \y_{1:k}) \propto p(\y_k | \x_k ) \, p(\x_k| \y_{1:k-1}) \,. \tag{HMM5}$$
Note that the prior is actually *yesterday's forecast*, $p(\x_k| \y_{1:k-1})$.
Thus, the above two steps (forecast and conditioning/update) can be applied for sequentially increasing time, building on the previous estimates. They are known as the Bayesian **filtering** recursions. The benefit of them being recursive is that we do not need to re-do the whole computation every time, but can build our next estimate based on the previous one.

**Exc (optional):** Derive eqn's HMM-3-5 from eqn's HMM-1 and 2.

# A straight-line example

Consider the straight line ($x_k$) for time index $k=1, 2, \ldots, K$, specified by
$$\begin{align}
x_k = a k \, , \tag{1}
\end{align}$$
where the slope ($a$) is unknown.
Also suppose we have observations ($y$) of the line, but corrupted by noise ($r$):
$$\begin{align}
y_k &= x_k + r_k \, , \tag{2}
\end{align}$$
where $r_k \sim \mathcal{N}(0, R)$ for some $R>0$.
The code below sets up an experiment based on eqns. (1) and (2).

In [None]:
# Parameters
a = 0.4
K = 10
R = 1

# Naming convention: xx and yy hold time series of x and y.
xx = np.zeros(K+1) # truth states
yy = np.zeros(K+1) # obs

# Simulate synthetic truth (x) and obs(y)
for k in 1+np.arange(K):
    xx[k] = a*k
    yy[k] = xx[k] + np.sqrt(R)*rnd.randn()

# The obs at k==0 should not be used (since we know xx[0]==0, it is worthless).
yy[0] = np.nan

Let's visualize the experiment:

In [None]:
@ws.interact(k=ws.IntSlider(min=1, max=K))
def plot_experiment(k):
    plt.figure(figsize=(10, 6))
    kk = np.arange(k+1)
    plt.plot(kk, xx[kk], 'k' , label='true state ($x$)')
    plt.plot(kk, yy[kk], 'k*', label='noisy obs ($y$)')

    ### Uncomment this block AFTER doing the Exc 3.4 ###
    # plt.plot(kk, kk*lin_reg(k), 'r', label='Linear regress.')

    ### Uncomment this block AFTER doing the Exc 3.8 ###
    # pw_bb, pw_xxhat = ws.weave_fa(bb, xxhat)
    # pw_kf, pw_ka    = ws.weave_fa(np.arange(K+1))
    # plt.plot(pw_kf[:3*k], pw_bb[:3*k]   , 'c'  , label='KF forecasts')
    # plt.plot(pw_ka[:3*k], pw_xxhat[:3*k], 'b'  , label='KF analyses')
    # #plt.plot(kk, kk*xxhat[k]/k         , 'g--', label='KF extrapolated')

    plt.xlim([0, 1.01*K])
    plt.ylim([-1, 1.2*a*K])
    plt.xlabel('time index (k)')
    plt.ylabel('$x$, $y$, and $\hat{x}$')
    plt.legend(loc='upper left')
    plt.show()

### Estimation by linear regression
The observations eqn. (2)
yields the likelihood
$$\begin{align}
p(y_k|x_k) = \mathcal{N}(y_k \mid x_k, R) \, . \tag{3}
\end{align}$$
Hopefully this is intuitive; otherwise, a derivation is provided in T4.

(Least-squares) linear regression minimizes the cost/objective function
$$\begin{align}
J_K(a) = \sum_{k=1}^K (y_k - a k)^2 \, ,  \tag{4}
\end{align}$$


**Exc 3.2:** Use eqns. (1) and (2) and the logarithm to derive $J_K(a)$ from the likelihood $p\, (y_1, \ldots, y_K \;|\; a)$.  
Explain (prove) that their optimum points will be the same.

In [None]:
# ws.show_answer('LinReg deriv a')

**Exc 3.3:** Show that the optimisation yields the estimator
$$\begin{align}
\hat{a} = \frac{\sum_{k=1}^K {k} y_{k}}{\sum_{k=1}^K {k}^2} \, . \tag{6}
\end{align}$$

In [None]:
# ws.show_answer('LinReg deriv b')

**Exc 3.4:** Code up the linear regression estimator (6).  
Then, go back to the animation above and uncomment the block that plots the its estimates.
If you did it right, then the estimated line should look reasonable.

In [None]:
def lin_reg(k):
    "Liner regression estimator based on observations y_1, ..., y_k."
    # PS: the observations (yy) are not among the input args
    #     because you can just grab them from the global namespace.
    ### INSERT ANSWER HERE ###
    return a

In [None]:
# ws.show_answer('LinReg_k')

In the following we tackle the same problem, but using the Kalman filter.

# Estimation by the (univariate) Kalman filter (KF)
The KF assumes that the ("true/nature") state, $x_k$, evolves recursively in time (indexed by $k$) according to
$$\begin{align}
x_{k} = \DynMod_{k-1} x_{k-1} + q_{k-1} \, , \tag{Dyn}
\end{align}$$
where $\DynMod_{k-1}$ is called the "dynamical" model, and $q_k$ is a random noise (process) that accounts for "model errors".
For now, $\DynMod_{k-1}$ is just a given number (function of $k$). In later tutorials we will  generalize it to matrices, and eventually nonlinear operators (functions).

####  The forecast step
Suppose that $\quad\quad\;\;\quad x_{k-1} \sim \mathcal{N}(\hat{x}_{k-1}, P_{k-1})$,  
and that (independently) $q_{k-1} \sim \mathcal{N}(0, Q_{k-1})$.

By eqn. (Dyn), the mean of $x_{k}$, i.e. $b_k = \mathbb{E}[x_k]$, is a linear function of the mean of $x_{k-1}$:
$$\begin{align}
b_k &= \DynMod_{k-1} \hat{x}_{k-1} \,, \tag{9}
\end{align}$$
since taking the expectation, $\mathbb{E}$, is a [linear operation](https://en.wikipedia.org/wiki/Expected_value#Properties).  
Meanwhile, by the [properties of variance](https://en.wikipedia.org/wiki/Variance#Propagation),
the model gets squared in the variance of $x_{k}$,
and the error variance is an added term
$$\begin{align}
B_k &= \DynMod_{k-1}^2 P_{k-1} + Q_{k-1} \,. \tag{10}
\end{align}$$

It can also be shown that [the sum of two Gaussian random variables](https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables#Proof_using_convolutions)
is also Gaussian.
In summary, eqn. (Dyn) yields
$$x_k \sim \mathcal{N}(b_k, B_k) \,, \tag{8}$$
for some $b_k$ and $B_k$ which we can compute from the previous mean and variance, i.e. $\hat{x}_{k-1}$ and $P_{k-1}$.

In [None]:
# ws.show_answer('RV sums')

**Exc 3.6 (a):** For the KF we want to reformulate our *example problem* of estimating the parameter $a$ as the problem of estimating $x_k$.

Derive the "forecast/dynamical model" $\DynMod_k$, as well as $q_k$, such that eqn. (Dyn) is equivalent to eqn (1).

Then implement it below:

In [None]:
def Mod(k):
    return ### INSERT ANSWER HERE ###

In [None]:
# ws.show_answer('Sequential 2 Recursive')

**Exc 3.6 (b):** The KF may seem like "overkill" for our simple example problem.
But this "heavy machinery" can do a lot more, and will pay off later.
Based on the above, *why* is it we can say that the KF can do more?

#### The analysis step
"updates" the prior (forecast), $\mathcal{N}(x_k \mid \; b_k,\; B_k)$, given by eqns. (8), (9), (10),  
based on the likelihood, $\quad\;\;\;\, \mathcal{N}(y_k \mid \, x_k, \; R)$,  
into the posterior (analysis), $\; \; \, \mathcal{N}(x_k \mid \; \hat{x}_{k}, \, P_{k})$, given by
the update formulae derived as the Gaussian-Gaussian Bayes' rule in [Exc 2.18 of the previous tutorial](T3%20-%20Bayesian%20inference.ipynb#Exc--2.18-'Gaussian-Gaussian-Bayes':).

This completes the KF cycle, which can then restart with the forecast from $k$ to $k+1$.

In [None]:
Q = 0 # Dynamical model noise strength

# Allocation
bb    = np.zeros(K+1) # mean estimates -- prior/forecast values
xxhat = np.zeros(K+1) # mean estimates -- post./analysis values
BB    = np.zeros(K+1) # var  estimates -- prior/forecast values
PP    = np.zeros(K+1) # var  estimates -- post./analysis values

**Exc 3.8:** Following the pattern of the code blocks below,
implement the KF to estimate $x_k$ for a given $k$ based on the estimate of $k-1$.

<mark><font size="-1">
<b>NB:</b> for this example, do not use the "Kalman gain" form of the analysis update.
This problem involves the peculiar, unrealistic situation of infinities
(related to "improper priors") at `k==1`, yielding platform-dependent behaviour.
These peculiarities are of mainly of academic interest.
</font></mark>

In [None]:
def KF(k):
    "Cycle k of the Kalman filter"
    # Forecast
    if k==1:
        BB[k] = np.inf # The "initial" prior uncertainty is infinite...
        bb[k] = 0      # ... thus the corresponding mean is inconsequential.
    else:
        BB[k] = ### INSERT ANSWER HERE ###
        bb[k] = ### INSERT ANSWER HERE ###
    # Analysis
    PP[k]    = ### INSERT ANSWER HERE ###
    xxhat[k] = ### INSERT ANSWER HERE ###

In [None]:
# ws.show_answer('KF_k')

Run the estimation computations:

In [None]:
for k in 1+np.arange(K):
    KF(k)

**Exc 3.10:** Go back to the animation above and uncomment the block that plots the KF estimates.  
Visually: what is the relationship between the estimates provided by the KF and by linear regression?

In [None]:
# ws.show_answer('LinReg compare')

**Exc 3.12 (optional):** This exercise proves (on paper) the conclusion of the previous exercise.

Firstly, note that the KF forecast step (here with $Q=0$) can be inserted in the analysis step, forming a single couple of recursions:
$$\begin{align}
\hat{x}_k &= P_k \big(y_k/R \;+\; \DynMod_{k-1} \hat{x}_{k-1} / [\DynMod_{k-1}^2 P_{k-1}] \big) \tag{11} \, , \\\
P_k &= 1/\big(1/R \;+\; 1/[\DynMod_{k-1}^2 P_{k-1}]\big) \tag{12} \, .
\end{align}$$

Use this and Exc 3.6 (a) to show that
$$\begin{align}
&\text{firstly,} &P_K &= R\frac{K^2}{\sum_{k=1}^K k^2} \, , \tag{13} \\\
&\text{secondly,} &\hat{x}_K &= K\frac{\sum_{k=1}^K k y_k}{\sum_{k=1}^K k^2} = K \hat{a}_K \tag{14} \, ,
\end{align}$$
where $\hat{a}_K$
is given by eqn. (6).

In [None]:
# ws.show_answer('x_KF == x_LinReg')

#### Exc 3.14:
Set $Q=0$ in eqn (Dyn) so that $x_{k+1} = \DynMod x_k$ *for some constant $\DynMod>1$*.

What does the sequence of $P_k$ converge to?  
*Hint: Start from eqn (12) [eqn (13) is for the straight-line example only] and find its "fixed point.*

In [None]:
# ws.show_answer('Asymptotic P when M>1')

#### Exc 3.15:
Redo Exc 3.14, but assuming  
 * (a) $\DynMod = 1$.
 * (b) $\DynMod < 1$.
In these cases it is not so fruitful to use the fixed point equation.

In [None]:
# ws.show_answer('Asymptotic P when M=1')
# ws.show_answer('Asymptotic P when M<1')

Thus, if $\DynMod>1$, the KF state's uncertainty variance, $P_k$ does not converge to 0. This is because, even though you keep gaining more information, this gets balanced out by the growth in uncertainty during the forecast. On the other hand, if $\DynMod \leq 1$ then the error converges to zero.

In general, however, $\DynMod$, $Q$, $R$  depend on time, $k$ (often to parameterize exogenous/outside factors/forces/conditions), and there is no limit value that the state distribution (and its parameters) converges to.

A particular exception is the above straight-line example. As we found above, $\DynMod_k =\frac{k+1}{k}$, which depends on time, and yet its limiting value can be found through eqn. (13); moreover, eqn. (13) and [the pyramidal sum](https://en.wikipedia.org/wiki/Square_pyramidal_number) can be used to show that $P_k \rightarrow 0$, even though $\forall k, \; \DynMod_k > 1$.

**Exc 3.18 (optional):** Set $Q$ to 1 or more in the KF code, and re-compute its estimates. Explain why the KF estimate is now closer to the obs (always at the latest time instance) than the linear regression estimate.

**Exc 3.20 (optional):** Now change $R$ (but don't re-run the simulation of the truth and obs). The KF estimates should not change (in this particular example). Why?

### Summary
The KF consists of two steps:
 * Forecast
 * Analysis
 
In each step, the mean and variance must be updated.

As an example, we saw that the linear regression estimate is reproduced by the KF, although it is a bit tricky to initialize the KF with infinite uncertainty. However, the KF (i.e. state estimation) is much more general.

### Next: [T5 - The Kalman filter (multivariate)](T5%20-%20Kalman%20filter%20(multivariate).ipynb)