In [2]:
#math and linear algebra stuff
import numpy as np

#plots
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (15.0, 15.0)
#mpl.rc('text', usetex = True)
import matplotlib.pyplot as plt
%matplotlib inline

# The recursive least square algorithm

## Some notation

By default, we will consider the framework of one dimensional time series, but our purpose can be later extended to multidimensional data.

### Input

Let's begin with the set of input samples:
$$
    \{ u(0), u(1), \dots u(N) \}
$$
Our time serie which will be an input of the RLS algorithm. In the real world, time series are often corrupted with noise, and we are interested in obtaining, for each time point $n$, an estimator $y(n)$ of the original perfect signal $d(n)$

For the convenience of our demonstration, we will consider that all previous signals are available, ie:
$$
    u(k) = 0 \forall k < 0
$$

We also call
$$
    \{ d(0), d(1), \dots d(N) \}
$$
the desired response, that we wish to recover from $u$

### Output

Let's call
$$
    y(n) = \sum_{k=0}^{M-1} w_k u(n-k)
$$
an estimator of the perfect time serie that can be considered as the result of the application of a linear system or filter over the $M$ previous elements of the time serie.

The $w_k$ are the coefficient of the filter, and we would like to find the one that are the most suited for our estimator.


## The (recursive) least square estimator

The least square estimator is very commonly used in many fields of science and engineering. Part of its success comes from the fact that it can be derived from a simple bayesian reasonning when using a gaussian noise model, in our case it would read:

$$
    \tilde{\vec{y(n)}} = \underset{y}{argmin} \| y(n) - d(n) \|_{2}^{2} \\
    \tilde{\vec{y(n)}} = \langle \vec{w}, \vec{u_n} \rangle
    \text{ When } \vec{w} = \underset{\vec{w}}{argmin} \| \langle \vec{w}, \vec{u_n} \rangle - d(n) \|_{2}^{2}
$$

This approach only take into account $y(n)$ and $d(n)$ at a single time point to derive the filter $\vec{w}$, and it is probably not the best option in case we have large deviation in a sample.
We can instead, look for a $\vec{w}$ that is optimal for the N last few samples, including the current one, and we can even use a forgetting factor $\beta$ such that the quadratic fidelity take more into account the most recent samples:

$$
    0 < \beta(n,i) \leq 1, i=n-(N-1),n-(N-1)+1,\dots,n
$$

An exponentially decreasy $\beta$ may for instance be a good choice:

$$
   \beta(n,i) = \lambda^{n-i}
$$

We can then define our **_recursive_** least square estimator by using a few linear operators:


With:

$$ R^{n} =
    \begin{pmatrix}
        u(n-(M-1)-(N-1)) & u(n-(M-1)-(N-2))& \dots & u(n-(M-1))   \\
        u(n-(M-2)-(N-1)) & u(n-(M-2)-(N-2))& \dots & u(n-(M-2)) \\
        \vdots           & \vdots  & \dots & \vdots   \\
        u(n-(N-1))       & u(n-(N-2))  & \dots & u(n)     \\
    \end{pmatrix}
$$

Describing the content of this matrix is pretty straightforward:
* The line T contains all inputs from $u(T)$ to $u(T-(N-1))$, ie, the $N$ previous entry, that should be used to predict the output $d(T)$
* We have $M$ lines, in order to compute the predictor on the $M$ samples under consideration


If we also want to include the forgetting function $\beta$ in our model then we can use $B$, a nice diagonal matrix of size $N$:
$$ B =
    \begin{pmatrix}
        \beta(0,N-1) & 0 & \dots & 0 \\
        0 & \beta(0,N-2) & \dots & 0 \\
        \vdots           & \vdots  & \ddots & \vdots   \\
        0 & 0 & \vdots & \beta(0,0) \\
    \end{pmatrix}
$$

So at each time step index $n$, we would like to solve:
$$
\begin{align*}
    \vec{\vec{w_n}} = &\underset{\vec{w}}{argmin} \| B^{\frac{1}{2}}(R^{n} \vec{w} - \vec{d_n}) \|_{2}^{2} \\
    \iff &\underset{\vec{w_n}}{argmin} \| (B^{\frac{1}{2}}R^{n} \vec{w_n} - B^{\frac{1}{2}}\vec{d_n}) \|_{2}^{2} \\
\end{align*}
$$
Where $\vec{w_n}$ is the current estimate for $\vec{w}$ at step n
$$\vec{w} = \begin{pmatrix}w_{N-1} \\ w_{N-2} \\ \vdots \\ w_{0}\end{pmatrix}$$
$$\vec{d_n} = \begin{pmatrix}d_{n-(N-1)} \\ d_{n-(N-2)} \\ \vdots \\ d_{n}\end{pmatrix}$$

This is a simple linear least square that can be solved potentially with Moore Penrose Pseudo inverse.
We recall that its expression for the functional : $\frac{1}{2}||Ax-b||_2^2$ is $(A^t A)^{-1}A^t b$, which in this case gives:
$$
\begin{align*}
\vec{w_n} = &((B^{\frac{1}{2}}R^{n})^t B^{\frac{1}{2}}R^{n})^{-1} (B^{\frac{1}{2}}R^{n})^t B^{\frac{1}{2}}\vec{d_n} \\
 = & A_n^{-1} b_n
\end{align*}
$$
With
* $A_n = (B^{\frac{1}{2}}R^{n})^t B^{\frac{1}{2}}R^{n})$
* $b_n = (B^{\frac{1}{2}}R^{n})^t B^{\frac{1}{2}}\vec{d_n}$


## When the recursivity kicks in

We are now going to take a look at the links between $w_n = A_n^{-1} b_n$ and $w_{n-1} = A_{n-1}^{-1} b_{n-1}$

Let's take first a closer look at 
$$ R^{n} =
    \begin{pmatrix}
        u(n-(M-1)-(N-1)) & u(n-(M-1)-(N-2))& \dots & u(n-(M-1))   \\
        u(n-(M-2)-(N-1)) & u(n-(M-2)-(N-2))& \dots & u(n-(M-2)) \\
        \vdots           & \vdots  & \dots & \vdots   \\
        u(n-(N-1))       & u(n-(N-2))  & \dots & u(n)     \\
    \end{pmatrix}
$$

$$ R^{n-1} =
    \begin{pmatrix}
        u(n-(M-1)-N) & u(n-(M-1)-(N-1))& \dots & u(n-(M-1)-1) \\
        u(n-(M-2)-N) & u(n-(M-2)-(N-1))& \dots & u(n-(M-2)-1) \\
        \vdots           & \vdots  & \dots & \vdots   \\
        u(n-N))       & u(n-(N-1))  & \dots & u(n-1)     \\
    \end{pmatrix}
$$

Matrix $R^{n-1}$ is like $R^{n}$ that would have been shifted 