#### Estimator of parameters: frequentist

##### Bias

`Bias` of an estimator $\hat{x}$ of a parameter $x$ (fixed, but unknown) is defined as

$$B(\hat{x})=\mathbb{E}[\hat{x}]-x$$

The expected value is computed over distribution of data, an estimator is said to be unbiased when $B(\hat{x})=0$

##### Variance

`Variance` of an estimator $\hat{x}$ of a parameter $x$ is defined as

$$\text{var}(\hat{x})=\sigma_{\hat{x}}^2=\mathbb{E}\left[\left(\hat{x}-\mathbb{E}[\hat{x}]\right)^2\right]$$

We would like variance to be as small as possible

##### Distribution and likelihood

Assume we have a system with input $x$ and output $y$, and

$$y=x+v, v\sim N(0, \sigma^2)$$

Suppose we obtain a `single` observation $y$, we can write

$$p(y|x)=\frac{1}{\sqrt{2\pi \sigma^2}}\exp \left[-\frac{1}{2\sigma^2}(y-x)^2\right]$$

Here, $x$ is the `parameter` we want to estimate and $y$ is observed value, or `data`

When $x$ is given, this is a probability `distribution` (i.e., PDF) of $y$ parameterized by $x$

However, when $y$ is given, this is a `likelihood function` of parameter $x$ (which is no longer a distribution and does not have to integrate to one over $x$)

The $\hat{x}$ obtained by maximizing this likelihood function is known as `maximum likelihood` estimator (MLE)

From the expression, it is not difficult to see that it (also happens to) minimize the mean squared error (MSE) for Gaussian distribution

$$\hat{x}_{\text{MLE}}=\arg \max_{x}p(y|x) = \arg \min_x (y-x)^2= y$$

Here, the MSE refers to `empirical (sum of) squared residual(s)` bewteen data and the estimator

If we consider $y$ as a random variable (due to noise $v$), we can sample many of them $y_{1:N}$, in this case

$$\begin{align*}\hat{x}_{\text{MLE}}&=\arg \max_x p(y_1, \cdots, y_N|x)\\
&=\arg \max_x \prod_{i=1}^N p(y_i|x) \\
&=\arg \max_x \sum_{i=1}^N \log p(y_i|x) \\
&=\arg \min_x \sum_{i=1}^N (y_i-x)^2 \\
& \text{take derivative w.r.t. x and set to zero} \\
&=\frac{1}{N}\sum_{i=1}^Ny_i
\end{align*}$$

##### MSE of estimator

Consider $y$ as distribution, `MSE of the estimator` is defined as

$$\begin{align*}
\text{MSE}&=\mathbb{E}_{y|x}\left[\left(\hat{x}(y)-x\right)^2\right]\\
&=\int \left(\hat{x}(y)-x\right)^2p(y|x)dy
\end{align*}$$

That is, the `expectation of squared difference` between estimator and true $x$ considering all possible $y$ generated by the distribution parameterized by this $x$

We can evalute this by sampling

$$\begin{align*}
\text{MSE}&=\frac{1}{N}\sum_{i=1}^N\left(\hat{x}(y_i)-x\right)^2
\end{align*}$$

It can be shown that the MLE also minimizes this MSE among all unbiased estimators for Gaussian problem

#### Estimator of parameters: Bayesian

The Bayesian approach considers $x$ a random variable, rather than a fixed but unknown value, with its own distribution

This means in MSE, the expected value will be evaluated over the joint PDF $p(x,y)$

$$\begin{align*}
\text{MSE}(\hat{x})&=\iint \left(\hat{x}-x\right)^2p(x, y)dxdy \\
& \text{split using } p(x, y)=p(x|y)p(y)\\
&=\int \left[\int\left(\hat{x}-x\right)^2p(x|y)dx \right]p(y)dy
\end{align*}$$

Since $p(y)\geq 0$, we can minimize the term in the bracket for each $y$, then the Bayesian MSE will be minimized

We take derivative w.r.t. $\hat{x}$

$$\begin{align*}
\frac{\partial}{\partial \hat{x}}\text{MSE} &\propto\frac{\partial}{\partial \hat {x}}\int (x-\hat{x})^2p(x|y)dx \\
&=\int\frac{\partial}{\partial \hat {x}} (x-\hat{x})^2p(x|y)dx \\
&=-2\int(x-\hat{x})p(x|y)dx \\
&=-2 \int xp(x|y)dx +2\hat{x}\int p(x|y) dx \\
&=-2 \int xp(x|y)dx + 2\hat{x}
\end{align*}$$

Set it to zero and we have

$$\hat{x}=\int x p(x|y)dx=\mathbb{E}[x|y]$$

We see that the estimator minimizing the Bayesian MSE is the mean of the `posterior` PDF $\mathbb{E}[x|y]$, after data has been observed

#### State space tracking

For a system with state $x$ and observation $y$, we define `state space` model $f_n$ and `measurement` model $h_n$ as

$$\begin{align*}x_{n+1}&=f_n(x_n, u_n)\\
y_n&=h_n(x_n, v_n)
\end{align*}$$

Primary goal: estimate state of system $x_n$ at some time step $n$, given a sequence of observations $y_{0:n}$, with $u_n, v_n$ being noise terms

Often, compact notation $\hat{x}_{n|n+l}$ is used, denoting estimate of state at step $n$, based on $y_{0:n+l}$ observations

* when $l=-1$: `predicted` estimate
* when $l=0$: `filtered` estimate (our focus)

We want to do it `recursively`, so we don't have to retain all past measurements

Bayesian estimation is all about finding the `(marginal) posterior` $p(x_n|y_{0:n})$ on top of which we can compute any metric we want, mean, median, whatever

We assume `Markov` property for state space model, that is, if we know what state is at step $n$, then there is no additional information gained from knowing previous values of the state

$$p(x_{n+1}|x_{0:n})=p(x_{n+1}|x_n)$$

Also for measurement model

$$p(y_n|x_{0:n})= p(y_n|x_{n})$$

However

$$p(x_n|y_{0:n})\neq p(x_n|y_{n})$$

#### Joint Bayesian recursion

We can start with joint PDF, the goal is to find a recursive equation from $p(x_{0:n-1}|y_{0:n-1})$ to $p(x_{0:n}|y_{0:n})$

First, we split the following

$$\begin{align*}p(x_{0:n}, y_n|y_{0:n-1})&=p(x_{0:n}|y_n,y_{0:n-1})p(y_n|y_{0:n-1})\\
&=p(x_{0:n}|y_{0:n})p(y_n|y_{0:n-1})
\end{align*}$$

and, we can also do split the other way

$$\begin{align*}p(x_{0:n}, y_n|y_{0:n-1})&=p(y_n|y_{0:n-1}, x_{0:n})p(x_{0:n}|y_{0:n-1}) \\
& \text{Markov property} \\
&=p(y_n|x_n)p(x_{0:n}|y_{0:n-1})\\
&=p(y_n|x_n)p(x_n, x_{0:n-1}|y_{0:n-1})\\
& \text{split the second term} \\
&=p(y_n|x_n)p(x_n|x_{0:n-1}, y_{0:n-1})p(x_{0:n-1}|y_{0:n-1})\\
& \text{Markov property} \\
&=p(y_n|x_n)p(x_n|x_{n-1})p(x_{0:n-1}|y_{0:n-1})\\
\end{align*}$$

Combine two results, we have our recursive equation for joint posterior PDF

$$p(x_{0:n}|y_{0:n})=p(x_{0:n-1}|y_{0:n-1})\frac{p(y_n|x_n)p(x_n|x_{n-1})}{p(y_n|y_{0:n-1})}$$

with likelihood $p(y_n|x_n)$ and prior $p(x_n|x_{n-1})$ in the numerator, and the denominator is a normalizing constant not related to the state

In theory, we can obtain the marginal through integration

$$p(x_n|y_{0:n})=\int p(x_{0:n}|y_{0:n})dx_{0:n-1}$$

But it beats the purpose of not having to computing the joint PDF in the first place

#### Marginal Bayesian recursion

Now, we start directly with marginal, and split it in two ways (with Markov property)

$$\begin{align*}
p(x_n, y_n|y_{0:n-1})&=p(x_n|y_n,y_{0:n-1})p(y_n|y_{0:n-1})\\
&=p(x_n|y_{0:n})p(y_n|y_{0:n-1})\\
p(x_n, y_n|y_{0:n-1})&=p(y_n|x_n,y_{0:n-1})p(x_n|y_{0:n-1})\\
&=p(y_n|x_n)p(x_n|y_{0:n-1})\\
\end{align*}$$

Combine these two, we have

$$p(x_n|y_{0:n})=\frac{p(y_n|x_n)p(x_n|y_{0:n-1})}{p(y_n|y_{0:n-1})}$$

where $p(y_n|x_n)$ is the likelihood, and $p(y_n|y_{0:n-1})$ does not depend on $x_n$

Now, we deal with $p(x_n|y_{0:n-1})$

$$\begin{align*}
p(x_n|y_{0:n-1})&=\int p(x_n, x_{n-1}|y_{0:n-1})dx_{n-1}\\
&=\int p(x_n| x_{n-1},y_{0:n-1})p(x_{n-1}|y_{0:n-1})dx_{n-1}\\
&\text{Markov property}\\
&=\int p(x_n| x_{n-1})p(x_{n-1}|y_{0:n-1})dx_{n-1}\\
\end{align*}$$

We see we have the prior $p(x_n| x_{n-1})$ and posterior from previous step $p(x_{n-1}|y_{0:n-1})$

So in summary, for marginal Bayesian recursion, we have

* `Prediction` step

$$\begin{align*}
p(x_n|y_{0:n-1})=\int p(x_n| x_{n-1})p(x_{n-1}|y_{0:n-1})dx_{n-1}\\
\end{align*}$$

* `Update` step

$$p(x_n|y_{0:n})=\frac{p(y_n|x_n)}{p(y_n|y_{0:n-1})}p(x_n|y_{0:n-1})$$

These are elegant equations, but not useful if the integral is impossible to evaluate

#### Numerical integration

Suppose we wish to evaluate the integral

$$\begin{align*}
p(x_n|y_{0:n-1})=\int_{x_{\min}}^{x_{\max}} p(x_n| x_{n-1})p(x_{n-1}|y_{0:n-1})dx_{n-1}\\
\end{align*}$$

We can write the approximation of the integral using trapezoidal rule with $n_{b}$ bins

$$\begin{align*}
p(x_n|y_{0:n-1})&=\int_{x_{\min}}^{x_{\max}} p(x_n| x_{n-1})p(x_{n-1}|y_{0:n-1})dx_{n-1}\\
&=\int_{x_{\min}}^{x_{\max}} g(x;x_n)dx \\
&\approx w\left[\frac{g(x_{\min};x_n)+g(x_{\max};x_n)}{2}+\sum_{k=1}^{n_b-1}g(x_{\min}+kw;x_n)\right]
\end{align*}$$

where $w=(x_{\max}-x_{\min})/n_b$ is the width of each bin