### The problem

Let's prove another really useful result that will crop up a lot in many areas of Bayesian probability, notably the derivation of the Kalman filter.

Suppose we have a situation where

$$X \sim N(\mu, \sigma^2)$$

and the distribution of $Y$ <b>conditional on the value of</b> $X$ is

$$ Y | X \sim N(aX + b, \tau^2)$$

(in words: when we know the value of $X$, $Y$ is Normally distributed with a mean of $aX + b$, and a variance of $\tau^2$)

The question we want to answer is: in this situation, what is the conditional distribution of $X$ given $Y$?

<hr>

### Bayes' rule

As you might imagine, when we want to "invert the conditionality" like this (go from $P(Y | X)$ to $P(X | Y)$), Bayes' rule is the key.

By Bayes' rule, 

$$
\begin{align}
P(X = x | Y = y) = \frac{ P(Y = y | X = x) P(X = x) }{ P(Y = y) }
\end{align}
$$

### Shape of the distribution

Notice that the distribution we are interested in is for $X$. What we care about is the _shape_ of this distribution: how it changes as $x$ changes. And only terms on the right hand side that involve $x$ can influence that.

So, in fact, we can forget about the denominator here, becaue it doesn't have any $x$ terms in it, so it is effectively just a multiplicative constant. It affects the _height_ of the final distribution, but not its shape. 

So we can forget about the denominator for now. (Since we know that the final distribution we find must sum to 1, we can use that condition to determine its height).

So
$$
P(X = x | Y = y) \propto P(Y = y | X = x) P(X = x)
$$

### Expand terms

We can now expand the two terms on the right-hand side:

$$
\begin{align}
P(X = x | Y = y) & \propto P(Y = y | X = x) P(X = x) \\
            & = \frac{1}{\sqrt{2 \pi} \tau} \exp \left\{ -\frac{1}{2} \frac{(y - (ax + b))^2}{\tau^2} \right\} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{ -\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2} \right\}  \\
             & \propto \exp \left\{ -\frac{1}{2} \left[ \frac{(y - (ax + b))^2}{\tau^2} + \frac{(x - \mu)^2}{\sigma^2} \right] \right\} \\
\end{align}
$$

In the last line here we have once again discarded any constant terms - remember, it's only terms that involve $x$ that can affect the shape of the distribution.

### Focus on the exponent

Let's focus on the term inside the square brackets in the exponent. Let's expand all the brackets.

$$
\begin{align}
E_1 & = \frac{(y - (ax + b))^2}{\tau^2} + \frac{(x - \mu)^2}{\sigma^2} \\
  & = \frac{1}{\tau^2 \sigma^2} \left[ \sigma^2 (y - (ax + b))^2 + \tau^2 (x - \mu)^2 \right] \\
  & = \frac{1}{\tau^2 \sigma^2} \left[ \sigma^2 \left(y^2 - 2(ax + b)y + (ax + b)^2 \right) + \tau^2 \left(x^2 - 2x\mu + \mu^2 \right) \right] \\
  & = \frac{1}{\tau^2 \sigma^2} \left[ \sigma^2 \left(y^2 - 2axy -2by + a^2x^2 + 2abx + b^2 \right) + \tau^2 \left(x^2 - 2x\mu + \mu^2 \right) \right] \\
\end{align}
$$

### Collect terms

Now we want to separate all the terms involving $x^2$, all those involving $x$, and all constant terms.

$$
\begin{align}
E_1 & = \frac{1}{\tau^2 \sigma^2} \left[ (\sigma^2 a^2 + \tau^2) x^2 - 2 (\sigma^2ay - \sigma^2 ab + \tau^2\mu) x + \sigma^2 (y^2 - 2by + b^2) + \tau^2 \mu^2 \right] \\
\end{align}
$$

Notice that the last term here once again does not involve $x$. When it is raised to the power of $e$ in the full pdf expression, it simply becomes another multiplicative constant. So we can forget about it - it does not affect the _shape_ of the distribution we are trying to find.

So let's focus on 

$$
\begin{align}
E_2 & = \frac{1}{\tau^2 \sigma^2} \left[ (\sigma^2 a^2 + \tau^2) x^2 - 2 (\sigma^2ay - \sigma^2 ab + \tau^2\mu) x \right] \\
    & = \frac{(\sigma^2 a^2 + \tau^2)}{\tau^2 \sigma^2} \left[ x^2 - 2 \left( \frac{\sigma^2a(y - b) + \tau^2\mu)}{\sigma^2 a^2 + \tau^2} \right)  x \right] \\
    & = A \left[ x^2 - 2 B x \right] \\
\end{align}
$$

where 

$$
\begin{align}
A & = \frac{(\sigma^2 a^2 + \tau^2)}{\tau^2 \sigma^2}  \\
B & = \frac{\sigma^2a(y - b) + \tau^2\mu)}{\sigma^2 a^2 + \tau^2} \\
\end{align}
$$

### Complete the square

Now we can "complete the square". This is a really useful technique that allows us to simplify an expression with both $x^2$ and $x$ terms into an expression with a single $(x - \dots)^2$ term.

$$
\begin{align}
E_2 & = A \left[ x^2 - 2 B x \right] \\
    & = A \left[ (x - B)^2 - B^2 \right] \\
    & = A (x - B)^2 - A B^2 \\
\end{align}
$$

To see that the first and second lines of the above equation are equal, just expand the round bracket in the second line.

Once again, because $A$ and $B$ do not contain $x$, we can discard the final $A B^2$ term here (it just becomes a multiplicative constant in front of our distribution). 

So we have found that

$$
P(X = x | Y = y) \propto \exp \left\{ -\frac{1}{2} A (x - B)^2 \right\}
$$

### Relabel 

Notice that in the case we are considering, $\sigma^2$, $\tau^2$ and $a^2$ and all $> 0$. 

$\sigma^2$ and $\tau^2$ are variances, so must be greater than zero for the problem to be interesting. $a$ must be non-zero, otherwise the mean of $Y$ does not actually depend on $X$ at all, and the problem doesn't make sense.

This means that the quantity $A$ is always greater than 0. So it's legitimate to define 

$$
D = 1 / \sqrt{A} \iff D^2 = 1 / A \iff A = 1/D^2
$$

This gives

$$
\begin{align}
D^2 = \frac{\tau^2 \sigma^2}{\sigma^2 a^2 + \tau^2}
\end{align}
$$

and we then have

$$
\begin{align}
P(X = x | Y = y) \propto \exp \left\{ -\frac{1}{2} \frac{(x - B)^2}{D^2} \right\}
\end{align}
$$

### A fantastically useful trick

Here, we use another really useful trick that mathematicians working with probability distributions use a lot: we _recognise_ that this is the pdf of the Normal distribution. 

We have shown that $X|Y$ has a pdf which is proportional to 

$$
\begin{align}
\exp \left\{ -\frac{1}{2} \frac{(x - B)^2}{D^2} \right\}
\end{align}
$$

and we know that this is the shape of the pdf of a Normally distributed random variable with mean $B$ and variance $D^2$.

### Constants taken care of

So all of those multiplicative constants we discarded along the way don't matter at all. We know that the shape of the distribution is Normal($A, D^2$), and so we know that the normalising constant must be 

$$
\begin{align}
\frac{1}{\sqrt{2 \pi} D}
\end{align}
$$

If we had retained all those multiplicative constants throughout our calculation, and gone to the effort of simplifying them, the result would have been exactly $\frac{1}{\sqrt{2 \pi} D}$, since this is the only constant that makes the final distribution integrate to 1.

So we're done!

$$
X | Y \sim N \left(B, D^2 \right)
$$

### Final result

So, we have shown that when

$$X \sim N(\mu, \sigma^2)$$

and the distribution of $Y$ <b>conditional on the value of</b> $X$ is

$$ Y | X \sim N(aX + b, \tau^2)$$

then the distribution of $X$ conditional on the value of $Y$ is

$$X | Y \sim N(\mu', \sigma'^2)$$

where 

$$
\begin{align}
\mu' = \frac{\sigma^2 a(y - b) + \tau^2 \mu}{\sigma^2 a^2 + \tau^2}
\end{align}
$$

and

$$
\begin{align}
\sigma'^2 = \frac{\tau^2 \sigma^2}{\sigma^2 a^2 + \tau^2}
\end{align}
$$

### Special Cases

There are a couple of special cases of the above formula that will be very useful in the proof of the 1D Kalman filter. Let's look at them now.

### Special case: 1

Suppose we have a state $X$

$$X \sim N(\mu, \sigma^2)$$

and an observation $Z$ of $X$, that is centered on the true state, but with some error:

$$Z | X \sim N(X, r^2)$$

Then we can replace $Y$ with $Z$ in the formulas above, and we have $a = 1, b = 0$. Therefore

$$X | Z \sim N(\mu', \sigma'^2)$$

where 

$$
\begin{align}
\mu' = \frac{\sigma^2 z + r^2 \mu}{\sigma^2 + r^2}
\end{align}
$$

and

$$
\begin{align}
\sigma'^2 = \frac{r^2 \sigma^2}{\sigma^2 + r^2}
\end{align}
$$

This result will be very useful in deriving the Kalman filter equations.

We can rearrange the formula for $\mu'$ a bit:

$$
\begin{align}
\mu' & = \frac{\sigma^2 z + r^2 \mu}{\sigma^2 + r^2} \\
     & = \frac{\sigma^2 z + r^2 \mu + \sigma^2 \mu - \sigma^2 \mu}{\sigma^2 + r^2} \\
     & = \frac{\sigma^2 z + (\sigma^2 + r^2) \mu - \sigma^2 \mu}{\sigma^2 + r^2} \\
     & = \mu + \frac{\sigma^2 z - \sigma^2 \mu}{\sigma^2 + r^2} \\
     & = \mu + \frac{\sigma^2 }{\sigma^2 + r^2} (z - \mu)\\
\end{align}
$$

#### Kalman gain

So we can see here, that when we update from our prior $P(X=x)$ to our posterior $P(X=x | Z=z)$, our new mean is the prior mean $\mu$, adjusted by an amount equal to 

$$
\begin{align}
\frac{\sigma^2}{\sigma^2 + r^2}(z - \mu)
\end{align}
$$

The quantity 

$$
\begin{align}
K = \frac{\sigma^2}{\sigma^2 + r^2}
\end{align}
$$

tells us how far we move _from_ our prior mean $\mu$ _towards_ the observation $z$. This quantity is often called the "Kalman gain" when using a Kalman filter.

- if the Kalman gain is 1, we trust our observation completely, and our posterior is centered on the observation itself
- if the Kalman gain is 0, we do not trust our observation at all, and our posterior is centered on the prior mean
- if the Kalman gain $K$ is between 0 and 1, our posterior mean is located a proportion $K$ towards the observation from the prior mean

#### Precision instead of variance

We can also rearrange the formula for the posterior variance. Suppose that we had parameterised our Normal distributions not by their variances, but by their _precisions_, where 

$$
\begin{align}
\text{precision} = 1/\text{variance}
\end{align}
$$

Therefore,
- precision is large when variance is small, (that is, when we know the likely state very accurately)
- precision is small when variance is large

Then the precision for our posterior is given by 

$$
\begin{align}
\frac{1}{\sigma'^2} 
& = \frac{\sigma^2 + r^2}{r^2 \sigma^2} \\
& = \frac{1}{r^2} + \frac{1}{\sigma^2} 
\end{align}
$$

Incredibly, the precision for our posterior is simply the sum of 
- the precision of our prior, and 
- the precision of the observation

What a beautifully simple formula!

Note that in this special case, the posterior precision is _larger_ than the prior precision (by $1/r^2$), so our posterior is a narrower distribution than our prior. This makes sense: we can be more confident about $X$ when we have the extra information provided by the observation $Z$.

### Special case: 2

Suppose we have a state $X_1$

$$X_1 \sim N(\mu_1, \sigma_1^2)$$

and a next state $X_2$ that depends on $X_1$ via the relationship

$$ X_2 | X_1 \sim N(X_1 + u, q^2)$$

Then we can replace $Y$ in our original derivation with $X_2$, and we have $a = 1, b = u$. Therefore

$$ X_1 | X_2 \sim N(\mu', \sigma'^2)$$

where 

$$
\begin{align}
\mu' = \frac{\sigma^2 (x_2 - u) + q^2 \mu}{\sigma^2 + q^2}
\end{align}
$$

and

$$
\begin{align}
\sigma'^2 = \frac{q^2 \sigma^2}{\sigma^2 + q^2}
\end{align}
$$

This will be a very useful result in the derivation of the Kalman smoother.