# Conditional Gaussian Distributions

An important property of Gaussian distributions is that if two sets of random variables are jointly Gaussian, then the conditional distribution is also Gaussian. This property will be derived in this section.


## Showing $p(x_a | x_b)$ is Gaussian

First consider some $x \in \mathbb{R}^D$ with distribution $\mathcal{N}(x|\mu, \Sigma)$. Let us partition $x$ into $M$ components corresponding with $x_a$ and $D - M$ components corresponding with $x_b$ such that

$$
x = \begin{pmatrix}x_a \\ x_b \end{pmatrix}
$$

and similarly for other vectors in $\mathcal{N}(x|\mu, \Sigma)$

$$
\mu = \begin{pmatrix}\mu_a \\ \mu_b \end{pmatrix}
$$

and

$$
\Sigma = \begin{pmatrix}\Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{pmatrix}
$$

We also define the **precision matrix** as the inverse of the covariance matrix $\Lambda = \Sigma^{-1}$. This matrix is also symmetric, as proven in the appendix. 

Using this partitioning we can deduce the covariance $\Sigma_{a|b}$ and mean $\mu_{a|b}$ of the conditional distribution $p(x_a | x_b)$ by considering the functional term of the Gaussian (**Mahalanobis distance**)

$$
\Delta^2 = -\frac{1}{2}(x_a - \mu_a)^T\Lambda_{aa}(x_a - \mu_a) - \frac{1}{2}(x_a - \mu_a)^T\Lambda_{ab}(x_b - \mu_b) -\frac{1}{2}(x_b - \mu_b)^T\Lambda_{ba}(x_a - \mu_a) - \frac{1}{2}(x_b - \mu_b)^T\Lambda_{bb}(x_b - \mu_b)
$$

This shows that our functional is still a quadratic function of $x_a$. Thus $p(x_a | x_b)$ is a Gaussian distribution as well!

## Deriving Mean and Covariance

Now we consider looking for the mean $\mu_{a|b}$ and covariance $\Sigma_{a|b}$ of this conditional Gaussian. We can "complete the square" by comparing only the variable parts of the function above to the variable parts of $p(x_a | x_b)$ (namely $x_a$). For a general Gaussian distribution, this can be written as

$$
-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu) = -\frac{1}{2}x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu + \text{const}
$$

where we leave all terms independent of x as a constant value for simplicity. Thus the corresponding Gaussian distribution should have a covariance equal to the second order term with $\Sigma^{-1}$ and a mean corresponding to the linear term in x $\Sigma^{-1}\mu$.

We first pick out the second order term in $x_a$,

$$
-\frac{1}{2}x_a^T\Lambda_{aa}x_a
$$

this is just the covariance of $p(x_a|x_b)$. So $\Sigma_{a|b} = \Lambda{aa}^{-1}$. 

Doing the same for all the linear terms in $x_a$ we get 

$$
x_a^T (\Lambda_{aa}\mu_a - \Lambda_{ab}(x_b - \mu_b))
$$

So equating this to the linear term for a general Gaussian above gives us

$$
\mu_{a|b} = \Sigma_{a|b}(\Lambda_{aa}\mu_a - \Lambda_{ab}(x_b - \mu_b) = \mu_a - \Lambda_{aa}^{-1}\Lambda_{ab}(x_b - \mu_b)
$$

We can express these in terms of the partitioned covariance matrix. We can express the partitioned covariance matrix as

$$
\begin{pmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{pmatrix}^{-1} = \begin{pmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{pmatrix}
$$

and make use of the identity of the inverse of a paritioned matrix proven below in the appendix to get

$$
\begin{align}
    \Lambda_{aa} &= (\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})^{-1} \\
    \Lambda_{ab} &= -(\Sigma_{aa} - \Sigma_{ab}\Sigma^{-1}_{bb}\Sigma_{ba})^{-1}\Sigma_{ab}\Sigma_{bb}^{-1}
\end{align}
$$

Plugging these in to the derived values of $\mu_{a|b}$ and $\Sigma_{a|b}$

$$
\begin{align}
\mu_{a|b} &= \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b)\\
\Sigma_{a|b} &= \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}
\end{align}
$$

since

$$
\Lambda_{aa}^{-1}\Lambda_{ab} = -(\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})(\Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})^{-1}\Sigma_{ab}\Sigma_{bb}^{-1} = -\Sigma_{ab}\Sigma_{bb}^{-1}
$$

These equations give a better intuition on the transformation of one Gaussian into another. For example, we see the mean from $x_a$ is being shifted to the right by a term dependent on the position and mean/covariance of the Gaussian of $x_b$. 

Note that these transformations on the Gaussian are **affine transformations**: this is a set of transformations that preserve lines and parallelism but not scale or distances. One can think of them as transformations one step further from a linear transformation, since they include linear transformations but also include transformations that move the points.

# Marginal Gaussian distributions

Another important property of Gaussian distributions is that, if the join distribution $p(x_a, x_b)$ of two Gaussians is Gaussian, then the marginal distribution given by

$$
p(x_a) = \int p(x_a, x_b)dx_b
$$

is also Gaussian.

## Showing $p(x_a)$ is Gaussian

Consider again the quadratic functional form of the joint Gaussian distribution $p(x_a, x_b)$. 

# Appendix

*Statement: (PRML Exercise 2.22)* The inverse of a symmetric matrix is also symmetric.

*Proof:* Let $A$ be a symmetric matrix, i.e. $A^T = A$. Then since $AA^{-1} = A^{-1}A = I$ where $I$ is symmetric. Then

$$
\begin{align}
    AA^{-1} &= (AA^{-1})^T\\
    AA^{-1} &= (A^{-1})^TA^T\\
    A^{-1}AA^{-1} &= (A^{-1})^T\\
    A^{-1} &= (A^{-1})^T
\end{align}
$$

So the inverse of $A$ is also symmetric. $\blacksquare$

*Statement: (PRML Exercise 2.24)* The inverse of a partitioned matrix is given by

$$
\begin{pmatrix}A & B \\ C & D \end{pmatrix}^{-1} = \begin{pmatrix}M & -MBD^{-1} \\ -D^{-1}CM & D^{-1} + D^{-1}CMBD^{-1} \end{pmatrix}
$$

for $M = (A - BD^{-1}C)^{-1}$, where $M^{-1}$ is called the **Schur complement** of the matrix on the left side above.

*Proof:* **TODO**

**NOTE:** This stackexchange response has a very interesting geometric explanation of regression and conditional Gaussians as simply affine transformations to a circle: https://stats.stackexchange.com/questions/71260/what-is-the-intuition-behind-conditional-gaussian-distributions