# Geometric View of Gaussians

The Gaussian distribution is one of the most widely used distributions in machine learning, which we will see as these notes progress. For a single variable the Gaussian is defined as 

$$
\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi \sigma^2)^{\frac{1}{2}}}e^{-\frac{1}{2\sigma^2}(x - \mu)^2}
$$

where $\mu$ is the mean and $\sigma^2$ is the variance. The multivariate Gaussian is defined as

$$
\mathcal{N}(\pmb{x}|\pmb{\mu}, \pmb{\Sigma}) = \frac{1}{(2\pi)^{\frac{D}{2}}}\frac{1}{|\pmb{\Sigma}|^{\frac{1}{2}}}e^{-\frac{1}{2}(\pmb{x} - \pmb{\mu})^T\pmb{\Sigma}^{-1}(\pmb{x} - \pmb{\mu})}
$$

where $\pmb{\mu}$ is a D-dimensional mean vector, $\pmb{\Sigma}$ is a $D \times D$ covariance matrix and $|\pmb{\Sigma}|$ is the determinant of the covariance. For ease of notation, I will drop the convention of using bold text to respresent vector values and instead use context to indicate when we are talking about a vector or scalar value.

We can define the functional dependence of the multivariate Gaussian on x through the **Mahalanobis distance** $\Delta$ between x and the mean:

$$
\Delta^2 = (x - \mu)^T \Sigma^{-1}(x - \mu)
$$

***Statement:*** The precision matrix $\Sigma^{-1}$ can be taken to be symmetric without loss of generality.

***Proof: (PRML Exercise 2.17)*** Let $A = \Sigma^{-1}$ for notational purposes. We can express the precision matrix as a sum of its symmetric and antisymmetric parts. Explicitly:

$$
A = \frac{1}{2}(A + A^T) + \frac{1}{2}(A - A^T)
$$

Then the exponent in the Gaussian definition is

$$
(x - \mu)^TA(x - \mu) = x^Tax - x^TA\mu - \mu^TAx + \mu^TA\mu
$$

The terms $x^TAx$ are called the **quadratic form** of the matrix $A$. We will essentially show the quadratic form of the antisymmetric part of a matrix is 0.

$$
\begin{align*}
x^TAx &= x^T(\frac{1}{2}(A + A^T) + \frac{1}{2}(A - A^T))x \\
&= \frac{1}{2}[x^TAx + x^TA^Tx + x^TAx - x^TAx] \\
&= \frac{1}{2}(2x^TAx) \\
&= x^TAx
\end{align*}
$$

So we see the only part that survives is the symmetric part of the matrix. This holds for the other terms as well. Thus we can treat the precision matrix as symmetric without loss of generality. $\blacksquare$

Consider the eigenvector equation for the covariance matrix

$$
\Sigma u_i = \lambda_i u_i
$$

for $i = \{1, \dots, D\}$. Using the results from the appendix, we know that because $\Sigma$ is real and symmetric, its eigenvalues are real and the eigenvectors for an orthonormal set

$$
u_i^Tu_j = I_{ij}
$$

where $I$ is the identity matrix.

Note also that since $\Sigma$ is real and symmetric, it is orthogonally diagonalizable with the diagonal entries being the eigenvalues of $\Sigma$. Thus we have $\Sigma = UDU^T$ for $U,D \in \mathbb{R}^D$ and apply it into some arbitrary $x \in \mathbb{R}^D$. Then

$$
\begin{align}
    UDU^Tx &= UD \begin{bmatrix}u_1^Tx \\ \vdots \\ u_D^Tx\end{bmatrix} \\
           &= U \begin{bmatrix}\lambda_1 u_1^Tx \\ \vdots \\ \lambda_D u_D^Tx\end{bmatrix} \\
           &= \sum_{i=1}^D \lambda_i u_i u_i^T x
\end{align}
$$

Note that this is the same as the above eigenvector equation, since in the case that $x = u_i$, $\lambda_i u_i u_i^T u_i = \lambda_i u_i$ by orthogonality. 

Similarly, the inverse covariance matrix can be expressed as 
$$
\begin{align}
\Sigma^{-1} &= (UDU^T)^{-1} \\
            &= (U^T)^{-1}D^{-1}U^{-1} \\
            &= U^T D^{-1} U
\end{align}
$$

by definition of orthogonality. Then

$$
\begin{align}
    UDU^Tx &= UD \begin{bmatrix}u_1^Tx \\ \vdots \\ u_D^Tx\end{bmatrix} \\
           &= U \begin{bmatrix}\lambda_1^{-1} u_1^Tx \\ \vdots \\ \lambda_D^{-1} u_D^Tx\end{bmatrix} \\
           &= \sum_{i=1}^D \lambda_i^{-1} u_i u_i^T x
\end{align}
$$

Plugging this into the quadratic form gives

$$
\Delta^2 = \sum_{i=1}^D\frac{y_i^2}{\lambda_i}
$$

where 

$$
y_i = u_i^T(x - \mu)
$$

which can be interpreted as a coordinate system with bases $u_i$, center $\mu$ and scaling factors $\lambda_i^{\frac{1}{2}}$. This gives us a geometric understanding of the Gaussian distribution. 

Lastly, in the appendix we see that $|J| = 1$ and $|\Sigma|$ is just the sum of its eigenvalues along the diagonal of its orthogonal decomposition. Thus the Gaussian in this geometric coordinate system is:

$$
p(y) = \prod_{j=1}^D \frac{1}{(2\pi \lambda_j)^{\frac{1}{2}}}e^{-\frac{y_j^2}{2 \lambda_j}}
$$

This representation can be useful since it essentially breaks down the multivariate Gaussian into a product of D independent univariate Gaussian distributions.

In [1]:
#TODO: Plot 2D Gaussian with axes to show geometry (let lambda and mu be a parameter). 
#Also plot aligned and isotropic covariance case to show limitations a la page 84. 

# Understanding $\mu$ and $\Sigma$

We quickly wish to show the interpretation of $\mu$ and $\Sigma$ in the multivariate Gaussian as the mean and covariance respectively. 

We define the **moment generating function (mgf)** for some n-dimensional random vector $X \in \mathbb{R}^n$ as

$$
M_X(t) := E[e^{t^TX}] 
$$

For a multivariate Gaussian, the mgf is given to be (derivation in appendix):

$$
M_X(t) = exp(t^T\mu + \frac{1}{2}t^T\Sigma t)
$$

Thus we can find the first and second moments with 
$$
\frac{dM_X(t)}{dt}\vert_{t=0}
$$.

**TODO: FINISH THIS SECTION!!**

# Appendix
*Statement: (PRML Exercise 1.35)* The entropy of a univariate Gaussian 

$$
p(x) = \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}}e^{-\frac{(x - \mu)^2}{2\sigma^2}}
$$

is given by

$$
H[x] = \frac{1}{2}(1 + ln(2\pi\sigma^2))
$$

*Proof:* **TODO**

*Statement: (PRML Exercise 2.13)* KL Divergence of two multivariate Gaussians $p(x) = \mathcal{N}(x|\mu, \Sigma)$ and $q(x) = \mathcal{N}(x|m, L)$.

*Proof:* **TODO**

*Statement: (PRML Exercise 2.14)* The multivariate distribution with maximum entropy, for a given covariance, is a multivariate Gaussian distribution.

*Proof:* **TODO**

*Statement: (PRML Exercise 2.15)* The entropy of a multivariate Gaussian $\mathcal{N}(x|\mu, \Sigma)$ is given by

$$
H[x] = \frac{1}{2}ln|\Sigma| + \frac{D}{2}(1 + ln(2\pi))
$$

where D is the dimensionality of x. 

*Proof:* **TODO**

*Statement: (PRML Exercise 2.18)* Given a real symmetric matrix $A$ with eigenvalue equation $Au_i = \lambda_i u_i$, the eigenvalues $\lambda_i$ are real and the set of eigenvectors satisfying this eigenvalue equation are orthonormal.

*Proof:* **TODO**

*Statement:* Every real, symmetric matrix is orthogonally diagonalizable. 

*Proof:* **TODO**

*Statement:* The determinant of the Jacobian matrix going from the coordinate system $x_i$ to $y_i$ for a Gaussian is 1, and the determinant of the covariance matrix is given by $|\Sigma| = \sum_{i=1}^D \lambda_i$.

*Proof:* **TODO**

*Statement:* The moment generating function of a multivariate normal distribution is given by 
$$
M_X(t) = exp(t^T\mu + \frac{1}{2}t^T\Sigma t)
$$

*Proof:* **TODO**