# An Example: A Linear-Gaussian Latent Feature Model with Binary Features



## Finite Linear-Gaussian Model

$\mathbf{x}_i$ is generated from a Gaussian with:

- mean $\mathbf{z}_i \mathbf{A}$
- covariance $\Sigma_X = \sigma^2_X \mathbf{I}$

Where:

- $\mathbf{z}_i$ is a $K$-dimensional binary vector
- $\mathbf{A}$ is a $K$ x $D$ matrix of weights

Since the mean of $\mathbf{x}_i$ is $\mathbf{z}_i \mathbf{A}$, so the expectation of $\mathbf{X}$, $\mathbb{E}(\mathbf{X})$, is $\mathbf{Z} \mathbf{A}$.

Given that:

- $\mathbf{X}$ is an $N$ x $D$ matrix

So:

- $\mathbf{Z}$ is an $N$ x $K$ matrix

For the sake of thinking this through, this means that for example each observation $\mathbf{x}_i$ could be low-dimensional, eg 1 or 2 dimensions, whilst the latent feature matrix $\mathbf{Z}$ could have high-dimensional $K$, eg 1000, or more.

Note that the covariance matrix is diagonal, so the Gaussians have spherical isocontours.  This constrains the solution, simplifying inference.

As per the tutorial, and since each $\mathbf{x}_i$ distributed as a symmetric Gaussian, the distribution of $\mathbf{X}$, given $\mathbf{Z}$, $\mathbf{A}$ and $\sigma_X$ is:

$$
p(\mathbf{X} \mid \mathbf{Z}, \mathbf{A}, \sigma_X)
=
\frac
  {1}
  {(2\pi \sigma_X^2)^{ND/2}}
\exp \left(
  -
  \frac{1}
    {2\sigma_X^2}
    \mathrm{tr}
    (
      (\mathbf{X} - \mathbf{Z}\mathbf{A})^T
      (\mathbf{X} - \mathbf{Z}\mathbf{A})
    )
\right)
$$


Note that since $\mathbf{Z}$ is a dimension in $N$, so the mean of each $\mathbf{x}_i$ can be different.  $\mathbf{A}$ does not have a dimension in $N$, and gives the properties of each mixing component. (I think).

The tutorial suggests that we should integrate out the model components $\mathbf{A}$, and that we can do so, if we define a prior on it, which the tutorial suggests to be a Gaussian:

$$
p(\mathbf{A} \mid \sigma_A) =
\frac{1}
  {(2 \pi \sigma_A^2)^{KD/2}}
\exp \left(
  - \frac{1}{2\sigma_A^2}
  \mathrm{tr}(\mathbf{A}^T \mathbf{A})
\right)
$$

Multiplying these two probabilities we get:

$$
p(\mathbf{X}, \mathbf{A} \mid \mathbf{Z}, \sigma_X)
= p(\mathbf{X} \mid \mathbf{Z}, \mathbf{A}, \sigma_X) \, p(\mathbf{A} \mid \sigma_A)
$$
$$
= 
\frac
  {1}
  {(2\pi \sigma_X^2)^{ND/2}}
\exp \left(
  -
  \frac{1}
    {2\sigma_X^2}
    \mathrm{tr}
    (
      (\mathbf{X} - \mathbf{Z}\mathbf{A})^T
      (\mathbf{X} - \mathbf{Z}\mathbf{A})
    )
\right)
\cdot
\frac{1}
  {(2 \pi \sigma_A^2)^{KD/2}}
\exp \left(
  - \frac{1}{2\sigma_A^2}
  \mathrm{tr}(\mathbf{A}^T \mathbf{A})
\right)
$$

Gradually working through the tutorial expressions:

$$
\propto
\exp
\left(
- \frac{1}{2}
  \mathrm{tr} \left(
    \frac{1}{\sigma_X^2}\mathbf{X}^T\mathbf{X}
    - \frac{1}{\sigma_X^2} \mathbf{A}^T\mathbf{Z}^T\mathbf{X}
    - \frac{1}{\sigma_X^2} \mathbf{X}^T \mathbf{Z} \mathbf{A}
    + \frac{1}{\sigma_X^2} \mathbf{A}^T\mathbf{Z}^T\mathbf{Z}\mathbf{A}
    + \frac{1}{\sigma_A^2} \mathbf{A}^T \mathbf{A}
  \right)
\right)
$$

$$
= \exp \left(
  - \frac{1}{2}
  \mathrm{tr} \left(
    \frac{1}{\sigma_X^2}\mathbf{X}^T\mathbf{X}
    - \frac{1}{\sigma_X^2} \mathbf{A}^T\mathbf{Z}^T\mathbf{X}
    - \frac{1}{\sigma_X^2} \mathbf{X}^T \mathbf{Z} \mathbf{A}
    + \mathbf{A}^T(\frac{1}{\sigma_X^2}\mathbf{Z}^T\mathbf{Z} + \frac{1}{\sigma_A^2}\mathbf{I})\mathbf{A}
  \right)
\right)
$$

### Interlude: Matrix and Gaussian revision

At this point, I had to reach out to revise some properties of matrices and Gaussians that I ~~had forgotten~~ didnt know.  Some of the resources I used for this section:

- "Bayesian Linear Regression", Minka, 1998 (revised 2010)
- https://en.wikipedia.org/wiki/Gaussian_integral
- https://en.wikipedia.org/wiki/Multivariate_normal_distribution
- https://en.wikipedia.org/wiki/Matrix_normal_distribution
- [https://en.wikipedia.org/wiki/Vectorization_(mathematics)](https://en.wikipedia.org/wiki/Vectorization_(mathematics%29)
- https://en.wikipedia.org/wiki/Conjugate_transpose

#### Matrix normal distribution

So, going through, bit by bit: the integral of a Multivariate normal distribution is:

$$
\int_{-\infty}^{\infty}
\exp \left(
   -\frac{1}{2}
   (\mathbf{x}^T \mathbf{A} \mathbf{x})
\right)
\,
d\mathbf{x}
=
\sqrt{\frac{(2\pi)^D}{\mathrm{det}\,\mathbf{A}}}
$$

*However*, in this case, $\mathbf{x}$ is a vector, but we will have $\mathbf{X}$, or something of this sort, ie: a matrix.  So, what we will have is in fact: a Matrix normal distribution.  A Matrix normal distribution, per wikipedia article, has the form:

$$
p(\mathbf{X} \mid \mathbf{M}, \mathbf{U}, \mathbf{V})
= \frac
  {\exp(
     -\frac{1}{2}
     \mathrm{tr}(
       \mathbf{V}^{-1}
       (\mathbf{X} - \mathbf{M})^T
       \mathbf{U}^{-1}
       (\mathbf{X} - \mathbf{M})
     )
  }
  {(2\pi)^{NK/2} \left| \mathbf{V} \right|^{N/2} \left| \mathbf{U} \right|^{K/2}}
$$

where:

- $\mathrm{tr}$ denotes "trace"
- $\mathrm{M}$ is $N$ x $K$
- $\mathrm{U}$ is $N$ x $N$ (so: square)
- $\mathrm{V}$ is $v$ x $v$ (also square)

(where I've changed $n$ in the Wikipedia article to $N$, and $p$ to $K$, in line with the notation we are using elsewhere)

The article then states that the relationship to Multivariate normal is:

$$
\mathrm{vec}(\mathbf{X})
\sim
\mathcal{N}_{NK}(\mathrm{vec}(\mathbf{M}, \mathbf{V} \otimes \mathbf{U}))
$$

where:

- $\otimes$ is Kronecker Product
- $\mathrm{vec}$ is vectorization

#### Vectorization

Where is Kronecker Product, and what is vectorization?  The [Wikipedia article](https://en.wikipedia.org/wiki/Vectorization_(mathematics%29) gives a good description of vectorization.  You stack each column of the matrix on top of each other, so it becomes a vector:

If:
$$
\mathbf{A}
=
\begin{bmatrix} a & b \\
c & d \\
\end{bmatrix}
$$
Then:
$$
\mathrm{vec}(\mathbf{A})
=
\begin{bmatrix}
a \\
c \\
b \\
e \\
\end{bmatrix}
$$

So, if $\mathbf{A}$ is $M$ x $N$, and the result of $\mathrm{vec}(\mathbf{A})$ is $C$, then:
$$
a_{i,j}
= 
c_{jM + i,1}
$$


#### Kronecker Product

The Kronecker product of matrices $\mathbf{A}$ and $\mathbf{B}$ is formed by tiling the matrices $\mathbf{A}$ and $\mathbf{B}$ as follows, and then forming the per-element product.  If we have the following matrices:

$$
\mathbf{A}
=
\begin{bmatrix} a & b \\
c & d \\
\end{bmatrix}
$$

$$
\mathbf{B}
=
\begin{bmatrix} e & f \\
g & h \\
\end{bmatrix}
$$

Then matrix $\mathbf{A}$ will be tiled like:

$$
\begin{bmatrix}
a & a & b & b \\
a & a & b & b \\
c & c & d & d \\
c & c & d & d \\
\end{bmatrix}
$$

Matrix $\mathbf{B}$ will be tiled like:

$$
\begin{bmatrix}
e & f & e & f \\
g & h & g & h \\
e & f & e & f \\
g & h & g & h \\
\end{bmatrix}
$$

... and the Kronecker product is the Hadamard (per-element) product of these tiled matrices:

$$
\begin{bmatrix}
ae & af & be & bf \\
ag & ah & bg & bh \\
ce & cf & de & df \\
cg & ch & dg & dh \\
\end{bmatrix}
$$



Let's compare this with the matrix product of these two matrices.  This is:

$$
\begin{bmatrix}
ae + cg & be + df \\
ce + dg & cf + ch \\
\end{bmatrix}
$$

#### Relationship between vectorization and inner product

There are 8 pairs of products in the matrix product above, and 16 in this Kronecker product, so it seems not obvious to relate the two, eg via vectorization, somehow?

Wikipedia however states that we can form a relationship between vectorization and the matrix product. For square, real matrices:

$$
\mathrm{tr}
(\mathbf{A}^T \mathbf{B})
= \mathrm{vec}(\mathbf{A})^T \mathrm{vec}(\mathbf{B})
$$


Let's try this, for the example matrices above.

$\mathrm{vec}(\mathbf{A})$ is:

$$
\begin{bmatrix}
a \\
c \\
b \\
d \\
\end{bmatrix}
$$

$\mathrm{vec}(\mathbf{B})$ is:

\begin{bmatrix}
e \\
g \\
f \\
h \\
\end{bmatrix}



So:

$$\mathrm{vec}^T(\mathbf{A})\mathrm{vec}(\mathbf{B}) = ae + cg + bf + dh$$


$\mathbf{A}^T$ is:
$$
\begin{bmatrix} a & c \\
b & d \\
\end{bmatrix}
$$

...and $\mathbf{B}$ is:

$$
\begin{bmatrix} e & f \\
g & h \\
\end{bmatrix}
$$


So, $\mathbf{A}^T \mathbf{B}$ is:

$$
\begin{bmatrix}
ae + cg & ae + cg \\
be + dg & cf + ch \\
\end{bmatrix}
$$

$\mathrm{tr}$ is the sum of the diagonal, for a square matrix.  For example, for matrix $\mathrm{X}$ of size $m$, $\mathrm{tr}(\mathbf{X})$ is:

$$
\mathrm{tr}(\mathbf{X}) = \sum_{i=1}^m x_{i,i}
$$

So, $\mathrm{tr}(\mathbf{A}^T \mathbf{B})$, in our example above, is:

$$
\mathrm{tr}(\mathbf{A}^T \mathbf{B})
= ae + cg + cf + ch
$$

... which matches the result for $\mathrm{vec}(\mathbf{A})^T \mathrm{vec}(\mathbf{B})$

More generally, let's try for two $n$ x $n$ matrices $\mathbf{A}$ and $\mathbf{B}$.  The vectorizations will look like:

$$
a_{i,j} = \mathrm{vec}(\mathbf{A})_{jn + i}
$$

$$
b_{k,l} = \mathrm{vec}(\mathbf{B})_{ln + k}
$$

To form the inner product of the vectorizations, let's use two nested sums.  The innermost sum will be over each row in a column, and the outermost will be over columns.  So this will give:

$$
\sum_{i=1}^{n} 
\sum_{j=1}^{n}
a_{j,i} b_{j,i}
$$

Meanwhile, $\mathbf{A}^T$ is:

$$
(\mathbf{A}^T)_{i,j} = a_{j,i}
$$

... and the matrix product $\mathbf{A}^T\mathbf{B}$ is:

$$
(\mathbf{A}^T\mathbf{B})_{i,j} = \sum_{k=1}^n a_{k,i} b_{k,j}
$$
The trace of this, is the sum over the diagonal, ie the sum of terms where $i = j$.  This gives:

$$
\mathrm{tr}(\mathbf{A}^T\mathbf{B}) = \sum_{l=1}^n \sum_{k=1}^n a_{k,l} b_{k,l}
$$

By inspection, this is identical to the expression for $\mathrm{vec}(\mathbf{A})^T \mathrm{vec}(\mathbf{B})$

### Old notes, not working yet:

The tutorial says we should complete the square now, ~~since we want to find the mean and covariance matrix of the resulting Gaussian:~~  I had to think about this.  We're actually completing the square in terms of $\mathbf{A}$, which I guess is because we intend to integrate over $\mathbf{A}$ later?

$$
= \exp \left(
 - \frac{1}{2}
 \mathrm{tr}
   \left(
     ()^T()
   \right)
 \right)
$$