## Low-rank approximation 

We consider an approximate posterior covariance of the form

\begin{equation}\begin{aligned}
\boldsymbol\Sigma^{-1} &\approx \mathbf X \mathbf X^\top,\,\,\,\,\mathbf X\in\mathbb R^{L^2\times K}
\end{aligned}\end{equation}

where $\mathbf X$ is a tall, thin matrix, with as many rows as there are spatial bins ($L^2$), and $K<L^2$ columns. We view $\mathbf X$ as being composed of $L^2$ length-$K$ row-vectors $\mathbf x_k$. 

\begin{equation}\begin{aligned}
\mathbf X^\top &= \{ \mathbf x_1^\top,...,\mathbf x_{L^2}^\top \}
\end{aligned}\end{equation}

Neglecting terms that do not depend on $\mathbf X$, the loss to be minimized is:

\begin{equation}\begin{aligned}
\mathcal L(\mathbf X)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right] - \ln|\mathbf X^\top \mathbf X |\right\}
\end{aligned}\end{equation}

Note that we have used $\ln|\mathbf X^\top \mathbf X |$ above, rather than $\ln|\mathbf X\mathbf X^\top|$
Since $\mathbf X \mathbf X^\top$ is rank $K<L^2$, it always as a subspace of size $L^2-K$ with zero eigenvalues. This null space doesn't affect the gradient, but it does make the log-determinant undefined. $\ln|\mathbf X^\top \mathbf X |$ considers only the log-determinant in the low-rank space spanned by $\mathbf X$, and remains finite. 

\begin{equation}\begin{aligned}
%%%% GRADIENT IN v
\operatorname{\nabla}_{\mathbf X}
\mathcal L &=
\left(
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf X - {\mathbf X^{+}}^\top
\end{aligned}\end{equation}


The derivatives of $\operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right]$ and $\ln|\mathbf X \mathbf X^\top|$ can be found in The Matrix Cookbook. The gradient of $\mathbf n^\top \exp\left(\boldsymbol\mu+ \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)$ can be solved manually by considering single elements of $\mathbf X$.  

\begin{equation}\begin{aligned}
\tfrac{d}{dx_{ij}}
\mathbf n^\top \exp\left(\boldsymbol\mu+ \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)
&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\tfrac12 \tfrac{d}{dx_{ij}} \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]_l
\\&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\tfrac12\sum_k \tfrac{d}{dx_{ij}} \mathbf x_{lk}^2
\\&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\tfrac12\sum_k 2 \mathbf x_{lk} \delta_{ij=lk}
\\&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\mathbf x_{lj} \delta_{i=l}
\\&=
n_i\langle\lambda_i\rangle \mathbf x_{ij}
\\&=
\left[\operatorname{diag}[\mathbf n\circ \langle\boldsymbol\lambda\rangle]\, \mathbf X\right]_{ij}
\end{aligned}\end{equation}

Writing down the Hessian in $\mathbf X$ is cumbersome, since it is four-dimensional. However, in practice we only need to calculate the product of the Hessian with a "vector", which is in this case a matrix $\mathbf M$: 

\begin{equation}\begin{aligned}
{}\left[\operatorname{H}_{\mathbf v} \mathcal L \right]{\mathbf M}
=\operatorname{diag}\left[\mathbf n \circ \langle\boldsymbol\lambda\rangle\right]
\left[ \mathbf I\circ\mathbf M\mathbf X^\top + \mathbf M \right]
\end{aligned}\end{equation}

Proving this is horrifying. This can be obtained by differentiating the scalar product between the Jacobian and a matrix $\mathbf M$, $\operatorname{tr}\left\{\left[\operatorname{\nabla}_{\mathbf X} \mathcal L\right]^\top \mathbf M\right\}$: 

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\mathbf X} \operatorname{tr}\left\{
\left[\operatorname{\nabla}_{\mathbf X}
\mathcal L\right]^\top \mathbf M\right\} 
&=
\operatorname{\nabla}_{\mathbf X} \operatorname{tr}\left\{
\left[\left(\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf X - {\mathbf X^{+}}^\top\right]^\top \mathbf M\right\}
\\&=
\operatorname{\nabla}_{\mathbf X} \operatorname{tr}\left\{
\left[\mathbf X^\top\left(\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right) - \mathbf X^{+}\right] \mathbf M\right\}
\\&=
\operatorname{\nabla}_{\mathbf X} 
\operatorname{tr}\left\{
\mathbf X^\top\left(\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf M - \mathbf X^{+}\mathbf M\right\}
\\
&=
\operatorname{\nabla}_{\mathbf X}\operatorname{tr}\left\{
\mathbf X^\top\operatorname{diag}\left[
\mathbf n\circ\langle\boldsymbol\lambda\rangle
\right]\mathbf M
\right\}
+
\operatorname{\nabla}_{\mathbf X}\operatorname{tr}\left\{
\mathbf X^\top\boldsymbol\Sigma_0^{-1}\mathbf M
\right\}
- 
\operatorname{\nabla}_{\mathbf X}\operatorname{tr}\left\{
\mathbf M\mathbf X^{+}
\right\}
\end{aligned}\end{equation}

How the hell did you derive this? Is it even correct? 

It is necessary to derive this element-wise. 

\begin{equation}\begin{aligned}
\frac{d}{d x_{ij}} \mathbf n^\top \langle\boldsymbol\lambda\rangle
&=
\frac{d}{d x_{ij}} \mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)
\\&=
\frac{d}{d x_{ij}} \sum_k n_k \exp\left(\mu_k + \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]_k\right)
\\&=
\frac{d}{d x_{ij}} \sum_l n_l \exp\left(\mu_l + \tfrac 1 2 \sum_k x_{lk}^2 \right)
\\&=
\sum_l n_l \frac{d}{d x_{ij}}\exp\left(\mu_l + \tfrac 1 2 \sum_k x_{lk}^2 \right)
\\&=
\sum_l n_l \langle\lambda_l\rangle \tfrac 1 2 \sum_k \frac{d}{d x_{ij}} x_{lk}^2
\\&=
\sum_l n_l \langle\lambda_l\rangle \sum_k x_{lk} \delta_{lk=ij}
\\&=
\sum_l n_l \langle\lambda_l\rangle x_{lj} \delta_{l=i}
\\&=
n_i \langle\lambda_i\rangle x_{ij}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\frac{d}{d x_{kl}}
\frac{d}{d x_{ij}} \mathbf n^\top \langle\boldsymbol\lambda\rangle
&=
\frac{d}{d x_{kl}} n_i \langle\lambda_i\rangle x_{ij}
\\&=
n_i \frac{d}{d x_{kl}} \langle\lambda_i\rangle 
+n_i \langle\lambda_i\rangle \frac{d}{d x_{kl}} x_{ij}
\\&=
n_i \langle\lambda_i\rangle \tfrac 1 2 \frac{d}{d x_{kl}} \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]_i
+n_i \langle\lambda_i\rangle \delta_{ij=kl}
\\&=
n_i \langle\lambda_i\rangle \tfrac 1 2 
\sum_m \frac{d}{d x_{kl}}  \mathbf x_{im}^2
+n_i \langle\lambda_i\rangle \delta_{ij=kl}
\\&=
n_i \langle\lambda_i\rangle
\sum_m \mathbf x_{im} \delta_{im=kl}
+n_i \langle\lambda_i\rangle \delta_{ij=kl}
\\&=
n_i \langle\lambda_i\rangle
\mathbf x_{il} \delta_{i=k}
+n_i \langle\lambda_i\rangle \delta_{ij=kl}
\\&=
\delta_{i=k}(
n_i \langle\lambda_i\rangle
\mathbf x_{il} 
+n_i \langle\lambda_i\rangle \delta_{j=l}
)
\end{aligned}\end{equation}



Scalar product with $\mathbf M$ ($m_{ij} = (\mathbf M)_{ij}$). We apply this on the left. 

\begin{equation}\begin{aligned}
\left<\frac{d}{d x_{kl}}
\frac{d}{d x_{ij}} \mathbf n^\top \langle\boldsymbol\lambda\rangle, 
\, 
\mathbf M \right>_{kl}
&=
\sum_{ij} m_{ij}
\delta_{i=k}(
n_i \langle\lambda_i\rangle
\mathbf x_{il} 
+n_i \langle\lambda_i\rangle \delta_{j=l}
)
\\&=
\sum_{j} m_{kj}
(
n_k \langle\lambda_k\rangle
\mathbf x_{kl} 
+n_k \langle\lambda_k\rangle \delta_{j=l}
)
\\&=
\sum_{j} m_{kj}
n_k \langle\lambda_k\rangle
\mathbf x_{kl} 
+
\sum_{j} m_{kj}
n_k \langle\lambda_k\rangle \delta_{j=l}
\\&=
n_k \langle\lambda_k\rangle
\mathbf x_{kl} \sum_{j} m_{kj}
+
m_{kl}
n_k \langle\lambda_k\rangle
\\&=
n_k \langle\lambda_k\rangle \left[
\mathbf x_{kl} \sum_{j} m_{kj}
+
m_{kl}
\right]
\\&=
\left\{
\operatorname{diag}\left[\mathbf n \langle\boldsymbol\lambda\rangle\right]
\left[
\operatorname{diag}[\mathbf M \mathbb 1]\, \mathbf X
+
\mathbf M
\right]
\right\}_{kl}
\end{aligned}\end{equation}


\begin{equation}\begin{aligned}
\tfrac 1 2 \left \{ \operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right] - \ln|\mathbf X^\top \mathbf X |\right\}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\boldsymbol\Sigma_0^{-1}\mathbf X - \mathbf X^{-\top}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\operatorname{tr}[
\mathbf X^\top \boldsymbol\Sigma_0^{-1}\mathbf M - \mathbf X^{-1}\mathbf M
]
\end{aligned}\end{equation}


\begin{equation}\begin{aligned}
\boldsymbol\Sigma_0^{-1}\mathbf M + \mathbf X^{-1}\mathbf M^\top \mathbf X^{-1}
\end{aligned}\end{equation}

We can work with just $\mathbf X$ as full rank, invertable, but we do need to be able to ake its inverse. This seems to imply that the inverse will diverse if $\mathbf X$ ever takes on a value that is low-rank. 

\begin{equation}\begin{aligned}
\mathcal L(\mathbf v)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] - \ln|\boldsymbol\Sigma|\right\}
\end{aligned}\end{equation}

## Reduced Fourier-space approximation 

\begin{equation}\begin{aligned}
\mathcal L(\mathbf v)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] - \ln|\boldsymbol\Sigma|\right\}
\end{aligned}\end{equation}

We consider an approximate posterior covariance of the form

\begin{equation}\begin{aligned}
\boldsymbol\Sigma &\approx \mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F,
\,\,\,\,\mathbf Q\in\mathbb R^{K\times K}
\end{aligned}\end{equation}



Neglecting terms that do not depend on $\mathbf Q$, the loss to be minimized is:

\begin{equation}\begin{aligned}
\mathcal L(\mathbf X)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}\left[\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F\right]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F\right] - \ln| \mathbf Q \mathbf Q^\top |\right\}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\frac{d}{d\sigma_{ij}} \left\{ \mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] - \ln|\boldsymbol\Sigma|\right\}
\right\}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\mathbf n^\top \frac{d}{d\sigma_{ij}} \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)
+ 
\frac 1 2 \frac{d}{d\sigma_{ij}} \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] 
- 
\frac 1 2 \frac{d}{d\sigma_{ij}} \ln|\boldsymbol\Sigma|
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\frac{\partial\mathcal L}{\partial\Sigma_{ij}}
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n \lambda ]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-\top}\right\}_{ij}
\end{aligned}\end{equation}

$$
\frac{\partial\mathcal L}{\partial\theta_i} = 
\operatorname{tr}\left[
\left(\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
\right)^\top
\frac{\partial\boldsymbol\Sigma}{\partial \theta_i}
\right]
$$

\begin{equation}\begin{aligned}
\mathbf F^\top  (\mathbf J^{ij}\mathbf Q^\top + \mathbf Q \mathbf J^{ji}) \mathbf F
\end{aligned}\end{equation}

$$
\frac{\partial\mathcal L}{\partial\theta_i} = \tfrac 1 2
\operatorname{tr}\left[
\left(
%
\operatorname{diag}[\mathbf n \lambda ]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}
%
\right)
\mathbf F^\top  (\mathbf J^{ij}\mathbf Q^\top + \mathbf Q \mathbf J^{ji}) \mathbf F
\right]
$$

$$
2QFF^\top
$$

$A J^{ij}$ puts $i$th column of $A$ in column $j$

$J^{ij}A$ puts row $j$ of $A$  in row $i$

$$
\frac
{\partial(\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F)_{ij}}
{\partial\mathbf Q_{kl}}
=
\sum_m F_{ki} F_{mj} Q_{ml}
+
F_{mi} F_{kj} Q_{ml}
=
F_{ki} (F^\top Q)_{jl}
+
F_{kj} (F^\top Q)_{il}
$$

$$
\frac{\partial\mathcal L}{\partial\mathbf Q_{kl}} = \tfrac 1 2
\operatorname{tr}\left[
\left(
%
\operatorname{diag}[\mathbf n \lambda ]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}
%
\right)
\mathbf F^\top  (\mathbf J^{ij}\mathbf Q^\top + \mathbf Q \mathbf J^{ji}) \mathbf F
\right]
$$

$$
\sum_m [F_{ki} F_{mj} 
+
F_{mi} F_{kj}] Q_{ml}
$$

$$
{Q^\top F A F^\top]_{qp}
$$

# We need a way to verify derivatives. 

I'm not sure I feel like .. let's just.. ones. 

\begin{equation}\begin{aligned}
\mathcal L(\boldsymbol\mu,\boldsymbol\Sigma)
&=
\mathbf n^\top \left[\langle\boldsymbol\lambda\rangle-\bar{\mathbf y} \circ \boldsymbol\mu\right]
+ 
\tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] + 
{\boldsymbol\mu}^\top{{\boldsymbol\Sigma}_0}^{-1}\boldsymbol\mu
+ \ln|\boldsymbol\Sigma_0| - \ln|\boldsymbol\Sigma|
\right\}
\\
\langle\boldsymbol\lambda\rangle &=
\exp\left(
\boldsymbol\mu+\boldsymbol\mu_0
+
\tfrac 1 2 \operatorname{diag}\left[\boldsymbol\Sigma\right]\right)
\end{aligned}
\label{loss}\end{equation}

## Variational Bayes: Low-rank approximation 

We consider an approximate posterior covariance of the form

\begin{equation}\begin{aligned}
\boldsymbol\Sigma^{-1} &\approx \mathbf X \mathbf X^\top,\,\,\,\,\mathbf X\in\mathbb R^{L^2\times K}
\end{aligned}\end{equation}

where $\mathbf X$ is a tall, thin matrix, with as many rows as there are spatial bins ($L^2$), and $K<L^2$ columns. We view $\mathbf X$ as being composed of $L^2$ length-$K$ row-vectors $\mathbf x_k$. 

\begin{equation}\begin{aligned}
\mathbf X^\top &= \{ \mathbf x_1^\top,...,\mathbf x_{L^2}^\top \}
\end{aligned}\end{equation}

Neglecting terms that do not depend on $\mathbf X$, the loss to be minimized is:

\begin{equation}\begin{aligned}
\mathcal L(\mathbf X)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right] - \ln|\mathbf X^\top \mathbf X |\right\}
\end{aligned}\end{equation}

Note that we have used $\ln|\mathbf X^\top \mathbf X |$ above, rather than $\ln|\mathbf X\mathbf X^\top|$
Since $\mathbf X \mathbf X^\top$ is rank $K<L^2$, it always as a subspace of size $L^2-K$ with zero eigenvalues. This null space doesn't affect the gradient, but it does make the log-determinant undefined. $\ln|\mathbf X^\top \mathbf X |$ considers only the log-determinant in the low-rank space spanned by $\mathbf X$, and remains finite. The Jacobian is:

\begin{equation}\begin{aligned}
%%%% GRADIENT IN v
\operatorname{\nabla}_{\mathbf X}
\mathcal L &=
\left(
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf X - {\mathbf X^{+}}^\top
\end{aligned}\end{equation}


The derivatives of $\operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right]$ and $\ln|\mathbf X \mathbf X^\top|$ can be found in The Matrix Cookbook. The gradient of $\mathbf n^\top \exp\left(\boldsymbol\mu+ \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)$ can be solved by considering single elements of $\mathbf X$.  

The Hessian-vector product is

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\mathbf X} \left< \operatorname{\nabla}_{\mathbf X}, \mathbf M \right>
\mathcal L 
&=
\operatorname{\nabla}_{\mathbf X} \left< 
\left(
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf X - {\mathbf X^{+}}^\top
, \mathbf M \right>
\\
&=
\operatorname{\nabla}_{\mathbf X}
\operatorname{tr}\left[
\mathbf X^\top
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right]\mathbf M
+ 
\mathbf X^\top \boldsymbol\Sigma_0^{-1}\mathbf M
- 
{\mathbf X^{+}}\mathbf M
\right]
\\
&=
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right]\mathbf M
+ 
\boldsymbol\Sigma_0^{-1}\mathbf M
+{\mathbf X^+}^\top \mathbf M^\top {\mathbf X^+}^\top - (\mathbf I - {\mathbf X^+}^\top \mathbf X^\top) \mathbf M \mathbf X^+ {\mathbf X^+}^\top
\end{aligned}\end{equation}

The derivative involving the pseudoinverse is given in Goulob and Pereya (1972) Eq. 4.12.