In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from util   import *
from basics import *
from simulate_data import *
from estimators    import *
from config import *
from scipy.special import *
configure_pylab()   

In [8]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});
MathJax.Hub.Queue(
  ["resetEquationNumbers", MathJax.InputJax.TeX],
  ["PreProcess", MathJax.Hub],
  ["Reprocess", MathJax.Hub]
);

<IPython.core.display.Javascript object>

# Variational Bayes

The posterior mode can be a good solution if the prior is appropriately chosen. However, the Laplace approximation is not accurate enough to estimate model likelihood and optimize prior hyperparameters when they are known. This is because the Laplace approximation uses the curvature at the posterior mode to approximate the posterior covariance. This approximation is inaccurate when the posterior is highly skewed. 

To select hyperparameters, we employ variational Bayes. This optimizes the mean and covariance of a Gaussian approximation to the posterior, $q(\mathbf z)$, to minimize the Kullback-Liebler divergence from the true posterior $p(\mathbf z)$ to this Gaussian approximation.

\begin{equation}\begin{aligned}
p(\mathbf z) &= \Pr(\mathbf z | \mathbf y) = \Pr( \mathbf y | \mathbf z ) \frac{\Pr(\mathbf z)}{\Pr(\mathbf y)}
\\
q(\mathbf z) &\sim \mathcal N( \boldsymbol\mu, \boldsymbol \Sigma)
\\
q(\mathbf z) &= \underset{\boldsymbol\mu, \boldsymbol \Sigma}{\operatorname{argmin}}
D_{\text{KL}}\left[ q(\mathbf z) \| p(\mathbf z) 
\right]
\end{aligned}\end{equation}

Minimizing KL divergence is equivalent to maximizing the negative log-likelihood of the observations $\mathbf y$, averaged over $q(\mathbf z)$, while simultaneously minimizing the KL divergence from the prior $\Pr(\mathbf z)$ to the approximating posterior. 

\begin{equation}\begin{aligned}
D_{\text{KL}}[q(\mathbf z) \| p(\mathbf z)] 
&= -\left< \ln\Pr(\mathbf y|\mathbf z)\right>
+ D_{\text{KL}}[q(\mathbf z) \| \Pr(\mathbf z)]
+ \text{constant}
\end{aligned}\end{equation}

above (and throught this manuscript) expectations $\langle\cdot\rangle$ are taken with respect to the approximate posterior $q(\mathbf z)$ unless otherwise stated. 

Optimizing the full posterior covariance $\boldsymbol \Sigma$ is intractable. For example, the covariance for a a 100×100 spatial grid contains $10^8$ entries. Instead, we explore various parameterizations of $\boldsymbol \Sigma$ which balance computational tractability with expressiveness. 

We evaluate the following approximations for the posterior covariance: 
1. A diagonal approximation $\boldsymbol\Sigma\approx\mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A$, where $\mathbf A$ is a fixed spatial convolution. We choose $\mathbf A$ to be a local Gaussian blur. This assumes that correlations are local, and and that there is a high-frequency cutoff in their spatial scale. It will not capture long-range correlations. This has complexity $\mathcal O(L^2 \log(L))$, where $L$ is the linear spatial dimension. (TODO: check complexity after writing code)
2. An inverse diagonal approximation $\boldsymbol\Sigma\approx[\boldsymbol\Sigma_0 + \operatorname{diag}[\mathbf p]]^{-1}$, where $\boldsymbol\Sigma_0$ is the prior covariance and $\mathbf p$ is a vector of the inverse variance (precision) at each spatial location. This form resembles the Laplace approximation with a covariance correction. This has complexity ???. (TODO: check complexity after writing code)
3. A low-rank approximation $\boldsymbol\Sigma\approx\mathbf X\mathbf X^\top$, where $\mathbf X$ is a tall, thin matrix. This captures the principle subspace of $\boldsymbol\Sigma$. This has complexity $\mathcal O(L^2 K)$, where $L$ is the linear spatial dimension and $K$ is the number of components in $\mathbf X$. (TODO: check complexity after writing code)
4. A reduced Fourier-space representation $\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F$, where $\mathbf F$ is the unitary 2D Fourier transform, with frequencies that are zero in the prior $\boldsymbol\Sigma_0$ discarded, and $\mathbf Q$ is a (lower? upper? TODO) triangular matrix. Since the posterior cannot assign probability mass where the prior $\boldsymbol\Sigma_0$ is zero, this fully parameterizes $\boldsymbol Q$ in a compact way. The Fourier transform can be computed quickly using the Fast Fourier Transform (FFT). Complex values in $\mathbf Q$ can be avoided by using the Discrete Cosine Transform (DCT).  This has complexity $\mathcal O(L^2 \log(L) K^3)$, where $L$ is the linear spatial dimension and $K$ is the rank of $\mathbf Q$. (TODO: check complexity after writing code)

In the following sections, we derive closed-form expressions for the gradient and Hessian of each of these models. But first, we write down the general form for $D_{\text{KL}}(q(z) \| p(z))$ in the case of a log-Gaussian Cox process. 

We employ a Poisson observation model, which models the firing intensity $\lambda_i$ at each location $i$ as an exponential function of the estimate log-rate $z_i$, plus the prior mean $\mu_{0,i}$ (which we assume is fixed).

\begin{equation}\begin{aligned}
\Pr\left(\textstyle\int_t^{t+\Delta t} y(t)\, dt = k\right) &\sim \operatorname{Poisson}\left(\textstyle\int_t^{t+\Delta} \lambda(t)\, dt\right)
=
\frac{1}{k!} \lambda^k e^{-\lambda}.
\end{aligned}\end{equation}

If the animal is at location $i$ at time $t$, then the probability of observing $y$ spikes is

\begin{equation}\begin{aligned}
\Pr(y_t) &\sim \operatorname{Poisson}\left[ \lambda_i = e^{z_i+\mu_{0,i}} \right]
\\
\ln\Pr(y_t)&= y_t \ln(\lambda_i)-\lambda_i + \text{constant}
\\
&= y_t z_i - e^{z_i+\mu_{0,i}} + \text{constant}
\end{aligned}\end{equation}

Above, constant terms that depend only on the observations $\mathbf y$ are omitted as they do not change the solution. The overall log-likelihood is a sum of this Poisson log-likehood for all observations. We simplify this sum by averaging observations in the same bin together: 

\begin{equation}\begin{aligned}
T_i &= \{ t \mid \mathbf x_t \text{ in bin }i  \}\\
n_i &= |T_i|,
\\
\bar y_i &= \tfrac 1 {n_i} \textstyle\sum_{t \in T_i} y_t,
\end{aligned}
\\
\textstyle\sum_{t \in T_i} \left[y_t \ln(\lambda_i) - \lambda_i\right]
= n_i \left[ \bar y_i \ln(\lambda_i) - \lambda_i \right].
\end{equation}

This can be written in vector notation as 
\begin{equation}\begin{aligned}
\ln \Pr(\mathbf y | \mathbf z)
&= \mathbf n^\top \left( \bar{\mathbf y} \circ \mathbf z - \boldsymbol\lambda \right) + \text{constant},
\end{aligned}\end{equation}

where $\mathbf n$ is a vector of the number of visits to each locatoin, and  $\circ$ denotes element-wise multiplication. Since $q(\mathbf z)$ is Gaussian, the expectation $\langle\ln \Pr(\mathbf y | \mathbf z)\rangle$ has a closed-form solution based on the log-normal distribution:

\begin{equation}\begin{aligned}
\left< \ln p(\mathbf y | \mathbf z) \right>_{q(\mathbf z)} 
&= \langle
\mathbf n^\top \left( \bar{\mathbf y} \circ \mathbf z - \boldsymbol\lambda \right)
\rangle + \text{constant}
\\
&= 
\mathbf n^\top( \bar{\mathbf y} \circ \boldsymbol\mu)
- \mathbf n^\top\langle\boldsymbol\lambda\rangle + \text{constant}
\\
&= 
\mathbf n^\top \left[\bar{\mathbf y} \circ \boldsymbol\mu
- \exp\left(\boldsymbol\mu_0 + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)\right] + \text{constant.}
\end{aligned}\end{equation}

Next, we need to compute $D_{\text{KL}}[q(\mathbf z) \| \Pr(\mathbf z)]$ with respect to $q(\mathbf z)$. Since both the prior and posterior are multivariate Gaussians, this has a closed form: 

\begin{equation}\begin{aligned}
D_{\text{KL}}[q(\mathbf z) \| \Pr(\mathbf z)]
&=
\tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] + 
{(\boldsymbol\mu-\boldsymbol\mu_0)}^\top{{\boldsymbol\Sigma}_0}^{-1}(\boldsymbol\mu-\boldsymbol\mu_0)
+ \ln|\boldsymbol\Sigma_0| - \ln|\boldsymbol\Sigma|
\right\}
\end{aligned}\end{equation}

Up to constants, then, the loss function that we must minimize is: 

\begin{equation}\begin{aligned}
\mathcal L(\boldsymbol\mu,\boldsymbol\Sigma)
&=
\mathbf n^\top \left[\langle\boldsymbol\lambda\rangle-\bar{\mathbf y} \circ \boldsymbol\mu\right]
+ 
\tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] + 
{(\boldsymbol\mu-\boldsymbol\mu_0)}^\top{{\boldsymbol\Sigma}_0}^{-1}(\boldsymbol\mu-\boldsymbol\mu_0)
+ \ln|\boldsymbol\Sigma_0| - \ln|\boldsymbol\Sigma|
\right\}
\end{aligned}
\label{loss}\end{equation}

The precise form for this loss function, and its derivatives, depends on the choice of approximation for $\boldsymbol\Sigma$, as we detail in the following sections. However, the derivatives in $\boldsymbol\mu$ are always the same, and similar to the gradients for the MAP estimator with the expected rate $\langle\boldsymbol\lambda\rangle$ replacing $\boldsymbol\lambda$:

\begin{equation}\begin{aligned}
%%%% GRADIENT IN μ
\operatorname{\nabla}_{\boldsymbol\mu}
\mathcal L &=
\mathbf n \circ (\langle\boldsymbol\lambda\rangle - \mathbf y)
+\boldsymbol\Sigma_0^{-1}(\boldsymbol\mu-\boldsymbol\mu_0)
\\
%%%% HESSIAN IN μ
\operatorname{H}_{\boldsymbol \mu}
\mathcal L &=
\operatorname{diag}[\mathbf n \circ \langle\boldsymbol\lambda\rangle]
+\boldsymbol\Sigma_0^{-1}
\end{aligned}\end{equation}

# Gradients of the posterior covariance

Neglecting terms that do not depend on $\boldsymbol\Sigma$, the loss to be minimized is:

\begin{equation}\begin{aligned}
\mathcal L(\boldsymbol\Sigma) &=
\mathbf n^\top \langle\boldsymbol\lambda\rangle + \tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] - \ln|\boldsymbol\Sigma|\right\}
\\
\langle\boldsymbol\lambda\rangle &= 
\exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)
\end{aligned}\end{equation}

Using the formula from The Matrix Cookbook, the derivative of this in $\boldsymbol\Sigma$ (Jacobian) is 

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
&=
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}\right\}
\end{aligned}\end{equation}

Ideally, we also want to get the second derivative, since Newton-Raphson (often) provides quadratic convergence. The Hessian is a fourth-order tensor, wich is cumbersome. However, we only need to compute Hessian-vector products. These can be calculated by taking the derivative of the Jacobian times a "vector". In this case, since the object being optimized is a matrix, we take the scalar product between the Matrix-valued Jacobian and a matrix. 

\begin{equation}\begin{aligned}
\left<\operatorname H_{\boldsymbol\Sigma} \mathcal L,\mathbf M\right>
&=
\operatorname{\nabla}_{\boldsymbol\Sigma} 
\left<
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
,\mathbf M
\right>
\\&=
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\left(
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
\right)^\top
\mathbf M
\right]
\\
&=
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\left(
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]
+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}\right\}
\right)^\top
\mathbf M
\right]
\\
&=
\tfrac 1 2 
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\left(
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle ]
+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}
\right)
\mathbf M
\right]
\end{aligned}\end{equation}

(transposes are dropped in the final step because $\operatorname{diag}[\cdot]$ and $\boldsymbol\Sigma_0$ are symmetric). Using the formula in The Matrix Cookbook, we can evaluate these trace derivatives. 

\begin{equation}\begin{aligned}
\left<\operatorname H_{\boldsymbol\Sigma} \mathcal L,\mathbf M\right>
&=
\tfrac 1 2 
\left\{
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\left(
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle ]
- \boldsymbol\Sigma^{-1}
\right)
\mathbf M
\right]
\right\}
\\
&=
\tfrac 1 2 
\left\{
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle ]
\mathbf M
\right]
-
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\boldsymbol\Sigma^{-1}
\mathbf M
\right]
\right\}
\\
&=
\tfrac 1 2 
\left\{
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle ]
\mathbf M
\right]
+
\boldsymbol\Sigma^{-1}
\mathbf M^\top
\boldsymbol\Sigma^{-1}
\right\}
\end{aligned}\end{equation}

The derivative of the trace of a matrix-valued function is the transpose of the scalar-derivative of said function. $\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle ]
\mathbf M
\right]$ is best computed element-wise:

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\boldsymbol\Sigma}
\operatorname{tr}\left[
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle ]\,
\mathbf M
\right]
&= 
\operatorname{\nabla}_{\boldsymbol\Sigma}
\sum_k
[\mathbf n\circ\langle\boldsymbol\lambda\rangle]_k
\mathbf M_{k}
\\
&= 
\sum_k
n_k \mathbf M_{k} \operatorname{\nabla}_{\boldsymbol\Sigma} \langle\lambda_k\rangle
\\
&=
\tfrac 1 2 \operatorname{diag}\left[
\mathbf n\circ\boldsymbol\lambda\circ\operatorname{diag}(\mathbf M)
\right]
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\left<\operatorname H_{\boldsymbol\Sigma} \mathcal L,\mathbf M\right>
&=
\tfrac 1 2 
\left\{
\tfrac 1 2 \operatorname{diag}\left[
\mathbf n\circ\boldsymbol\lambda\circ\operatorname{diag}(\mathbf M)
\right]
+
\boldsymbol\Sigma^{-1}
\mathbf M^\top
\boldsymbol\Sigma^{-1}
\right\}
\end{aligned}\end{equation}

## Numerically verify these gradients

## Gradients of parameterized posterior covariance

The above gradients only apply if we're parameterizing $\boldsymbol\Sigma$ directly. This is computationally impractical. We explore the following: 

1. $\boldsymbol\Sigma = \mathbf A\operatorname{diag}[\mathbf v]\,\mathbf A^\top$
2. $\boldsymbol\Sigma = \mathbf X \mathbf X^\top$
3. $\boldsymbol\Sigma = \mathbf F^\top \mathbf Q\mathbf Q^\top \mathbf F$
4. $\boldsymbol\Sigma = \left(\boldsymbol\Sigma_0^{-1}+\operatorname{diag}[\mathbf p]\right)^{-1}$

How can we handle these derivatives? 

In cases (2) and (3), $\boldsymbol\Sigma$ isn't full rank, and the term $\ln|\boldsymbol\Sigma|$ diverges to negative infinity. However, since the null space of $\boldsymbol\Sigma$ is not being optimized, it does not affect the gradient. Since $\boldsymbol\Sigma$ is positive semi-definite, it can always be factored as:

$$
\boldsymbol\Sigma = \mathbf X\mathbf X^\top
$$

In these cases, the log-determinant can be replaced by $\ln|\mathbf X^\top\mathbf X|$, which remains well-defined assuming $\mathbf X$ has fixed rank $\mathbf K$ euqal to its number of columns. The derivative of this expression in $\mathbf X$ is:

$$
\partial_{\mathbf X} \ln |\mathbf X^\top\mathbf X| = 2 {\mathbf X^+}^\top
$$



\begin{equation}\begin{aligned}
\partial_{\mathbf X_{ij}}\boldsymbol\Sigma_{kl} 
&= 
\sum_m 
\partial_{\mathbf X_{ij}}
\mathbf X_{km} \mathbf X_{lm}
\\
&=
\sum_m 
\delta_{ij=km} \mathbf X_{lm}
+
\mathbf X_{km} \delta_{ij=lm}
\\
&=
\delta_{i=k} \mathbf X_{lj}
+
\mathbf X_{kj} \delta_{i=l}
\end{aligned}\end{equation}

According to The Matrix Cookbook, if $\boldsymbol\Sigma(\theta_i)$ is a function of a parameter $\theta_i$, then the chain rule is: 

\begin{equation}\begin{aligned}
\frac{\partial\mathcal L}{\partial\theta_i} = 
\left<
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
,
\frac{\partial\boldsymbol\Sigma}{\partial \theta_i}
\right>
=
\operatorname{tr}\left[
\left(\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
\right)^\top
\frac{\partial\boldsymbol\Sigma}{\partial \theta_i}
\right]
=
\sum_{kl}
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma_{kl}}
\frac{\partial\boldsymbol\Sigma_{kl}}{\partial \theta_i}
\end{aligned}\end{equation}

For a vector or matrix of parameters $\boldsymbol\Theta$, we write this succinctly as 

$$
\operatorname{\nabla}_{\boldsymbol\Theta} \mathcal L
=
\left<
\operatorname{\nabla}_{\boldsymbol\Sigma} \mathcal L
,
\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma
\right>
$$

$$
\left<
\frac{\partial\mathcal L}{\partial\boldsymbol\Theta}
,
\mathbf U
\right>
=
\left<
\left<
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
,
\frac{\partial\boldsymbol\Sigma}{\partial\boldsymbol\Theta}
\right>
,
\mathbf U
\right>
$$

$$
\frac{\partial\mathcal L}{\partial\boldsymbol\Theta^\top}
\left<
\frac{\partial\mathcal L}{\partial\boldsymbol\Theta}
,
\mathbf M
\right>
=
\frac{\partial\mathcal L}{\partial\boldsymbol\Theta^\top}
\left<
\left<
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
,
\frac{\partial\boldsymbol\Sigma}{\partial\boldsymbol\Theta}
\right>
,
\mathbf M
\right>
$$

The chain rule applies to the Hessian-vector product as follows: 

\begin{equation}\begin{aligned}
\left<\operatorname H_{\boldsymbol\Sigma} \mathcal L,\mathbf M\right>
&=
\operatorname{\nabla}_{\boldsymbol\Sigma} 
\left<
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
,\mathbf M
\right>
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\left<\operatorname{H}_{\boldsymbol\Theta}\mathcal L,\mathbf M\right>
&=
\operatorname{\nabla}_{\boldsymbol\Theta}
\left<
\operatorname{\nabla}_{\boldsymbol\Theta} \mathcal L,
\mathbf M
\right>
\\&=
\operatorname{\nabla}_{\boldsymbol\Theta}\left<\left<\operatorname{\nabla}_{\boldsymbol\Sigma} \mathcal L,\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma\right>,\mathbf M\right>
\\
&=
\left<\operatorname{\nabla}_{\boldsymbol\theta} \boldsymbol\Sigma,
\operatorname{\nabla}_{\boldsymbol\Sigma}\left<\left<\operatorname{\nabla}_{\boldsymbol\Sigma} \mathcal L,\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma\right>,\mathbf M\right>\right>
\\
&=
\left<
\left<
\left<
\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma
,
\operatorname{H}_{\boldsymbol\Sigma} \mathcal L
\right>
,
\mathbf M
\right>
,
\operatorname{\nabla}_{\boldsymbol\theta} \boldsymbol\Sigma
\right>
\end{aligned}\end{equation}

Assuming everything is associative this might mean something: 
    
\begin{equation}\begin{aligned}
\left<\operatorname{H}_{\boldsymbol\Theta}\mathcal L,\mathbf M\right>
&=
\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma
:
\operatorname{H}_{\boldsymbol\Sigma} \mathcal L
:
\mathbf M
:
\operatorname{\nabla}_{\boldsymbol\theta} \boldsymbol\Sigma
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\left<\operatorname{H}_{\boldsymbol\Theta}\mathcal L,\mathbf M\right>
&=
[\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma]
[\operatorname{H}_{\boldsymbol\Sigma} \mathcal L](\mathbf M)
[\operatorname{\nabla}_{\boldsymbol\theta} \boldsymbol\Sigma]^\top
\end{aligned}\end{equation}

# Cleaning up this mess

The original Jacobian and Hessian-vector product are

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
&=
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}\right\}
\\
\left<\operatorname H_{\boldsymbol\Sigma} \mathcal L,\mathbf M\right>
&=
\tfrac 1 2 
\left\{
\tfrac 1 2 \operatorname{diag}\left[
\mathbf n\circ\boldsymbol\lambda\circ\operatorname{diag}(\mathbf M)
\right]
+
\boldsymbol\Sigma^{-1}
\mathbf M^\top
\boldsymbol\Sigma^{-1}
\right\}
\end{aligned}\end{equation}

The chain rule for the Jacobian and Hessian-vector product are: 


\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\boldsymbol\Theta} \mathcal L
&=
\left<
\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma,
\operatorname{\nabla}_{\boldsymbol\Sigma} \mathcal L
\right>
\\
\left<\operatorname{H}_{\boldsymbol\Theta}\mathcal L,\mathbf U\right>
&=
\left<
\operatorname{\nabla}_{\boldsymbol\Theta}\boldsymbol\Sigma
,
\left<
\operatorname H_{\boldsymbol\Sigma}\mathcal L
,
\operatorname{\nabla}_{\boldsymbol\Theta}\boldsymbol\Sigma
\right>
,
\mathbf U
\right>
\end{aligned}\end{equation}

For each parameterization, we need to write down $\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma$, if possible. 

$H$ is a binlinear form that accepts matrices $M$. 

$\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma$ is also a bilinear form. It relates matrix-valued derivatives in $\boldsymbol\Sigma$ to matrix-valued derivatives in $\boldsymbol\Theta$. 

$\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma$ is typically not invertable. But it may not matter, pseudoinverse is OK, if available. 


1. $\boldsymbol\Sigma = \mathbf A\operatorname{diag}[\mathbf v]\,\mathbf A^\top$
2. $\boldsymbol\Sigma = \mathbf X \mathbf X^\top$
3. $\boldsymbol\Sigma = \mathbf F^\top \mathbf Q\mathbf Q^\top \mathbf F$
4. $\boldsymbol\Sigma = \left(\boldsymbol\Sigma_0^{-1}+\operatorname{diag}[\mathbf p]\right)^{-1}$


### $\boldsymbol\Sigma = \mathbf A\operatorname{diag}[\mathbf v]\,\mathbf A^\top$

\begin{equation}\begin{aligned}
\partial_{\mathbf v_k}\boldsymbol\Sigma_{ij}
&= 
\sum_{lm} 
\partial_{\mathbf v_k}
\mathbf A_{il}
\operatorname{diag}[\mathbf v]_{lm}
\mathbf A^\top_{mj}
\\
&= 
\sum_{l} 
\partial_{\mathbf v_k}
\mathbf A_{il}
\mathbf v_l
\mathbf A^\top_{lj}
\\
&= 
\mathbf A_{ik}
\mathbf v_k
\mathbf A^\top_{kj}
\end{aligned}\end{equation}

This operates on $\mathbf v$ (index $k$) and $\mathbf \Sigma$ (indecies $ij$).

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
&=
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}\right\}
\\
\operatorname{\nabla}_{\boldsymbol\Theta} \mathcal L
&=
\left<
\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma,
\operatorname{\nabla}_{\boldsymbol\Sigma} \mathcal L
\right>
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\boldsymbol\Sigma}\mathcal L
&=
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}\right\}
\\
\operatorname{\nabla}_{\boldsymbol\Theta} \mathcal L
&=
\left<
\operatorname{\nabla}_{\boldsymbol\Theta} \boldsymbol\Sigma,
\operatorname{\nabla}_{\boldsymbol\Sigma} \mathcal L
\right>
\end{aligned}\end{equation}


\begin{equation}\begin{aligned}
\frac{\partial\mathcal L}{\partial\theta_i} = 
\left<
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
,
\frac{\partial\boldsymbol\Sigma}{\partial \theta_i}
\right>
=
\operatorname{tr}\left[
\left(\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma}
\right)^\top
\frac{\partial\boldsymbol\Sigma}{\partial \theta_i}
\right]
=
\sum_{kl}
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma_{kl}}
\frac{\partial\boldsymbol\Sigma_{kl}}{\partial \theta_i}
\end{aligned}\end{equation}


\begin{equation}\begin{aligned}
\sum_{ij}
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma_{ij}}
\frac{\partial\boldsymbol\Sigma_{ij}}{\partial \mathbf v_k}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\sum_{ij}
\frac{\partial\mathcal L}{\partial\boldsymbol\Sigma_{ij}}
\mathbf A_{ik}
\mathbf v_k
\mathbf A^\top_{kj}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\sum_{ij}
\tfrac 1 2\left\{\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]+ \boldsymbol\Sigma_0^{-1}- \boldsymbol\Sigma^{-1}\right\}_{ij}
\mathbf A_{ik}
\mathbf v_k
\mathbf A^\top_{kj}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\tfrac 1 2\left\{
\sum_{ij}
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]_{ij}
\mathbf A_{ik}
\mathbf v_k
\mathbf A^\top_{kj}
+ 
\sum_{ij}
{\boldsymbol\Sigma_0^{-1}}_{ij}
\mathbf A_{ik}
\mathbf v_k
\mathbf A^\top_{kj}
- 
\sum_{ij}
\boldsymbol\Sigma^{-1}_{ij}
\mathbf A_{ik}
\mathbf v_k
\mathbf A^\top_{kj}
\right\}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\tfrac 1 2
v_k
\left\{
[\mathbf A^\top
\operatorname{diag}[\mathbf n\circ\langle\boldsymbol\lambda\rangle]
\mathbf A]_{kk}
+ 
[A^\top Σ_0^{-1} A]_{kk}
- 
[A^\top Σ A]_{kk}
\right\}
\end{aligned}\end{equation}

$
\boldsymbol\Sigma^{-1} = [\mathbf A\operatorname{diag}[\mathbf v]\,\mathbf A^\top]^{-1}
=
\mathbf A\operatorname{diag}[\mathbf v^{-1}]\,\mathbf A^{-1}
$

$$
dg[Σ]_i = Σ_{ii} = A_{ij}^2 v_j
$$

\begin{equation}\begin{aligned}
%%%% GRADIENT IN v
\operatorname{\nabla}_{\mathbf v}
\mathcal L &=\tfrac 1 2 \left\{
\mathbf G^\top (\mathbf n \circ \langle\boldsymbol\lambda\rangle)
+ \operatorname{diag}[\mathbf A \boldsymbol\Sigma_0^{-1} \mathbf A^\top]
- \tfrac 1 {\mathbf v}
\right\}
\end{aligned}\end{equation}

The gradient and Hessian of this loss function in $\mathbf v$ are : 

\begin{equation}\begin{aligned}
%%%% GRADIENT IN v
\operatorname{\nabla}_{\mathbf v}
\mathcal L &=\tfrac 1 2 \left\{
\mathbf G^\top (\mathbf n \circ \langle\boldsymbol\lambda\rangle)
+ \operatorname{diag}[\mathbf A \boldsymbol\Sigma_0^{-1} \mathbf A^\top]
- \tfrac 1 {\mathbf v}
\right\}
\\
%%%% HESSIAN IN v
\operatorname{H}_{\mathbf v}
\mathcal L &=\tfrac 1 2 \left\{
\tfrac 1 2 
\mathbf G \operatorname{diag}[\mathbf n \circ \langle\boldsymbol\lambda\rangle] \, \mathbf G 
+ \operatorname{diag}\left[\tfrac 1 {\mathbf v^2}\right]
\right\}
\end{aligned}\end{equation}

This follows from the usual matrix and scalar derivatives, with the exception of the term $\mathbf n^\top \langle\boldsymbol\lambda\rangle$. These can be obtained by considering the derivative with respect to single elements of $\mathbf v$ (note: $\mathbf G$ is symmetric): 

\begin{equation}\begin{aligned}
\tfrac{d}{dv_j} \mathbf n^\top\langle\boldsymbol\lambda\rangle 
&=  \mathbf n^\top
\tfrac{d}{dv_j}  \langle\boldsymbol\lambda\rangle 
\\
&=  \mathbf n^\top
\tfrac{d}{dv_j} \exp\left(\boldsymbol\mu + \tfrac12\mathbf G\mathbf v\right)
\\
&= \tfrac{d}{dv_j} \sum_k
n_k \exp\left[\mu_k + \tfrac12(\mathbf G\mathbf v)_k\right]
\\
&= \sum_k
n_k \left[
\langle\lambda_k\rangle
\cdot \tfrac12 \tfrac{d}{dv_j} (\mathbf G\mathbf v)_k
\right]
\\
&= 
\tfrac12 \sum_k
 n_k 
\langle\boldsymbol\lambda_k\rangle
\textstyle \mathbf G_{kj} 
\\
&= 
\tfrac12 \left[ \mathbf G^\top(\mathbf n\circ\langle\boldsymbol\lambda\rangle) \right]_j
\\\\
\tfrac{d^2}{dv_iv_j} \mathbf n^\top\langle\boldsymbol\lambda\rangle 
&= 
\tfrac12 \left[ \sum_k n_k \tfrac{d}{dv_i} \langle\boldsymbol\lambda_k\rangle\textstyle \mathbf G_{kj} \right]_j
\\
&= 
\tfrac14 \left[ \sum_k n_k 
\langle\boldsymbol\lambda_k\rangle
\textstyle \mathbf G_{jk} 
\textstyle \mathbf G_{ki} 
\right]_j
\\
&= 
\tfrac 1 4 \left[\mathbf G \operatorname{diag}[\mathbf n\circ \langle\boldsymbol\lambda\rangle] \, 
\mathbf G
\right]_{ij}
\end{aligned}\end{equation}

## Low-rank approximation $\boldsymbol\Sigma = \mathbf X \mathbf X^\top$

We consider an approximate posterior covariance of the form

\begin{equation}\begin{aligned}
\boldsymbol\Sigma^{-1} &\approx \mathbf X \mathbf X^\top,\,\,\,\,\mathbf X\in\mathbb R^{L^2\times K}
\end{aligned}\end{equation}

where $\mathbf X$ is a tall, thin matrix, with as many rows as there are spatial bins ($L^2$), and $K<L^2$ columns. We view $\mathbf X$ as being composed of $L^2$ length-$K$ row-vectors $\mathbf x_k$. 

\begin{equation}\begin{aligned}
\mathbf X^\top &= \{ \mathbf x_1^\top,...,\mathbf x_{L^2}^\top \}
\end{aligned}\end{equation}

Neglecting terms that do not depend on $\mathbf X$, the loss to be minimized is:

\begin{equation}\begin{aligned}
\mathcal L(\mathbf X)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right] - \ln|\mathbf X^\top \mathbf X |\right\}
\end{aligned}\end{equation}

Note that we have used $\ln|\mathbf X^\top \mathbf X |$ above, rather than $\ln|\mathbf X\mathbf X^\top|$
Since $\mathbf X \mathbf X^\top$ is rank $K<L^2$, it always as a subspace of size $L^2-K$ with zero eigenvalues. This null space doesn't affect the gradient, but it does make the log-determinant undefined. $\ln|\mathbf X^\top \mathbf X |$ considers only the log-determinant in the low-rank space spanned by $\mathbf X$, and remains finite. The Jacobian is:

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\mathbf X}
\mathcal L &=
\left(
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf X - {\mathbf X^{+}}^\top
\end{aligned}\end{equation}


The derivatives of $\operatorname{tr}\left[\boldsymbol\Sigma_0^{-1}\mathbf X \mathbf X^\top\right]$ and $\ln|\mathbf X \mathbf X^\top|$ can be found in The Matrix Cookbook. The gradient of $\mathbf n^\top \exp\left(\boldsymbol\mu+ \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)$ can be solved by considering single elements of $\mathbf X$.  

\begin{equation}\begin{aligned}
\tfrac{d}{dx_{ij}}
\mathbf n^\top \exp\left(\boldsymbol\mu+ \tfrac 1 2 \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]\right)
&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\tfrac12 \tfrac{d}{dx_{ij}} \operatorname{diag}\left[\mathbf X \mathbf X^\top\right]_l
\\&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\tfrac12\sum_k \tfrac{d}{dx_{ij}} \mathbf x_{lk}^2
\\&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\tfrac12\sum_k 2 \mathbf x_{lk} \delta_{ij=lk}
\\&=
\textstyle\sum_l n_l\langle\lambda_l\rangle
\mathbf x_{lj} \delta_{i=l}
\\&=
n_i\langle\lambda_i\rangle \mathbf x_{ij}
\\&=
\left[\operatorname{diag}[\mathbf n\circ \langle\boldsymbol\lambda\rangle]\, \mathbf X\right]_{ij}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\operatorname{\partial}_{\mathbf X_{ij}} 
\left< 
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] \mathbf X
, 
\mathbf M 
\right>
&
=
\operatorname{\partial}_{\mathbf X_{ij}} 
\operatorname{tr}\left[
\mathbf X^\top \operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] 
\mathbf M
\right]
\\
&
=
\operatorname{\partial}_{\mathbf X_{ij}}
\textstyle\sum_k
\left[
\mathbf X^\top \operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] 
\mathbf M
\right]_kk
\\
&
=
\operatorname{\partial}_{\mathbf X_{ij}}
\textstyle\sum_{klm}
\mathbf X^\top_{kl} \operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right]_{lm}
\mathbf M_{mk}
\\
&
=
\operatorname{\partial}_{\mathbf X_{ij}}
\textstyle\sum_{kl}
\left(
\mathbf X_{lk} \left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right]_{l}
\right)
\mathbf M_{lk}
\\
&
=
\operatorname{\partial}_{\mathbf X_{ij}}
\textstyle\sum_{kl}
\left(
\mathbf X_{lk} n_l\langle\lambda_l\rangle
\right)
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
\left[
\left(
\operatorname{\partial}_{\mathbf X_{ij}}
\mathbf X_{lk} 
\right)
n_l\langle\lambda_l\rangle
+
\mathbf X_{lk} 
\left(
\operatorname{\partial}_{\mathbf X_{ij}}
n_l\langle\lambda_l\rangle
\right)
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
\left[
\delta_{il} \delta_{jk} n_l\langle\lambda_l\rangle
+
\mathbf X_{lk} n_l
\left(
\operatorname{\partial}_{\mathbf X_{ij}}
\langle\lambda_l\rangle
\right)
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
\left[
\delta_{il} \delta_{jk} n_l\langle\lambda_l\rangle
+
\mathbf X_{lk} n_l
\operatorname{\partial}_{\mathbf X_{ij}}
\exp\left(\mu_l + \tfrac 1 2 \boldsymbol\Sigma_{ll}\right)
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
\left[
\delta_{il} \delta_{jk} n_l\langle\lambda_l\rangle
+
\tfrac 1 2
\mathbf X_{lk} n_l\langle\lambda_l\rangle
\operatorname{\partial}_{\mathbf X_{ij}}
\boldsymbol\Sigma_{ll}
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
n_l\langle\lambda_l\rangle
\left[
\delta_{il} \delta_{jk}
+
\tfrac 1 2
\mathbf X_{lk}
\operatorname{\partial}_{\mathbf X_{ij}}
\boldsymbol[\mathbf X\mathbf X^\top]_{ll}
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
n_l\langle\lambda_l\rangle
\left[
\delta_{il} \delta_{jk}
+
\tfrac 1 2
\mathbf X_{lk}
\sum_p 
\operatorname{\partial}_{\mathbf X_{ij}}
\mathbf X_{lp}^2
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
n_l\langle\lambda_l\rangle
\left[
\delta_{il} \delta_{jk}
+
\mathbf X_{lk}
\sum_p 
\mathbf X_{lp} \delta_{il} \delta_{jp}
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
n_l\langle\lambda_l\rangle
\left[
\delta_{il} \delta_{jk} 
+
\mathbf X_{lk} 
\mathbf X_{lj} \delta_{il}
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{kl}
n_l\langle\lambda_l\rangle
\delta_{il}
\left[
\delta_{jk} 
+
\mathbf X_{lk} 
\mathbf X_{lj} 
\right]
\mathbf M_{lk}
\\
&
=
\textstyle\sum_{k}
n_i\langle\lambda_i\rangle
\left[
\delta_{jk} 
+
\mathbf X_{ik} 
\mathbf X_{ij} 
\right]
\mathbf M_{ik}
\\
&
=
n_i\langle\lambda_i\rangle
\textstyle\sum_{k}
\left[
\delta_{jk} 
+
\mathbf X_{ik} 
\mathbf X_{ij} 
\right]
\mathbf M_{ik}
\\
&
=
n_i\langle\lambda_i\rangle
\left\{
\textstyle\sum_{k}
\left[
\delta_{jk} 
\mathbf M_{ik}
\right]
+
\textstyle\sum_{k}
\left[
\mathbf X_{ik} 
\mathbf X_{ij} 
\mathbf M_{ik}
\right]
\right\}
\\
&
=
n_i\langle\lambda_i\rangle
\left[
\mathbf M_{ij}
+
\mathbf X_{ij} 
\textstyle\sum_{k}
\mathbf X_{ik} 
\mathbf M_{ik}
\right]
\\
&
=
n_i\langle\lambda_i\rangle
\left[
\mathbf M_{ij}
+
[ \mathbf X \mathbf M^\top ]_{ii}
\right]\mathbf X_{ij}
\\
&
=
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\operatorname{\nabla}_{\mathbf X} \left< \operatorname{\nabla}_{\mathbf X} \mathcal L , \mathbf M \right>
&=
\operatorname{\nabla}_{\mathbf X} \left< 
\left(
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right] + \boldsymbol\Sigma_0^{-1}\right)\mathbf X - {\mathbf X^{+}}^\top
, \mathbf M \right>
\\
&=
\operatorname{\nabla}_{\mathbf X}
\operatorname{tr}\left[
\mathbf X^\top
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right]\mathbf M
+ 
\mathbf X^\top \boldsymbol\Sigma_0^{-1}\mathbf M
- 
{\mathbf X^{+}}\mathbf M
\right]
\\
&=
\operatorname{diag}\left[\mathbf n\circ\langle\boldsymbol\lambda\rangle\right]\mathbf M
+ 
\boldsymbol\Sigma_0^{-1}\mathbf M
+{\mathbf X^+}^\top \mathbf M^\top {\mathbf X^+}^\top - (\mathbf I - {\mathbf X^+}^\top \mathbf X^\top) \mathbf M \mathbf X^+ {\mathbf X^+}^\top
\end{aligned}\end{equation}

## Inverse-diagonal approximation 

We consider an approximate posterior covariance of the form

\begin{equation}\begin{aligned}
\boldsymbol\Sigma^{-1} &\approx \boldsymbol\Sigma_0^{-1} + \operatorname{diag}\left[\frac 1 {\mathbf v} \right]
\end{aligned}\end{equation}

This can be optimized using the fixed-point iteration:

\begin{equation}\begin{aligned}
\mathbf v \gets \operatorname{diag}\left[
\left(\boldsymbol\Sigma_0^{-1} + \operatorname{diag}\left[\frac 1 {\mathbf v} \right]\right)^{-1}
\right]
\end{aligned}\end{equation}

I can't prove that this converges, but it seems to. It's also difficult to compute. The above matrix inverse is cubic in complexity. Individual marginals can be calculated using Krylov subspace-based algorithms, but I havn't tested it. 

This can be solved self-consistently by adjusting $\mathbf v$ until 

$$
\frac 1 {\mathbf v} = \mathbf n \circ \langle\boldsymbol\lambda\rangle
$$

$$
\frac 1 {\mathbf v} = \mathbf n \circ \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}\left[\left(\boldsymbol\Sigma_0^{-1} + \operatorname{diag}\left[\tfrac 1 {\mathbf v}\right]\right)^{-1}\right]
\right)
$$

My intuition is that there should be some sort of negative-feedback based solution here. Maybe assuming that 

$$
\mathbf v = \exp(\boldsymbol \mu)
$$

would work? 

\begin{equation}\begin{aligned}
\mathcal L(\mathbf v)&=\mathbf n^\top \exp\left(\boldsymbol\mu + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma]\right)+ \tfrac 1 2 \left \{ \operatorname{tr}[\boldsymbol\Sigma_0^{-1}\boldsymbol\Sigma] - \ln|\boldsymbol\Sigma|\right\}
\end{aligned}\end{equation}

## Let's try this .. chain rule 

$$
\partial (\mathbf X ^{-1}) = - \mathbf X ^{-1} (\partial \mathbf X)  \mathbf X^{-1}
$$

$$
\mathbf Y = \mathbf X^{-1}
$$

$$
\boldsymbol\Sigma^{-1} = 
\left(\boldsymbol\Sigma_0^{-1} + \operatorname{diag}\left[\frac 1 {\mathbf v} \right]\right)^{-1}
$$

$$
\partial (\boldsymbol\Sigma^{-1}) = - \boldsymbol\Sigma^{-1} (\partial \operatorname{diag}\left[\frac 1 {\mathbf v} \right] ) \boldsymbol\Sigma^{-1}
$$