In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

from pylab import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Variational inference of generalized linear models with Gaussian priors

These notebook reviews how to perform [variational Bayesian](https://en.wikipedia.org/wiki/Variational_Bayesian_methods) inference on some classes of [Generalized Linear Models (GLMs)](https://en.wikipedia.org/wiki/Generalized_linear_model), where both the prior and approximate posterior are multivariate normal distributions.

Consider an $N$-dimensional space of latent (unobserved) variables $\mathbf z\in\mathbb R^N$; $\mathbf z=\{z_1,..,z_n\}$. Let $\mathbf z$ have a know (or specified) multivariate Gaussian prior distribution, with mean $\boldsymbol\mu_z$ and covariance matrix $\boldsymbol\Sigma_z$:

\begin{equation}\begin{aligned}
\mathbf z &\sim\mathcal N(\boldsymbol\mu_z,\boldsymbol\Sigma_z)
\end{aligned}\end{equation}

We consider a dataset of $M$ observations $\mathbf y = \{y_1,..,y_M\}$, which are independent conditioned on $\mathbf z$:

\begin{equation}\begin{aligned}
\Pr(\mathbf y|\mathbf z) = \prod_{m=1..M} \Pr(y_m | \mathbf z)
\end{aligned}\end{equation}

In these notes, we focus on $\Pr(y|\mathbf z)$ from the [natural exponential family](https://en.wikipedia.org/wiki/Natural_exponential_family#General_multivariate_case). For a single scalar $y_m$, the canonical form of distributions from this family can be written as:

\begin{equation}\begin{aligned}
\ln\Pr(y_m | \mathbf z) &= y_m\cdot \theta_m - A( \theta_m ) + \mathcal O(y_m)
\end{aligned}\end{equation}

Above, $A(\cdot)$ is a known function, $y_m$ is the $m^{\text{th}}$ observation, and $\theta_m = \mathbf b_m^\top\mathbf z$ 
is a fixed linear projection ($\mathbf b_m^\top\in\mathbb R^N$) of the latent states $\mathbf z$ associated with this observation. We've ignored any normalization terms that depend only on the observations $y_m$ ("$\mathcal O(y_m)$"), since these are fixed and will not affect our optimization.

Overall, the log-likelihood of $\mathbf y$ given $\mathbf z$ can be written as the sum:

\begin{equation}\begin{aligned}
\ln \Pr(\mathbf y|\mathbf z) 
&=
\textstyle\sum_{m=1..M} \left\{ 
y_m\cdot \theta_m - A( \theta_m )
\right\}
+ \text{constant}
\end{aligned}\end{equation}

We write this in vector notation as:

\begin{equation}\begin{aligned}
\ln \Pr(\mathbf y|\mathbf z) 
&=
\mathbf y^\top \boldsymbol\theta - \mathbf 1^\top A(\boldsymbol\theta)
+ \text{constant}
\end{aligned}\end{equation}

where $A(\cdot)$ acts element-wise and $\mathbf 1$ is a length-$M$ column vector of 1s.

Since we will be optimizing our posterior using gradient descent (and, ideally, Newton's method), it is helpful if $A(\cdot)$ is differentiable (twice differentiable). For point-process inference in neuroscience the derivative $A'(\cdot) = \rho(\cdot)$ is usually a firing-rate nonlinearity (e.g. exponential, logistic). We will use a multivariate Gaussian to approximate the posterior distirbution of the latent states $\mathbf z$, so it is helpful if the expectations of $\rho(\theta)$ with respect to a Gaussian distribution have a closed form. 

In these notes, we consider an exponential nonlineariy $\rho = \exp$, which corresponds to a linear-nonlinear-Poisson model, as well as a sigmoidal Gaussian-CDF nonlinearity $\rho = \Phi$, which corresponds to a ["probit"](https://en.wikipedia.org/wiki/Probit) observation model. (Logistic regression can be approximated by matching the logistic sigmoid to a rescaled Gaussian CDF.) For obtaining derivatives, it is useful to recall that if $x\sim\mathcal N(\mu,\sigma^2)$ is Gaussian, then the derivatives of the expectation of $\rho(x)$ with respect to $\mu$ and $\sigma^2$ are:

\begin{equation}\begin{aligned}
\partial_\mu \langle A(x) \rangle_x
&=
\langle 
A'(x) 
\rangle_x 
=
\langle 
\rho(x) 
\rangle_x 
\\
\partial_{\sigma^2} \langle A(x)\rangle_x
&=
\tfrac1{2\sigma^2}
\left<(x-\mu) A'(x)\right>_x
=
\tfrac1{2\sigma^2}
\left<(x-\mu) \rho(x)\right>_x
\end{aligned}\end{equation}

If $A = \rho = \exp$, then $\langle \rho(x) \rangle_x = \exp(\mu + \sigma^2/2)$ and:

\begin{equation}\begin{aligned}
\partial_\mu \langle \exp(x) \rangle_x
&= \exp(\mu + \sigma^2/2)
\\
\partial_{\sigma^2} \langle \exp(x)\rangle_x
&=
\tfrac 1 2 \exp(\mu + \sigma^2/2)
\end{aligned}\end{equation}

These expectations also have a closed form if $\rho = \Phi$. Let $u = (x-\mu)/\sigma$. Then $x = \mu + \sigma u$ and $dx = \sigma du$. Define $\gamma$ as $\gamma = (1+\sigma^2)^{-1}$. Then:

\begin{equation}\begin{aligned}
\partial_\mu\left<A(x)\right>_x
&=
\left<\Phi(x)\right>_x
\\&=
 \int \Phi(x) \frac 1 \sigma \phi\left(\tfrac{x-\mu}{\sigma}\right)\, dx
\\&=
\int \Phi(\mu + \sigma u) \phi(u) \,du
\\&=
\Phi(\gamma \mu)
\\
\\
\partial_{\sigma^2}\left<A(x)\right>_x
&=
\tfrac 1 {2\sigma^2} 
\left<(x-\mu)\Phi(x)\right>_x
\\
&=
\frac 1 {2\sigma^2} 
\int
(x-\mu)\Phi(x)\frac1\sigma\phi\left(\tfrac{x-\mu}{\sigma}\right)\,dx
\\
&=
\frac 1 {2\sigma^2} 
\int
(\mu + \sigma u -\mu)\Phi(\mu + \sigma u)\phi(u)\,du
\\
&=
\frac 1 {2 \sigma}
\int
u\Phi(\mu + \sigma u)\phi(u)\,du
\\
&=
\frac \gamma 2 \phi(\gamma\mu)
\end{aligned}\end{equation}

## Variational Bayes

We are interested in the posterior distribution of $\mathbf z$ given the observations $\mathbf y$ 

\begin{equation}\begin{aligned}
\Pr(\mathbf z | \mathbf y) = \Pr(\mathbf y | \mathbf z)
\frac{\Pr(\mathbf z)}{\Pr(\mathbf y)}
\end{aligned}\end{equation}

The variational Bayesian approach approximates this posterior $\Pr(\mathbf z | \mathbf y)$ with a simpler more tractable distribution. Here, we will use a multivariate Gaussian distribution:

\begin{equation}\begin{aligned}
\Pr(\mathbf z | \mathbf y)
\approx Q(\mathbf z) = \mathcal N(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
\end{aligned}\end{equation}

The parameters $\Theta = (\boldsymbol\mu_q,\boldsymbol\Sigma_q)$ of $Q(\mathbf z)$ are fit buy minimizing the [Kullback-Leibler (KL) divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) from the true posterior $\Pr(\mathbf z | \mathbf y)$ to $Q(\mathbf z)$:

\begin{equation}\begin{aligned}
\Theta\gets\underset{\Theta}{\operatorname{argmin}}
D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z | \mathbf y)\right]
\end{aligned}\end{equation}

This is equivalent to jointly maximising the entropy $\operatorname H_Q = \left<-\ln Q(\mathbf z)\right>_{Q}$ and the average log-probability of the observations with respect to the posterior, $\left<\ln\Pr(\mathbf z | \mathbf y)\right>_{Q}$:

\begin{equation}\begin{aligned}
D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z | \mathbf y)\right]
&=
\left<
\ln\frac{Q(\mathbf z)}
{\Pr(\mathbf z | \mathbf y)}
\right>_{Q}
\\
&=
-\left<
-\ln Q(\mathbf z)
\right>_Q
-
\left<
\ln\Pr(\mathbf z | \mathbf y)
\right>_Q
\\
&=
-\operatorname H_Q
-\left<\ln \Pr(\mathbf z | \mathbf y)\right>_{Q},
\end{aligned}\end{equation}

where $\langle\cdot\rangle_Q$ denotes averaging with respect to $Q(\mathbf z)$.

It is also equivalent to minimizing the KL divergenece $D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z)\right]$ from the prior to the posterior, while maximizing the expected log-likelihood $\left<\Pr(\mathbf y | \mathbf z)\right>_Q$

\begin{equation}\begin{aligned}
D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z | \mathbf y)\right]
&=
\left<
\ln Q(\mathbf z)
\right>_Q
-
\left<
\ln\Pr(\mathbf z | \mathbf y)
\right>_Q
\\
&=
\left<
\ln Q(\mathbf z)
\right>_Q
-
\left<
\ln\left(
\Pr(\mathbf y | \mathbf z)
\frac
{\Pr(\mathbf z)}
{\Pr(\mathbf y)}
\right)
\right>_Q
\\
&=
\left<\ln \frac{Q(\mathbf z)}{\Pr(\mathbf z)}\right>_Q-\left<\ln\Pr(\mathbf y | \mathbf z)
\right>_Q
+\left<\ln\Pr(\mathbf y)\right>_Q\\
&=
D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z)\right]
-\left<\ln\Pr(\mathbf y | \mathbf z)\right>_Q+c,
\end{aligned}\end{equation}

where the probability of the observations $\Pr(\mathbf y)$ can be neglected, since it is a constant ("$c$") that does not depend on $Q$. 

Since both $Q(\mathbf z)$ and $\Pr(\mathbf z)$ are multivariate Gaussian, the KL divergence $D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z)\right]$ [has the closed form](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Kullback%E2%80%93Leibler_divergence)

\begin{equation}\begin{aligned}
D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z)\right]
&=
\frac 1 2 \left\{
(\boldsymbol\mu_z-\boldsymbol\mu_q)^\top 
\boldsymbol\Sigma_z^{-1}
(\boldsymbol\mu_z-\boldsymbol\mu_q)
+
\operatorname{tr}\left(
\boldsymbol\Sigma_z^{-1}
\boldsymbol\Sigma_q
\right)
+
\ln|
\boldsymbol\Sigma_z^{-1}
\boldsymbol\Sigma_q
|
\right\}
+\text{constant}
\end{aligned}\end{equation}

For our choice of the canonically-parameterized natural exponential family, the expected negative log-likelihood can be written as: 


\begin{equation}\begin{aligned}
-\langle\ln \Pr(\mathbf y|\mathbf z) \rangle + \mathcal O(\mathbf y)
&=
\langle
\mathbf 1^\top A(\boldsymbol\theta)
-
\mathbf y^\top \boldsymbol\theta
\rangle
\\
&=
\langle
\mathbf 1^\top A(\mathbf B\mathbf z)
-
\mathbf y^\top \mathbf B\mathbf z
\rangle
\\
&=
\mathbf 1^\top\langle A(\mathbf B\mathbf z)\rangle
-
\mathbf y^\top \mathbf B\boldsymbol\mu_q
\end{aligned}\end{equation}

where $\mathbf B = \{\mathbf b_1,..,\mathbf b_M \}^\top$ is the matrix of all $\mathbf b$.

## The objective

Neglecting constant terms that do not depend on $(\boldsymbol\mu_q,\boldsymbol\Sigma_q)$, the loss function to be minimized is: 

\begin{equation}\begin{aligned}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
&=
D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z)\right]
-\left<\ln\Pr(\mathbf y | \mathbf z)\right>_Q
\\
&=
\frac 1 2 \left\{
(\boldsymbol\mu_z-\boldsymbol\mu_q)^\top 
\boldsymbol\Sigma_z^{-1}
(\boldsymbol\mu_z-\boldsymbol\mu_q)
+
\operatorname{tr}\left(
\boldsymbol\Sigma_z^{-1}
\boldsymbol\Sigma_q
\right)
+
\ln|\boldsymbol\Sigma_z^{-1}\boldsymbol\Sigma_q|\right\}
+ \mathbf 1^\top\langle A(\mathbf B\mathbf z)\rangle_Q
- \mathbf y^\top \mathbf B \boldsymbol\mu_q 
\end{aligned}\end{equation}

To optimize this, we need the derivatives in $\boldsymbol\mu_q$ and $\boldsymbol\Sigma_q$. We consider each of these variables separately. 

### Derivatives in ${\boldsymbol\mu}_q$

The gradient and Hessian of $\mathcal L$ with respect to $\boldsymbol\mu_q$ are:

\begin{equation}\begin{aligned}
\mathbf J_{\boldsymbol\mu_q}
&=
\partial_{\boldsymbol\mu_q}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
=
\boldsymbol\Sigma_z^{-1}
(\boldsymbol\mu_q-\boldsymbol\mu_z)
+
\mathbf B^\top\left\{\langle\rho(\mathbf B\mathbf z) \rangle_Q-\mathbf y\right\}
\\
\mathbf H_{\boldsymbol\mu_q}
&=
\operatorname H_{\boldsymbol\mu_q}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
=
\boldsymbol\Sigma_z^{-1}
+
\mathbf B^\top
\operatorname{diag}\left[
\langle
\rho'(\mathbf B\mathbf z) 
\rangle_Q
\right]
\mathbf B
\end{aligned}\end{equation}

### Gradient in $\boldsymbol\Sigma_q$

The gradient and Hessian of $\mathcal L$ with respect to $\boldsymbol\Sigma_q$ are more complicated. An analytic derivative in $\boldsymbol\Sigma_q$ can be obtained using the various identities provided in [The Matrix Cookbook](https://www2.imm.dtu.dk/pubdb/pubs/3274-full.html).

\begin{equation}\begin{aligned}
\mathbf J_{\boldsymbol\Sigma_q}
&=
\partial_{\boldsymbol\Sigma_q}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
=
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
\right\}
+ 
\partial_{\boldsymbol\Sigma_q}
\mathbf 1^\top \langle A (\mathbf B\mathbf z)\rangle_Q
\end{aligned}\end{equation}

The term $\partial_{\boldsymbol\Sigma_q} \mathbf 1^\top \langle A(\mathbf B\mathbf z)\rangle_Q$ is less straightforward, but can be obtained by considering derivatives with respect to individual elements of $\boldsymbol\Sigma_q$. Recall that $\theta_m = \mathbf b_m^\top \mathbf z$. Let $\boldsymbol\mu_{\boldsymbol\theta}$ and $\boldsymbol\sigma^2_{\boldsymbol\theta}$ denote the vectors of the means and variances of each $\theta$, respectively. These are given by the projections: 

\begin{equation}\begin{aligned}
\boldsymbol\mu_{\boldsymbol\theta}
&=
\mathbf B\boldsymbol\mu_q
\\
\boldsymbol\sigma^2_{\boldsymbol\theta}
&=
\operatorname{diag}\left[
\mathbf B\boldsymbol\Sigma_q\mathbf B^\top
\right].
\end{aligned}\end{equation}

The derivative of $\sigma^2_{\theta_m}$ with respect to a single element $\boldsymbol\Sigma_{q,ij}$ is

\begin{equation}\begin{aligned}
\frac{\partial \sigma^2_{\theta_m}}
{\partial\boldsymbol\Sigma_{q,ij}}
&=
\mathbf B_{m,i}
\mathbf B_{m,j}
\end{aligned}\end{equation}

We also know that the derivative of the expectation of $A$ is: 

\begin{equation}\begin{aligned}
\frac{\partial \langle A(\theta_m) \rangle}
{\partial{\sigma^2_{\theta_m}}}
 = 
\frac 1 {2\sigma^2_{\theta_m}}
\left<
(\theta_m - \mu_{\theta_m}) \rho(\theta_m)
\right>
\end{aligned}\end{equation}

Combining these two with the chain rule yields

\begin{equation}\begin{aligned}
\frac{\partial\langle A(\theta_m) \rangle}
{\partial\boldsymbol\Sigma_{q,ij}}
=
\frac{\partial \sigma^2_{\theta_m}}{\partial\boldsymbol\Sigma_{q,ij}}
\frac{\partial \langle A(\theta_m) \rangle}{\partial{\sigma^2_{\theta_m}}}
=
\mathbf B_{i,m}^\top\frac 
{\left<\rho(\theta_m)(\theta_m - \mu_{\theta_m})\right>}
{2\sigma^2_{\theta_m}}
\mathbf B_{m,j},
\end{aligned}\end{equation}

which provides the derivative of the likliehood with respect to $\boldsymbol\Sigma_q$:

\begin{equation}\begin{aligned}
\frac
{\partial\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q}
{\partial\boldsymbol\Sigma_{q,ij}}
&=
\sum_{m=1..M}
\frac
{\partial\langle A(\theta_m)\rangle_Q}
{\partial\boldsymbol\Sigma_{q,ij}}
\\
&=
\sum_{m=1..M}
\mathbf B_{i,m}^\top
\frac 
{\left<\rho(\theta_m)(\theta_m - \mu_{\theta_m})\right>}
{2\sigma^2_{\theta_m}}
\mathbf B_{m,j}
\\
&=
\left\{
\mathbf B^\top
\operatorname{diag}\left[
{
\frac
{1}
{2\boldsymbol\sigma^2_{\boldsymbol\theta}}\circ
\left<\rho(\boldsymbol\theta)\circ(\boldsymbol\theta - \boldsymbol\mu_{\boldsymbol\theta})\right>}
\right]
\mathbf B
\right\}_{ij},
\end{aligned}\end{equation}

where $\circ$ denotes element-wise multiplication.

Overall and in terms of $\mathbf z$, we have:

\begin{equation}\begin{aligned}
{\partial}_{\boldsymbol\Sigma_{q}}
\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q
&=
\frac 1 2
\mathbf B^\top
\left(\mathbf I \circ 
\operatorname{diag}\left[
\mathbf B\boldsymbol\Sigma_q\mathbf B^\top
\right]
\right)^{-1}
\operatorname{diag}\left[
{\left<\rho(\mathbf B\mathbf z)\circ\mathbf B(\mathbf z - \boldsymbol\mu_q)\right>}
\right]
\mathbf B
\end{aligned}\end{equation}

In practice it is almost always easier to first derive the closed-form for $\langle A(\theta)\rangle$, and then differentiate. 


\begin{equation}\begin{aligned}
\mathbf J_{\boldsymbol\Sigma_q}
&=
\partial_{\boldsymbol\Sigma_q}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
=
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
\right\}
+ 
\frac 1 2
\mathbf B^\top
\operatorname{diag}\left[
{\left<\rho(\mathbf B\mathbf z)\circ\mathbf B(\mathbf z - \boldsymbol\mu_q)\right>}
\circ\operatorname{diag}\left[\mathbf B\boldsymbol\Sigma_q\mathbf B^\top\right]^{-1}
\right]
\mathbf B
\end{aligned}\end{equation}


\begin{equation}\begin{aligned}
\mathbf J_{\boldsymbol\Sigma_q}
&=
\partial_{\boldsymbol\Sigma_q}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
=
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
\right\}
+ 
\mathbf B^\top
\operatorname{diag}\left[
\partial_{\boldsymbol\sigma^2_\theta}
\left<
A(\boldsymbol\theta)
\right>
\right]
\mathbf B
\end{aligned}\end{equation}

# Some special cases

If $\rho = \exp$, we have:

\begin{equation}\begin{aligned}
\frac
{\partial\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q}
{\partial\boldsymbol\Sigma_{q}}
&=
\frac 1 2 
\mathbf B^\top
\operatorname{diag}\left[
 \exp(\boldsymbol\mu_\theta + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma_\theta])
\right]
\mathbf B
\\
\Rightarrow
\mathbf J_{\boldsymbol\Sigma_q}
&=
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
+ 
\mathbf B^\top
\operatorname{diag}\left[
 \exp(\boldsymbol\mu_\theta + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma_\theta])
\right]
\mathbf B
\right\}
\end{aligned}\end{equation}

If $\rho = \Phi$, we have: 

\begin{equation}\begin{aligned}
\frac
{\partial\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q}
{\partial\boldsymbol\Sigma_{q}}
&=
\frac 1 2 
\mathbf B^\top
\operatorname{diag}\left[
\boldsymbol\gamma\circ\phi(\boldsymbol\gamma\circ\boldsymbol\mu_\theta)
\right]
\mathbf B,
\text{  where  } \gamma_m =
\frac 1 {\sqrt{1 + \sigma_m^2}}
\\
\Rightarrow
\mathbf J_{\boldsymbol\Sigma_q}
&=
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
+ 
\mathbf B^\top
\operatorname{diag}\left[
\boldsymbol\gamma\circ\phi(\boldsymbol\gamma\circ\boldsymbol\mu_\theta)
\right]
\mathbf B
\right\}
\end{aligned}\end{equation}

# Considering a special case

We consider the special case where the variational posterior is parameterized as

$$
\boldsymbol\Sigma_q^{-1}
=
\boldsymbol\Sigma_0^{-1}
+
\operatorname{diag}[\mathbf p]
$$

Let

$$
\boldsymbol\Lambda = \boldsymbol\Sigma_0^{-1}
$$

In this case the gradient in $\mathbf {p}$ is... 

\begin{equation}\begin{aligned}
{\partial}_{\boldsymbol\Sigma_{q}}
\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q
&=
\frac 1 2
\mathbf B^\top
\left(\mathbf I \circ 
\operatorname{diag}\left[
\mathbf B\boldsymbol\Sigma_q\mathbf B^\top
\right]
\right)^{-1}
\operatorname{diag}\left[
{\left<\rho(\mathbf B\mathbf z)\circ\mathbf B(\mathbf z - \boldsymbol\mu_q)\right>}
\right]
\mathbf B
\end{aligned}\end{equation}

### Hessian in $\boldsymbol\Sigma_q$

The Hessian in $\boldsymbol\Sigma_q$ is a fourth-order tensor. It's simpler to express the Hessian in terms of a Hessian-vector product, which can be used with [Krylov subspace](https://en.wikipedia.org/wiki/Krylov_subspace) solvers to efficiently compute $\mathbf H_{\boldsymbol\Sigma_q}^{-1} \mathbf J_{\boldsymbol\Sigma_q}$ in Newton's method. These solvers are implemented, for example, in Matlab and Scipy's "minres" function. Considering an $N\times N$ matrix $\mathbf M$, the Hessian-vector product is given by 

\begin{equation}\begin{aligned}
\left<\mathbf H_{\boldsymbol\Sigma_q}, \mathbf M\right>
&=
\partial_{\boldsymbol\Sigma_q}
\left< 
\mathbf J_{\boldsymbol\Sigma_q}
,
\mathbf M
\right>
=
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf J_{\boldsymbol\Sigma_q}^\top \mathbf M
\right]
\end{aligned}\end{equation}

where $\langle\cdot,\cdot\rangle$ denotes the scalar product. The Hessian-vector product for terms arising from the KL divergence is straightforward:

\begin{equation}\begin{aligned}
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
\right\}^\top\mathbf M
\right]
=
\frac 1 2 
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\boldsymbol\Sigma_q^{-1}
\mathbf M
\right]
=
-\frac 1 2 
\boldsymbol\Sigma_q^{-1}
\mathbf M^\top
\boldsymbol\Sigma_q^{-1}
\end{aligned}\end{equation}

The contribution from the expected negative log-likelihood is not straightforward in the general case. It does simplify for some specific choices of $\phi(\cdot)$, however. 

#### Hessian in $\boldsymbol\Sigma_q$ for $\rho=\exp$

Before continuing, we derive the following lemma. We use Einstein summation to simplify the notation:

\begin{equation}\begin{aligned}
\partial_{\boldsymbol\Sigma_{q,ij}}
\operatorname{tr}
\left[
\mathbf C
\operatorname{diag}\left[
\langle f(\boldsymbol\theta)\rangle
\right]
\right]
&=
\partial_{\boldsymbol\Sigma_{q,ij}}
\left[
\mathbf C
\operatorname{diag}\left[
\langle f(\boldsymbol\theta)\rangle
\right]
\right]_{kk}
\\
&=
\partial_{\boldsymbol\Sigma_{q,ij}}
\left[
\mathbf C_{lm}
\operatorname{diag}\left[
\langle f(\boldsymbol\theta)\rangle
\right]_{mn}
\right]_{kk}
\\
&=
\partial_{\boldsymbol\Sigma_{q,ij}}
\left[
{\mathbf C}_{kk}
\operatorname{diag}\left[
\langle f(\boldsymbol\theta)\rangle
\right]_{k}
\right]
\\
&=
{\mathbf C}_{kk}
\langle 
\partial_{\boldsymbol\Sigma_{q,ij}}
f(\theta_k)\rangle
\\
&=
{\mathbf C}_{kk}
\mathbf B_{ik}^\top
\,\partial_{\sigma^2_{\theta}} \langle f(\theta_k)\rangle
\mathbf B_{kj}
\\
&=
\mathbf B_{ik}^\top
{\mathbf C}_{kk}
\,\partial_{\sigma^2_{\theta}} \langle f(\theta_k)\rangle
\mathbf B_{kj}
\\
&=
\left\{
\mathbf B^\top
\operatorname{diag}[\mathbf C]
\operatorname{diag}
\left[
\partial_{\sigma^2_{\boldsymbol\theta}} \langle f(\boldsymbol\theta)\rangle
\right]
\mathbf B
\right\}_{ij}
\\%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\\
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}
\left[
\mathbf C
\operatorname{diag}\left[
\langle f(\boldsymbol\theta)\rangle
\right]
\right]
&=
\mathbf B^\top
\operatorname{diag}[\mathbf C]
\operatorname{diag}
\left[
\partial_{\sigma^2_{\boldsymbol\theta}} \langle f(\boldsymbol\theta)\rangle
\right]
\mathbf B
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\langle 
\partial_{\boldsymbol\Sigma_q}
f(\boldsymbol\theta)\rangle
&=
\mathbf B^\top
\operatorname{diag}[
\partial_{\sigma^2_{\boldsymbol\theta}} \langle f(\boldsymbol\theta)\rangle
]
\mathbf B
\end{aligned}\end{equation}

We now consider the Hessian-vector product for the expected negative log-likelihood term in the case of an exponential nonlinearity, $\rho=\exp$:

\begin{equation}\begin{aligned}
\text{let } \bar{\boldsymbol\lambda}' &= 
\exp(\boldsymbol\mu_\theta + \tfrac 1 2 \operatorname{diag}[\boldsymbol\Sigma_\theta])
\\
\\
\partial_{\boldsymbol\Sigma_q}
\left< 
\partial_{\boldsymbol\Sigma_q}
\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q
,
\mathbf M
\right>
&=
\frac 1 2 
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B^\top
\operatorname{diag}\left[
\bar{\boldsymbol\lambda}'
\right]
\mathbf B\,
\mathbf M
\right]
\\
&=
\frac 1 2 
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B
\mathbf M
\mathbf B^\top
\operatorname{diag}\left[
\bar{\boldsymbol\lambda}'
\right]
\right]
\\
&=
\frac 1 2 
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B
\mathbf M
\mathbf B^\top
\operatorname{diag}\left[
\langle A(\boldsymbol\theta)\rangle
\right]
\right]
\\
&=
\frac 1 2 
\mathbf B^\top
\operatorname{diag}[
\mathbf B
\mathbf M
\mathbf B^\top
]
\operatorname{diag}
\left[
\partial_{\sigma^2_{\boldsymbol\theta}} \langle A(\boldsymbol\theta)\rangle
\right]\,
\mathbf B
\\
&=
\frac 1 4 
\mathbf B^\top
\operatorname{diag}[
\mathbf B
\mathbf M
\mathbf B^\top
]
\operatorname{diag}
\left[
\bar{\boldsymbol\lambda}'
\right]\,
\mathbf B
\end{aligned}\end{equation}

#### Hessian in $\boldsymbol\Sigma_q$ for $\rho=\Phi$

In the case that $\rho = \Phi$:


\begin{equation}\begin{aligned}
\text{let } \bar{\boldsymbol\lambda}' &= 
\tfrac 1 2
\boldsymbol\gamma\circ\phi(\boldsymbol\gamma\circ\boldsymbol\mu_\theta)
=
\tfrac 1 {2 \boldsymbol\sigma^2_{\theta} } \left<(\boldsymbol\theta-\boldsymbol\mu_\theta)\Phi(\boldsymbol\theta)\right>
\\
\\
\partial_{\boldsymbol\Sigma_q}
\left< 
\partial_{\boldsymbol\Sigma_q}
\mathbf 1^\top \langle A(\boldsymbol\theta)\rangle_Q
,
\mathbf M
\right>
&=
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B^\top
\operatorname{diag}\left[
\bar{\boldsymbol\lambda}'
\right]
\mathbf B\,
\mathbf M
\right]
\\
&=
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B
\mathbf M
\mathbf B^\top
\operatorname{diag}\left[
\bar{\boldsymbol\lambda}'
\right]
\right]
\\
&=
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B
\mathbf M
\mathbf B^\top
\operatorname{diag}\left[
\tfrac 1 {2\boldsymbol\sigma^2_{\theta}} \left<(\boldsymbol\theta-\boldsymbol\mu_\theta)\Phi(\boldsymbol\theta)\right>
\right]
\right]
\\
&=
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}\left[
\mathbf B
\mathbf M
\mathbf B^\top
\operatorname{diag}\left[
\left<
\tfrac 1 {2\boldsymbol\sigma^2_{\theta}}
(\boldsymbol\theta-\boldsymbol\mu_\theta)\Phi(\boldsymbol\theta)\right>
\right]
\right]
\\
&=
\mathbf B^\top
\operatorname{diag}[
\mathbf B
\mathbf M
\mathbf B^\top
]
\operatorname{diag}
\left[
\partial_{\sigma^2_{\boldsymbol\theta}} 
\left<
\tfrac 1 {2\boldsymbol\sigma^2_{\theta}}
(\boldsymbol\theta-\boldsymbol\mu_\theta)\Phi(\boldsymbol\theta)
\right>
\right]
\mathbf B
\\
&=
\mathbf B^\top
\operatorname{diag}[
\mathbf B
\mathbf M
\mathbf B^\top
]
\operatorname{diag}
\left[
\tfrac {-1} {2\boldsymbol\sigma^4_{\theta}}
\left<
(\boldsymbol\theta-\boldsymbol\mu_\theta)\Phi(\boldsymbol\theta)
\right>
+
\tfrac 1 {2\boldsymbol\sigma^2_{\theta}}
\partial_{\sigma^2_{\boldsymbol\theta}} 
\left<
\boldsymbol\theta
\Phi(\boldsymbol\theta)
\right>
-
\tfrac {\boldsymbol\mu_\theta} {2\boldsymbol\sigma^2_{\theta}}
\partial_{\sigma^2_{\boldsymbol\theta}} 
\left<
\Phi(\boldsymbol\theta)
\right>
\right]
\mathbf B
\\
&=
\mathbf B^\top
\operatorname{diag}[
\mathbf B
\mathbf M
\mathbf B^\top
]
\operatorname{diag}
\left[
\tfrac {-1} {\boldsymbol\sigma^2_{\theta}}
\bar{\boldsymbol\lambda}'
+
\tfrac 1 {2\boldsymbol\sigma^2_{\theta}}
\partial_{\sigma^2_{\boldsymbol\theta}} 
\left<
\boldsymbol\theta
\Phi(\boldsymbol\theta)
\right>
-
\tfrac {\boldsymbol\mu_\theta} {2\boldsymbol\sigma^2_{\theta}}
\partial_{\sigma^2_{\boldsymbol\theta}} 
\left<
\Phi(\boldsymbol\theta)
\right>
\right]
\mathbf B
\end{aligned}\end{equation}


These expectations also have a closed form if $\rho = \Phi$. Let $u = (x-\mu)/\sigma$. Then $x = \mu + \sigma u$ and $dx = \sigma du$. Define $\gamma$ as $\gamma = (1+\sigma^2)^{-1}$. Then:

\begin{equation}\begin{aligned}
\partial_\mu\left<A(x)\right>_x
&=
\left<\Phi(x)\right>_x
\\&=
 \int \Phi(x) \frac 1 \sigma \phi\left(\tfrac{x-\mu}{\sigma}\right)\, dx
\\&=
\int \Phi(\mu + \sigma u) \phi(u) \,du
\\&=
\Phi(\gamma \mu)
\\
\\
\partial_{\sigma^2}\left<A(x)\right>_x
&=
\tfrac 1 {2\sigma^2} 
\left<(x-\mu)\Phi(x)\right>_x
\\
&=
\frac 1 {2\sigma^2} 
\int
(x-\mu)\Phi(x)\frac1\sigma\phi\left(\tfrac{x-\mu}{\sigma}\right)\,dx
\\
&=
\frac 1 {2\sigma^2} 
\int
(\mu + \sigma u -\mu)\Phi(\mu + \sigma u)\phi(u)\,du
\\
&=
\frac 1 {2 \sigma}
\int
u\Phi(\mu + \sigma u)\phi(u)\,du
\\
&=
\frac \gamma 2 \phi(\gamma\mu)
\end{aligned}\end{equation}

d $\sigma^2$ are:

\begin{equation}\begin{aligned}
\partial_\mu \langle A(x) \rangle
&=
\langle 
A'(x) 
\rangle_x 
=
\langle 
\rho(x) 
\rangle_x 
\\
\partial_{\sigma^2} \langle A(x)\rangle
&=
\tfrac1{2\sigma^2}
\left<(x-\mu) A'(x)\right>_x
=
\tfrac1{2\sigma^2}
\left<(x-\mu) \rho(x)\right>_x
\end{aligned}\end{equation}

Before continuing, we derive the following lemma. We use Einstein summation to simplify the notation:

\begin{equation}\begin{aligned}
\partial_{\boldsymbol\Sigma_{q,ij}}
\operatorname{tr}
\left[
\mathbf C
\operatorname{diag}\left[
\langle A(\boldsymbol\theta)\rangle
\right]
\right]
&=
\partial_{\boldsymbol\Sigma_{q,ij}}
\left[
\mathbf C
\operatorname{diag}\left[
\langle A(\boldsymbol\theta)\rangle
\right]
\right]_{kk}
\\
&=
\partial_{\boldsymbol\Sigma_{q,ij}}
\left[
\mathbf C_{lm}
\operatorname{diag}\left[
\langle A(\boldsymbol\theta)\rangle
\right]_{mn}
\right]_{kk}
\\
&=
\partial_{\boldsymbol\Sigma_{q,ij}}
\left[
{\mathbf C}_{kk}
\operatorname{diag}\left[
\langle A(\boldsymbol\theta)\rangle
\right]_{k}
\right]
\\
&=
{\mathbf C}_{kk}
\langle 
\partial_{\boldsymbol\Sigma_{q,ij}}
A(\theta_k)\rangle
\\
&=
{\mathbf C}_{kk}
\mathbf B_{ik}^\top
\tfrac
{\left<\rho(\theta_k)(\theta_k - \mu_{\theta_k})\right>}
{2\sigma^2_{\theta_k}}
\mathbf B_{kj}
\\
&=
\mathbf B_{ik}^\top
{\mathbf C}_{kk}
\tfrac
{\left<\rho(\theta_k)(\theta_k - \mu_{\theta_k})\right>}
{2\sigma^2_{\theta_k}}
\mathbf B_{kj}
\\
&=
\left\{
\mathbf B^\top
\operatorname{diag}[\mathbf C]
\operatorname{diag}
\left[
\tfrac
{\left<\rho(\theta_k)(\theta_k - \mu_{\theta_k})\right>}
{2\sigma^2_{\theta_k}}
\right]
\mathbf B
\right\}_{ij}
\\
\\
\partial_{\boldsymbol\Sigma_q}
\operatorname{tr}
\left[
\mathbf C
\operatorname{diag}\left[
\langle A(\boldsymbol\theta)\rangle
\right]
\right]
&=
\mathbf B^\top
\operatorname{diag}[\mathbf C]
\operatorname{diag}
\left[
\tfrac
{\left<\rho(\theta_k)(\theta_k - \mu_{\theta_k})\right>}
{2\sigma^2_{\theta_k}}
\right]
\mathbf B
\end{aligned}\end{equation}

# Case study: log-Gaussian Poisson 

We explore the specific case of $\phi = \exp$. In this case, $\phi''=\phi'=\phi$ and $\langle\rho(\theta)\rangle$ has the closed form of $\exp(\mu_\theta + \sigma^2_\theta/2)$, which can be obtained by solving out the integral by completing the square (or from looking up the mean of a log-normal distribution). Define $\boldsymbol\lambda$ as:

\begin{equation}\begin{aligned}
\boldsymbol\lambda 
&= \exp(\mathbf B \mathbf z)
\\
\langle\boldsymbol\lambda\rangle 
&= \exp\left(\mathbf B \boldsymbol\mu_q + \tfrac12\operatorname{diag}[\mathbf B \boldsymbol\Sigma_q \mathbf B ]\right)
\end{aligned}\end{equation}

The loss is 

\begin{equation}\begin{aligned}
\mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q)
&=
\frac 1 2 \left\{
(\boldsymbol\mu_z-\boldsymbol\mu_q)^\top 
\boldsymbol\Sigma_z^{-1}
(\boldsymbol\mu_z-\boldsymbol\mu_q)
+
\operatorname{tr}\left(
\boldsymbol\Sigma_z^{-1}
\boldsymbol\Sigma_q
\right)
+
\ln|\boldsymbol\Sigma_z^{-1}\boldsymbol\Sigma_q|\right\}
+ \mathbf 1^\top\langle\boldsymbol\lambda\rangle 
- \mathbf y^\top \mathbf B \boldsymbol\mu_q 
\end{aligned}\end{equation}

The gradient and Hessian of $\mathcal L$ with respect to $\boldsymbol\mu_q$ are:

\begin{equation}\begin{aligned}
\mathbf J_{\boldsymbol\mu_q}
&=
\boldsymbol\Sigma_z^{-1}
(\boldsymbol\mu_q-\boldsymbol\mu_z)
+
\mathbf B^\top(\langle\boldsymbol\lambda\rangle -\mathbf y)
\\
\mathbf H_{\boldsymbol\mu_q}
&=
\boldsymbol\Sigma_z^{-1}
+
\mathbf B^\top
\operatorname{diag}\left[
\langle\boldsymbol\lambda\rangle 
\right]
\mathbf B
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

The gradient in $\boldsymbol\Sigma_q$ is

\begin{equation}\begin{aligned}
\mathbf J_{\boldsymbol\Sigma_q}
&=
\frac 1 2 \left\{
\boldsymbol\Sigma_z^{-1}
+
\boldsymbol\Sigma_q^{-\top}
\right\}
+ 
\mathbf B^\top
\operatorname{diag}[\langle\boldsymbol\lambda\rangle]
\mathbf B
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}

\begin{equation}\begin{aligned}
\end{aligned}\end{equation}