# Automating Probabilistic ML

* probabilistic programming
* specify a probabilistic model
    * $\to$ "directy" get the solution

### preliminaries for black box variational inference

* $\nabla_\lambda E_{q_\lambda(z)}[f(z)] = E_{q_{\lambda}(z)}[(\nabla_\lambda \log q_\lambda(z)) f(z)]$
    * *proof*  
    $\nabla_\lambda E_{q_\lambda(z)}[f(z)] = \nabla_\lambda \int q_\lambda(z) f(z) dz$  
    $= \int (\nabla_\lambda q_\lambda(z)) f(z) dz$  
    $= \int q_\lambda(z) (\nabla_\lambda \log q_\lambda (z)) f(z) dz$
* $E_{q_\lambda(z)} [\nabla_\lambda \log q_\lambda (z)] = 0$
    * *proof*  
    the above expression $= \int q_\lambda (z) (q_\lambda(z))^{-1} \nabla q_\lambda (z) dz$  
    $= \nabla_\lambda \int q_\lambda (z) dz$
    $= \nabla_\lambda (1) = 0$

* $\nabla_\lambda E_{q_\lambda(z)} [\log \frac{p(x, z)}{q_\lambda(z)}]$  
$= \nabla_\lambda \int q_\lambda(z) \log \frac{p(x, z)}{q_\lambda(z)} dz$  
$= \int (\nabla_\lambda q_\lambda(z)) \log \frac{p(x, z)}{q_\lambda(z)} dz - \int q_\lambda(z) (\nabla_\lambda \log q_\lambda(z)) dz$  
$= E_{q_\lambda(z)}[(\nabla_\lambda q_\lambda(z)) \log \frac{p(x, z)}{q_\lambda(z)}]$

## black box variational inference

* $\nabla_\lambda E_{q_\lambda(z)}[\log \frac{p(x, z)}{q_\lambda(z)}]$
$\approx \frac{1}{S} \sum_s^S (\nabla_\lambda \log q_\lambda(z^{(s)})) (\log p(x, z^{(s)}) - q_\lambda (z^{(s)}))$
    * $z^{(s)} \stackrel{iid}{\sim} q_\lambda(z)$
    * optimize variational distribution with stochastic gradient descent
    * consider $p(x, z) = \frac{1}{c} v(x, z)$  
    then we can ignore the proportion $c$ since $\nabla_\lambda \log c = 0$

* naive black box gradient has high variance--SGD is inefficient

**rao-blackwellization**

* given estimate that depends on multiple random variables and we can integrate out some of the random variables,
then we get an estimate with lower variance
* e.g. let $\bar{J} = E_X E_Y[J(X, Y)]$  
$\hat{J}(X) = E_Y[J(X, Y) | X]$  
$E_X[\hat{J}(X)] = \bar{J}$  
then $J(X, Y)$ can be sampled to estimate $\bar{J}$  
and $\hat{J}(X)$ can be sampled to estimate $\bar{J}$

**theorem** $Var(\hat{J}(X)) \leq Var(J(X, Y))$

**simple case** $P(X, Z) = \prod_i p(z_i) p(x_i | z_i) = \prod_i p(x_i, z_i)$  
$q(z) = \prod q_{\lambda_i} (z_i)$ (mean field approximation)  
$\nabla_{\lambda_i} E_{q(z)} [\log \frac{p(x, z)}{q(z)}]$  
$= \nabla_{\lambda_i} E_{q(z_1), ... q(z_n)} [\sum_j \log p(x_j, z_j) - \sum_j \log q(z_j)]$  
$= E_{q(z_1), ..., q(z_n)} [\sum_j (\nabla_{\lambda_i} \log q(z_j))  (\sum_j \log p(x_j, z_j) - \sum_j \log q(z_j))]$  
$= E_{q(z_i)}[(\nabla_{\lambda_i} \log q(z_i))(\log p(x_i, z_i) - \log q(z_i))]$  

and we can see that the gradient of the ELBO is an expectation

**e.g.** logistic regression  
$p(w) \prod_i p(t_i | w)$  
$q(w) = \mathcal{N}(w | m, V)$  
can't factor

**control variances**

**e.g.** $\mu = E_{p(v)}[f(v)]$  
then $\hat{\mu} = f(v^{(s)})$ with $v^{(s)} \sim p(v)$  
high variance

consider any random variable $w$ s.t. $E[w] = 0$  
use $\tilde{\mu} = \hat{\mu} - c w$  
$E[\tilde{\mu}] = E[\hat{\mu}] - c E[w] = E[\hat{\mu}]$  
want to show $Var(\tilde{\mu}) < Var(\hat{\mu})$  
$Var(\tilde{\mu}) = Var(\hat{\mu}) + c^2 Var(w) - 2 c Cov(\hat{\mu}, w)$  
to minimize, $c = \frac{Cov(\hat{\mu}, w)}{Var(w)}$, and we get  
$= Var(\hat{\mu}) - \frac{(Cov(\hat{\mu}, w))^2}{Var(w)}$ 
$\leq Var(\hat{\mu})$  
this is also equal to $Var(\hat{\mu}) - Cor(\hat{\mu}, w)^2 Var(\hat{\mu})$
$= Var(\hat{\mu}) (1 - Cor(\hat{\mu}, w))$

how to choose $w$ and calculate $c$?

* $w^{(s)} = \nabla_\lambda \log q_\lambda (z^{(s)})$
* $\hat{\mu}^{(s)} = (\nabla_\lambda \log q_\lambda(z^{(s)}) (\log p(x, z^{(s)} - q_\lambda (z^{(s)}))$  
where $z^{(s)} \sim q_\lambda$
* to calculate/estimate $c$
    * $b = Cov(\hat{\mu}, w) \approx \frac{1}{S} \sum_s^S (\hat{\mu}^{(s)} - \bar{\mu}) (w^{(s)} - 0)$
    * $\bar{\mu} = \frac{1}{S} \sum_s \hat{\mu}^{(s)}$
    * $a = Var(w) \approx \frac{1}{S} \sum_s (w^{(s)})^2$
    * then $c = b / a$

then we get $\frac{1}{S} \sum_s^S (\nabla_\lambda \log q_\lambda (z^{(s)}) (\log p(x, z^{(s)}) - q_\lambda(z^{(s)}) - c)$

## variational auto-encoder

two ideas

1. alternative method to express gradients as expectations and estimate them through sampling
2. amortized inference

recall,  
$ELBO = E_{q(z)}[\log \frac{p(x, z)}{q(z)}]$  
$= E_{q(z)}[\log p(z) + \log p(x|z) - \log q(z)]$  
$= E_{q(z)}[\log p(x|z)] - d_{KL}(q(z) || p(z))$  
and gradients for the second term should be easy to obtain

### reparameterization

given $E_{q_\phi(z)}[f(x, z)]$  
want to optimize by choosing $q_\phi(z)$ (need gradients w.r.t. $\phi$)

in some cases, can rewrite $E[\cdots]$ so that $\nabla$ is easy

**e.g.** let $q_\phi(z) = \mathcal{N}(z | \mu, \Sigma)$

* $\phi = (\mu, \Sigma)$
* cholesky decomposition $\Sigma = L L^\top$, $L$ is lower diagonal
* then $q_\phi(z) = \mathcal{N}(z | \mu, L L^\top)$
* let $\epsilon \mathcal{N}(0, I)$
* then we can say $z = \mu + L \epsilon$
* $z \sim \mathcal{N}(\mu, L I L^\top)$
* $E_{q_\phi(z)} [f(x, z)] = E_{\epsilon}[f(x, \mu + L \epsilon)]$

**more formally**,  
$z \sim q_\phi(z) \iff \epsilon \sim p(\epsilon)$ and $z = g_\phi(\epsilon)$

$ELBO = E_{q_\phi(z)}[\log \frac{p(x, z)}{q(z)}]$
$= E_\epsilon[\log \frac{p(x, g_\phi(\epsilon))}{q_\phi(z)}]$

$\nabla_\phi ELBO = E_\epsilon[\nabla_\phi \log p(x, g_\phi(\epsilon))] - E_\epsilon[\nabla_\phi \log q_\phi(g_\phi(\epsilon))]$

then $\nabla_\phi ELBO \sim \frac{1}{S} \sum_s^S \nabla_\phi \log p(x, g_\phi(\epsilon^{(s)})) - \nabla_\phi \log q_\phi(g_\phi(\epsilon^{(s)}))$

then use chain rule  
requirements w.r.t. $\nabla$ are stronger than in BBVI  
requires $g_\phi(\epsilon)$ to be differentiable

consider $p_\theta(x, z) = \prod_i p_\theta(x_i, z_i) = \prod_i p_\theta(z_i) p_\theta(x_i, z_i)$  

* also assume $q(z) = \prod_i q_i(z_i)$

$ELBO = E_{q(z)}[\log \frac{p(x, z)}{q(z)}]$  
$= E_{q(z)}[\log \prod p(x_i | z_i) + \log \prod p(z_i)- \log \prod q(z_i)]$  
$= \sum_i E_{q(z_i)}[\log p(x_i | z_i)] + \sum_i E_{q(z_i)}[\log p(z_i)] - \sum_i E_{q(z_i)}[\log q(z_i)]$  
$= \sum_i E_\epsilon [\log p(x_i, g_\phi(\epsilon))] + \cdots$

$\nabla_{\phi_k} ELBO E_\epsilon [\log p(x_k, g_{\phi_k}(\epsilon))] + \cdots$

### amortized inference

consider $p(x, z) = \prod_i p(z_i) p(x_i | z_i)$ and $q(z) = \prod_i q(z_i)$

for variational inference, pick $q(z_i)$ to optimize ELBO

* implicitly, $q(z_i)$ depends on on $x_i$, but dependence is not explicit
* $q(z_i)$ "flexible" to adjust according to the data
* $q(z_i) = q_\phi(z_i | x_i)$
* **e.g.**, $\mu = A x_i$
    * puts some restrictions on parameters
    * if we want to be flexible, instead of a linear model, use something like a DNN

**e.g.** $q_\phi(z_i | x_i) = \mathcal{N}(z_i | \mu_\phi(x_i), \sigma_\phi(x_i)^2)$

then $x_i$ is input to the network  
and network has two outputs, $\mu$ and $\sigma^2$  
$\phi$ represents the weights/biases of network and are parameters to be tuned  
$z_i \in \mathbb{R}^d$, use diagonal approximation  
$z_{ik} \sim \mathcal{N}(\mu_{k, \phi}(x_i), \sigma^2_{k, \phi}(x_i))$

tradeoff: number of parameters goes from $2 N$ (two parameters, $\mu$ and $\sigma^2$), we have $|\phi|$ parameters (number of weights/biases in NN)

minibatch sample

$\nabla_\phi ELBO = \frac{1}{S} \sum_s \frac{N}{M} \sum_{i \in \mathcal{M}} \log p_\theta(g_\phi(\epsilon_i^{(s)})) p_\theta (x_i | g_\phi(\epsilon_i^{(s)})) - \log q_\phi(g_\phi(\epsilon_i^{(s)}))$

* $S$ : number of samples of $\epsilon$
* $N$ : number of examples

(in paper, $S = 1$ as long as $M \geq 100$)

VAE: combine the two ideas  
$+$ $p(x_i | z_i)$ is also approximated by a DNN

**e.g.**: $x_i$ is an image $x_i \in \{0, 1\}^{d \times d}$ $p(x_i | z_i) = \prod_d \mu_{\theta, d}(z_i)^{x_i} (1 - \mu_{\theta, d}(z_i))^{1 - x_i}$

**e.g.**, $x_i \in \mathbb{R}^d$ with diagonal variance $p(x_i | z_i) = \prod_d \mathcal{N}(\mu_{\theta, d}(z_i), \sigma^2_{\theta, d}(z_i))$

**e.g.** logistic regression

* $f_i = w^\top \phi(x_i)$
* $y_i = \sigma(f_i)$
* $t_i \sim Bernoulli(y_i)$
* $\log L = \sum t_i \log y_i + \sum (1 - t_i) \log (1 - y_i)$
* $\nabla_w \log L = \sum (t_i - y_i) \phi(x_i)$
* $\nabla^2_w \log L = \sum y_i (1 - y_i) \phi(x_i) \phi(x_i)^\top = \Phi^\top R \Phi$

* prior: $w \sim \mathcal{N}(0, \frac{1}{\alpha} I)$
* approx posterior: $q(w) = \mathcal{N}(m, V) = \mathcal{N}(m, L L^\top)$

* ELBO: $E_{q_\lambda(z)}[\log \frac{p(x, z)}{q_\lambda(z)}]$
    * $x \to t$
    * $z \to w$
    * $\lambda \to (m, L)$ (or $(m, V)$)
* BBVI: $\nabla_\lambda ELBO = E_{q_\lambda(z)}[(\nabla_\lambda \log q_\lambda(z)) (\log p(x, z) - \log q_\lambda(z))]$
    * can also add a constant to reduce variance (reparameterization)
* reparameterization: $ELBO = E_\epsilon[\log \frac{p(x, g_\lambda(\epsilon))}{q_\lambda(g_\lambda(\epsilon))}]$
    * reparameterize $z = g_\lambda(\epsilon)$
        * for $z \sim \mathcal{N}(\mu, V)$, $z = \mu + L \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
    * $ELBO = E_{q(w)}[\log p(w) - \log p(t|w) - \log q(w)]$  
    $= E_{q(w)}[(-\frac{d}{2} \log \alpha - \frac{\alpha}{2} w^\top w) + (\sum_i t_i \log y_i + (1 - t_i) \log (1 - y_i)) - (-\frac{1}{2} \log |V| - \frac{1}{2} (w - m)^\top V^{-1} (w - m))]$
        * $\partial_X \log |X| = (X^{-1})^\top$
        * $\partial_X \log |X^\top X| = 2 |X^{-1}|^\top$
        * $\partial_A (q^\top A^{-1} b) = -A^{-1} ab^\top (A^{-1})^\top$
        * $\partial_X Tr(X X\top B) = B X + B^\top X$
        * $\partial_X (q^\top X b) = ab^\top$

direct solution: calculate ELBO and $\nabla_m$ and $\nabla_L$ and use (S)GA

* gaussian identities given $q(w) = \mathcal{N}(w | m, V)$
    * $\nabla_m E_{q(w)}[f(w)] = E_{q(w)}[\nabla_w f(w)]$
    * $\nabla_V E_{q(w)}[f(w)] = \frac{1}{2} E_{q(w)}[\nabla_w^2 f(w)]$
* $\nabla_m = -\alpha m + \sum (t_i - E[y_i]) \phi(x_i)$
* $\nabla_V = \frac{1}{2} (-\alpha I - \Phi^\top R \Phi + V^{-1})$
    * $R = diag(E[y_i (1 - y_i)])$
    * $V \leftarrow V + \gamma \nabla_V$ (may not be PSD)  
    better to use gradient ascent on $L$ instead
* $\nabla_L ELBO = -\alpha L - \Phi^\top R \Phi + (L^{-1})^\top$
    * use gradient ascent on $m, L$

BBVI approach

* $\log q(z) \to -\frac{1}{2} \log |V| - \frac{1}{2} (w - m)^\top V^{-1} (w - m)$  
$= -\frac{1}{2} \log ||L L^\top| - \frac{1}{2} (w - m)^\top (L^\top)^{-1} L^{-1} (w - m)$
* $\log p(x, z) \to -\frac{\alpha}{2} \log \alpha - \frac{\alpha}{2} w^\top w - \sum t_i \log y_i - (1 - t_i) \log (1 - y_i)$
* $\nabla_m \log q(z) = V^{-1} (m - w)$
* $\nabla_L \log q(z) = -((L^{-1})^\top - V^{-1} (w - m) (w - m)^\top (L^{-1})^\top)$

reparameterization approach: $w = m + L \epsilon$

* $ELBO = E_\epsilon[\frac{d}{2} \log \alpha + \frac{\alpha}{2} (m + L \epsilon)^\top (m + L \epsilon) + \frac{1}{2} \log |L L^\top| + \frac{1}{2} \epsilon^\top L^\top (L^\top)^{-1} L^{-1} L \epsilon + \sum_i t_i \log \sigma(m + L \epsilon)^\top \phi(x_i) + \sum_i (1 - t_i) \log \sigma(1 - m - L \epsilon)^\top \phi(x_i)]$
* $\nabla_m ELBO = E_\epsilon[\nabla_m \cdots]$  
$= E_\epsilon[-\alpha (m + L \epsilon + \sum_i (t_i - y_i) \phi(x_i)]$  
$= -\alpha m + \sum (t_i - E_\epsilon[y_i]) \phi(x_i)$  
which is the same as in the direct solution
* $\nabla_L ELBO = E_\epsilon[\nabla_L \cdots]$  
$= -\alpha L + (L^{-1})^\top \sum_i \phi(x_i) E[y_i \epsilon^\top]$  
not the same as in the direct method in form but should be equivalent