# Automating Probabilistic ML

* probabilistic programming
* specify a probabilistic model
    * $\to$ "directy" get the solution

### preliminaries for black box variational inference

* $\nabla_\lambda E_{q_\lambda(z)}[f(z)] = E_{q_{\lambda}(z)}[(\nabla_\lambda \log q_\lambda(z)) f(z)]$
    * *proof*  
    $\nabla_\lambda E_{q_\lambda(z)}[f(z)] = \nabla_\lambda \int q_\lambda(z) f(z) dz$  
    $= \int (\nabla_\lambda q_\lambda(z)) f(z) dz$  
    $= \int q_\lambda(z) (\nabla_\lambda \log q_\lambda (z)) f(z) dz$
* $E_{q_\lambda(z)} [\nabla_\lambda \log q_\lambda (z)] = 0$
    * *proof*  
    the above expression $= \int q_\lambda (z) (q_\lambda(z))^{-1} \nabla q_\lambda (z) dz$  
    $= \nabla_\lambda \int q_\lambda (z) dz$
    $= \nabla_\lambda (1) = 0$

* $\nabla_\lambda E_{q_\lambda(z)} [\log \frac{p(x, z)}{q_\lambda(z)}]$  
$= \nabla_\lambda \int q_\lambda(z) \log \frac{p(x, z)}{q_\lambda(z)} dz$  
$= \int (\nabla_\lambda q_\lambda(z)) \log \frac{p(x, z)}{q_\lambda(z)} dz - \int q_\lambda(z) (\nabla_\lambda \log q_\lambda(z)) dz$  
$= E_{q_\lambda(z)}[(\nabla_\lambda q_\lambda(z)) \log \frac{p(x, z)}{q_\lambda(z)}]$

### black box variational inference

* $\nabla_\lambda E_{q_\lambda(z)}[\log \frac{p(x, z)}{q_\lambda(z)}]$
$\approx \frac{1}{S} \sum_s^S (\nabla_\lambda \log q_\lambda(z^{(s)})) (\log p(x, z^{(s)}) - q_\lambda (z^{(s)}))$
    * $z^{(s)} \stackrel{iid}{\sim} q_\lambda(z)$
    * optimize variational distribution with stochastic gradient descent
    * consider $p(x, z) = \frac{1}{c} v(x, z)$  
    then we can ignore the proportion $c$ since $\nabla_\lambda \log c = 0$

* naive black box gradient has high variance--SGD is inefficient

**rao-blackwellization**

* given estimate that depends on multiple random variables and we can integrate out some of the random variables,
then we get an estimate with lower variance
* e.g. let $\bar{J} = E_X E_Y[J(X, Y)]$  
$\hat{J}(X) = E_Y[J(X, Y) | X]$  
$E_X[\hat{J}(X)] = \bar{J}$  
then $J(X, Y)$ can be sampled to estimate $\bar{J}$  
and $\hat{J}(X)$ can be sampled to estimate $\bar{J}$

**theorem** $Var(\hat{J}(X)) \leq Var(J(X, Y))$

**simple case** $P(X, Z) = \prod_i p(z_i) p(x_i | z_i) = \prod_i p(x_i, z_i)$  
$q(z) = \prod q_{\lambda_i} (z_i)$ (mean field approximation)  
$\nabla_{\lambda_i} E_{q(z)} [\log \frac{p(x, z)}{q(z)}]$  
$= \nabla_{\lambda_i} E_{q(z_1), ... q(z_n)} [\sum_j \log p(x_j, z_j) - \sum_j \log q(z_j)]$  
$= E_{q(z_1), ..., q(z_n)} [\sum_j (\nabla_{\lambda_i} \log q(z_j))  (\sum_j \log p(x_j, z_j) - \sum_j \log q(z_j))]$  
$= E_{q(z_i)}[(\nabla_{\lambda_i} \log q(z_i))(\log p(x_i, z_i) - \log q(z_i))]$  

**e.g.** logistic regression  
$p(w) \prod_i p(t_i | w)$  
$q(w) = \mathcal{N}(w | m, V)$  
can't factor

**control variances**

**e.g.** $\mu = E_{p(v)}[f(v)]$  
then $\hat{\mu} = f(v^{(s)})$ with $v^{(s)} \sim p(v)$  
high variance

consider any random variable $w$ s.t. $E[w] = 0$  
use $\tilde{\mu} = \hat{\mu} - c w$  
$E[\tilde{\mu}] = E[\hat{\mu}] - c E[w] = E[\hat{\mu}]$  
want to show $Var(\tilde{\mu}) < Var(\hat{\mu})$  
$Var(\tilde{\mu}) = Var(\hat{\mu}) + c^2 Var(w) - 2 c Cov(\hat{\mu}, w)$  
to minimize, $c = \frac{Cov(\hat{\mu}, w)}{Var(w)}$, and we get  
$= Var(\hat{\mu}) - \frac{(Cov(\hat{\mu}, w))^2}{Var(w)}$ 
$\leq Var(\hat{\mu})$  
this is also equal to $Var(\hat{\mu}) - Cor(\hat{\mu}, w)^2 Var(\hat{\mu})$
$= Var(\hat{\mu}) (1 - Cor(\hat{\mu}, w))$

how to choose $w$ and calculate $c$?

* $w^{(s)} = \nabla_\lambda \log q_\lambda (z^{(s)})$
* $\hat{\mu}^{(s)} = (\nabla_\lambda \log q_\lambda(z^{(s)}) (\log p(x, z^{(s)} - q_\lambda (z^{(s)}))$  
where $z^{(s)} \sim q_\lambda$
* to calculate/estimate $c$
    * $b = Cov(\hat{\mu}, w) \approx \frac{1}{S} \sum_s^S (\hat{\mu}^{(s)} - \bar{\mu}) (w^{(s)} - 0)$
    * $\bar{\mu} = \frac{1}{S} \sum_s \hat{\mu}^{(s)}$
    * $a = Var(w) \approx \frac{1}{S} \sum_s (w^{(s)})^2$
    * then $c = b / a$

then we get $\frac{1}{S} \sum_s^S (\nabla_\lambda \log q_\lambda (z^{(s)}) (\log p(x, z^{(s)}) - q_\lambda(z^{(s)}) - c)$