# Variational Inference

### Motivation

**e.g.** (bayesian) logistic regression

* can't obtain exact posterior
* possible solution: laplace approximation

**e.g.** LDA

* use MCMC to sample from posterior
    * slow

**e.g.** EM algorithm

* requires calculation of expectation
    * sometimes not possible to calculate analytically

### Convex Duality

**outline**

* suppose $f(x)$ is concave
* find a linear function that captures the upper bound of $f(x)$
* $f(x) = \min_\lambda \{\lambda^\top x - f^*(\lambda)\}$
    * $f^*(\lambda) = \min_x |\lambda^\top x - f(x)|$
* $f(x) \leq \lambda^\top x - f^*(\lambda)$ (by definition of minimum)

**e.g.** $f(x) = \log x$

* already know this is concave
* $f^*(\lambda) = \min_x \{\lambda x - \log x\}$
* $f'(x) = -\lambda - x^{-1}$  
$\implies x = \lambda^{-1}$
* then $f^*(\lambda) = 1 + \log \lambda$
* $\implies f(x) = \min_\lambda \{\lambda x - (1 + \log \lambda)\}$
    * i.e., $f(x) \leq \lambda x - (1 + \log \lambda)$
    
**e.g.** $f(x) = \sigma(x) = \frac{1}{1 + e^{-x}}$

* not concave (or convex for that matter)
* $g(x) = \log f(x) = -\log (1 + e^{-x})$ is concave
    * "$f(x)$ is log concave"
    * can verify concaveness with the second derivative  
    $g''(x) = \frac{-e^x}{(1 + e^x)^2} \geq 0$
* $g^*(\lambda) = \min_x \{\lambda x + \log (1 + e^{-x}) \}$
* $g'(x) = \lambda - \frac{e^{-x}}{1 + e^{-x}}$  
$\implies \lambda = \frac{e^{-x}}{1 + e^{-x}}$  
$\implies x = \log \frac{1 - \lambda}{\lambda}$  
$\implies 1 + e^{-x} = \frac{1}{1 - \lambda}$  
$\implies g^*(\lambda) = \lambda \log \frac{1 - \lambda}{\lambda} + \log \frac{1}{1 - \lambda}$  
$= \lambda \log (1 - \lambda) - \lambda \log \lambda - \log (1 - \lambda)$  
$= -\lambda \log \lambda - (1 - \lambda) \log (1 - \lambda)$
* entropy function $H(\lambda)$
* $g(x) \leq \lambda x - H(\lambda)$
* $f(x) = e^g(x) \leq e^{\lambda x - H(\lambda)}$

### Using convex duality bounds in machine learning

**local variational methods**

* likelihood function $L = f_1 f_2 \cdots$  
$\leq f_1 g_2 g_3 f_4 \cdots$ (replace some of the $f$'s with approximations)
    * not the same as the likelihood, but hopefully the bounds are close enough
    * a linear approximation is okay locally

### Global/block variational approximation

* if we can't calculate posterior $p(z | data)$, 
what is the best replacement/approximation?
* find $q(z)$ s.t. $distance(q(z), p(z | data)$ is minimized
    * need some measure of distance/divergence
    * one possibility: KL divergence $d_{KL}(p_1(z), p_2(z)) = E_1[\log \frac{p_1(z)}{p_2(z)}]$
        * can minimize this over $q(z)$ (variational) or expectation propagation (KL divergence is not symmetric)
        * for this class use $\min_{q(z)} d_{KL}(q, p)$
            * other way around is more difficult since we don't have $p(z)$
            * if $p$ is multimodal, the variational approach will choose one mode
            * if $p$ is multimodal, the EP approach will average the modes (although we still need $p$ for direct variational approach)
            * either way, there's a compromise for multimodal $p$
* $\log p(y) = E_{q(z)}[\log \frac{q(z)}{p(z | y)}] + E_{q(z)}[\log \frac{p(y, z)}{q(z)}]$
    * *proof*: the above rhs is  
    $= E_{q(z)}[\log \frac{q(z) p(y) p(z|y)}{p(z|y) p(z)}]$  
    $= E_{q(z)}[\log p(y)] = \log p(y)$
* then $\log p(y) = d_{KL}(q(z) p(z|y)) + L$ ($L$ is the lower bound)
    * $p(y)$ doesn't depend on our parameter $z$, so we can treat it as constant
    * minimizing the KL-divergence is equivalent to maximizing $L$, the variational lower bound (also called evidence lower bound, ELBO)
* $\log p(y) \leq L$
* $\log p(y) = \log \int p(z) p(y|z) dz$ 
$= \log \int q(z) \frac{p(z)}{q(z)} p(y|z) dz$
$\geq \int q(z) \log \frac{p(z) p(y|z)}{q(z)} dz$
$= E_{q(z)} [\log \frac{p(y, z)}{q(z)}] = L$
* alternative and equivalent form of ELBO:  
$ELBO = E_{q(z)}[\log \frac{p(y, z)}{q(z)}]$
$= E_{q(z)}[\log p(y|z)] + E_{q(z)}[\log \frac{p(z)}{q(z)}]$
$= E_{q(z)}[\log p(y|z)] - d_{KL}(q(z), p(z))$

### Connection between global and local variational methods

* global approximation can be derived from convex duality
* for this part, $z$ is discrete
* consider $\log p(y) = \log \sum_z p(y, z) = \log \sum_z e^{\log p(y, z)}$
    * let $x$ be a vector indexed by $z$ with entries $\log p(y, z)$
    * $\log p(y) = f(x) = \log \sum_z e^{\log p(y, z)} = \log \sum_z e^{x_z}$
    * $f(x)$ is convex so we can apply convex duality to minimize
* $f(x) = \max_\lambda \{\lambda^\top x - f^*(\lambda)\}$
$= \geq \lambda^\top x - f^*(\lambda)$
$= \sum_z q(z) \log p(y, z) - f^*([q_z])$
    * $\lambda$ is a vector indexed by $z$ with values $q(z) = q_z$
    * $f^*(\lambda) = \max_z \{\lambda^\top x - f(x)\}$
    $= \max_{p(y, z)} \{\sum_z q(z) \log p(y, z) - \log p(y)\}$
    $= \max_{p(y, z)} \{\sum_z q(z) \log p(z|y)\}$  
    $= \sum_z q(z) \log q(z)$
    * then we get $\log p(y) = f(x) \geq \sum_z q(z) \log p(y, z) - \sum_z q(z) \log q(z)$  
    $= E_{q(z)} [\log \frac{p(y, z)}{q(z)}] = ELBO$

### Variational EM algorithm

* $p(y, z)$ parameterized by $\theta$
* $q(z)$ parameterized by $\lambda$
* then $\log p(y) \leq L(\lambda, \theta)$  
$= E_{q(z|\lambda}[\log \frac{p(y, z)|\theta)}{q(z|\lambda)}]$

**algorithm** Variational EM

* initialize $\theta^{(0)}$, $i = 0$
* E-step
    * pick $q$, i.e., $\lambda^{(i)}$ by $\max L(\lambda^{(i)}, \theta^{(i)})$
* M-step
    * pick $\theta^{(i+1)}$ by $\max L(\lambda^{(i)}, \theta^{(i+1)}$
 
**e.g.** $q(z)$ is not restricted  
$\min d_{KL}(q(z), p(z|y)) \implies q(z) = p(z, y) = p(z, y | \theta)$  
then this is the implicit solution of the E-step  
M-step: $\max_{\theta^{(i+1)}} E_{p(z|y, \theta^{(i)})}[\log p(y, z | \theta^{(i+1)})] - E_{p(z | y, \theta^{(i)}}[\log p(z | y, \theta^{(i)}]$  
first term is exact, second term doesn't depend on $\theta^{(i+1)}$

### Mean field approximation

* pick $q(z)$ to maximize $L = E_{q(z)}[\log \frac{p(y, z)}{q(z)}]$
* if family of $q(z)$ is not restricted, then $q(z) = p(z|y)$
* restrict family $Q = \{q(z)\}$
* how to pick $Q$?
    * restrict form of $q(z)$, e.g., $q$ can be gaussian with different parameters (fixed form variational inference)
    * divide $z$ into indepedndent subgroups $z_1, ..., z_K$, then $q(z) = \prod_j q(z_j)$  (**mean field approximation**)
    
**General form of the solution to the mean field approximation**

* consider discrete case for now
* again, $\log p(y) \geq L = \sum_z q(z) \log \frac{p(y, z)}{q(z)}$
* restrict $q(z) = \prod_j q(z_j)$
* then we get $L = \sum_z (\prod_j q(z_j)) \log p(y, z) - \sum_z (\prod_j q(z_j)) (\sum_l \log q (z_l))$  
$= \sum_{z_j} q(z_j) \prod_{l \neq j} q(z_l) \log p(y, z) - \sum_z (\prod_a q(z_a)) \log q_j (z_j) - \sum_z (\prod_a q(z_a)) \sum_{j \neq l} q_l (z_l)$
    * optimize w.r.t. $z_j$
    * third term integrates out $z_j$, so constant w.r.t. $z_j$
    * second term: integration w.r.t. $l \neq j$ results in $1$, so it simplifies to $\sum_{z_j} q(z_j) \log q(z_j)$
    * first term can be rewritten as  
    $\sum_{z_j} q(z_j) \prod_{l \neq j} q(z_l) \log p(y, z)$  
    $= \sum_{z_j} q(z_j) \sum_{z_{l \neq j}} \prod_{l \neq j} q(z_l) \log p(y, z)$  
    $= \sum_{z_j} q(z_j) E_{\prod_{l \neq j} q(z_l)} [\log p(y, z)]$
* so $L = constant + \sum_{z_j} q(z_j) g(z_j, y) - \sum_{z_j} q(z_j) \log q(z_j)$
    * where $g(z_j, y) = E_{\prod_{l \neq j} q(z_l)}[\log p(y, z)]$
* rewriting again, we get $L = constant + \sum_{z_j} q(z_j) \log \frac{\exp(g(z_j, y))}{q(z_j)}$
* if $e^{g(z_j, y)}$ is normalized (i.e., treat as distribution), then $L = constant - d_{KL}(q(z_j), e^{g(z_j, y)})$
* pick $q(z_j)$ to minimize divergence $\implies q(z_j) = h(z_j)$  
$\implies q(z_j) \propto e^{g(z_j, y)}$

**e.g.** linear regression

* prior: 
    * $w \sim \mathcal{N}(0, \alpha^{-1} I)$
    * $\alpha \sim Gamma(a_0, b_0)$
* likelihood: $t \mid w, \beta \sim \mathcal{N}(\Phi w, \beta^{-1} I)$
* posterior: $p(w, \alpha | t)$, complicated to solve
* use approximate posterior $q(w, \alpha) = q(w) q(\alpha)$
* $\log p(\alpha, w, t) = \log G(\alpha | a_0, b_0) + \log \mathcal{N}(w | \alpha) + \log \mathcal{N}(t | w, \beta^{-1} I)$  
$= constant + (a_0 - 1) \log \alpha -b_0 \alpha + \frac{d}{2} \log \alpha - \frac{\alpha}{2} w^\top w - \frac{\beta}{2} w^\top \Phi^\top \Phi w + \beta w^\top \Phi^\top t$
* $\log q(\alpha) + E_{q(w)}[\log p(\alpha, w, t)]$ (can ignore constants w.r.t. $\alpha$)  
$= (a_0 + d / 2 - 1) \log \alpha - (b_0 + \frac{1}{2} ||m_N||^2 + tr(S_N)) \alpha$  
$\implies q(\alpha) = Gamma(a_N, b_N)$
    * $a_N = a_0 + d / 2$
    * $b_N = b_0 + \frac{1}{2} (||m_N||^2 + tr(S_N))$
    * $E_{q(w)}[w^\top w] = ||m_N||^2 + tr(S_N)$ (currently unknown)
* $\log q(w) = E_{q(\alpha)}[-\frac{\alpha}{2} w^\top w - \frac{\beta}{2} w^\top \Phi^\top \Phi w + \beta w^\top \Phi^\top t] + constant$  
$= -\frac{E[\alpha]}{2} w^\top w - \frac{\beta}{2} w^\top \Phi^\top \Phi w + \beta w^\top \Phi^\top t + constant$  
$\implies q(w) = \mathcal{N}(m_N, S_N)$
    * the quadratic term is $E[\alpha] I + \beta \Phi^\top \Phi$  
    $\implies S_N = (E_{q(\alpha)}[\alpha] I + \beta \Phi^\top \Phi)^{-1}$
    * $m_N = \beta S_N \Phi^\top t$
    * $E[\alpha] = \frac{a_N}{b_N} = \frac{a_0 + d / 2}{b_0 + \frac{1}{2} (||m_N||^2 + tr(S_N))}$
* so we get an alternating optimization algorithm

### Clustering with constraints

**motivation**

* difficult clustering problem
* but we have some affinity information (e.g., whether some pairs of points belong in the same cluster or not
* "must link" vs "do not link" constraints
* alternatively, we might have soft constraints
* penalized probabilistic clustering

**penalized probabilistic clustering**

* GMM: $L = \prod_i \prod_j (p_j p_i(y_i | \theta_j))^{z_{ij}}$
* want to impose "must link" or "do not link" constraints (soft constraints)
* encode preferences using weights $w_{il}$
    * $w_{il} = 0 \implies$ no constraint
    * $w_{il} > 0 \implies$ prefer to link $i$ and $l$
    * $w_{il} < 0 \implies$ prefer to put $i$ and $l$ in separate clusters
    * $w_{il}$ given a priori
* $\prod_i \prod_{l \neq i} e^{w_{il} \delta(z_i, z_l)}$  
$= \exp(\sum_i \sum_{l \neq i} \sum_j z_{ij} z_{il} w_{il})$
* $p(z) = (\prod_i \prod_j p_j^{z_{ij}})(\exp(\sum_i \sum_{l \neq i} \sum_j z_{ij} z_{il} w_{il}))$ (not normalized)
* $\Omega = \sum_z p(z)$ normalizating coefficient
* complete data likelihood $L = \Omega^{-1} \prod_i \prod_j (p_j p(y_i | \theta_j))^{z_{ij}} \exp(\sum_i \sum_l \sum_j z_{ij} z_{lj} w_{il})$
    * $\alpha_{ij} = \log p_j p(y_i \theta_j)$
* $\log L = -\log \Omega + \sum_i \sum_j z_{ij} \alpha_{ij} + \sum_i \sum_{l \neq i} \sum_j z_{ij} z_{lj} w_{il}$
    * $z_{ij}$ and $z_{lj}$ are correlated in the true posterior (not separable)
* mean field variational approximation $q(z) = \prod_i^N q(z_i) = \prod_i \prod_j q_{ij}^{z_{ij}}$
    * implies each $i$ has its own parameters for variational distribution
    * but those are integrated out
* E-step: compute new approximation $q(z)$
* can calculate ELBO directly and optimize
* alternatively can use general solutionof mean field
* $g(z_i) = E_{\prod_{l \neq i} q_{z_l}}[\log L]$  
$= constant + E[\sum_j z_{ij} \alpha_{ij}] + E[2 \sum_{l \neq i} \sum_j z_{ij} z_{lj} w_{il}]$  
$= constant + \sum_j z_{ij} \alpha_{ij} + 2 \sum_{l \neq i} \sum_j z_{ij} q_{lj} w_{il}$
    * if $z_i = j$ then $z_{ij} = 1$ and $z_{ik} = 0$ $\forall k \neq j$  
    then $g(z_i = j) = constant + \alpha_j + 2 \sum_{l \neq i} q_{lj} w_{il}$  
    weighted sum of $w_{il}$ constraints weighted by $q_{lj}$
    * $q_{ij} = q(z_i = j) \propto \exp(\alpha_{ij} + 2 \sum_{l \neq i} q_{lj} w_{il})$  
    $\propto p_j p(y_i | \theta_j) \exp( \sum_{l \neq i} q_{lj} w_{il})$
* then the E-step becomes:
    * init $q_{ij}$'s
    * for $i = 1, ..., N$, calculate $q_{ij}$ for $j = 1, ..., K$ using  
    $q_{ij} \propto p_j p(y_i | \theta_j) \exp(\sum_{l \neq i} q_{lj} w_{il})$

* alternatively, the direct derivation of the E-step:  
    * $q(z) = \prod_i q(z_i) = \prod_i \prod_j q_{ij}^{z_{ij}}$
    * ELBO $= E_{q(z)}[\log \frac{p(y, z)}{q(z)}]$  
    $= E[-\log \Omega + \sum_i \sum_j z_{ij} \alpha_{ij} + \sum_i \sum_l \sum_j z_{ij} z_{lj} w_{il} - \sum_i \sum_j z_{ij} \log q_{ij}]$  
    $= -\log \Omega + \sum_i \sum_j q_{ij} \alpha_{ij} + \sum_i \sum_l \sum_j q_{ij} q_{lj} w_{il} - \sum_i \sum_j q_{ij} \log q_{ij}$
    * optimize ELBO w.r.t. $q(z_i)$, i.e. $\{q_{ij}\}_{j=1}^K$
        * we have constraint $\sum_j q_{ij} = 1$
    * $Lagrangian = ELBO + \lambda_i (\sum_j q_{ij} - 1)$
    * $\partial_{q_{ab}} L = \alpha_{ab} + \sum_l q_{lb} w_{al} + \sum_i q_{ib} w_{ia} - \log q_{ab} + \frac{q_{ab}}{q_{ab}} + \lambda_a$  
    $= constant + \alpha_{ab} + 2 \sum_{l \neq a} q_{lb} w_{al} - \log q_{ab}$  
    $\implies \log q_{ab} = constant + \alpha_{ab} + 2 \sum_{l \neq a} q_{lb} w_{al}$  
    $\implies q_{ab} \propto e^{\alpha_{ab}} e^{2 \sum_l q_{lb} w_{al}}$  
    $\propto p_b p(y_a | \theta_b) e^{2 \sum_l q_{lb} w_{al}}$
    * then $q_{ij} \propto p_j p(y_i | \theta_j) e^{2\sum_{l \neq i} q_{lj} w_{il}}$

* M-step: maximize w.r.t. $\theta_j$'s and $p_j$'s
    * recall $ELBO = -\log \Omega + \sum_i \sum_j q_{ij} \alpha_{ij} + \sum_i \sum_l \sum_j q_{ij} q_{lj} w_{il} - \sum_i \sum_j q_{ij} \log q_{ij}$
    * last two terms don't contain any model parameters and can be ignored when optimizing
    * $\Omega$ and $\alpha$ contain model parameters
    * then $ELBO = -\log \sum_z p(z) + \sum_i \sum_j q_{ij} \log(_j p(y_i | \theta_j)) + constant$
    * M-step for $\mu_j$ and $\Sigma_j$ identical to GMM except $q_{ij}$ instead of $\gamma_{ij}$, since we can ignore $\Omega$
        * $\mu_j = \frac{\sum_i q_{ij} y_i}{\sum_i q_{ij}}$
        * $\Sigma_j = \frac{1}{\sum_i q_{ij}} \sum_i q_{ij} (y_i - \mu_j) (y_i - \mu_j)^\top$
    * M-step for $p_j$'s is difficult
        * relaxed/inaccurate solution assumes $\partial_{p_j} \Omega = 0$ which yields the same result as in GMM: $p_j = \frac{\sum_i q_{ij}}{N}$