In deep learning, we usually have a set of visible variables $v$, and hiden $h$.  
Inference is usually to compute $p(h|v)$, or it's expectation. There are often necessary for maximum likelihood learning.  

Many deep models (many latent layers) have intractable posterior distribution. This is due to interractions between latent variables.

# Inference as Optimization

Inference can be described as an optimization problem, and we can find an approximate solution to it.  
We want to compute $\log p(v;\theta)$, mut it may be too costly to marginalize out $h$.  
We can compute instead the evidence lower bound (ELBO):
$$\mathcal{L}(v, \theta, q) = \log p(v;\theta) - D_\text{KL}(q(h|v)||p(h|v;\theta))$$
with $q$ arbitrary distribution.  
$L \leq \log p(v;\theta)$, they are equal if $q$ is the same distribution as $p(h|v)$.  
$\mathcal{L}$ can be a lot more easy to compute for some $q$. We can rewrite it as:
$$\mathcal{L}(v, \theta, q) = \mathbb{E}_{h \sim q}[\log p(h,v)] + H(q)$$
$$H(q) = - \mathbb{E}_{h \sim q}[\log q(h|v)]$$

We can think of inference as finding $q$ to maximizes $\mathcal{L}$.  
Exact inference search over a family of function that contains $p(h|v)$. We can make inference less expensive by:
- Search over a restricted family of $q$
- Une an imperfect optimization prodecure.

# Expectation Maximization

EM is a popular algorithm for training with latent variables. We alternate between 2 steps untils convergence:

- E-step: Set $q(h^{(i)}|v) = p(h^{(i)}|v^{(i)};\theta^{(0)})$ for all training examles. $q$ is set in terms of $\theta^{(0)}$, and changing $\theta$ doesn't afect $q$.
- M-Step: maximize $\sum_i \mathcal{L}(v^{(i)},\theta,q)$ w.r.t. $\theta$.  

It can be seen as maximizing $\mathcal{L}$ w.r.t. $q$ on E-step, and $\theta$ on M-step.  
On M-step, we can take one or several gradient steps.  
The M-step introduce a gap between $\mathcal{L}$ and $\log p(v)$ as $\theta$ moves further away from $\theta^{(0)}$. E-step reduces the gap to $0$ each time.

# MAP Inference and Sparse Coding

MAP Inference infer the most likely value:
$$h^* = \arg \max_h p(h|v)$$

MAP inference is a form of approximate inference, with $q$ required to be a Dirac distribution:
$$q(h|v) = \delta(h - \mu)$$
$q$ is entirely controller by $\mu$. Dropping constants terms of $\mathcal{L}$, we get:
$$\mu^* = \arg \max_\mu \log p(h=\mu, v)$$

This can be trained with EM. E-step use MAP inference to infer $h^*$, and M-step updatee $\theta$ to increase $\log p(h^*, v)$.  

MAP inference is usually used for sparse coding.  
We start with a sparse-inducing prior, such as a factorial Laplace:
$$p(h_i) = \frac{\lambda}{2} \exp (-\lambda |h_i|)$$

The visible units are generated with a linear transformation and adding Gaussian noise:
$$p(x|h) = \mathcal{N}(Wh + b, \beta^{-1}I)$$

Computing $p(h|v)$ is difficult because the latent variables are connected, and the sparse prior makes these interractions non-Gaussian.  
We use MAP inference and learn the parameters by maximizing the ELBO.  

Let $H$ matrix of all hidden units, and $V$ matrix of all visible units. The sparse coding process minimizes the following criterion:
$$J(H,W) = \sum_{i,j} |H_{i,j}| + \sum_{i,j} (V - HW^T)^2_{ij}$$

We can minimize $J$ by alternating minimize w.r.t. $H$ (convex) and w.r.t. $W$ (linear regression).

# Variational Inference and Learning

We can maximize $\mathcal{L}$ over a restricted family of $q$. This family should be chosen so it's easy to compute $\mathbb{E}_q \log p(h,v)$. We can do this by adding assumptions about how $q$ factorizes, for example the mean field approach impose $q$ to be factorial:
$$q(h|v) = \prod_i q(h_i|v)$$
This graphical model approach is called stuctured variatonal inference.  

We specify how $q$ factorizes, and then determines the optimal distribution.  
For discrete $h$, we uasually use a table to store $q$ and optimize use classical optimization techniques.  
For continuous $h$, we use calculus of variation to perform optimization over a space of function.  

We can think of maximizing $\mathcal{L}$ w.r.t $q$ as minimizing $D_\text{KL}(q(h|v)||p(h|v))$, which is fitting $q$ to $p$. But this is usually done in the opposite direction, here we encourage $q$ to have low probability everywhere the posterior has low probability. We choose this direction because the computations involve take the expectation over $q$, which is simple because of the choice of $q$.

## Discrete Latent Variables

Usually $q$ is defined as a lookup table. For example, if $h$ is binary, and we use the mean field assumption, we only need a vector $\hat{h}$: $q(h_i = 1 | v) = \hat{h}_i$.  
We need ro optimize $q$ using basic optimization techniques. But it should be fast, because the optimization is done in the iner loop of training. A popular choice is to solve fixed-point equations:
$$\frac{\partial}{\partial \hat{h}_i} \mathcal{L} = 0$$

We update iteratively all $\hat{h}_i$ until we satisfy a convergence criterion.  

### Example: Binary sparse coding

The input $v \in \mathbb{R}^n$ is generated by adding Gaussian noise to the sum of $m$ different components. Each component is switched on or off by the corresponding hidden unit $h \in \{ 0, 1 \} ^m$.
$$p(h_i = 1) = \sigma(b_i)$$
$$p(v|h) = \mathcal{N}(v; Wh, \beta^{-1})$$
with $b$ set of biases, $W$ weight matrix, and $\beta$ diagonal matrix.  

Let's suppose we want to train this model mith maximum likelihood:
$$\frac{\partial}{\partial b_i} \log p(v) = \mathbb{E}_{h \sim p(h|v)} = \frac{\partial}{\partial b_i} \log p(h)$$

This requires expectation w.r.t. $p(h|v)$, which is untractable.  

We can solve the problem using variational inference, with a mean field assumption:
$$q(h|v) = \prod_i q(h_i|v)$$

We represent q using a vector of probabilities: $q(h_i = 1 | v) = \hat{h}_i$. We add rescrition that $\hat{h}_i$ is never $0$ or $1$, this donne using an unrestricted vector $z$, and we set $\hat{h} = \sigma(z)$.  
This gives us a tractable $\mathcal{L}$.  

This is possible to optimize $\mathcal{L}$ for both $v$ and $h$, but:
- We need to store a $\hat{h}$ for each $v$ (doesn't scale to big data)
- We need to extract $\hat{h}$ very quickly from $v$.  

We don't optimize for $\hat{h}$, instead we rapidly find a local maxium:
$$\nabla_h \mathcal{L}(v, \theta, \hat{h}) = 0$$
We iteratively solve until convergence:
$$\frac{\partial}{\partial \hat{h}_i}\mathcal{L}(v, \theta, \hat{h}) = 0$$

It can be more advantageous to update several units at once. Some models like RBMs are structured in a way that allow block updates, but not sparse coding.  

Another technique is damping: we solve for the individual optimal values, and move all units a little in that direction.

## Calculus of Varitations

Many ML algorithms minimize a funcion $J(\theta)$ by finding $\theta \in \mathbb{R}^n$ for which $J$ takes on its minimal value.  
A functional $J[f]$ is a function of functions. We can take  variational derivaties of a function w.r.t. indivual values $f$ at any specific value of $x$.  
The function derivate of $J$ w.r.t. function $f$ at point $x$ is denoted:
$$\frac{\delta}{\delta f(x)} J$$

We can optimise a functional by solving for the function were the functional derivative at every point is $0$.

### Example

Find the probabability distribution over $x \in \mathbb{R}$ that has maximal differential entropy. We want to optimize:
$$H[p] = \mathbb{E}_x \log p(x)$$

We need to ue Lagrange multipliers for:
- make sure $p(x)$ integrates to $1$.
- Fix the variance $\sigma^2$, otherwhise the entropy increases without bonds.
- Fix the mean $\mu$, otherwhise shifting the distriubtion yields another one with same entropy.

The lagangrial functional is:
$\mathcal{L}[p] = \lambda_1(\int p(x)dx - 1) + \lambda_2 (\mathbb{E}[x] - \mu) + \lambda_3 (\mathbb{E}[(x - \mu)^2] - \sigma^2) + H[p]$$

We need to find $p$ such that:
$$\forall x, \frac{\delta}{\delta p(x)} \mathcal{L} = 0$$
We obtain:
$$p(x) = \exp (\lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 - 1)$$

We also need to choose the $\lambda$ values to satisfy the constraints. For example, it could be $\lambda_1 = 1 - \log \sigma \sqrt{2 \pi}$, $\lambda_2 = 0$, $\lambda_3 = - \frac{1}{2\sigma^2}$. We obtain:
$$p(x) = \mathcal{N}(\mu, \sigma^2)$$

That's why the Gaussian is the nonimformative prior, a Gaussian has maximum entropy, we impose the least possible amount of structure.  

There is no critical points (function) for minimum entropy. As functions place more probability on $x = \mu \pm \sigma$, and less probability on all other points, they lose entropy. There is no single minimal as there is no minimal positive real number.  
One solution would be to place zero mass on all but two points, using a mixture of Dirac, but they can't be describe by a single function, and cannot be find by our optimization method.

## Continuous Latent Variables

We need to use calculus of variations when maximizing $\mathcal{L}$ w.r.to $q(h|v)$.  

Usually, we just use proposed equations. For example, for the mean field approximation:
$$q(h|v) = \prod_i q(h_i|v)$$
The optimal unormalized distribution is:
$$\tilde{q}(h_i|v) = \exp (\mathbb{E}_{h_{-i} \sim q(h_{-i}|v)} \log \tilde{p}(v,h))$$

We only need to use calculus of variations to develop new forms of variational learning.  

The solution is a fixed point equation, we can apply it for every value of $i$ until convergence to get the correct functional form of $q(h|v)$.  
But it also tell us the functional form that the optimal solution will take, wheter we use the fixed-point equations or not to arrive there.  
The usual technique is to take the functional form, regard some values as parameters, and learn them using usual optimization alorithms.

## Interactions between Learning and Inference

Training using approximate inference tends to adapt the model in a way that makes the approximating assumptions more true.  

Variatonal leanring increases $\mathbb{E}_{h \sim q} \log p(v, h)$. This behavior cause our approximating assumptions to become self-fulfilling. Train a model with a unimoal approximate posterior yields a model with a posterior far closer to unimodal than with exact inference.  

It's difficult to estimate the amount of harm caused by variational approximation. After trainin, we can estimate $\log p(v;\theta)$ and find the gap with $\mathcal{L}$ for a specific value of $\theta$. It doesn't mean that this is accurate for other values of $\theta$, or that we found a godd $\theta$.  
We would need to know $\theta^* = \arg \max_\theta \log p(v;\theta)$.  
If $\max_q \mathcal{L}(v, \theta^*, q) \ll p(v; \theta^*)$, it might be that $\theta^*$ can't be captured by our $q$ family, and learning will never approac $\theta^*$.

# Learned Approximate Inference

We can think of the optimization process as a function $f$ that maps input $v$ to a distributions $q^* = \arg \max_q \mathcal{L}(v, q)$.  
We can approximate this using a neural network $\hat{f}(v, \theta)$

## Wake-Sleep

When training to infer $h$ from $v$ we don't have a supervised training set.  
The Wake-sleep algorithm draw samples of both $v$ and $h$ from the model.  
In a directed model, we can use ancestral simpling to go from $h$ and to $v$, and we can learn the reverse mapping $v$ to $h$. But this approach learns on inputs that have high probability according to the model, not the training data.

## Other Forms of Learned Infeence

A single pass in a learned inference network yield faster inference than iterating the mean field point equation. It's trained by running the network, improving the estimate with one step of mean-field, then update network to refine its estimate.  

Learned approximate inference is often used for generative modeling, with variational autoencoder. The inference network define $\mathcal{L}$, and the parameters are adapter to increase $\mathcal{L}$.