In [None]:
%%HTML
<!-- Mejorar visualización en proyector -->
<style>
.rendered_html {font-size: 1.2em; line-height: 150%;}
div.prompt {min-width: 0ex; padding: 0px;}
.container {width:95% !important;}
</style>

In [None]:
%matplotlib notebook
%autosave 0
import numpy as np
import matplotlib.pyplot as plt
import torch
import pyro
from tqdm import tqdm_notebook

In [None]:
import torchvision
path_to_datasets = '/home/phuijse/datasets'
mnist_train_data = torchvision.datasets.MNIST(path_to_datasets, train=True, download=True,
                                              transform=torchvision.transforms.ToTensor())
mnist_test_data = torchvision.datasets.MNIST(path_to_datasets, train=False, download=True,
                                             transform=torchvision.transforms.ToTensor())

fig, ax = plt.subplots(1, 10, figsize=(6, 1), tight_layout=True)
for i in range(10):
    image, label = mnist_train_data[i]
    ax[i].imshow(image.numpy()[0, :, :], cmap=plt.cm.Greys_r)
    ax[i].axis('off')
    ax[i].set_title(label)

$$
p(x_1, x_2) = p(x_2|x_1) p(x_1)
$$

$$
p(x_1|z)p(x_2|z)
$$

# Latent Variable Models (LVM)


Let's say we want to model a dataset $X = (x_1, x_2, \ldots, x_N)$ with $x_i \in \mathbb{R}^D$ 

> We are looking for $p(x)$

Each sample has D attributes

> These are the **observed variables** (visible space)

To model the data we have to propose dependency relationships between attributes

> Modeling correlation is difficult

One alternative is to assume that what we observe is correlated due to *hidden causes*

> These are the **latent variables** (hidden space)

Models with latent variables are called **Latent Variable Models** (LVM)

Then we get the marginal using

$$
\begin{align}
p(x) &= \int_z p(x, z) \,dz \nonumber \\
&= \int_z p(x|z) p(z) \,dz \nonumber
\end{align}
$$

Did we gain anything? 

> The integral can be hard to solve (in some cases it is tractable)

The answer is YES

> We can propose simple $p(x|z)$ and $p(z)$ and get complex $p(x)$

If the integral is intractable we try approximate inference

# Principal Component Analysis (PCA)

PCA is an algorithm to reduce the dimensionality of continous data

Let's say we have $X = (x_1, x_2, \ldots, x_N) \in \mathbb{R}^{N \times D}$ 

In classical PCA we 

1. Compute covariance matrix $C = \frac{1}{N} X^T X$
1. Solve the eigen value problem $(C - \lambda I)W = 0$

This comes from 

$$
\min_W W^T C W, \text{s.t.} ~ W^T W = I
$$

> PCA finds an **orthogonal transformation** $W$ that **minimizes the variance** of the projected data $XW$

Then we can reduce the amount of columns of $W$ to reduce the dimensionality of $XW$


### Example: Classical PCA for MNIST using pytorch

Implementation using Singular Value Decomposition (SVD)

In [None]:
class PCA:
    def __init__(self, data, K=2):
        self.data_mean = torch.mean(data, dim=0)
        data_centered = data - self.data_mean.expand_as(data)
        U, S, V = torch.svd(data_centered.T)
        # S is sorted in decreasing order
        self.W = U[:, :K]
    
    def encode(self, x):
        return torch.mm(x - self.data_mean.expand_as(x), self.W)

    def decode(self, z):
        return self.data_mean + torch.mm(z, self.W.T)

Project data and plot the reduced space

In [None]:
pca = PCA(test_data, K=2)
Z = pca.encode(test_data)

fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
for digit in range(10):
    mask = mnist_test_data.targets == digit
    ax.scatter(Z[mask, 0].detach().numpy(), Z[mask, 1].detach().numpy(), 
               s=5, alpha=0.5, cmap=plt.cm.tab10, label=str(digit))
plt.legend()
ax.set_xlabel('PC 1'); ax.set_ylabel('PC 2');

The two most important principal components

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(4, 1.5), tight_layout=True)
for i in range(2):
    ax[i].imshow(pca.W[:, i].reshape(28, 28).detach().numpy())
    ax[i].axis('off')
    ax[i].set_title('PC %d' %(i))

Plot some reconstructions

In [None]:
fig, ax = plt.subplots(2, 10, figsize=(8, 2), tight_layout=True)
reconstructions = pca.decode(Z[:10, :]).reshape(-1, 28, 28).detach().numpy()
for i in range(10):
    ax[0, i].imshow(test_data[i, :].reshape(28, 28).detach().numpy(), cmap=plt.cm.Greys_r)
    ax[0, i].axis('off')
    ax[1, i].imshow(reconstructions[i], cmap=plt.cm.Greys_r)
    ax[1, i].axis('off')

## Probabilistic interpretation for PCA

We can give a probabilistic interpretation to PCA as an LVM

An observed sample $x_i \in \mathbb{R}^D$ is modeled as 

$$
x_i = W z_i + B + \epsilon
$$

> The observed variable is related to the latent variable via a **linear mapping**

where 
- $B \in \mathbb{R}^D$ is the mean of $X$
- $W \in \mathbb{R}^{D\times K}$ is a linear transformation matrix
- $\epsilon$ is noise

> $z_i \in  \mathbb{R}^K$ is a continuous latent variable with $K<D$

#### Assumption: The noise is independent and Gaussian distributed with variance $\sigma^2$

Then

$$
p(x_i | z_i) = \mathcal{N}(B + W z_i, I \sigma^2)
$$

Note: In general factor analysis the noise has a diagonal covariance

#### Assumption: The latent variable has a standard Gaussian prior

$$
p(z_i) = \mathcal{N}(0, I)
$$


#### Marginal likelihood

The Gaussian is conjugated to itself (convolution of Gaussians is Gaussian)
$$
\begin{align}
p(x) &= \int p(x|z) p(z) \,dz \nonumber \\
&= \mathcal{N}(x|B, W^T W + I\sigma^2 ) \nonumber
\end{align}
$$

> We have parametrized a normal with full covariance from to normals with diagonal covariance"

The parameters are calculated from 
- $\mathbb{E}[x] = W\mathbb{E}[z] + \mu + \mathbb{E}[\epsilon]$
- $\mathbb{E}[(Wz + \epsilon)(Wz + \epsilon)^T] = W \mathbb{E}[zz^T] W^T + \mathbb{E}[\epsilon \epsilon^T]$

#### Posterior

Using Bayes we can obtain the posterior to go from observed to latent

$$
p(z|x) = \mathcal{N}(z|M^{-1}W^T(x-B), M\sigma^{-2} )
$$

where

$$
M = W^T W + I\sigma^2
$$

#### Training

We fit the model to find $W$, $\mu$ and $\sigma$ by maximizing the marginal likelihood

$$
\max \log L(W, B, \sigma^2) = \sum_{i=1}^N \log p(x_i)
$$

From here we can do derivatives and obtain closed form solutions of the parameters

> Solution for $W$ is equivalent to conventional PCA ($\sigma^2 \to 0$)

> Now we have estimated $\sigma$, we have error-bars for $z$ and the model is generative


## Self-study
- Barber, Chapter 21 and Murphy, Chapter 12
- Model with categorical latent variables: **Gaussian Mixture Model** (INFO337)


# Autoencoders

Autoencoders are deep neural networks for dimensionality reduction

![nn.svg](attachment:nn.svg)


#### Architecture
- Input and output dimensionality are equivalent
- Code (bottleneck) has smaller dimensionality than input/output
- **Encoder:** Neural net that maps input to code

$$
z = g_\phi(x)
$$

- **Decoder:** Neural net that maps code to output

$$
\hat x = f_\theta(z)
$$

- Model is trained by matching the input with the output (data as targets) with MSE (or cross-entropy)

$$
\hat \theta, \hat \phi = \text{arg} \min_{\phi, \theta} \| x - f_\theta(g_\phi(x)) \|^2
$$

> **Probabilistic intepretation:** Maximum likelihood with spherical Gaussian (or Bernoulli) likelihood

Typically an L2 regularizer on $\theta$ and $\phi$ is used

> **Probabilistic intepretation:** Spherical Gaussian prior




### Example: Autoencoder for MNIST in pytorch

One module for the encoder and one for the decoder

Two hidden layers each

In [None]:
class Decoder(torch.nn.Module):
    def __init__(self, latent_dim, output_dim=28*28, hidden_dim=128):
        super(Decoder, self).__init__()
        self.hidden1 = torch.nn.Linear(latent_dim, hidden_dim)
        self.hidden2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.output = torch.nn.Linear(hidden_dim, output_dim)
        self.activation = torch.nn.Softplus()

    def forward(self, z):
        h = self.activation(self.hidden1(z))
        h = self.activation(self.hidden2(h))
        return self.output(h)

class Encoder(torch.nn.Module):
    def __init__(self, latent_dim, input_dim=28*28, hidden_dim=128):
        super(Encoder, self).__init__()
        self.hidden1 = torch.nn.Linear(input_dim, hidden_dim)
        self.hidden2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.code = torch.nn.Linear(hidden_dim, latent_dim)
        self.activation = torch.nn.Softplus()

    def forward(self, x):
        h = self.activation(self.hidden1(x))
        h = self.activation(self.hidden2(h))
        return (self.code(h))
    
class AutoEncoder(torch.nn.Module):
    def __init__(self, latent_dim, hidden_dim=128):
        super(AutoEncoder, self).__init__() 
        self.encoder = Encoder(latent_dim, hidden_dim=hidden_dim)
        self.decoder = Decoder(latent_dim, hidden_dim=hidden_dim)
        
    def forward(self, x):
        return self.decoder(self.encoder(x))

Prepare datasets and dataloaders

In [None]:
from torch.utils.data import DataLoader, SubsetRandomSampler

np.random.seed(0)
idx = list(range(len(mnist_train_data)))
np.random.shuffle(idx)
split = int(0.7*len(idx))

train_loader = DataLoader(mnist_train_data, batch_size=128, drop_last=True,
                          sampler=SubsetRandomSampler(idx[:split]))

valid_loader = DataLoader(mnist_train_data, batch_size=128, drop_last=True,
                          sampler=SubsetRandomSampler(idx[split:]))

test_loader = DataLoader(mnist_test_data, batch_size=1024, drop_last=False, shuffle=False)

Train the autoencoder

In [None]:
model = AutoEncoder(latent_dim=2)
criterion = torch.nn.BCEWithLogitsLoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

use_gpu = True
if use_gpu:
    model = model.cuda()

fig, ax = plt.subplots()
for nepoch in tqdm_notebook(range(10)):
    # Plot latent space on the fly
    Z = torch.tensor([], device='cuda') if use_gpu else torch.tensor([], device='cpu')
    for x, label in test_loader:
        if use_gpu:
            x = x.cuda()
        Z = torch.cat((Z, model.encoder(x.reshape(-1, 28*28))))
    Z = Z.detach().cpu().numpy()
    ax.cla()
    for digit in range(10):
        mask = mnist_test_data.targets == digit
        ax.scatter(Z[mask, 0], Z[mask, 1], 
                   s=5, alpha=0.5, cmap=plt.cm.tab10, label=str(digit))
    plt.legend()
    fig.canvas.draw()
    # Actual training
    epoch_loss = 0.0
    for x, label in train_loader:
        if use_gpu:
            x = x.cuda()
        optimizer.zero_grad()
        hatx = model.forward(x.reshape(-1, 28*28))
        loss = criterion(hatx, x.reshape(-1, 28*28))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print("%d %f" %(nepoch, epoch_loss))

Inspect reconstructions

In [None]:
output_activation = torch.nn.Sigmoid()
fig, ax = plt.subplots(2, 10, figsize=(8, 2), tight_layout=True)
reconstructions = output_activation(hatx).reshape(-1, 28, 28).detach().numpy()
for i in range(10):
    ax[0, i].imshow(x.detach().numpy()[i+10, 0, :, :], cmap=plt.cm.Greys_r)
    ax[0, i].axis('off')
    ax[1, i].imshow(reconstructions[i+10], cmap=plt.cm.Greys_r)
    ax[1, i].axis('off')

# VI for LVM

The LVM is defined by the joint density between observed $x$ and latent variables $z$

$$
p(x, z) = \prod_i p(x_i|z_i) p(z_i)
$$

If we follow the **PCA recipe** (Linear mapping, Gaussian likelihood and Gaussian prior) we obtained an analytical posterior

If we use a more complex (non-linear) mapping the posterior and evidence may not be tractable

> In such case, we can use **VI**

We propose an approximate posterior $q_\nu(z)$ and maximize the ELBO

$$
\begin{align}
\log p(x) \geq \mathcal{L}(\nu) &= \mathbb{E}_{z\sim q_\nu(z|x)} \left[\log \frac{p(x, z)}{q_\nu(z|x)}\right] \nonumber \\
&=- \int q_\nu(z|x) \log \frac{q_\nu(z|x)}{p(x, z)} dz \nonumber 
\end{align} 
$$

to find the best parameters $\hat \nu$

# Variational Autoencoder (VAE)

The Variational Autoencoder (VAE) is an LVM where **deep neural networks** are used to model the **conditional distributions** between latent $z$ and observed $x$ variables

It was proposed simultaneously by [(Kingma and Welling, ICLR, Dec. 2013)](https://arxiv.org/pdf/1312.6114.pdf) and [(Rezende *et al*, ICML, Jan. 2014)](https://arxiv.org/abs/1401.4082) perhaps sparking the revived interest into **Deep Learning plus Approximate Bayesian Inference** that we see today


The difference with a regular autoencoder is that the latent (code) is now a stochastic variable
- a prior distribution is placed on $z$: $p(z)$
- a neural network is used to model the likelihood: $p_\theta(x | z)$
- a neural network is used to model the approximate posterior: $q_\phi(z|x)$
- The parameters of the networks $\theta$ and $\phi$ are deterministic, *i.e.* not a "fully" bayesian neural net

> Variational Inference is used to obtain the posterior and point estimates of the global parameters

In what follows we will review the assumptions, training and the key contributions of this work to the field of Bayesian Neural Networks



#### VAE Assumption 1: 

The latent variable has a standard Gaussian prior

$$
p(z_i) = \mathcal{N}(0, I)
$$

#### VAE Assumption 2: 

The approximate posterior is a Factorized (diagonal) Gaussian

$$
q_\phi(z_i|x_i) = \mathcal{N}(\mu_i, I \sigma_i^2)
$$

**Problem:** The number of variational parameters scales with $N$, inpractical for large datasets

> **Solution:** Local variational parameters are replaced by a function

This is called **amortization**

For example

$$
\mu_i, \sigma_i = g_\phi(x_i)
$$

where $g_\phi(\cdot)$ is the **encoder network**

#### VAE Assumption 3: 

The likelihood is chosen depending on the data

- Continuous data: Factorized (diagonal) Gaussian
$$
\mu_i, \sigma_i = f_\theta(z_i)
$$
- Binary data: Bernoulli
$$
p_i = f_\theta(z_i)
$$

where $f_\theta(\cdot)$ is the **decoder network**

# Writing a VAE in Pyro

First we will create a "dual-headed" encoder

The encoder outputs the parameters of the factorized gaussian associated to the latent variable

In [None]:
class EncoderDual(torch.nn.Module):
    def __init__(self, latent_dim, input_dim=28*28, hidden_dim=128):
        super(EncoderDual, self).__init__()
        self.hidden1 = torch.nn.Linear(input_dim, hidden_dim)
        self.hidden2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.z_loc = torch.nn.Linear(hidden_dim, latent_dim)
        self.z_scale = torch.nn.Linear(hidden_dim, latent_dim)
        self.activation = torch.nn.Softplus()

    def forward(self, x):
        h = self.activation(self.hidden1(x))
        h = self.activation(self.hidden2(h))
        return self.z_loc(h), torch.exp(self.z_scale(h))
    
class Decoder(torch.nn.Module):
    def __init__(self, latent_dim, output_dim=28*28, hidden_dim=128):
        super(Decoder, self).__init__()
        self.hidden1 = torch.nn.Linear(latent_dim, hidden_dim)
        self.hidden2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.output = torch.nn.Linear(hidden_dim, output_dim)
        self.activation = torch.nn.Softplus()

    def forward(self, z):
        h = self.activation(self.hidden1(z))
        h = self.activation(self.hidden2(h))
        return self.output(h)

The generative process in the model 

>For each $i=1,2,\ldots, N$
- Sample: $z_i \sim \mathcal{N}(0, I)$ # Prior
- Compute: $p_i = f_\theta(z_i)$ # Decoder
- Sample: $x_i \sim \text{Bernoulli}(p_i)$ # Likelihood

And for the guide

>For each $i=1,2,\ldots, N$
- Compute: $\mu_i, \sigma_i = g_\phi(x_i)$ # Encoder
- Sample: $z_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$ # Approximate posterior


Note that we are not  using are not using `pyro.param` in the guide
> Instead of having $2N$ variational parameters we **amortize** with the encoder

In [None]:
from pyro.distributions import Bernoulli, Normal

class VariationalAutoEncoder(torch.nn.Module):
    
    def __init__(self, latent_dim, hidden_dim=128):
        super(VariationalAutoEncoder, self).__init__() 
        self.encoder = EncoderDual(latent_dim, hidden_dim=hidden_dim)
        self.decoder = Decoder(latent_dim, hidden_dim=hidden_dim)
        self.latent_dim = latent_dim
        
    def model(self, x):
        pyro.module("decoder", self.decoder)
        with pyro.plate("data", size=x.shape[0]):
            # p(z)
            z_loc = torch.zeros(x.shape[0], self.latent_dim, device=x.device)
            z_scale = torch.ones(x.shape[0], self.latent_dim, device=x.device)
            z = pyro.sample("latent", Normal(z_loc, z_scale).to_event(1))
            # p(x|z)
            p_logits = self.decoder.forward(z)
            pyro.sample("observed", Bernoulli(logits=p_logits, validate_args=False).to_event(1), 
                        obs=x.reshape(-1, 28*28))
    
    def guide(self, x):
        pyro.module("encoder", self.encoder)
        with pyro.plate("data", size=x.shape[0]):
            # q(z|x)
            z_loc, z_scale  = self.encoder.forward(x.reshape(-1, 28*28))
            pyro.sample("latent", Normal(z_loc, z_scale).to_event(1))

We train the VAE using SVI with the Mean field ELBO

In [None]:
pyro.enable_validation(True) # BUG?
pyro.clear_param_store()

vae = VariationalAutoEncoder(latent_dim=2)

use_gpu = True
if use_gpu:
    vae = vae.cuda()
    
svi = pyro.infer.SVI(model=vae.model, 
                     guide=vae.guide, 
                     optim=pyro.optim.Adam({"lr": 1e-2}), 
                     loss=pyro.infer.Trace_ELBO())

fig, ax = plt.subplots()
for nepoch in tqdm_notebook(range(10)):
    # Plot latent space on the fly
    Z = torch.tensor([], device='cuda') if use_gpu else torch.tensor([], device='cpu')
    for x, label in test_loader:
        if use_gpu:
            x = x.cuda()
        Z = torch.cat((Z, torch.cat((vae.encoder(x.reshape(-1, 28*28))), dim=1)), dim=0)
    Z = Z.detach().cpu().numpy()
    ax.cla()
    for digit in range(10):
        mask = mnist_test_data.targets == digit
        ax.errorbar(x=Z[mask, 0], y=Z[mask, 1], 
                    xerr=Z[mask, 2], yerr=Z[mask, 3],
                    fmt='none', alpha=0.5, label=str(digit))
    plt.legend()
    fig.canvas.draw()
    
    # Actual training
    epoch_loss = 0.0
    for x, label in train_loader:
        if use_gpu:
            x = x.cuda()
        epoch_loss += svi.step(x)
    print("%d %f" %(nepoch, epoch_loss))

Reconstructions

In [None]:
if use_gpu:
    vae = vae.cpu()
    
output_activation = torch.nn.Sigmoid()
fig, ax = plt.subplots(4, 10, figsize=(8, 4), tight_layout=True)

x, label = next(iter(train_loader))
z_loc, z_scale = vae.encoder.forward(x.reshape(-1, 28*28))
for i in range(10):
    ax[0, i].imshow(x.detach().numpy()[i, 0, :, :], cmap=plt.cm.Greys_r)
    ax[0, i].axis('off')
    reconstructions_mean = output_activation(vae.decoder(z_loc)).reshape(-1, 28, 28). detach().numpy()
    ax[1, i].imshow(reconstructions_mean[i], cmap=plt.cm.Greys_r)
    ax[1, i].axis('off')
    z = Normal(z_loc, z_scale).rsample()
    reconstructions = output_activation(vae.decoder(z)).reshape(-1, 28, 28). detach().numpy()
    ax[2, i].imshow(reconstructions[i], cmap=plt.cm.Greys_r)
    ax[2, i].axis('off')
    ax[3, i].imshow(reconstructions_mean[i] - reconstructions[i], cmap=plt.cm.RdBu_r, 
                    vmin=-0.05, vmax=0.05)
    ax[3, i].axis('off')

Sampling

In [None]:
M = 30
z_plot = np.linspace(-3, 3, num=M)
big_imag = np.zeros(shape=(28*M, 28*M))

for i in range(M):
    for j in range(M):
        z = torch.tensor(np.array([z_plot[j], z_plot[M-1-i]]), dtype=torch.float32)
        xhat = output_activation(vae.decoder.forward(z)).reshape(28, 28). detach().numpy()
        big_imag[i*28:(i+1)*28, j*28:(j+1)*28] = xhat

fig, ax = plt.subplots(figsize=(9, 9), tight_layout=True)
Z_plot1, Z_plot2 = np.meshgrid(z_plot, z_plot)
ax.matshow(big_imag, vmin=0.0, vmax=1.0, cmap=plt.cm.gray, extent=[-4, 4, -4, 4])
H, xedge, yedge = np.histogram2d(Z[:, 0], Z[:, 1], bins=30, range=[[-4, 4], [-4, 4]])
ax.contour(Z_plot1, Z_plot2, H.T, linewidths=3, levels=[1], cmap=plt.cm.Reds);

# Details on the VAE training

The ELBO in this case is 

$$
\begin{align}
\mathcal{L}(\theta, \phi) = \mathbb{E}_{z\sim q_\phi(z|x)} \left [\log p_\theta(x|z) p(z) - \log q_\phi(z|x) \right ] ,
\end{align}
$$

the VAE is trained by maximizing the ELBO via gradient descent

$$
\theta_{t+1} = \theta_{t} - \eta \nabla_\theta \mathcal{L}(\theta_{t}, \phi_{t})
$$

$$
\phi_{t+1} = \phi_{t} - \eta \nabla_\phi \mathcal{L}(\theta_{t}, \phi_{t})
$$

> We need the derivates of the ELBO wrt to $\theta$ and $\phi$

### The derivative wrt to $\theta$ 

We can ignore the terms not dependent on $\theta$ 

$$
\nabla_\theta \mathcal{L}(\theta, \phi)  = \nabla_\theta \mathbb{E}_{z\sim q_\phi(z|x)}\left [\log p_\theta(x|z)\right ] = \mathbb{E}_{z\sim q_\phi(z|x)} \left [\nabla_\theta \log  p_\theta(x|z)\right ] 
$$

If we can sample from $q_\phi(z|x)$ then we can "Monte-Carlo approximate" the expected value 

$$
\nabla_\theta \mathcal{L}(\theta, \phi) \approx \frac{1}{S} \sum_{s=1}^S \nabla_\theta \log p_\theta(x|z^{(s)})
$$

### The derivative wrt to $\phi$ 

Let's consider a general function $f(z)$ that depends on $z$, e.g. $\log p_\theta(x|z) p(z)$ and $\log q_\phi(z|x)$

$$
\nabla_\phi  \mathcal{L}(\theta, \phi) =  \nabla_\phi \mathbb{E}_{z\sim q_\phi(z|x)}\left [f(z) \right ] = \nabla_\phi \int q_\phi(z|x) f(z) dz
$$

Cannot do monte-carlo sampling. How do we solve this?

##### Traditional solution: [REINFORCE](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) aka score function (SF)

Using the identify $\nabla_\phi q_\phi(z) = q_\phi(z) \nabla_\phi\log q_\phi(z) $, and assuming that $f(z)$ does not depend on $\phi$

This way we can again rely on Monte-Carlo sampling of $q_\phi(z|x)$

$$
\nabla_\phi \mathbb{E}_{z\sim q_\phi(z|x)}\left [f(z)\right ] = \mathbb{E}_{z\sim q_\phi(z|x)}\left [ f(z) \nabla_\phi \log q_\phi(z|x) \right ] \approx \frac{1}{S} \sum_{s=1}^S f(z^{(s)}) \nabla_\phi \log q_\phi(z^{(s)}|x)
$$

The estimator is unbiased but has high variance, in most case is not usable

##### Key contribution in VAE: **Reparameterization trick**

The latent is $z \sim \mathcal{N}(\mu_\phi, \sigma_\phi^2)$ so

$$
z = g(\phi, \epsilon) = \mu_\phi + \epsilon \sigma_\phi, \quad \epsilon \sim \mathcal{N}(0, I)
$$

Instead of sampling $z$ we sample $\epsilon$ and apply $g$ to obtain $z$, so

$$
\mathbb{E}_{z\sim q_\phi(z|x)}\left [f(z) \right ] =  \mathbb{E}_{\epsilon\sim \mathcal{N}(0, I)}\left [  f(g(\phi, \epsilon))  \right ] 
$$

Now the expectation does not depend on $\phi$ so the gradient

$$
\nabla_\phi \mathbb{E}_{\epsilon\sim \mathcal{N}(0, I)}\left [  f(g(\phi, \epsilon))  \right ] = \mathbb{E}_{\epsilon\sim \mathcal{N}(0, I)}\left [  f'(g(\phi, \epsilon)) \nabla_\phi g(\phi, \epsilon) \right ] 
$$

This estimator is unbiased and has a [much lower variance than REINFORCE](https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb)

And we can do Monte-Carlo sampling

$$
\nabla_\phi \mathbb{E}_{z\sim q_\phi(z|x)}\left [f(z)\right ] \approx \frac{1}{S} \sum_{s=1}^S f'(g(\phi, \epsilon^{(s)})) \nabla_\phi g(\phi, \epsilon^{(s)}) 
$$

We only require that $z = g(\phi, \epsilon)$ and that $\nabla_\phi g$ exists

#### Beyond the reparameterization trick: [Pathwise derivatives](https://arxiv.org/pdf/1806.01851.pdf)

### (Once again) more attention on the ELBO

We can write the ELBO as
$$
\begin{align}
\mathcal{L}(\theta, \phi) &= \mathbb{E}_{z\sim q_\phi(z|x)} \left [\log p_\theta(x|z) p(z) - \log q_\phi(z|x) \right ] \nonumber \\
&= \mathbb{E}_{z\sim q_\phi(z|x)} \left [\log p_\theta(x|z) \right ] - D_{KL}\left[ q_\phi(z|x) || p(z) \right]\nonumber
\end{align}
$$

Hence maximizing de ELBO

> Maximize the log likelihood when sampling from the approximate posterior: **Minimize Reconstruction error**

> Minimize the divergence between the approximate posterior and prior: **Regularization**



### Further reducing the variance by using closed-form terms


The RHS term in the ELBO is the KL divergence between two multivariate Gaussian distributions which has a closed [form](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions)

If we consider the assumptions in this case we have

$$
D_\text{KL}\left[q_\phi(z|x) || p(z) \right] = \frac{1}{2}\sum_{j=1}^K \left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right)
$$

where $K$ is the dimensionality of the latent variable

> The derivatives are straighforward and the variance is low

### Pyro notes:

- [`TraceMeanField_ELBO`](https://docs.pyro.ai/en/stable/inference_algos.html#pyro.infer.trace_mean_field_elbo.TraceMeanField_ELBO) assumes reparameterized latent variables in the guide and uses the analytical KL when available
- [More on Pyro's variance reduction](https://pyro.ai/examples/svi_part_iii.html)