# Variational Autoencoders (VAEs)

## The neural net perspective

In neural net language, a variational autoencoder consists of an encoder, a decoder, and a loss function.

![](./images/1.png)

The *encoder* is a neural network. Its input is a datapoint $x$, its output is a hidden representation $z$, and it has weights and biases $\theta$. To be concrete, let’s say $x$ is a 28 by 28-pixel photo of a handwritten number. The encoder ‘encodes’ the data which is $784$-dimensional into a latent (hidden) representation space $z$, which is much less than $784$ dimensions. This is typically referred to as a ‘bottleneck’ because the encoder must learn an efficient compression of the data into this lower-dimensional space. Let’s denote the encoder $q_\theta (z \mid x)$. We note that the lower-dimensional space is stochastic: the encoder outputs parameters to $q_\theta (z \mid x)$, which is a Gaussian probability density. We can sample from this distribution to get noisy values of the representations zz.

The decoder is another neural net. Its input is the representation $z$, it outputs the parameters to the probability distribution of the data, and has weights and biases $\phi$. The decoder is denoted by $p_\phi(x\mid z)$. Running with the handwritten digit example, let’s say the photos are black and white and represent each pixel as $0$ or $1$. The probability distribution of a single pixel can be then represented using a Bernoulli distribution. The decoder gets as input the latent representation of a digit $z$ and outputs $784$ Bernoulli parameters, one for each of the $784$ pixels in the image. The decoder ‘decodes’ the real-valued numbers in $z$ into $784$ real-valued numbers between $0$ and $1$. Information from the original $784$-dimensional vector cannot be perfectly transmitted, because the decoder only has access to a summary of the information (in the form of a less-than-$784$-dimensional vector $z$). How much information is lost? We measure this using the reconstruction log-likelihood $\log p_\phi (x\mid z)$ whose units are nats. This measure tells us how effectively the decoder has learned to reconstruct an input image $x$ given its latent representation $z$.

The *loss function* of the variational autoencoder is the negative log-likelihood with a regularizer. Because there are no global representations that are shared by all datapoints, we can decompose the loss function into only terms that depend on a single datapoint $l_i$. The total loss is then $\sum_{i=1}^N l_i$ for $N$ total datapoints. The loss function $l_i$ for datapoint $x_i$:

$$l_i(\theta, \phi) = - \mathbb{E}_{z\sim q_\theta(z\mid x_i)}[\log p_\phi(x_i\mid z)] + \mathbb{KL}(q_\theta(z\mid x_i) \mid\mid p(z))$$

The first term is the reconstruction loss, or expected negative log-likelihood of the $i$-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. If the decoder’s output does not reconstruct the data well, statistically we say that the decoder parameterizes a likelihood distribution that does not place much probability mass on the true data. For example, if our goal is to model black and white images and our model places high probability on there being black spots where there are actually white spots, this will yield the worst possible reconstruction. Poor reconstruction will incur a large cost in this loss function.

The second term is a regularizer that we throw in (we’ll see how it’s derived later). This is the Kullback-Leibler divergence between the encoder’s distribution $q_\theta(z\mid x)$. This divergence measures how much information is lost (in units of nats) when using $q$ to represent $p$. It is one measure of how close $q$ is to $p$.

In the variational autoencoder, $p$ is specified as a standard Normal distribution with mean zero and variance one, or $p(z) = Normal(0,1)$. If the encoder outputs representations $z$ that are different than those from a standard normal distribution, it will receive a penalty in the loss. This regularizer term means ‘keep the representations $z$ of each digit sufficiently diverse’. If we didn’t include the regularizer, the encoder could learn to cheat and give each datapoint a representation in a different region of Euclidean space. This is bad, because then two images of the same number (say a 2 written by different people, $2_{alice}$ and $2_{bob}$ could end up with very different representations $z_{alice}, z_{bob}$. We want the representation space of $z$ to be meaningful, so we penalize this behavior. This has the effect of keeping similar numbers’ representations close together (e.g. so the representations of the digit two ${z_{alice}, z_{bob}, z_{ali}}$ remain sufficiently close).

We train the variational autoencoder using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder $\theta$ and $\phi$. For stochastic gradient descent with step size $\rho$, the encoder parameters are updated using $\theta \leftarrow \theta - \rho \frac{\partial l}{\partial \theta}$ and the decoder is updated similarly.

## Implementation in Keras

First, let’s implement the encoder net $Q(z \mid X)$, which takes input $X$ and outputting two things: $\mu(X)$ and $\Sigma(X)$, the parameters of the Gaussian.

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
from keras.layers import Input, Dense, Lambda
from keras.models import Model
from keras.objectives import binary_crossentropy
from keras.callbacks import LearningRateScheduler

import numpy as np
import matplotlib.pyplot as plt
import keras.backend as K
import tensorflow as tf


m = 50
n_z = 2
n_epoch = 10


# Q(z|X) -- encoder inputs = Input(shape=(784,))
h_q = Dense(512, activation='relu')(inputs)
mu = Dense(n_z, activation='linear')(h_q)
log_sigma = Dense(n_z, activation='linear')(h_q)

ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'

SystemError: <class '_frozen_importlib._ModuleLockManager'> returned a result with an error set

ImportError: numpy.core._multiarray_umath failed to import

ImportError: numpy.core.umath failed to import

That is, our $Q(z \mid X)$ is a neural net with one hidden layer. In this implementation, our latent variable is two dimensional, so that we could easily visualize it. In practice though, more dimension in latent variable should be better.

However, we are now facing a problem. How do we get $z$ from the encoder outputs? Obviously we could sample $z$ from a Gaussian which parameters are the outputs of the encoder. Alas, sampling directly won’t do, if we want to train VAE with gradient descent as the sampling operation doesn’t have gradient!

There is, however a trick called reparameterization trick, which makes the network differentiable. Reparameterization trick basically divert the non-differentiable operation out of the network, so that, even though we still involve a thing that is non-differentiable, at least it is out of the network, hence the network could still be trained.

The reparameterization trick is as follows. Recall, if we have $x \sim N(\mu, \Sigma)$ and then standardize it so that $\mu = 0, \Sigma = 1$, we could revert it back to the original distribution by reverting the standardization process. Hence, we have this equation:

$$x = \mu + \Sigma^{\frac{1}{2}}x_{std}$$

With that in mind, we could extend it. If we sample from a standard normal distribution, we could convert it to any Gaussian we want if we know the mean and the variance. Hence we could implement our sampling operation of $z$ by:

$$z = \mu(X) + \Sigma^{\frac{1}{2}}(X){\epsilon}$$

where $\epsilon \sim N(0,1)$

Now, during backpropagation, we don’t care anymore with the sampling process, as it is now outside of the network, i.e. doesn’t depend on anything in the net, hence the gradient won’t flow through it.

In [None]:
def sample_z(args):
    mu, log_sigma = args
    eps = K.random_normal(shape=(m, n_z), mean=0., std=1.)
    return mu + K.exp(log_sigma / 2) * eps


# Sample z ~ Q(z|X) z = Lambda(sample_z)([mu, log_sigma])

Now we create the decoder net $P(X \mid z)$:

In [None]:
# P(X|z) -- decoder decoder_hidden = Dense(512, activation='relu')
decoder_out = Dense(784, activation='sigmoid')

h_p = decoder_hidden(z)
outputs = decoder_out(h_p)

Lastly, from this model, we can do three things: reconstruct inputs, encode inputs into latent variables, and generate data from latent variable. So, we have three Keras models:

In [None]:
# Overall VAE model, for reconstruction and training vae = Model(inputs, outputs)

# Encoder model, to encode input into latent variable # We use the mean as the output as it is the center point, the representative of the gaussian encoder = Model(inputs, mu)

# Generator model, generate new data given latent variable z d_in = Input(shape=(n_z,))
d_h = decoder_hidden(d_in)
d_out = decoder_out(d_h)
decoder = Model(d_in, d_out)

Then, we need to translate our loss into Keras code:

In [None]:
def vae_loss(y_true, y_pred):
    """ Calculate loss = reconstruction loss + KL loss for each data in minibatch """
    # E[log P(X|z)]     recon = K.sum(K.binary_crossentropy(y_pred, y_true), axis=1)
    # D_KL(Q(z|X) || P(z|X)); calculate in closed form as both dist. are Gaussian     kl = 0.5 * K.sum(K.exp(log_sigma) + K.square(mu) - 1. - log_sigma, axis=1)

    return recon + kl

and then train it:

In [None]:
vae.compile(optimizer='adam', loss=vae_loss)
vae.fit(X_train, X_train, batch_size=m, nb_epoch=n_epoch)

And that’s it, the implementation of VAE in Keras!