rho_VAE:

A repo dedicated to our paper -VAE: Autoregressive parametrization of the VAE encoder

Main idea:

We replace the usual diagonal Gaussian parametrization of the approximate posterior in VAE models with autoregressive Gaussian.

The usual way:

In standard VAE models, any input training sample is inducing an approximate posterior with a diagonal Gaussian distribution, i.e., to the latent space. This is realized by a linear layer that is outputing the mean vector of this distribution for the sample, and another linear layer that is creating the , i.e., the logarithm of the diagonal elements of the covariance matrix for the sample.

Next step is to generate samples from this distribtion. The reparametrization trick is then used to generate latent codes as , where is the element-wise multiplication of the variance vector , with another vector of random samples generated on the fly from the zero-mean and unit-variance normal distribution. Note that this comes from the fact that the Choleskiy factorization of the diagonal covariance is another diagonal matrix whose diagonal elements are the square-root of the diagonal elements of the covariance matrix.

So in order to keep things practical, one important restriction in the choice of how to parametrize the approximate posterior distribution is to have a straightforward Choleskiy factorization of its covariance, such that it can be constructed parametrically and without having to numerically calculate it.

From the other hand, we also need to calculate the KL-divergence between this approximate posterior and the prior for the latent space, which is usually a zero-mean and unit-variance Gaussian distribution. In order to avoid many issues and further complications, we would rather have this KL-divergence term calculated in closed-form and again parametrically. This way we can easily back-propagate the gradients of the weights of the network.

With this standard diagonal Gaussian choice, this can be derived as:

However, while very handy, this way of parametrization may be too simple to approximate the posterior. Note that it does not allow any correlation within the dimensions of the posterior. It just lets each dimension scale arbitrarily and independent of the other dimensions, a freedom which may even be not very necessary, since variances in the pixel domain are usually within the same range.

The rho_VAE way:

We propose to construct the covariance matrix of the approximate posterior differently. In particular, we account for the correlation through the simplest possible way, i.e., an AR(1) process with the covariance parametrized as:

where is a scalar value that controls the level of correlation and is another scalar that scales the whole covariance.

These two scalars can be constructed as outputs of two linear layers, each of size (so times less parameters than the standard way for this layer). Since we want the correlation coefficient to have magnitude less than one (both positive and negative), we pass the output of the first layer through a activation. And similarly to the standard way, the output of the other layer is assumed to be the log of the positive scaling factor.

As far as the practical issues of optimization are concerned, fortunately, this is as convenient as the standard way. The Choleskiy factorization of this covariance is parametric and has the form:

which has structure similar to the coariance matrix itself.

As for the KL-divergence term, this also comes in closed-form as:

In essence: a drop-in replacement for the usual encoder:

As far as the implementation of our idea is concerned, concretely, we only make the following 3 changes:

Replacing the fully-connected linear layer of size (h_sim,z_dim) that outputs logvar with two linear layers of size (h_dim,1). The first one is passed through a tanh non-linearity anf outputs the correlation coefficient rho, and the second one (without any non-linearity) is outputing the logarithm of the scaling factor, i.e., log_s.
Changing the reparametrization function from

def reparameterize(self, mu, logvar):
    std = torch.exp(0.5*logvar)
    eps = torch.randn_like(std)
    z_q = mu + eps*std
    return z_q

to:

def reparameterize(self, mu, rho, logs):

    z_q = torch.randn_like(rho).view(-1,1) * torch.sqrt(logs.exp())
    for j in range(1,z_dim):
        addenum = rho * z_q[:,-1].view(-1,1)  + torch.randn_like(rho).view(-1,1) * torch.sqrt(logs.exp())
        z_q = torch.cat(( z_q, addenum ),1)        
    z_q  = z_q + mu  
    return z_q

Note that this is equivalent to generation by matrix-vector multiplication of the choleskiy form above with the vector eps. The reason we chose this direct form is that Toeplitz matrices are not yet implemented in PyTorch and we need a for-loop to ralize the AR(1) process in practice. This, however, does not bring any slow-down and we notice that the rho_VAE runs as fast as the baseline.

Changing the Kulback-Leibler divergence term in the loss from

KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

to:

KLD = 0.5 * ( torch.sum(mu.pow(2)) + - z_dim * logs - (z_dim - 1) * torch.log(1 - rho**2 + 1e-7) + z_dim * (logs.exp()-1)).mean().

Note that these are exactly equivalent quantities that have the same range. Only the first one is valid when the approximate posterior is diagonal and the second one is valid when it is AR(1).

These 3 changes can be applied in a plug-and-play manner, meaning wherever relevant, replacing the usual way with this -way is expected to produce better results. So no parameters to tune!

Some generated samples:

Here are some samples generated after the above changes, while keeping all other things the same:

On mnist,

Vanilla-VAE: rho-VAE:

and on fashion-mnist,

Vanilla-VAE: rho-VAE:

Citation:

Here is how to cite this work:

@misc{ferdowsi2019vae,
    title={$ρ$-VAE: Autoregressive parametrization of the VAE encoder},
    author={Sohrab Ferdowsi and Maurits Diephuis and Shideh Rezaeifar and Slava Voloshynovskiy},
    year={2019},
    eprint={1909.06236},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
notebooks		notebooks
paper		paper
tex		tex
.gitignore		.gitignore
README.md		README.md
README.tex.md		README.tex.md
data.py		data.py
driver.sh		driver.sh
generation_driver.sh		generation_driver.sh
main.py		main.py
models.py		models.py
test_generation.py		test_generation.py
todo.org		todo.org
utils.py		utils.py

peria1/rho_VAE

Folders and files

Latest commit

History

Repository files navigation

rho_VAE:

Main idea:

The usual way:

The rho_VAE way:

In essence: a drop-in replacement for the usual encoder:

Some generated samples:

Citation:

About

Resources

Stars

Watchers

Forks

Languages