Understanding disentangling in β-VAE #33

howardyclo · 2018-11-27T09:09:47Z

Metadata

Authors: Christopher P. Burgess, Irina Higgins, +4 authors Alexander Lerchner
Organization: DeepMind
Publish Date: 2018.04
Paper: https://arxiv.org/pdf/1804.03599.pdf
3rd-party code: https://github.com/1Konny/Beta-VAE

Useful Tutorials of VAE and β-VAE

Read From Autoencoder to Beta-VAE or What a Disentangled Net We Weave: Representation Learning in VAEs for understanding their intuition.
Read Variational Coin Toss for understanding the intuition of variational inference (basics of VAE).
Read variational inference notes in Stanford CS228 - Probabilistic Graphical Models, or refer more mathematical details in A Tutorial on Variational Bayesian Inference.
The original VAE paper and the Notes on Variational Autoencoders.
This paper is a follow-up work of the original β-VAE paper.

Background

β-VAE is a state of the art model for unsupervised visual disentangled representation learning.
β-VAE adds an extra hyperparameter β to the VAE objective, which constricts the effective encoding capacity of the latent bottleneck and encourages the latent representation to be more factorized.
The disentangled representations learned by β-VAE have been shown to be important for learning a hierarchy of abstract visual concepts conducive of imagination (SCAN, Higgins et al.) and for improving transfer performance of reinforcement learning policies, including simulation to reality transfer in robotics (DARLA. Higgins et al.)

Motivation

It is currently unknown what causes the factorized representations learnt by β-VAE to be axis aligned with the human intuition of the data generative factors compared to the standard VAE.
Furthermore, β-VAE has other limitations, such as worse reconstruction fidelity compared to the standard VAE. This is caused by a trade-off introduced by the modified training objective that punishes reconstruction quality in order to encourage disentanglement within the latent representations.
This paper attempts to shed light on the question of why β-VAE disentangles, and to use the new insights to suggest practical improvements to the β-VAE framework to overcome the reconstruction-disentanglement trade-off.

Understanding disentangling in β-VAE

From information bottleneck principle (Tishby et al. 1999) perspective, the β-VAE training objective encourages the latent distribution q(z|x) to efficiently transmit information about the data points x by jointly minimizing the β-weighted KL term and maximizing the data log likelihood.
A strong pressure for overlapping posteriors encourages β-VAE to find a representation space preserving as much as possible the locality of points on the data manifold.
Hypothesis: β-VAE finds latent components which make different contributions to the log-likelihood term of the objective function. These latent components tend to correspond to features in the data that are intuitively qualitatively different, and therefore may align with the generative factors in the data.
For example, consider optimizing the β-VAE objective under an almost complete information bottleneck constraint (i.e. β >> 1). The optimal thing to do in this scenario is to only encode information about the data points which can yield the most significant improvement in data log-likelihood (i.e. Eq(z|x)[log p(x|z)]).

Intuition of Improvement (The most important part)

For example, in the dSprites dataset (consisting of white 2D sprites varying in position, rotation, scale and shape rendered onto a black background) the model might only encode the sprite position under such a constraint. Intuitively, when optimizing a pixel-wise decoder log likelihood, information about position will result in the most gains compared to information about any of the other factors of variation in the data, since the likelihood will vanish if reconstructed position is off by just a few pixels.
Continuing this intuitive picture, we can imagine that if the capacity of the information bottleneck were gradually increased, the model would continue to utilize those extra bits for an increasingly precise encoding of position, until some point of diminishing returns is reached for position information, where a larger improvement can be obtained by encoding and reconstructing another factor of variation in the dataset, such as sprite scale.
They further test this intuition by training a model to generate dSprites conditioned on ground truth factors, with a controllable information bottleneck. Each factor is independently scaled by a learnable parameter and are subject to independently scaled additive noise (also learned), similar to the reparameterized latent distribution in β-VAE. Throughout the training, the capacity of information bottleneck increases linearly. The experiment shows that the early capacity is allocated to positional latents only (x and y), followed by a scale latent, then shape and orientation latents.

Reference

SCAN: Learning Hierarchical Compositional Visual Concepts by Irina Higgins et al. ICLR 2018.
DARLA: Improving Zero-Shot Transfer in Reinforcement Learning by Irina Higgins et al. ICML 2017

How to Tune Hyperparameters Gamma and C? (Response by Christopher P. Burgess)

Gamma sets the strength of the penalty for deviating from the target KL, C. Here you want to tune this such that the (batch) average KL stays close to C (say within < 1 nat) across the range of C that you use. This exact value doesn't usually matter much, but just avoid it being too high such that it destabilises the optimisation. C itself should start from low (e.g. 0 or 1) and gradually increase to a value high enough such that reconstructions end up good quality. A good way to estimate Cmax is to train B-VAE on your dataset with a beta low enough such that reconstructions end up good quality and look at the trained model's average KL. That KL can be your Cmax because it gives you a rough guide as to the average amount of representational capacity needed to encode your dataset.

howardyclo added Unsupervised Learning Disentangled Representation Visual Attributes Interpretable Machine Learning labels Nov 27, 2018

howardyclo mentioned this issue Nov 28, 2018

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning #34

Open

This was referenced Dec 7, 2018

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets #39

Open

Learning Disentangled Joint Continuous and Discrete Representations #40

Open

howardyclo added Variational Autoencoder and removed Interpretable Machine Learning labels Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding disentangling in β-VAE #33

Understanding disentangling in β-VAE #33

howardyclo commented Nov 27, 2018 •

edited

Loading

howardyclo commented Dec 7, 2018

Understanding disentangling in β-VAE #33

Understanding disentangling in β-VAE #33

Comments

howardyclo commented Nov 27, 2018 • edited Loading

Metadata

Useful Tutorials of VAE and β-VAE

Background

Motivation

Understanding disentangling in β-VAE

Intuition of Improvement (The most important part)

Reference

Further Readings

howardyclo commented Dec 7, 2018

How to Tune Hyperparameters Gamma and C? (Response by Christopher P. Burgess)

howardyclo commented Nov 27, 2018 •

edited

Loading