Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding disentangling in β-VAE #33

Open
howardyclo opened this issue Nov 27, 2018 · 1 comment
Open

Understanding disentangling in β-VAE #33

howardyclo opened this issue Nov 27, 2018 · 1 comment

Comments

@howardyclo
Copy link
Owner

howardyclo commented Nov 27, 2018

Metadata

Useful Tutorials of VAE and β-VAE

Background

  • β-VAE is a state of the art model for unsupervised visual disentangled representation learning.
  • β-VAE adds an extra hyperparameter β to the VAE objective, which constricts the effective encoding capacity of the latent bottleneck and encourages the latent representation to be more factorized.
  • The disentangled representations learned by β-VAE have been shown to be important for learning a hierarchy of abstract visual concepts conducive of imagination (SCAN, Higgins et al.) and for improving transfer performance of reinforcement learning policies, including simulation to reality transfer in robotics (DARLA. Higgins et al.)

Motivation

  • It is currently unknown what causes the factorized representations learnt by β-VAE to be axis aligned with the human intuition of the data generative factors compared to the standard VAE.
  • Furthermore, β-VAE has other limitations, such as worse reconstruction fidelity compared to the standard VAE. This is caused by a trade-off introduced by the modified training objective that punishes reconstruction quality in order to encourage disentanglement within the latent representations.
  • This paper attempts to shed light on the question of why β-VAE disentangles, and to use the new insights to suggest practical improvements to the β-VAE framework to overcome the reconstruction-disentanglement trade-off.

Understanding disentangling in β-VAE

  • From information bottleneck principle (Tishby et al. 1999) perspective, the β-VAE training objective encourages the latent distribution q(z|x) to efficiently transmit information about the data points x by jointly minimizing the β-weighted KL term and maximizing the data log likelihood.
  • A strong pressure for overlapping posteriors encourages β-VAE to find a representation space preserving as much as possible the locality of points on the data manifold.
  • Hypothesis: β-VAE finds latent components which make different contributions to the log-likelihood term of the objective function. These latent components tend to correspond to features in the data that are intuitively qualitatively different, and therefore may align with the generative factors in the data.
  • For example, consider optimizing the β-VAE objective under an almost complete information bottleneck constraint (i.e. β >> 1). The optimal thing to do in this scenario is to only encode information about the data points which can yield the most significant improvement in data log-likelihood (i.e. Eq(z|x)[log p(x|z)]).

Intuition of Improvement (The most important part)

  • For example, in the dSprites dataset (consisting of white 2D sprites varying in position, rotation, scale and shape rendered onto a black background) the model might only encode the sprite position under such a constraint. Intuitively, when optimizing a pixel-wise decoder log likelihood, information about position will result in the most gains compared to information about any of the other factors of variation in the data, since the likelihood will vanish if reconstructed position is off by just a few pixels.
  • Continuing this intuitive picture, we can imagine that if the capacity of the information bottleneck were gradually increased, the model would continue to utilize those extra bits for an increasingly precise encoding of position, until some point of diminishing returns is reached for position information, where a larger improvement can be obtained by encoding and reconstructing another factor of variation in the dataset, such as sprite scale.
  • They further test this intuition by training a model to generate dSprites conditioned on ground truth factors, with a controllable information bottleneck. Each factor is independently scaled by a learnable parameter and are subject to independently scaled additive noise (also learned), similar to the reparameterized latent distribution in β-VAE. Throughout the training, the capacity of information bottleneck increases linearly. The experiment shows that the early capacity is allocated to positional latents only (x and y), followed by a scale latent, then shape and orientation latents.

Reference

Further Readings

@howardyclo
Copy link
Owner Author

How to Tune Hyperparameters Gamma and C? (Response by Christopher P. Burgess)

Gamma sets the strength of the penalty for deviating from the target KL, C. Here you want to tune this such that the (batch) average KL stays close to C (say within < 1 nat) across the range of C that you use. This exact value doesn't usually matter much, but just avoid it being too high such that it destabilises the optimisation. C itself should start from low (e.g. 0 or 1) and gradually increase to a value high enough such that reconstructions end up good quality. A good way to estimate Cmax is to train B-VAE on your dataset with a beta low enough such that reconstructions end up good quality and look at the trained model's average KL. That KL can be your Cmax because it gives you a rough guide as to the average amount of representational capacity needed to encode your dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant