# Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

This paper introduces an extension to the GAN framework to learn disentangled representations completely unsupervised.

Disentangled representation means that different factors/dimensions of the representation will be correspond to different types of semantic change in the output. I.e. walking along one dimension might change only the thickness of generated digits.

## Theory: Mutual information for inducing latent codes
* Normal GANs just use a simple factored continuous noise vector $z$ as input.
    * No restrictions on how this input is used by the generator.
    * Thus possible (and likely?) that $z$ will be used in a highly entangled way.


* Instead let the input consist of two parts:
    * $z$ same as in normal GANs.
    * $c = [ c_1, c_2, \dotsc, c_L ]$ used to capture the salient structured semantic features of the data.
        * Assume a factored distribution $p(c_1, \dotsc, c_L) = \prod^L_{i=1} p(c_i)$


* To force G to parametrize a distribution that actually uses $c$ they add *information-theoretic* regularization.
    * Want high mutual information between $c$ and generator distribution $G(z, c)$
    * Mutual information: $I(A; B) = H(A) - H(A\ \lvert\ B) = H(B) - H(B\ \lvert\ A)$, $H$ is entropy.
    * Intuition: mutual information measures how much information is gained by conditioning on another variable.
        * Independence means mutual information is zero.
        * If $c$ "changes" the generative distribution the mutual information is higher, which we want.
    * For $x \sim p_G(x) = G(z, c)$, we want $p(c\ \lvert\ x)$ to have small entropy, i.e. the generative process should not lose the information stored in $c$.
    * We add $-\lambda I(c;G(z, c))$ to the GAN objective.
    
    
* Can't maximize $I(c;G(z, c))$ directly since we need the posterior $p(c\ |\ x)$.
    * Instead, use technique called *Variational Information Maximization*.
        * TODO: lower bound, same idea as variational inference?
        * Q distribution
    * $I(c;G(z, c)) = H(c) - H(c\ \lvert\ G(z, c)) \geq \mathbb{E}_{x \sim G(z, c)} \left[ \mathbb{E}_{c' \sim P(c\lvert x)} \left[ log\ Q(c'\ \lvert\ x) \right] \right] + H(c) = \mathcal{L}_I$
    * With this formulation, still need to sample from posterior in inner expectation so they use an identity to define the loss in the following way.
    * $\mathcal{L}_I = \mathbb{E}_{c \sim p(c), x \sim G(z, c)} \left[ log\ Q(c\ \lvert\ x) \right] + H(c)$
    * $H(c)$ can be optimized over as well, but in this work they keep $p(c)$ fixed. Thus consider $H(c)$ constant.
    * $\mathcal{L}_I$ can be approximated with Monte Carlo simulation + reparametrization trick.

## Implementation
* Q parametrized as a neural network that shares parameters with D.
    * One extra fully connected layer to out put parameters for $Q(c\ \lvert\ x)$
    * For categorical $c_i$ use softmax for $Q(c_i\ \lvert\ x)$
    * For continuous $c_j$, can use other things but factored gaussian usually enough.
* $\lambda$ parameter easy to tune.
* They use DCGAN architecture.

## Experiments
TODO