# Autoencoding beyond pixels using a learned similarity metric
* In short this paper combines a variational autoencoder (VAE) with a generative adversarial network (GAN) to form the VAE-GAN model.

* The idea is to use the feature representations that the discriminator of a GAN learns in the reconstruction objective of the VAE. Normal VAE is too concerned with pixel-wise errors but doing it this way should help with instead minimizing feature-wise errors to better capture the data distribution.

* Pixel-wise errors are not optimal for image data because they do not really model human visual perception. Small image translation that a human doesn't care about will give large pixel-wise errors.

* They also show some disentanglement of latent factors.

## VAE recap
$z \sim Enc(x) = q(z\ \lvert\ x), \quad \hat{x} \sim Dec(z) = p(x\ \lvert\ z)$

VAE loss consists of two parts
* Regularization on latent space to be close to prior over $z$ (often $p(z) = \mathcal{N}(0, 1)$) via KL divergence $\mathcal{L}_{prior} = KL(q(z\ \lvert\ x) \lVert p(z))$
* Reconstruction loss which is the negative expected log likelihood (pixel-wise).

## GAN recap
* Two networks, generator and discriminator
    * Generator generates data samples from noise $z$.
    * Discriminator outputs probability that a sample is real (vs fake).
    * Binary cross entropy $\mathcal{L}_{GAN} = log(D(x)) + log(1 - D(G(z))$, minimize for training G, maximize for training D.

## VAE-GAN

### Architecture
<img src="figs/vaegan/vaegan.png" width="30%" height="30%">

### Losses
* The discriminator of GANs learns a good feature representation of data which we want to use for a loss.
* They replace the reconstruction loss part of the VAE loss with a reconstruction loss expressed in discriminator.
    * Pick a layer $D_l$ of discriminator.
    * Choose gaussian as distribution over output at this layer, $p(D_l(x)\ \lvert\ z) = \mathcal{N}(D_l(x)\ \lvert\ D_l(\hat{x}), \mathbf{I})$
    * The new reconstruction error is $\mathcal{L}_{llike}^{D_l} = -\mathbb{E}_{q(z \lvert x)} \left[ log\ p(D_l(x)\ \lvert\ z) \right]$
    * So we minimize squared difference between $x$ and $\hat{x}$
    * They call this learned similarity.
* Also use augmented GAN loss described below.
    
### Training
* Discriminating based on samples from $p(z)$ and $q(z \lvert x) = Enc(x)$
    * $\mathcal{L}_{GAN} = log(D(x)) + log(1 - D(G(z)) + log(1 - D(G(Enc(x))$
* Limiting error signals to relevant network by using parts of whole loss for encoder, decoder/generator, and discriminator.
    * Encoder is trained by minimizing $\mathcal{L}_{prior} + \mathcal{L}_{llike}^{D_l}$ w.r.t $\theta_{encoder}$.
    * Decoder/Generator is trained by minimizing $\gamma \mathcal{L}_{llike}^{D_l} + \mathcal{L}_{GAN}$ w.r.t $\theta_{decoder}$ (so in effect minimize $\mathcal{L}_{GAN}$ becomes minimize $log(1 - D(G(z)) + log(1 - D(G(Enc(x)))\ $).
    * Discriminator is trained by maximizing $\mathcal{L}_{GAN}$
* $\gamma$ weight interpreted as weight between style and content.

<img src="figs/vaegan/vaegan-pseudocode.png" width="36%" height="36%">

### Experiments
* 64x64 images.
* They state as many others that log likelihood measures do not correlate with visual fidelity.
* Fractional striding ("deconvolutions") with stride 2
* RMSProp(3e-4)
* Batch size 64

#### Visual attribute vectors, CelebA
* Images in dataset are annotated with binary attributes (glasses, pale skin, etc).
* Aligned dataset
* They inspect the latent space to find directions corresponding to semantic features.
    * Use encoder to get latent representation.
    * For each binary attribute:
        * Compute mean representation vectors of all images with and without the attribute.
        * Visual attribute vector as difference of the two mean vectors.
    * Not perfect, but captures most of the attributes.

#### Attribute similarity, Labeled faces in the wild
* Align faces as preprocessing
* Concatenate attribute vector to inputs of encoder, decoder, and discriminator. For encoder and discriminator, concat to first fully connected layer.
* Have a regression network (similar architecture as encoder) to predict attributes.
* Not a lot of details 

#### Supervised tasks, CIFAR-10, STL-10
* Semisupervised setup
* Pretrain unsupervised
* Not sure how they use the label
* Not competetive results