# Progressive Growing of GANs for Improved Quality, Stability, and Variation
This paper
* Proposes a new method for training GANS that both yields training speed up as well as more stable training. It is also orthogonal to other stabilization methods.
* Proposes a new method to get increased variation in the learned distribution.
* Suggests a new metric for evaluation of GANs.
* They evaluate both on standard datasets (SoTA on CIFAR10) and on a new high resolution dataset with great results.

## Introduction
* They point out the common GAN training errors
    * Gradients from "distance" metrics between training data distribution and generated distribution
    can point in more or less random directions if little to no overlap in these distributions.
* Other papers have suggested ways to improve this
    * By using other distance metrics or in other ways put constraints to not let this happen (WGAN/WGAN-GP, BGAN, LSGAN, EBGAN, etc).
* These problems are more common for higher resolutions because more chances for errors leading to discriminator having it too easy.
    
## Progressive Growing of GANs
In short
* Both generator and discriminator are initially defined to produce/handle very low resolution.
* Then layers that produce/handle increased resolution are added to respective network progressively during training. Growing happens at the same time for both.
* The idea is that this allows for first learning about large scale structure and then later having to focus on small details.
    * A form of curriculum learning
    * Intuitively it's simpler to turn a 512x512 image to a 1024x1024 than it is to take the initial latent vector to the full resolution image.
    * Initially fewer modes.
* The generator and discriminator have the same (but flipped) architecture.
* New layers are "faded in" by
    * First scaling the second to last (prev. last layer) to new resolution via nearest neighbor filtering and average pooling respectively.
    * Then adding a bypass connection (residual block) around the new layer to a weighted sum of the old (up-/downscaled) layer and the new layer.
    * $(1 - \alpha) * old + \alpha * new$
    * $\alpha$ is increased linearly from 0 to 1 during some amount of batches.
* At each layer/resolution they have a 1x1 convolution layer to convert layer output to and from RGB to allow for more than 3 filters.
* For training discriminator they downscale images to fit the current resolution.
    * During fade in of a new layer, they employ the same technique described above for the real images too.
    
<img src="../figs/prog_gan/proggan.png">

## Minibatch Standard Deviation
To increase variation in the generated distribution they use a variant of *minibatch discrimination*.
* *minibatch discrimination*: add the following layer at the end of the discriminator
    * Idea: Compute feature statistics at some layer across a minibatch to pass this information to discriminator.
    * Take features of all samples and map these through through shared weights to produce a separate set of statistics such that each sample has information from all other samples in the minibatch.
    * These are concatenated to the individual features.
* *This variant*
    * No learnable parameters or hyper parameters.
    * Compute std of each feature for each spatial position across minibatch.
    * These are averaged into a single scalar.
    * Replicate this to form a new feature map this is concatenated on the samples in the minibatch.
    * Best results if inserted at end of network.
    
## Normalization in Generator and Discriminator
* The competition between generator and discriminator can sometimes cause escalation of signal magnitudes.
* Usually combated with normalization techniques, like batch-normalization or layer normalization, in genarator (and sometimes discriminator).
* They theorize that covariate shift (which these tecniques were originally intended to deal with) is not an issue for GANs.
* Instead just focus on constraining signal magnitudes.

## Evaluation Metrics

### Multi-Scale Structural Similarity MS-SSIM
* 


### Sliced Wasserstein Distance (SWD)
