# Improved Techniques for Training GANs
This paper presents a couple of techniques that can help to stabilize the training of GANs which is often very difficult. When training GANs we want to find the Nash equilibrium of the adversarial game but are using standard gradient descent techniques which can fail to converge. The techniques presented here aim to reduce the risk of this.

## Training tips

### Feature matching
Instead of training G to have a high output from D for the generated samples they propose a training objective for G where they instead minimize a feature distance for real and generated samples for some feature (layer) in D.

Specifically, they pick a layer $l$ in D and then take the output at this layer for real samples $D_l(x)$ and for generated samples $D_l(G(z))$ and minimize the euclidean distance between these. 

The reasoning is that we train D to find features that are telling of whether something is fake or not, so training G to output something with those features should be good.

### Minibatch discrimination
They want to allow D to look at all images in the same minibatch to some extent because this can steer the gradients away from the mode collapse case.

They take the output of D at an intermediate layer $l$ as the features of each image $x_i$ in the batch and call it $f(x_i)$.

Features are passed through linear transformation $T \in R^{A \times B \times C}$ to get a matrix $M_i \in R^{B \times C}$ for each image.

They define 
 * $c_b(x_i, x_j) = exp(-\lvert\lvert M_{i, b} - M_{j, b} \lvert\lvert_{L_1}$ where $M_{i, b}$ is row $b$ in $M_i$
 * $o(x_i)_b = \sum_{j=1}^n c_b(x_i, x_j)$
 * $o(x_i) = [o(x_i)_1, o(x_i)_2, \dotsc, o(x_i)_B]$

They then concatenate $o(x_i)$ and $f(x_i)$ which is then fed into the next layer of D.

These minibatch features are computed separately for sampled x and for generated x.

### Historical averaging
They add a penalty cost that punishes large changes of parameter values over a few timesteps back. 

They define it as $\lvert\lvert \theta - \frac{1}{T} \sum_{i=1}^T \theta_{(i)} \lvert\lvert^2$ where $\theta_{(i)}$ are the parameter values at previous timesteps and $T$ are the number of steps back to account for.

This is done for both G and D.

### One sided label smoothing
Not a new idea, but label smoothing is to replace the 0 and 1 labels of the classifier/discriminator with smoothed labels like 0.1 and 0.9 respectively.

Because of how the optimal discriminator D is defined, only the 1 labels are smoothed to 0.9 or something and the 0 labels are left at 0 when training GANs.

### Virtual batch normalization
Normal batch normalization normalizes each sample in the batch based on statistics (mean, variance) collected from the same batch.

In their proposed virtual batch normalization (VBN) they instead normalize with respect to a reference batch which is picked once at the start.

Then during each batch update step in training, both the current batch and the reference batch are inputted in the network. Statistics are collected from the output of the reference batch which are then used to normalize the actual current batch.

This is expensive (two forward propagations), so only done for updates of G.

## Evaluation tip
They suggest using the Inception model to classify the generated images which apparently correlates a lot with human judgement.

Lower entropy in the softmax output $p(y\ |\ x)$ (high probability for one class basically) is good and gives a higher score.

The score is also based on variation in the output of G so the marginalization $p(y) = \int p(y\ |\ x=G(z))dz$ should have high entropy.

The inception score is then defined as $exp \left( \mathbb{E}_x \left[ KL(p(y\ |\ x)\ |\ p(y)) \right] \right)$

## Semisupervision
For training data consisting of K classes they suggest having a classifier $p_{model}$ with K+1 outputs where the extra dimension corresponds to a new "generated" class.

This allows for learning from both unlabeled and labeled data. For labeled samples, maximize the likelihood for the corresponding label. For unlabeled real samples, minimize the likelihood for the "generated" class $K+1$. For generated samples, maximize the likelihood for the "generated" class $K+1$.

The losses are as follows
* $L_{supervised} = -\mathbb{E}_{x,y \sim p_{data}(x, y)} \left[ log\ p_{model}(y\ |\ x, y < K+1) \right]$
* $L_{unsupervised} = -\left( \mathbb{E}_{x \sim p_{data}(x)} \left[ log\ (1 - p_{model} (y=K+1\ |\ x)) \right] + \mathbb{E}_{x \sim G} \left[ log\ p_{model} (y=K+1\ |\ x) \right] \right)$
* $L = L_{supervised} + L_{unsupervised}$

$D(x) = 1 - p_{model} (y=K+1\ |\ x))$ corresponds to D in the standard GAN setup.

TODO: there is some more to this

They say that semisupervision always seems to give better results and speculate that it is because this biases D to develop an internal representation that emphasizes the same features humans emphasize.

## Experiments

### MNIST
D and G both with 5 hidden layers. Weight normalization. Gaussian noise added at every layer of D.

Semisupervised training with different fractions of labeled samples from dataset. They report the test error and show some generated samples with different setups.

They find that
* Semisupervision + feature matching looks visually less good, but pretty good classication results
* Using minibatch discrimination instead of semisupervision looks visually better
* Minibatch discrimination + semisupervision worse classification than with feature matching.

### CIFAR-10
D is 9 layer CNN with dropout and weight normalization. G is 4 layer CNN (conv transposes I guess?) with batch normalization.

They do some ablation experiments here to see which of the training tips are most valuable. They find that removing minibatch discrimination gives the biggest drop in inception score. With all their methods the results are very visually good.

### SVHN
Same network setups as for CIFAR-10.

### Imagenet
They use the DCGAN architecture.

Imagenet is very difficult to learn a generative model for (128x128 and 1000 classes).

Using their proposed techniques they improve the previous DCGAN results to at least learn some recognizable features of animals e.g, like eyes/fur/texture. The bigger structure of the images are still wrong though.