# Generative Adversarial Networks

## Basics
This paper introduces a new framework for fitting generative models. The goal in these cases is to try to find a model that is close to some data distribution that we want to generate more examples from. The traning set are samples from this distribution.

In essence this is done by training two models at the same time in an adversarial process. One is a generative model $G$ that tries to capture the data distribution and one is a discriminative model $D$ that tries to distinguish whether a sample came from the training data or from $G$, i.e. output a high probability for $x \sim p_{data}(x)$ (training data) and low probability for $x \sim p_G(x)$. 

The training goal of $G$, that will steer the direction of weight updates, is to generate samples that will fool the $D$ model. Or informally, $G$ wants to generate samples $x$ that $D$ gives a high probability for. This can be expressed as the following mini max game.

$$\min_G \max_D \mathbb{E}_{p_{data}(x)} \left[ log\ D(x) \right] + \mathbb{E}_{p_z(z)} \left[ log (1 - D(G(z))) \right]$$

A solution to this equation exists where $G$ perfectly models the training data distribution and $D$ outputs $\frac{1}{2}$ for all $x$ generated from $G$.

A good conceptual model of this is pointed out by the authors where the generative model $G$ is a money counter feiter and the discriminative model $D$ is the police who tries to see which are real and which are fake bills. The competition between them forces both to become better. This drives the generative model towards the goal of modeling the actual data distribution.

### Problems with other generative models
Before this, getting good results with deep generative models has been difficult and the success has been smaller than for deep discriminative models (e.g. input an image, get a label out). This is mostly because of the difficulties with intractable computations for posterior distributions. This is usually solved with sampling techniques like *markov chain monte carlo* (MCMC) or other approximation techniques like *variational inference* which both have pros and cons. But this paper at least introduces an alternative when it comes to generative models.

### Specifics of GAN
The adversarial idea works for all (?) generative models. But the work in this paper is with neural networks which forms the special case of *adversarial nets* which can be trained using backpropagation and dropout.

For adversarial nets we first define a prior $p_z(z)$ on input variables $z$ to $G$ which are just noise.

The network then represents a mapping $G(z, \theta_g)$ to the data space of things we want to generate. $G(z, \theta_g)$ is differentiable which makes backpropagation possible. $\theta_g$ are the parameters of this network.

We also have a second neural network $D(x, \theta_d)$ which is a mapping from data space to a scalar value that represents the probability that $x$ came from the actual data distribution rather than from $p_g$ (from $G$). $D(x)$ should be high for samples $x$ which are from the training data and low otherwise.

$D$ is trained to maximimize the probability assigning the correct label to both samples drawn from training data and samples drawn from $G$. At the same time, $G$ is trained to minimize $log(1 - D(G(z)))$, which is the same as saying that we are training $G$ to create samples $x$ that $D$ says are real samples which thus minimizes the expression.

This is expressed as the following min max equation and is what the loss functions when implementing this are based on. In practice this is optimized iteratively as a sort of balancing equilibrium.

$$\min_G \max_D \mathbb{E}_{p_{data}(x)} \left[ log\ D(x) \right] + \mathbb{E}_{p_z(z)} \left[ log (1 - D(G(z))) \right]$$

#### Training in practice
The training is done by iteratively updating the weights of both networks.

If we were to have a nested loop for updating $G$ in the outer and $D$ in the inner, it would result in overfitting.

Instead training is done by $k$ steps of updates to $D$ followed by 1 step of update to $G$. In this way, $D$ is kept almost optimal for its task while $G$ slowly becomes better at generating samples close to the data distribution. When one model is updated the other is kept fixed.

In practice, the gradient from the min max equation might not work well for learning. This is because when $G$ initially is very bad, $log (1 - D(G(z))$ saturates. Thus, it can sometimes be a good idea to initially have $G$ be updated by instead *maximizing* $log (D(G(z)))$ which gives stronger gradients initially.

The following image show the incremental learnings of both nets. Here the black dotted line is the data distribution, the green line is the generated distribution of $G$, and the blue dashed line is the output of $D$. Here $z$'s are sampled from the bottom horizontal line and then mapped via $G$ to $x$ which are distributed according to the green line.

In the fourth plot, $G$ matches the data distribution perfectly and thus $D$ can't distinguish between them and says it's 50/50.

<img src="figs/GAN/gan-training.png" width="80%" height="80%">


**Pseudo code**
```
while not done:
    iterate k times:
        sample m noise inputs z_1,...,z_m from noise prior
        sample m data examples x_1,...,x_m from data distribution (training data)
        update discriminator D by ascending gradient of Eq1
    
    sample m noise inputs z_1,...,z_m from noise prior
    update generator G by descending gradient of Eq2 (or using the in practice trick for early iterations)  
```

where Eq1 and Eq2 are defined below 

$$Eq1:\, \nabla_{\theta_d} \frac{1}{m} \sum^m_{i=1} log(D(x_i)) + log(1 - D(G(z_i))) \quad Eq2:\, \nabla_{\theta_g} \frac{1}{m} \sum^m_{i=1} log(1 - D(G(z_i)))$$


#### The noise prior input p(z)
The paper doesn't say what kind of distribution they use in their experiments, but from what I've read elsewhere it seems to often be a normal or uniform distribution.

#### Theory
TODO: explain and summarize proofs

TODO: game theory - nash equilibrium


### Experiments
The image show generated examples where $G$ has been trained on different datasets. The closest training image is shown in the right column.
<img src="figs/GAN/gan-experiments.png" width="80%" height="80%">


### Problems
Can be difficult to train, because $G$ and $D$ have to be kept in sync since they basically teach each other. If one is too weak, the other will not be able to learn what it should.

This can cause a degenerate case where $G$ only generates a single example from many different $z$ and which is badly classified by $D$. This also shows that $G$ doesn't necessarily converge to the data distribution. *Helvetica scenario*


## Discussion and Thoughts
This adversarial training technique is nice because it can be applied in many different models. See *Adversarial Autoencoders* for example.

TODO: read nips 2016 tutorial on GAN for more tips and tricks, https://arxiv.org/pdf/1701.00160v1.pdf

The adversarial technique seems to have become popular.