# Generative Adversarial Networks (GANs)

Generative models that **create new data instances that resemble the training data**.

```
For example, GANs can create images that look like photographs of human faces, 
even though the faces don't belong to any real person.
```

Process:
* Pairing a generator with a discriminator:
  * *Generator* learns to produce the target output
  * *Discriminator*: learns to distinguish true data from the output of the generator. 
* The generator tries to fool the discriminator, and the discriminator tries to keep from being fooled.

## Generative Models

* Generative models 
  * can generate new data instances.
  * capture the joint probability p(X,Y), or just p(X) if there are no labels.
  * includes the distribution of the data itself, and determines how likely a given instance is.
* Discriminative models 
  * discriminate between different kinds of data instances.
  * capture the conditional probability p(Y|X).
  * ignores the question of whether a given instance is likely, and just determines how likely a label is to apply to the instance.

### Modeling Probabilities
Both generative and discriminative models can estimate probabilities, but they don't have to.

* Generative models
  * It can model the distribution of data by imitating that distribution.
  * It produces convincing "fake" data that looks like it's drawn from that distribution.
* Disciriminative models
  * It can also label an instance without assigning a probability to that label.
  * It produces predicted labels with distribution similar to the real distribution of labels in the data.

### Generative Models Are Hard
* Generative models 
  * Have to model more.
  * It might capture correlations of objects (in an image) other than the main object. 
  * They try to model how data is placed throughout the space produces very complicated distributions.
* Discriminative models 
  * Have to model relatively less. 
  * It might capture the difference between the objects (in an image) by just looking for a few tell-tale patterns. It could ignore many of the correlations that the generative model must get right.
  * They try to draw boundaries in the data space.

**GANs offer an effective way to train such rich models to resemble a real distribution**.

## GAN Structure

### Generator 
* It is a neural network.
* Learns to generate plausible data.
* The output is connected directly to the discriminator input.  
* The generated instances become negative training examples for the discriminator.
* Generator training
  * Random input (like noise or a uniform distribution)
  * Generator network, which transforms the random input into a data instance
  * Discriminator network, which classifies the generated data
discriminator output as real or fake.
  * Calculate *generator loss*, which penalizes the generator for failing to fool the discriminator.
  * Backpropagate through both the discriminator and generator to obtain gradients.
  * Use gradients to change only the generator weights.
* It creates fake data by incorporating feedback from the discriminator.
  * Through backpropagation, the discriminator's classification provides a signal that the generator uses to update its weights.

### Discriminator 
* It is a neural network.
* The input comes from two sources:
  * Fake data directly from the generator output; used as negative examples.
  * Real data instances; used as positive examples.
* Learns to distinguish the fake data from real data.
  * Classifies the real data.
  * Classifies the fake data from the generator.
  * Ignores the generator loss and just uses the *discriminator loss*.
  * The *discriminator loss* penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real.
    * It updates its own weights through backpropagation.
* It then penalizes the generator for producing implausible results.
  * Through backpropagation, it provides a signal that the generator uses to update its weights.

## GAN Training

```
1. When training begins, the generator produces fake data.
2. The discriminator quickly learns to tell that it's fake.
3. As training progresses, the generator gets closer to producing output closer to real data.
4. The discriminator starts to get fooled. If the training goes well, the discriminator gets worse at telling the difference between real and fake. 
5. The discriminator starts to classify fake data as real, and its accuracy decreases.
```

Because a GAN contains two separately trained networks, its training algorithm must address two complications:
* GANs must juggle two different kinds of training (generator and discriminator).
* GAN convergence is hard to identify

### Alternating Training
1. The discriminator trains for one or more epochs.
  * Keep the generator constant during the discriminator training phase.
  * Learns how to recognize the generator's flaws.
2. The generator trains for one or more epochs.
  * Keep the discriminator constant during the generator training phase.
  * Optimize based on discriminator back-propagated signal.
3. Repeat steps 1 and 2 to continue to train the generator and discriminator networks.

### Convergence
* As the generator improves with training, the discriminator performance gets worse because the discriminator can't easily tell the difference between real and fake. 
* If the generator succeeds perfectly, then the discriminator has a 50% accuracy. <br>`In effect, the discriminator flips a coin to make its prediction.`
* The discriminator feedback gets less meaningful over time. 
  * This progression poses a problem for convergence of the GAN as a whole.
* The convergence is often a fleeting, rather than stable, state.
  * If the GAN continues training past the point when the discriminator is giving completely random feedback, and the generator starts to train on junk feedback, and its own quality may collapse.

## GAN Loss Functions
* GANs try to replicate a probability distribution. 
* The loss function needs to reflect the distance between the distribution of the data generated by the GAN and that of the real data.
* A GAN can have two loss functions: 
  * one for generator training
  * one for discriminator training.
* The generator can only affect one term in the distance measure that reflects the distribution of the fake data. So, during generator training we drop the other term which reflects the distribution of the real data.

### Minimax Loss

The **generator tries to minimize** the following function while the **discriminator tries to maximize** it. It derives from the cross-entropy between the real and generated distributions.

*E<sub>x</sub>[ log(D(x)) ] + E<sub>z</sub>[ log(1 - D(G(z))) ]*

* x = real data instance
* E<sub>x</sub> = expected value over all real data instances.
* D(x) = discriminator's estimate of the probability that real data instance is real.
* z = noise
* G(z) = fake data instance, i.e. generator's output when given noise z.
* E<sub>z</sub> is the expected value over all fake instances.
* D(G(z)) = discriminator's estimate of the probability that a fake instance is real.

The **generator** can't directly affect the *log(D(x))*. It can only **minimize *log(1 - D(G(z)))***

### Modified Minimax Loss

The Minimax Loss can cause the GAN to get stuck in the early stages of GAN training when the discriminator's job is very easy.<br>
So, modifying the generator loss so that the **generator** tries to **maximize *log D(G(z))***.

### Wasserstein loss

* A modification of the GAN scheme called *Wasserstein GAN* or *WGAN*.
* The weights throughout the GAN are clipped so that they remain within a constrained range.
* The discriminator does not actually classify instances. 
  * So, it is called a *critic* instead of a discriminator.
* For each instance the discriminator outputs a number.
* Discriminator training tries to make the output bigger for real instances than for fake instances.

Discriminator tries to maximize **Critic Loss**: *D(x) - D(G(z))*.

Generator tries to maximize **Generator Loss**: D(G(z))

Benefits:
* Less vulnerable to getting stuck than minimax-based GANs.
* Avoid problems with vanishing gradients.
* *Earth mover distance* between the real and generated distribution has the advantage of being a true metric.
  * A measure of distance in a space of probability distributions Cross-entropy is not a metric in this sense.

## GAN Common Problems

### Vanishing Gradients
Problem:
* If your discriminator is too good, then generator training can fail due to vanishing gradients. 
* An optimal discriminator doesn't provide enough information for the generator to make progress.

Remedies:
  * *Wasserstein loss*
  * *Modified minimax loss*

### Mode Collapse
Problem:
* If a generator produces an especially plausible output, the generator may learn to produce only that output.
* The generator always tries to find the one output that seems most plausible to the discriminator.
* If the generator starts producing the same output (or a small set of outputs) over and over again, the discriminator's best strategy is to learn to always reject that output. 
* As the next generation of discriminator gets stuck in a local minimum and doesn't find the best strategy, then it's too easy for the next generator iteration to find the most plausible output for the current discriminator.
* Each iteration of generator over-optimizes for a particular discriminator, and the discriminator never manages to learn its way out of the trap. 
* As a result, the generators rotate through a small set of output types.

Remedies: force the generator to broaden its scope
* *Wasserstein loss*
  * Lets train the discriminator to optimality without worrying about vanishing gradients. If the discriminator doesn't get stuck in local minima, it learns to reject the outputs that the generator stabilizes on. 
  * So the generator has to try something new.
* *Unrolled GANs*
  * Use a generator loss function that incorporates not only the current discriminator's classifications, but also the outputs of future discriminator versions. 
  * So the generator can't over-optimize for a single discriminator.

### Failure to Converge
Problem:
* The GAN may continue training past the point when the discriminator is giving completely random feedback.
* The generator starts to train on junk feedback, and its own quality may collapse.
* GANs frequently fail to converge.

Remedies:
* Adding noise to discriminator inputs
* Regularization by Penalizing discriminator weights

## GAN Variations

### Progressive GANs
* The generator's first layers produce very low resolution images.
* The subsequent layers add details. 
* This technique allows the GAN to train more quickly.
* Produces higher resolution images.

### Conditional GANs
* Trains on a labeled data set.
* Lets you specify the label for each instance to be generated. 
* Model the conditional probability P(X|Y).

```
An unconditional MNIST GAN would produce random digits, 
while a conditional MNIST GAN would let you specify which digit the GAN should generate.
```

### Image-to-Image Translation
* Take an image as input.
* Map it to a generated output image with different properties. 
* The loss is a weighted combination of the *discriminator loss* and a *pixel-wise loss*
* Penalizes the generator for departing from the source image.

```
We can take a mask image with blob of color in the shape of a car, 
and the GAN can fill in the shape with photorealistic car details.
```
```
Take sketches of handbags and turn them into photorealistic images of handbags.
```

### CycleGAN
* Transform images from one set into images that could plausibly belong to another set. 
* The training data is simply two sets of images. 
* The system requires no labels or pairwise correspondences between images.

```
Take an image of a horse and turn it into an image of a zebra.
```

### Text-to-Image Synthesis
* Take text as input.
* Produce images that are plausible and described by the text. 
* Can only produce images from a small set of classes.

```
"This flower has petals that are yellow with shades of orange." 
produces an image of a flower that has yellow petals with orange shades.
```

### Super-resolution
* Increase the resolution of images.
* Adds detail where necessary to fill in blurry areas.
* Some patterns might be skipped and made-up patterns might be produced.

```
Given the blurry image, a sharper image is produced.
```

### Face Inpainting
* Semantic image inpainting task. 
* Chunks of an image are blacked out.
* The system tries to fill in the missing chunks.

### Text-to-Speech
* Produce synthesized speech from text input.



