<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/gan/c1_w3_wgan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wasserstein GAN with Gradient Penalty (WGAN-GP)

### WGAN-GP improvements

WGAN-GP enhances training stability. As shown below, when the model design is less optimal, WGAN-GP can still create good results while the original GAN cost function fails.

<img src="https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/gan/images/wgan-gp-1.png" width=600>


Below is the inception score using different methods. The experiment from WGAN-GP paper demostrates better image quality and convergence comparing with WGAN. However, DCGAN demostrates slightly better image quality and it converges faster. But the inception score for WGAN-GP is more stable when it starts converting.

<img src="https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/gan/images/wgan-gp-2.png" width=500>

### Goals
In this notebook, you're going to build a Wasserstein GAN with Gradient Penalty (WGAN-GP) that solves some of the stability issues with the GANs that you have been using up until this point. Specifically, you'll use a special kind of loss function known as the W-loss, where W stands for Wasserstein, and gradient penalties to prevent mode collapse.

*Fun Fact: Wasserstein is named after a mathematician at Penn State, Leonid Vaseršteĭn. You'll see it abbreviated to W (e.g. WGAN, W-loss, W-distance).*

### Learning Objectives
1.   Get hands-on experience building a more stable GAN: Wasserstein GAN with Gradient Penalty (WGAN-GP).
2.   Train the more advanced WGAN-GP model.

GAN `discriminator loss` minimization $$min_D−(\mathbb{E}_{x∼pX}[log\,D(x)] + \mathbb{E}_{z∼pZ}[log\,(1−D(G(z)))])$$

To train the GAN `generator` G, we calculate the loss when comparing predictions for generated images $p_i=D(G(zi))$ to the response y_i=1. Therefore for the GAN generator, minimizing the loss function can be written as follows:

GAN `generator loss` minimization:
$$min_G−(\mathbb{E}_{z∼pZ}[log\,D(G(z))])$$

### Wasserstein loss function

First, the Wasserstein loss requires that we use $y_i=1$ and $y_i=-1$ as labels, rather than 1 and 0. We also remove the sigmoid activation from the final layer of the **discriminator(critic)**, so that predictions $p_i$ are no longer constrained to fall in the range $[0,1]$, but instead can now be any number in the range $[–∞, ∞]$. For this reason, the discriminator in a WGAN is usually referred to as a critic.The Wasserstein loss function is then defined as follows:

Wasserstein loss:
$$-\frac{1}{n}\sum_{i=1}^{n}{(y_i\,p_i)}$$

To train the WGAN critic $D$, we calculate the loss when comparing predictions for a real images $p_i=D(x_i)$ to the response $y_i=1$ and predictions for generated images $p_i=D(G(z_i))$ to the response $y_i=-1$. Therefore for the WGAN critic, minimizing the loss function can be written as follows:

WGAN critic loss minimization:

$$min_D−(\mathbb{E}_{x∼pX}[D(x)]−\mathbb{E}_{z∼pZ}[D(G(z))])$$

### The Lipschitz Constraint

It may surprise you that we are now allowing the critic to output any number in the range $[–∞, ∞]$, rather than applying a sigmoid function to restrict the output to the usual $[0, 1]$ range. The Wasserstein loss can therefore be very large, which is unsettling—usually, large numbers in neural networks are to be avoided!

In fact, the authors of the WGAN paper show that for the Wasserstein loss function to work, we also need to place an additional constraint on the critic. Specifically, it is required that the critic is a `1-Lipschitz continuous` function. Let’s pick this apart to understand what it means in more detail.

The critic is a function $D$ that converts an image into a prediction. We say that this function is 1-Lipschitz if it satisfies the following inequality for any two input images x1 and x2:

$$\frac{|D(x_1)−D(x_2)|}{|x_1−x_2|}\leq1$$

Here, $x_1 – x_2$ is the average pixelwise absolute difference between two images and $|D(x_1)−D(x_2)||D(x_1)-D(x_2)|$ is the absolute difference between the critic predictions. Essentially, we require a limit on the rate at which the predictions of the critic can change between two images (i.e., the absolute value of the gradient must be at most 1 everywhere).

## Generator and Critic

You will begin by importing some useful packages, defining visualization functions, building the generator, and building the critic. Since the changes for WGAN-GP are done to the loss function during training, you can simply reuse your previous GAN code for the generator and critic class. Remember that in WGAN-GP, you no longer use a discriminator that classifies fake and real as 0 and 1 but rather a critic that scores images with real numbers.