# WGAN: Wasserstein Generative Adversarial Networks

In short, WGAN (the ‘W’ stands for Wasserstein) proposes a **new cost function**.

The old version of the GAN minimax optimization game:

$$ min_G max_D E_x \sim p_{data}(x)[\log D(x)]+ E_z \sim p_{generated}(z)[1−\log D(G(z))] $$

which approximates a statistical quantity called the Jensen-Shannon divergence.

And here is the new one that WGAN uses:

$$ min_G max_D E_x \sim p_{data}(x)[D(x)]+ E_z \sim p_{generated}(z)[D(G(z))] $$

which approximates a statistical quantity called the 1-Wasserstein distance.

The original **GAN** paper showed that when the discriminator is optimal, the generator is updated in such a way to minimize the Jensen-Shannon divergence.

The **Jensen-Shannon divergence** is a way of measuring how different two probability distributions are. The larger the JSD, the more “different” the two distributions are, and vice versa. You compute it like this:

$$ JSD(P||Q) = KL(P|| \frac{P+Q}{2}) + KL(Q|| \frac{P+Q}{2}) $$

$$ KL(A||B)= ∫_{-∞}^{∞}a(x)\log \frac{a(x)}{b(x)} dx $$

However the authors of the WGAN paper thought that minimizing the JSD is not the best thing to do. Because when the two distributions don’t overlap at all, you can show that the value of the JSD stays at a constant value of 
$2\log2$. A function that has has a constant value has a gradient equal to zero, and a zero gradient is bad because it means that the generator learns absolutely nothing.

The alternate distance metric proposed by the **WGAN** authors is the **1-Wasserstein distance**, sometimes called the earth mover distance.

It gets the name “earth mover distance” because of an analogy. Imagine that one of the two distributions is a pile of earth, and the other is a pit.

The earth mover distance measures the cost of transporting the pile of earth to the pit, assuming that you’re transporting the mud, sand, dirt, etc., as efficiently as possible. Here, “cost” is considered to be distance between point × amount of earth moved.

Concretely, the earth mover distance between two distributions can be written as:

$$ 
EMD(P_r, P_{\theta}) = inf_{\gamma in \Pi}, \sum_{x,y} ||x−y ||\gamma(x,y) = inf_{\gamma in \Pi} E_{(x,y)\sim \gamma} ||x-y|| 
$$

Where $inf$ is the infinimum (minimum), $x$ and $y$ are points on the two distributions, and $\gamma$ is the optimal transport plan.

Unfortunately, computing this is intractable. So instead, we compute something totally different:

$$
EMD(P_r, P_{\theta}) = sup_{||f||_{L \leq l}} E_x \sim P_r f(x) - E_x \sim P_{\theta} f(x)
$$

The connection between these two equations certainly doesn’t seem evident at first, but through some fancy math called the Kantorovich-Rubenstein duality (try saying that three times fast), you can show that these formulas for the Wasserstein/earth mover distance are trying to calculate the same thing.

## Code

In [3]:
! python3 /Users/luisagaltarossa/Documents/doc_generative_ai/PyTorch-GAN/implementations/relativistic_gan/relativistic_gan.py --n_epochs 10

Namespace(n_epochs=10, batch_size=64, lr=0.0002, b1=0.5, b2=0.999, n_cpu=8, latent_dim=100, img_size=32, channels=1, sample_interval=400, rel_avg_gan=False)
[Epoch 0/10] [Batch 0/938] [D loss: 0.694829] [G loss: 0.713984]
[Epoch 0/10] [Batch 1/938] [D loss: 0.687632] [G loss: 0.713861]
[Epoch 0/10] [Batch 2/938] [D loss: 0.684558] [G loss: 0.713641]
[Epoch 0/10] [Batch 3/938] [D loss: 0.677194] [G loss: 0.712993]
[Epoch 0/10] [Batch 4/938] [D loss: 0.676178] [G loss: 0.713047]
[Epoch 0/10] [Batch 5/938] [D loss: 0.668581] [G loss: 0.712437]
[Epoch 0/10] [Batch 6/938] [D loss: 0.665704] [G loss: 0.711283]
[Epoch 0/10] [Batch 7/938] [D loss: 0.658166] [G loss: 0.710028]
[Epoch 0/10] [Batch 8/938] [D loss: 0.658035] [G loss: 0.709110]
[Epoch 0/10] [Batch 9/938] [D loss: 0.652608] [G loss: 0.707713]
[Epoch 0/10] [Batch 10/938] [D loss: 0.645431] [G loss: 0.703137]
[Epoch 0/10] [Batch 11/938] [D loss: 0.639332] [G loss: 0.700064]
[Epoch 0/10] [Batch 12/938] [D loss: 0.632474] [G loss: 0.697