# Wasserstein GAN

This paper is concerned about measuring how close the model distribution is to the real distribution in a way that improves GAN training.

Practically, I think the main contributions are
* It presents another way to stabilize training of GANs which seems to work really well.
* There seems to be a correlation between discriminator loss and quality of the generated content.

The paper is relatively theory heavy, which gives a theoretical reason for why this new algorithm works, but the changes needed are also very easy to implement.


## Background
The paper begins with some background on learning distributions, i.e. generative models. Using their notation:
* $P_r$ is the actual distribution
* $P_\theta$ is the model distribution parametrized by $\theta$.

The most intuitive way of finding $P_\theta$ might be the maximum likelihood estimation (MLE) method $\ argmax_\theta \frac{1}{m} \sum^m_{i=1} log\ P_\theta (x^{(i)})$

In the limit $m \to \infty$ MLE is equal to minimizing the KL divergence $KL(P_r\ ||\ P_\theta) = \int TODO$

This means that the KL divergence will go to $\infty$ if even just a single datapoint gets zero density under $P_\theta$. We are often dealing with data on a lower dimensional manifold, and then it will be very unlikely that all mass of $P_r$ that of $P_\theta$ covers each other perfectly which in turn means that the KL divergence will go to infinity. Bad!

MLE is still used a lot, but to solve this issue a noise term is usually added to $P_\theta$ to ensure that all data points have some density. This method also has drawbacks however since it reduces quality (blurriness).

Another approach to find a good $P_\theta$ is to learn a deterministic function $g_\theta:\ z \rightarrow x$ (e.g. a neural network) where $z$ is sampled from a known distribution $p(z)$.

For this approach, a measure of closeness between distributions is needed, $\rho$. The paper discusses some different such measures and their impacts on convergence. This metric/distance/divergence $\rho$ is then what would be used for the loss function $\theta \rightarrow \rho(P_\theta, P_r)$

### Distance measures
---
**Set theory recap**

$sup$ is supremum which is the smallest upper bound of a set, i.e. smallest element equal or larger than all elements in the set.

$inf$ is infimum which is the biggest element equal or smaller than all elements in the set.

---

* Total Variation (TV): 
$$\delta(P_r, P_g) = sup_A \lvert P_r(A) - P_g(A) \lvert$$ where $A$ is TODO


* Kullback-Leibler divergence (KL): 
$$KL(P_r\ ||\ P_g) = \int P_r(x) log\ \left( \frac{P_r(x)}{P_g(x)} \right) dx$$ 
KL divergence is assymetric and also has the previously discussed problems.


* Jensen-Shannon divergence (JS): 
$$JS(P_r, P_g) = KL(P_r\ \lVert\ P_m) + KL(P_g\ \lVert\ P_m), \quad P_m = \frac{P_r + P_g}{2}$$
JS divergence is symmetric and always defined.


* Earth-Mover distance (EM), aka wasserstein-1:
$$W(P_r, P_g) = inf_{\gamma \in \prod (P_r, P_g)} \mathbb{E}_{(x,y) \sim \gamma} \left[\lVert x - y \rVert \right]$$
    * $\prod (P_r, P_g)$ is the set of all joint distributions where the marginal distributions are $P_r$ and $P_g$
    * The intuition with the EM distance is that $\gamma(x, y)$ is how much probability mass must be moved from $x$ to $y$ for $P_r$ to become $P_g$. 
    * The EM distance is the cost of the optimal "transport plan" which is how mass of all $x, y$ should be moved. TODO: more explanation

They give an example of learning a distribution $P_r = (0, z), \quad z \sim Unif(0, 1)$ with $P_g = (\theta, z), \quad z \sim Unif(0, 1)$. Thus we want $\theta \to 0$ but the problem with all of the aforementioned distances except the EM distance is that this convergence will not happen because for KL the distance is $\infty$ and for the other two the distance is constant, so no gradient to learn with.

---
**Lipschitz**

A function $f: X \to Y$ is Lipschitz-K if for two inputs $x_1, x_2 \in X$ the following holds:
$$d_Y(f(x_1), f(x_2)) \leq K d_X(x_1, x_2)$$
where $d_X$ and $d_Y$ are some distance functions.

This also says that $K$ is the largest value of $f'$

---


## Wasserstein GAN
$$W(P_r, P_\theta) = inf_{\gamma \in \prod (P_r, P_\theta)} \mathbb{E}_{(x,y) \sim \gamma} \left[\lVert x - y \rVert \right]$$

The infimum in the EM distance definition is intractable so WGAN is based on an approximation.

For this they use the Kantorovich-Rubinstein duality which says
$$W(P_r, P_\theta) = sup_{\lVert f \lVert_L \leq 1} \mathbb{E}_{x \sim P_r} \left[ f(x) \right] - \mathbb{E}_{x \sim P_\theta} \left[ f(x) \right]$$

They then replace the supremum over 1-Lipschitz functions with supremum over K-Lipschitz functions which is $K \cdot W(P_r, P_\theta)$ since TODO

## TODO

We say distance d is weaker than distance d' if every sequence that converges under d' converges under d.

Kantorovich rubenstein duality

Together, this proves that every distribution that converges under the KL, reverse-KL, TV, and JS divergences also converges under the Wasserstein divergence. It also proves that a small earth mover distance corresponds to a small difference in distributions.