# Towards Principled Methods for Training Generative Adversarial Networks
This note is a section by section summary of the [paper](https://arxiv.org/abs/1701.04862)  based on personal understanding. Some personal remarks appear at the end of the note.

## 1. Introduction


In GAN, the discriminator is trained to maximize [[Ref](2014_GAN.ipynb)]

$$
L(D,G) = \mathbb{E}_{x\sim p_r} \left[\log D(x)\right] + \mathbb{E}_{x\sim p_g} \left[\log (1-D(x))\right]
$$

where $p_r$ is the real data distribution, $p_g$ is the genrator distribution. Then the optimal discriminator has the shape 

$$
D^*(x) = \frac{p_r(x)}{p_r(x)+p_g(x)}
$$

When discriminator is optimum, 

$$
L(D^*,G)=  2 JSD \left( p_r | p_g \right) -2 \log 2
$$

where  

$$
JSD\left( p_r | p_g \right) = \frac{1}{2}KL\left( p_r | p_A \right)  + \frac{1}{2} KL\left( p_g | p_A \right) 
$$

and $p_A=\frac{p_r+p_g}{2}$ is the average. It is conjectured that the reason of GANs success is due to the switch from the traditional maximum likelihood approaches to the Jensen-Shannon divergence [[Ref](https://arxiv.org/abs/1511.01844)]. 

However, in practice (experiementally) as the discriminator gets better, the updates to the generator get consistently worse



<br/><br/>
## 2. Source of Instability

According to theory the discriminator will have cost at most $2 \log 2 - 2 JSD \left( p_r | p_g \right) $. 
However, in practice, if we just train D till convergence, its error will go to $0$, as observed in the following Figure (from the [paper](https://arxiv.org/abs/1701.04862)):

<img src="GANtrainingTheoryFig1.png" width="800"/>

It means the Jensen-Shannon divergence is maxed. **The only way this can happen is if the distributions are not continuous, or they have disjoint supports.**
- One possible cause for the **discontinuos distribution** is if its **support lies on low dimensional manifold**
    - Often $P_r$ is concentrated on a low dimensional manifold [Ref](https://papers.nips.cc/paper/2010/hash/8a1e808b55fde9455cb3d8857ed88389-Abstract.html)
    - **Lemma 1.** The generator net $G(Z)$ is contained in a countable union of manifolds of dimension at most dim$(Z)$.
    
    

<br/><br/>
### 2.1. The perfect discriminator theorems

**Theorem 2.1.** If supports of $p_r$ and $p_g$ are in disjoint compact manifolds $\mathcal{M}_r$ and $\mathcal{M}_g$, then there exist perfect (i.e. accuracy is 1) discriminator for all $x\in \mathcal{M}_r \cup \mathcal{M}_g$

**Theorem 2.2.** If supports of $p_r$ and $p_g$ are in closed compact manifolds $\mathcal{M}_r$ and $\mathcal{M}_g$ that do not align perfectly and do not have full diemsnion and $p_r$ and $p_g$ are continuous, then there exist perfect (i.e. accuracy is 1) discriminator for almost all $x\in \mathcal{M}_r \cup \mathcal{M}_g$

**Theorem 2.3.** If supports of $p_r$ and $p_g$ are in manifolds $\mathcal{M}_r$ and $\mathcal{M}_g$ that do not align perfectly and do not have full diemsnion and $p_r$ and $p_g$ are continuous, then $JSD\left( p_r | p_g \right) = \log2$

<br/><br/>
### 2.2. The consequences and the problems of each cost function
consequences: Theorems 2.1 and 2.2 $\rightarrow$ If supports of $P_r$ and $P_g$ are disjoint or lie on low dimensional manifolds the optimal discriminator will be perfect and gradient will be zero.

Next, we will look at gradient of generator...

##### 2.2.1. original cost function

$$
\lim_{||D-D^*|| \rightarrow 0} \nabla_\theta \mathbb{E}_{z\sim p(z)}\left[ \log \left(1-D(G_\theta(z))\right)\right] = 0
$$

##### 2.2.1. The $-\log D$ alternative

$$
- \nabla_\theta \mathbb{E}_{z\sim p(z)}\left[ \log D^*(G_\theta(z))\right] = \nabla_\theta \left[ KL\left( p_g | p_r \right) - 2 JSD \left( p_g | p_r \right) \right]
$$

The negative Jensen-Shannon divergence push the two distribution away. *Note* also that KL divergence is, now, not equivalent to maximum likilihood. 

Practically, the norm of gradient grows drastically as the discriminoator is trained close to optimality.

<br/><br/>
## 3. Toward softer metrics and distributions

How to fix? Break the assumptions of previos theorems! (by adding noise)