In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

# Introduction

When evaluating the quality of synthetic data, it might be reasonable to speculate whether one could
- Train a NN to distinguish between real and synthetic data

We will call a NN designed for that purpose a *Discriminator*.

We will call the NN designed to *generate* synthetic data the *Generator*.

It's easy to train a weak Discriminator
- one that distinguishes between real data and noise (random data)

We can train a stronger Discriminator if we have access to higher quality (than noise) synthetic data.

The higher the quality of the synthetic data, the stronger the Discriminator.

But how do we construct a Generator that might be able to create synthetic data good enough to fool the Discriminator ?

Using the NN for the Discriminator
- given an input $\x$ created by the Generator
- we can compute the Gradient 
    - of the logit (the Discriminator output indicating Real or Not Real)
    - with respect to $\x$
- the Generator can modify $\x$ using the Gradient in the direction that moves the logit toward "Real"  
    

One can imagine an iterative process in which
- feedback from the Discriminator improves the Generator
- the resulting higher quality synthetic data from the Generator can be used to train a stronger Discriminator

This "adversarial" training is the basis for a *Generative Adversarial Network (GAN)*

**Aside**

The [GAN](https://arxiv.org/pdf/1406.2661.pdf) was invented by Ian Goodfellow in one night, following a party at a [bar](https://www.technologyreview.com/2018/02/21/145289/the-ganfather-the-man-whos-given-machines-the-gift-of-imagination/) !

# Details 

**Notation summary**


text | meaning                       
:----|:---|
<img width=100 /> | <img width=300 /> 
$p_\text{data}$ | Distribution of real data 
$\x \in p_\text{data}$  | Real sample 
$p_\text{model}$ | Distribution of fake data 
$\hat{\x}$ | Fake sample
           | $\hat{\x} \not\in p_\text{data}$ 
           | $\text{shape}(\hat{\x}) = \text{shape} ( \x ) $
           $\tilde{\x}$ | Sample (real or fake)
             | $\text{shape} ( \tilde{\x} ) =\text{shape}(\x)$
$D_{\Theta_D}$ | Discriminator NN, parameterized by $\Theta_D$ 
               | Binary classifier:  $\tilde{\x} \mapsto \{ \text{Real}, \text{Fake} \} $
               | $D_{\Theta_D}(\tilde{x}) \in \{ \text{Real}, \text{Fake} \} \text{ for } \text{shape}(\tilde{\x}) = \text{shape}(\x)$ 
$\z$ | vector or randoms with distribution $p_\z$
$G_{\Theta_G}$  | Generator NN, parameterized by $\Theta_G$  
                | $\z \mapsto \hat{\x}$
                | $\text{shape}( G(\z) ) = \text{shape}(\x)$
                | $G(\z) \in p_\text{model}$



Our goal is to generate new *synthetic* examples.

Let
- $\x$ denote a *real* example
    - vector of length $n$
- $\pdata$ be the distribution of real examples
   - $\x \in \pdata$
   

We will create a Neural Network called the *Generator*

Generator $G_{\Theta_G}$ (parameterized by $\Theta_G$) will
- take a vector $\z$ of random numbers from distribution $p_\z$ as input
- and outputs $\hat{\x}$ 
- a *synthetic/fake* example
    - vector of length $n$

Let
- $\pmodel$ be the distribution of fake examples

<table>
    <tr>
        <th><center>GAN Generator</center></th>
    </tr>
    <tr>
        <td><img src="images/GAN_generator.png"></td>
    </tr>
</table>

The Generator will be paired with another Neural Network called the *Discriminator*.

The Discriminator $D_{\Theta_D}$ (parameterized by $\Theta_D$) is a binary Classifier
- takes a vector $\tilde{\x} \in \pdata \cup \pmodel$

**Goal of Discriminator**
$$
\begin{array} \\
D( \tilde{\x} ) & = & \text{Real} & \text{ for } \tilde{\x} \in p_\text{data} \\
D (\tilde{\x} ) & = &\text{Fake}  &\text{ for } \tilde{\x} \in p_\text{model}
\end{array}
$$

That is
- the Discriminator tries to distinguish between Real and Fake examples

<table>
    <tr>
        <th><center>GAN Discriminator</center></th>
    </tr>
    <tr>
        <td><img src="images/GAN_discriminator.png"></td>
    </tr>
</table>

In contrast, the goal of the Generator

**Goal of Generator**
$$
\begin{array} \\
D (\hat{\x} ) & = & \text{Real} & \text{ for } \hat{\x} = G_{\Theta_G}(\z)  \in p_\text{model}
\end{array}
$$

That is
- the Generator tries to create fake examples that can fool the Discriminator into classifying as Real

How is this possible ?

We describe a training process (that updates $\Theta_G$ and $\Theta_D$)
- That follows an *iterative* game
- Train the Discriminator to distinguish between 
    - Real examples
    - and the Fake examples produced by the Generator on the prior iteration
- Train the Generator to produce examples better able to fool the updated Discriminator

Sounds reasonable, but how do we get the Generator to improve it's fakes ?

We will define loss functions 
- $\loss_G$ for the Generator
- $\loss_D$ for the Discriminator

Then we can improve the Generator (parameterized by $\Theta_G$) by Gradient Descent
- updating $\Theta_G$ by $- \frac{\partial\loss_G}{\partial {\Theta_G}}$
- since $\Theta_G$ controls production of $\hat{\x}$, we modify $\Theta_G$ rather than $\hat{\x}$ directly

That is
- The Discriminator will indirectly give "hints" to the Generator as to why a fake example failed to fool

<table>
    <tr>
        <th><center>GAN Generator training</center></th>
    </tr>
    <tr>
        <td><img src="images/GAN_generator_train.png"></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>GAN Discriminator training</center></th>
    </tr>
    <tr>
        <td><img src="images/GAN_discriminator_train.png"</td>
    </tr>
</table>

After enough rounds of the "game" we hope that the Generator and Discriminator battle to a stand-off
- the Generator produces realistic fakes
- the Discriminator has only a $50 \%$ chance of correctly labeling a fake as Fake

# Loss functions

The goal of the generator can be stated as
- Creating $\pmodel$ such that
- $\pmodel \approx \pdata$



 
There are a number of ways to measure the dis-similarity of two distributions
- KL divergence
    - equivalent to Maximum Likelihood estimation
- Jensen Shannon Divergence (JSD)
- Earth Mover Distance (Wasserstein GAN)

The original paper choose the minimization of the KL divergence, so we illustrate with that measure.

To be concrete. let the Discriminator uses labels
- $1$ for Real
- $0$ for Fake

 
The Discriminator tries to maximize per example $\loss_D$ (by minimizing the $- \loss_D$)

$$
- \loss_D = 
\begin{cases} 
\log D(\tilde{\x}) & \text{ when } \tilde{\x} \in \pdata \\
1 - \log D(\tilde{\x}) & \text{ when } \tilde{\x} \in \pmodel \\
\end{cases}
$$

That is
- Classify real $\x$ as Real
- Classify fake $\hat{\x}$ as Fake

In training the Discriminator, we present it with batches of examples
- half real, half fake

The Discriminator tries to maximize (over the batch) the negative of the loss over the batch
$$
\begin{array} \\ 
\loss_D & = & - \left( \frac{1}{2}  \E_{\x^\ip \in \pdata } { \log D(\x^\ip) }  +  \frac{1}{2} \E_{z \in P_z} { \log \left( 1 - D(G(\z))  \right) } \right) & \text{leading minus sign to turn this into minimization } \\
& = &  - \left( \frac{1}{2}  \E_{\x^\ip \in \pdata } { \log D(\x^\ip) }  +  \frac{1}{2} \E_{\x^\ip  \in \pmodel} { \log \left( 1 - D(\x^\ip)  \right) } \right) & D(G(\z)) = \x^\ip \text{ for fake examples}\\
& = & - \frac{1}{2}  \sum_{\x^\ip \in \pdata}  { \pdata (\x^\ip) \log D(\x^\ip)   - \frac{1}{2}  \sum_{\x^\ip \in \pmodel} \pmodel (\x^\ip) \log ( 1 - D(\x^\ip)) }   \\
\end{array}
$$ 

You will recognize this term as Binary Cross Entropy (BCE)
- hence, you will see BCE used as the Loss Function in the code
$$
\begin{array} \\
\loss_G = - \loss_D & \text{Zero sum game} \\
\end{array}
$$

The per-example Loss for the Generator is 
$$\loss_G = 1 - \log D(G(\z))$$

which is minimized when the fake example 
$$D(G(\z)) = 1$$

That is
- the Discriminator mis-classifies the fake example as Real

The Generator takes batches of $\z$ (and hence sees only fake examples, not an even mix of real and fake as does the Discriminator.

Since the game is zero sum
$$
\loss_G = - \loss_D 
$$
and you will similarly see BCE as the Loss for the Generator
- except the "true" labels passed to BCE will be an array of "Real"
- as opposed to a mix of "Real" and "Fake" labels in the BCE of the Discriminator

So the iterative game seeks to solve a minimax problem

$$
\min{G}\max{D} \left( { \mathbb{E}_{\x \in p_\text{data}} \log D(\x) + \mathbb{E}_{\z \in p_z} ( 1 - \log D(G(\z))} \right)
$$
- $D$ tries to 
    - make $D(\x)$ big: correctly classify (with high probability) real $\x$
    - and $D(G(\z))$ small: correctly classify (with low probability) fake $G(\z))$
- $G$ tries to
    - make $D(G(\z))$ high: fool $D$ into a high probability for a fake

Note that the Generator improves 
- by updating $\Theta_G$
- so as to increase $D(G(\z))$
    - the mis-classification of the fake as Real

## Optimal Discriminator Loss

Can minimize per example $\loss_D$ wrt $D(\x)$ by taking derivative and setting to $0$

$$
\begin{array} \\
\frac{ \partial \loss_D}{\partial D(\x)} & = & - \frac{1}{2} \left( 
 \pdata(\x) * \frac{1}{\log_e 10} \frac{1}{D(\x)} + 
 \pmodel(\x) * \frac{1}{\log_e 10} \frac{1}{1 - D(\x)} * -1  \right) & \text{Definition:}\log_e a = \frac{\log_e a}{\log_e 10};  \\
 & & & \text{Derivative of } \log_e a = \frac{1}{a}\\
 & = & - \frac{1}{2 * \log_e 10} \frac{ \pdata(\x) * (1 - D(\x)) - \pmodel(\x)* D(\x)}{D(\x) * (1 - D(\x)) } \\
 & = & \frac{1}{c} \frac{ \pdata (\x) - D(\x) ( \pmodel(\x) +\pdata(\x))}{D(\x) * (1 - D(\x))}
 \\
\frac{ \partial \loss_D}{\partial D(\x)} & = 0 &\mapsto D^*(\x) = \frac{\pdata (\x)}{ \pmodel(\x) +\pdata(\x)}
\end{array}
$$

So the optimal Discriminator succeeds with probability

$$
\frac{\pdata (\x)}{ \pmodel(\x) +\pdata(\x)}
$$

The optimal Generator results in
$$
\pmodel(\x) = \pdata(\x)
$$

Thus, if the minimax optimization succeeds
$$
D^*(\x) = \frac{1}{2}
$$

Nothing better than a coin toss !

# Training

We will train Generator $G_{\Theta_G}$ Discriminator $D_{\Theta_D}$ by turns
- creating sequence of updated parameters
    - $\Theta_{G, (1)} \ldots \Theta_{G,(T)}$
    - $\Theta_{D, (1)} \ldots \Theta_{D,(T)}$
- Trained *competitively*

**Competitive training**

Iteration $\tt$

- Train $D_{\Theta_{D, (\tt-1)}}$ on samples
    - $\tilde{\x} \in p_\text{data} \cup p_{\text{model}, (\tt-1)}$
        - where $G_{\Theta_{G, (\tt-1)}} ( \z) \in p_{\text{model}, (\tt-1)}$
    - Update $\Theta_{D, (\tt-1)}$ to $\Theta_{D, \tp}$ via gradient $\frac{\partial \loss_D}{\partial \Theta_{D,(\tt-1)}}$
        - $D$ is a maximizer of $\int_{\x \in p_\text{data}} \log D(\x) + \int_{\z \in p_\z} \log ( \, 1 - D(G(\z)) \, )$
- Train $G_{\Theta_{G, (\tt-1)}}$ on random samples $\z$
    - Create samples $\hat{\x}_\tp \in G_{\Theta_{G, (\tt-1)}}(\z)  \in p_\text{model}$
    - Have Discriminator $D_{\Theta_{D, \tp}}$ evaluate $D_{\Theta_{D,\tp}} ( \hat{\x}_\tp )$
    - Update $\Theta_{G, (\tt-1)}$ to $\Theta_{G, \tp}$ via gradient $\frac{\partial \loss_G}{\partial \Theta_{G,(\tt-1)}}$
        - $G$ is a minimizer of $\int_{\z \in p_\z} \log ( \, 1 - D(G(\z)) \, )$
            - i.e., want $D(G(\z))$ to be high
    - May update $G$ multiple times per update of $D$

**Training code for a simple GAN**

[Here](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/dcgan_overriding_train_step.ipynb#scrollTo=AOO8AqLy86jb)
       is the code for the training step of a simple GAN.

# Issues

Although the description of GAN training as an adversarial game is appealing,
actually getting training to find a stable equilibrium is difficult in practice.


## Vanishing Gradient

Early in training, the Discriminator has the advantage
- it has been trained to distinguish real input from noise
- the parameters of the Generator are uninitialized
    - Generator needs feedback from Discriminator in order to learn direction for improvement

What happens if the Discriminator is "too good" ?
- $D(\hat{\x}) = 0$ for all $\hat{\x} \in p_\text{model}$ 

With absolute certainty that every $\hat{\x}$ from the Generator is Fake, the gradient is zero (or near zero)
- Generator can't learn (weight updates near zero)

So we don't want the Discriminator to be too good, too early in training.

## Mode Collapse

We condition the Generator on random $\z$ so that it will produce diverse $\hat{\x}$.

Sometimes, the Generator is only able to create a single (or small number) $\hat{\x}'$ that is good
enough to fool the Discriminator.

In this case: the Generator may learn to ignore input $\z$ and *only* produce $\hat{\x}'$.

## Hard to achieve equilibrium

The optimal solution is the Nash equilibrium of the minimax problem
$$
\min{G}\max{D} \left( { \mathbb{E}_{\x \in p_\text{data}} \log D(\x) + \mathbb{E}_{\z \in p_z} ( 1 - \log D(G(\z))} \right)
$$

However: the objective of Neural Network training is minimization of a Loss.

There is no guarantee that Gradient Descent will always converge to the Nash equilibrium
- [See this paper, section 3](https://arxiv.org/pdf/1412.6515.pdf)
- [Also, see this paper, section 3](https://arxiv.org/pdf/1606.03498.pdf)

The gradients are partials with respect to the denominator, *holding everything else constant*.

But everything is *not* constant: the Generator and Discriminator are each modifying their weights.
- So the weight update of the Generator may not result in improvement if the simultaneous weight update of the Discriminator moves in the opposite direction.

An often cited example 2 player game illustrates the point
- Player 1 seeks to minimize product $x * y$ by manipulating $x$
$$
\begin{array} \\
\frac{\partial x * y}{\partial x} = y \\
x \rightarrow (x -y) & \text {update x by negative of gradient} \\
\end{array}
$$
- Player 2 seeks to minimize product $- x * y$ by manipulating $x$
$$
\begin{array} \\
\frac{\partial (- x * y)}{\partial y} = - x \\
y \rightarrow (y + x) & \text {update y by negative of gradient} \\
\end{array}
$$

If $x, y$ have opposite signs, then the update causes them to either both increase or both decreases.
- one can show by experiment that each update causes $x, y$ to oscillate in increasing magnitude.

# Code

- [GAN on Colab](https://keras.io/examples/generative/dcgan_overriding_train_step/)
- [Wasserstein GAN with Gradient Penalty](https://keras.io/examples/generative/wgan_gp/#create-the-wgangp-model)

# References
- [Goodfellow](https://arxiv.org/pdf/1406.2661.pdf)
- [Huszar](https://arxiv.org/pdf/1511.05101.pdf)
- [Wasserstein GAN paper](https://arxiv.org/pdf/1701.07875.pdf)

**Good blog, submitted as paper**
- [Weng blog](https://arxiv.org/pdf/1904.08994.pdf)