In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

# How does the GAN make $\pdata \approx \pmodel$ ?

The Generator Loss function we constructed is a proxy to achieve the goal

$$\pmodel \approx \pdata$$

That is: the distribution of samples produced by the Generator is (approximately) the same as the "true" distribution
- we note that we don't know the "true" $\pdata$
    - we only have available a sample and those the training set defines an *empirical* distribution

There are several ways to quantify

$$\pmodel \approx \pdata$$

One choice would be the minimization of KL Divergence
- $\KL( \pdata || \pmodel)$

As a reminder: we now show that this is equivalent to Maximum Likelihood estimation

Choose $\pmodel$ to Minimize

$$
\begin{array} \\
\KL( \pdata || \pmodel ) & = & \int_\x  { \pdata(\x) \, \left( \log\frac{\pdata(\x)}{\pmodel\x)} \right) }{d\x} & \text{Definition of KL Divergence} \\
& = & \E_{\x \in \pdata} \log(\pdata(\x) - \log(\pmodel(\x)) & \text{Definiton of log of ratio as difference in logs} \\
\text{minimizing KL} \\
& \approx & \E_{\x \in \pdata} - \log(\pmodel(\x))  & \text{Since } $\log(\pdata(\x))$ \text{ is only a constant in the term being minimized} \\
\end{array}
$$

So minimizing $\KL$ is equivalent to minimizing the Negative Log Likelihood.
mm

Notice that the expectation is over the "true" distribution $\pdata$.

The expectation is certainly reasonable for training put perhaps not best for the purposes of generating synthetic data
- Measures fidelity to training data
- NOT how "realistic" the synthetic data is
- the penalty for $\pmodel$ placing large probability mass around a particular $\hat{\x}'$
is small when $\pmodel{\hat{\x}'} \approx 0$
    - so Generator may create large quantity of synthetic data that is improbable given the training set

If we knew the true $\pdata$, a better objective to minimize for the purpose of generating synthetic data would
be the similar
$$
\KL( \pmodel || \pdata)
$$

which is equivalent to maximizing
$$
\E_{\x \in \pmodel} - \log(\pdata(\x))
$$

The expectation is over the synthetic data, not the true data
- $\log(\pdata(\x))$ is defined as log of Perplexity
    - an element of "surprise" in seeing $\x$
- So the expectation asks: for each synthetic datum generated, how likely is it to occur in the true distribution ?

This is merely a theoretical argument
- In practical terms: we only have empirical $\pdata$
- So can't evaluate  log Perplexity $\pdata(\hat{\x})$ for $\hat{\x} \in \pmodel$ unless synthetic $\hat{\x}$ replicates a sample in the training data

# Jensen-Shannon Divergence

We have observed that the KL divergence is *not* symmetric
$$
\KL( P || Q ) \ne \KL( Q || P )
$$
because the expectations are taken over different distributions.

An alternative measure of similarity of two distributions is the Jensen-Shannon Divergence (JSD)

  $$
    \begin{array} \\
    \text{JSD}( P || Q ) & = & \text{JSD}( Q || P )\\
    & = & \frac{1}{2} \; \text{KL} \left( P \, ||\, \frac{P+Q}{2} \right) + \\
    && \frac{1}{2} \; \text{KL} \left( Q \, || \, \frac{P+Q}{2} \right)
    \end{array}
    $$
    
This measure is
- symmetric
- is a kind of mixture of $\KL(P || Q)$ and $\KL(Q || P)$.

[Huszar](https://arxiv.org/pdf/1511.05101.pdf) has a Generalized JSD which interpolates between the two terms
$$
    \begin{array} \\
    \text{JSD}_\pi( P || Q ) & = & \text{JSD}( Q || P )\\
    & = & \pi \; \text{KL} \left( P \, ||\, \pi P + (1-\pi) Q \right) + \\
    && (1-\pi) \; \text{KL} \left( Q \, || \, \pi P + (1-\pi) Q \right)
    \end{array}
    $$
    
The Generalized JSD
- **Not** symmetric although
$$
\text{JSD}_\pi( P || Q ) = \text{JSD}_{1-\pi}( Q || P )
$$
- Is similar to Maximum Likelihood when $\pi \approx \epsilon$
- Is similar to $\KL(Q || P )$ when $\pi \approx (1-\epsilon)$

$$
\frac{\text{JSD}_{1-\epsilon}( P || Q )}{1-\epsilon} \approx \text{KL}( Q || P )
$$

In implementing Generalized JSD
- The Discriminator is trained (as usual) on a mix of real an fake examples
    - But *not* in equal numbers
    - $\pi$ is fraction of  samples  from $Q$
    - $(1-\pi)$ is fraction of samples from $P$
    - $\pi \lt \frac{1}{2}$: real samples over represented
    - $\pi \gt \frac{1}{2}$: biased toward $Q$
- Explains why we often see training with Generator updated twice for each update of Discriminator ?
        
        

# Adversarial Training and the Jensen-Shannon Divergence

The Discriminator Loss $\loss_D$, summed over all examples (ignoring the $\frac{1}{2}$ from the previous presentation where we assumed equal number of Real and Fake)

$$
\begin{array} \\ 
\loss_D 
& = &  - \left(  \E_{\x^\ip \in \pdata } { \log D(\x^\ip) }  + \E_{\x^\ip  \in \pmodel} { \log \left( 1 - D(\x^\ip)  \right) } \right) & D(G(\z)) = \x^\ip \text{ for fake examples}\\
\end{array}
$$ 

We also showed that the optimal Discriminator results in 
$$
D^*(\x) =  \frac{\pdata (\x)}{ \pmodel(\x) +\pdata(\x)}
$$

Plugging $D^*(\x)$ into $\loss_D$ (Goodfellow Equation ):

$$
\begin{array} \\ 
\loss_D 
& = &  - \left(  \E_{\x^\ip \in \pdata } { \log  \frac{\pdata (\x)}{ \pmodel(\x) +\pdata(\x)} }  + \E_{\x^\ip  \in \pmodel} { \log  \frac{\pmodel (\x)}{ \pmodel(\x) +\pdata(\x)} } \right) \\
& = & 
- \left(  \KL( \pdata || \pmodel(\x) +\pdata(\x)) + \KL(\pmodel ||  \pmodel(\x) +\pdata(\x) \right) & \text{Def. of } \KL \\
& = & - \left( \log 4 + \KL( \pdata || \frac{\pmodel(\x) +\pdata(\x)}{2}) + \KL(\pmodel || \frac{ \pmodel(\x) +\pdata(\x)}{2} \right) & \text{dividing second arg. of each KL term by 2}  \\
& & & \text{translates into } - \log 2 \text{ in expansion of each KL term} \\
& & & \text{into log form.} \\
& & & \text{The } \log 4 \text{ offsets this}  \\
& = & - \left( \log 4  + 2 * \text{JSD} (  \pdata || \pmodel ) \right) & \text{Def. of JSD}\\
 & & & \text{this is Equation 6 of Goodfellow}\\
\end{array}
$$ 

Thus Goodfellow proves that solving the minimax optimally minimizes
the JSD divergence between $\pdata$ and $\pmodel$.

To summarize
- $\loss_D$ is implemented by KL Divergence
- *Under the assumption* that the Discriminator can train to be the **optimal** adversary
    - $\loss_D$ becomes equivalent to the Jensen-Shannon Distance