In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

# How does the GAN make $\pdata \approx \pmodel$ ?

The Generator Loss function we constructed is a proxy to achieve the goal

$$\pmodel \approx \pdata$$

That is: the distribution of samples produced by the Generator is (approximately) the same as the "true" distribution
- we note that we don't know the "true" $\pdata$
    - we only have available a sample and those the training set defines an *empirical* distribution

There are several ways to quantify

$$\pmodel \approx \pdata$$

One choice would be the minimization of KL Divergence
- $\KL( \pdata || \pmodel)$

An alternative, still using KL Divergence
- $\KL( \pmodel || \pdata )$

Which is a better choice ?

In order to answer the question, we begin with a few preliminaries
- Definition of KL Divergence
- Proving 
    - minimizing KL Divergence increases log-likelihood

# Definition of KL Divergence

As a reminder of the definition of KL Divergence

$$
\begin{array} \\
\text{KL}(p || q ) 
& = & -\sum\limits_{x}p(x) \log q(x) + \sum\limits_{x}p(x) \log p(x)  \\
& = & \sum\limits_{x} {  p(x) * \left( \log p(x) - \log q(x) \right) } \\
& = & \E_{\x \sim p } { \left( \log p(x) - \log q(x) \right) } \\
\end{array}
$$

You can see that it is
- the point-wise difference between the (log) probability of $\x$ in distributions $p$ and $q$
- averaged over the distribution of $\x \sim p$

and thus is a point-wise measure of the dis-similarity of the two distributions.

We note that the KL Divergence is *not symmetric*
$$
\KL( \pdata || \pmodel) \ne \KL( \pmodel || \pdata )
$$
 so the two choices are different.
 - both are expectations
 - but over *different* distributions

## KL Divergence leads to Maximum Likelihood Estimation

We now show that using $\KL( \pdata || \pmodel )$ as a loss function
- results in a estimation of the model distribution  $\pmodel$
- that is the Maximum Likelihood estimator of the training examples (represented by $\pdata$)

That is
- $\pmodel$ is the best explanation of the training dataset $\pdata$

Choosing $\pmodel$ to Minimize gives

$$
\begin{array} \\
\KL( \pdata || \pmodel ) & = & \int_\x  { \pdata(\x) \, \left( \log\frac{\pdata(\x)}{\pmodel\x)} \right) }{d\x} & \text{Definition of KL Divergence} \\
& = & \E_{\x \in \pdata} \log(\pdata(\x)) - \log(\pmodel(\x)) & \text{Definiton of log of ratio as difference in logs} \\
\text{minimizing KL} \\
& \approx & \E_{\x \in \pdata} - \log(\pmodel(\x))  & \text{Since } \log(\pdata(\x)) \text{ is only a constant in the term being minimized} \\
\end{array}
$$

So minimizing $\KL$ is equivalent to 
- minimizing the Negative Log Likelihood
- in other words: *maximizing* the Log Likelihood

# Choosing the KL Divergence

The first choice 
$$
\KL( \pdata || \pmodel) = \E_{\x \sim \pdata } { \left( \log \pdata (x) - \log 
\pmodel(x) \right) }
$$

maximizes $\log(\pmodel(\x))$ for $\x \in \pdata$
- $\pmodel$ assigns high probability to Real examples
- model creates Real examples with high probability

By way of analogy with measures for Classification
- the expectation over $\pdata$ emphasizes Recall over Precision

We can achieve high Recall
- by reducing chance of False Negatives (FN)
- even if it increases chance of False Positives (FP)

In the GAN context this means

$$
\begin{array} \\
\text{reducing FN}   \leadsto \pmodel \text{ assigns high probability to each training example in } \pdata & \text{fidelity to training data}\\
\text{increasing FP} \leadsto \pmodel \text{ assigns high probability to } \x \notin \pdata \\
\end{array}
$$



The second choice $\KL( \pmodel || \pdata )$

$$
\KL( \pmodel || \pdata) = \E_{\x \sim \pmodel } { \left( \log \pmodel (x) - \log 
\pdata (x) \right) } 
$$

maximizes $\pdata(x)$ for $\x \in \pmodel$
- emphasizes that synthetic examples are "realistic"
    - highly probable, as defined by the empirical distribution (training data) $\pdata$
    
This ("realistic examples") might be the more desirable property than "high fidelity" to the training data.

Continuing with our Recall versus Precision analogy, this measure
- increases Precision by reducing False Positives
    - examples generated by $\pmodel$ are likely according to $\pdata$


So it seems as if the second choice  $\KL( \pmodel || \pdata )$ may be more desirable.

**But** we don't know the true $\pdata$ !
- we only have an empirical sample: the training dataset
- so, in practical terms: we can't maximize it

Thus, practical considerations lead us to the first choice.

# Jensen-Shannon Divergence

We have observed that the KL divergence is *not* symmetric
$$
\KL( P || Q ) \ne \KL( Q || P )
$$
because the expectations are taken over different distributions.

An alternative measure of similarity of two distributions is the Jensen-Shannon Divergence (JSD)

  $$
    \begin{array} \\
    \text{JSD}( P || Q ) & = & \text{JSD}( Q || P )\\
    & = & \frac{1}{2} \; \text{KL} \left( P \, ||\, \frac{P+Q}{2} \right) + \\
    && \frac{1}{2} \; \text{KL} \left( Q \, || \, \frac{P+Q}{2} \right)
    \end{array}
    $$
    
This measure is
- symmetric
- is a kind of mixture of $\KL(P || Q)$ and $\KL(Q || P)$.

[Huszar](https://arxiv.org/pdf/1511.05101.pdf) has a Generalized JSD which interpolates between the two terms
$$
    \begin{array} \\
    \text{JSD}_\pi( P || Q ) & = & \text{JSD}( Q || P )\\
    & = & \pi \; \text{KL} \left( \,  P \, ||\, \pi P + (1-\pi) Q \, \right) + \\
    && (1-\pi) \; \text{KL} \left( \, Q \, || \, \pi P + (1-\pi) Q \, \right)
    \end{array}
    $$
    
The Generalized JSD
- **Not** symmetric although
$$
\text{JSD}_\pi( P || Q ) = \text{JSD}_{1-\pi}( Q || P )
$$

Huszar shows that, for small values of $\pi$
$$
\frac{
    \text{JSD}_\pi( P || Q )
  }{\pi}  
    \approx \text{KL} \left( \,  P \, ||\, Q \right)
$$
and
$$
\frac{
    \text{JSD}_{1-\pi}( P || Q )
  }{1-\pi}  
    \approx \text{KL} \left( \,  Q \, ||\, P \right)
$$

In the first case
- $\text{JSD}_\pi( P || Q )$ is proportional to Maximum Likelihood

In the second case
- $\text{JSD}_{1-\pi}( P || Q )$ is proportional to $\text{KL} \left( \,  Q \, ||\, P \right)$




In implementing Generalized JSD
- The Discriminator is trained (as usual) on a mix of real and fake examples
    - But *not* in equal numbers
    - $\pi$ is fraction of  samples  from $Q$
    - $(1-\pi)$ is fraction of samples from $P$
    - $\pi \lt \frac{1}{2}$: real samples over represented
    - $\pi \gt \frac{1}{2}$: biased toward $Q$
- Explains why we often see training with Generator updated twice for each update of Discriminator ?
        
        

# Adversarial Training and the Jensen-Shannon Divergence

The Discriminator Loss $\loss_D$
- summed over all examples 
    - (ignoring the $\frac{1}{2}$ from the previous presentation where we assumed equal number of Real and Fake)
    
is
$$
\begin{array} \\ 
\loss_D 
& = &  - \left(  \E_{\x^\ip \in \pdata } { \log D(\x^\ip) }  + \E_{\x^\ip  \in \pmodel} { \log \left( 1 - D(\x^\ip)  \right) } \right) & D(G(\z)) = \x^\ip \text{ for fake examples}\\
\end{array}
$$ 

We also showed that the optimal Discriminator results in 
$$
D^*(\x) =  \frac{\pdata (\x)}{ \pmodel(\x) +\pdata(\x)}
$$

Plugging $D^*(\x)$ into $\loss_D$ (Goodfellow Equation ):

$$
\begin{array} \\ 
\loss_D 
& = &  - \left(  \E_{\x^\ip \in \pdata } { \log  \frac{\pdata (\x)}{ \pmodel(\x) +\pdata(\x)} }  + \E_{\x^\ip  \in \pmodel} { \log  \frac{\pmodel (\x)}{ \pmodel(\x) +\pdata(\x)} } \right) \\
& = & 
- \left(  \KL( \pdata || \pmodel(\x) +\pdata(\x)) + \KL(\pmodel ||  \pmodel(\x) +\pdata(\x) \right) & \text{Def. of } \KL \\
& = & - \left( \log 4 + \KL( \pdata || \frac{\pmodel(\x) +\pdata(\x)}{2}) + \KL(\pmodel || \frac{ \pmodel(\x) +\pdata(\x)}{2} \right) & \text{dividing second arg. of each KL term by 2}  \\
& & & \text{translates into } - \log 2 \text{ in expansion of each KL term} \\
& & & \text{into log form.} \\
& & & \text{The } \log 4 \text{ offsets this}  \\
& = & - \left( \log 4  + 2 * \text{JSD} (  \pdata || \pmodel ) \right) & \text{Def. of JSD}\\
 & & & \text{this is Equation 6 of Goodfellow}\\
\end{array}
$$ 

The above equations shows that
- minimizing KL Divergence (second line above)
- under the assumption that the Discriminator can train to be the **optimal** adversary

results in $\loss_D$ becoming equivalent to Jensen-Shannon Distance (last line above)

So solving the minimax optimally
minimizes
the JSD divergence between $\pdata$ and $\pmodel$.




In [2]:
print("Done")

Done
