## Forward KL and Maximum Likelihood Estimation (MLE)

A standard choice for the discrepancy between the data distribution and a parametric model $p_\phi$ is the (forward) Kullback–Leibler divergence:
\begin{align}
\mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)
:=& \int p_{\text{data}}(\mathbf x)\,
\log\!\frac{p_{\text{data}}(\mathbf x)}{p_\phi(\mathbf x)}\, d\mathbf x \\
=& \mathbb E_{\mathbf x\sim p_{\text{data}}}\!\Big[\log p_{\text{data}}(\mathbf x) - \log p_\phi(\mathbf x)\Big].
\tag{1}
\end{align}

It is **asymmetric**:
$$
\mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)
\ne
\mathcal D_{\mathrm{KL}}\!\left(p_\phi \,\Vert\, p_{\text{data}}\right).
\tag{2}
$$

### Mode covering (intuition)
If there exists a set $A$ with positive $p_{\text{data}}$-mass where $p_\phi(\mathbf x)=0$ for $\mathbf x\in A$, then the integrand in (1) contains $\log\!\big(p_{\text{data}}(\mathbf x)/0\big)=+\infty$ on $A$. Hence
$$
\mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)=+\infty,
$$
so minimizing forward KL **forces** the model to assign nonzero probability wherever the data has support—i.e., it encourages *mode covering*.



## Decomposing the forward KL

Start from (1):
$$
\begin{aligned}
\mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)
&= \mathbb E_{p_{\text{data}}}\!\left[\log p_{\text{data}}(\mathbf x)\right]
   - \mathbb E_{p_{\text{data}}}\!\left[\log p_\phi(\mathbf x)\right] \\
&= -\,\mathbb E_{p_{\text{data}}}\!\left[\log p_\phi(\mathbf x)\right]
   \;-\; \mathcal H\!\left(p_{\text{data}}\right),
\end{aligned}
\tag{3}
$$
where
$$
\mathcal H\!\left(p_{\text{data}}\right)
:= -\,\mathbb E_{p_{\text{data}}}\!\left[\log p_{\text{data}}(\mathbf x)\right]
$$
is the (Shannon) entropy of the data distribution and **does not depend on $\phi$**.

Equation (3) also shows
$$
\mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)
= \underbrace{-\,\mathbb E_{p_{\text{data}}}\!\left[\log p_\phi(\mathbf x)\right]}_{\text{cross-entropy } \mathcal H(p_{\text{data}},p_\phi)}
- \underbrace{\left(-\,\mathbb E_{p_{\text{data}}}\!\left[\log p_{\text{data}}(\mathbf x)\right]\right)}_{\text{entropy } \mathcal H(p_{\text{data}})}.
$$


## Lemma 1.1.1 — Minimizing KL \(\Leftrightarrow\) MLE

Because $\mathcal H(p_{\text{data}})$ is constant in $\phi$, minimizing the forward KL is equivalent to maximizing the expected log-likelihood under the data:
$$
\boxed{
\min_{\phi}\; \mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)
\;\;\Longleftrightarrow\;\;
\max_{\phi}\; \mathbb E_{\mathbf x\sim p_{\text{data}}}\!\left[\log p_\phi(\mathbf x)\right].
}
\tag{4}
$$
This is precisely **maximum likelihood estimation (MLE)** at the population level.
$$
\phi^{*} \in \arg\min_{\phi}\, \mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}}, p_\phi\right)
\quad\Longleftrightarrow\quad
\phi^{*} \in \arg\max_{\phi}\, \mathbb E_{p_{\text{data}}}\!\big[\log p_\phi(\mathbf x)\big].
\tag{5}
$$


## Empirical MLE via Monte Carlo

In practice, we cannot evaluate the population expectation in (4) but we do have i.i.d. samples
$$
\{\mathbf x^{(i)}\}_{i=1}^N \stackrel{\text{i.i.d.}}{\sim} p_{\text{data}}.
$$
The Monte Carlo (sample) approximation of the expected log-likelihood is
$$
\mathbb E_{p_{\text{data}}}\!\big[\log p_\phi(\mathbf x)\big]
\;\approx\;
\frac{1}{N}\sum_{i=1}^N \log p_\phi\!\big(\mathbf x^{(i)}\big),
\tag{6}
$$
leading to the empirical **negative** log-likelihood (NLL) objective:
$$
\widehat{\mathcal L}_{\mathrm{MLE}}(\phi)
:= -\,\frac{1}{N}\sum_{i=1}^N \log p_\phi\!\big(\mathbf x^{(i)}\big).
\tag{7}
$$
We then solve
$$
\phi_{\text{MLE}} \in \arg\min_{\phi}\, \widehat{\mathcal L}_{\mathrm{MLE}}(\phi)
\quad\left(\equiv \arg\max_{\phi}\, \frac{1}{N}\sum_{i=1}^N \log p_\phi(\mathbf x^{(i)})\right).
\tag{8}
$$

**Key point:** optimization of (7) requires only evaluating $p_\phi$ (and its gradients); **no evaluation of $p_{\text{data}}(\mathbf x)$** is needed—only samples from it.


$$
\begin{aligned}
\mathcal D_{\mathrm{KL}}\!\left(p_{\text{data}} \,\Vert\, p_\phi\right)
&= \int p_{\text{data}}(\mathbf x)\log\frac{p_{\text{data}}(\mathbf x)}{p_\phi(\mathbf x)}\,d\mathbf x \\
&= \int p_{\text{data}}(\mathbf x)\log p_{\text{data}}(\mathbf x)\,d\mathbf x
  - \int p_{\text{data}}(\mathbf x)\log p_\phi(\mathbf x)\,d\mathbf x \\
&= -\mathcal H(p_{\text{data}})
  + \mathcal H\!\left(p_{\text{data}},p_\phi\right),
\end{aligned}
$$
where \(\mathcal H(p_{\text{data}},p_\phi) := -\mathbb E_{p_{\text{data}}}[\log p_\phi(\mathbf x)]\).
Since \(\mathcal H(p_{\text{data}})\) is independent of \(\phi\), minimizing KL over \(\phi\) is the same as minimizing \(\mathcal H(p_{\text{data}},p_\phi)\), i.e. maximizing \(\mathbb E_{p_{\text{data}}}[\log p_\phi(\mathbf x)]\).


In [None]:



---

---


---


---

## Expanded derivations and details

### (A) From definition to cross-entropy form


### (B) Likelihood of an i.i.d. dataset

For i.i.d. data, the joint likelihood factorizes:
\[
p_\phi\!\big(\mathbf x^{(1)},\dots,\mathbf x^{(N)}\big)
= \prod_{i=1}^N p_\phi\!\big(\mathbf x^{(i)}\big).
\]
Taking logs gives the **log-likelihood**:
\[
\log p_\phi\!\big(\mathbf x^{(1:N)}\big)
= \sum_{i=1}^N \log p_\phi\!\big(\mathbf x^{(i)}\big).
\]
Dividing by \(N\) and negating yields (7). Therefore minimizing NLL equals maximizing the (average) log-likelihood.

### (C) Gradient used in practice

Assuming we can differentiate through \(p_\phi\),
\[
\nabla_\phi \widehat{\mathcal L}_{\mathrm{MLE}}(\phi)
= -\,\frac{1}{N}\sum_{i=1}^N \nabla_\phi \log p_\phi\!\big(\mathbf x^{(i)}\big).
\]
Stochastic gradients use a minibatch \(\mathcal B\subset\{1,\dots,N\}\):
\[
g(\phi;\mathcal B)
= -\,\frac{1}{|\mathcal B|}\sum_{i\in\mathcal B} \nabla_\phi \log p_\phi\!\big(\mathbf x^{(i)}\big),
\]
which is an unbiased estimator of the full gradient, enabling SGD/Adam.

### (D) Why forward KL is “mode covering”

Let \(S_{\text{data}}=\{\mathbf x: p_{\text{data}}(\mathbf x)>0\}\) and
\(S_\phi=\{\mathbf x: p_\phi(\mathbf x)>0\}\).
If \(S_{\text{data}}\not\subseteq S_\phi\), then there is a set \(A\subseteq S_{\text{data}}\setminus S_\phi\)
with positive \(p_{\text{data}}\)-mass. On \(A\), the integrand in (1) equals \(+\infty\),
so \(\mathcal D_{\mathrm{KL}}(p_{\text{data}}\Vert p_\phi)=+\infty\).
Therefore any sequence \(\{\phi_t\}\) minimizing the forward KL must eventually satisfy \(S_{\text{data}}\subseteq S_{\phi_t}\), pushing \(p_\phi\) to **cover** all data modes.  
(Conversely, minimizing the *reverse* KL \(\mathcal D_{\mathrm{KL}}(p_\phi\Vert p_{\text{data}})\) tends to be *mode seeking*, because it heavily penalizes placing mass where the data has none but does **not** blow up when \(p_\phi\) ignores a data mode.)

---

## Summary

- Forward KL:
  \(\mathcal D_{\mathrm{KL}}(p_{\text{data}}\Vert p_\phi)
  = -\mathbb E_{p_{\text{data}}}\!\left[\log p_\phi(\mathbf x)\right] + \mathcal H(p_{\text{data}})\).
- Minimizing forward KL over \(\phi\) is **equivalent** to **MLE**:
  \(\arg\min_\phi \mathcal D_{\mathrm{KL}}(p_{\text{data}}\Vert p_\phi)
  \Leftrightarrow \arg\max_\phi \mathbb E_{p_{\text{data}}}[\log p_\phi(\mathbf x)]\).
- The empirical objective is the average **negative log-likelihood**:
  \(\widehat{\mathcal L}_{\mathrm{MLE}}(\phi)=-(1/N)\sum_i \log p_\phi(\mathbf x^{(i)})\),
  optimized with SGD on minibatches—no evaluation of \(p_{\text{data}}(\mathbf x)\) required.


# Fisher Divergence (Score-Based Modeling)

**Definition**
$$
\mathcal D_F(p\Vert q)
:= \mathbb E_{\mathbf x\sim p}\!\left[ \left\| \nabla_{\mathbf x}\log p(\mathbf x) - \nabla_{\mathbf x}\log q(\mathbf x) \right\|_2^2 \right].
$$

**What it measures**
- Compares the **score functions** $s_p(\mathbf x)=\nabla_{\mathbf x}\log p(\mathbf x)$ and $s_q(\mathbf x)=\nabla_{\mathbf x}\log q(\mathbf x)$.
- These are vector fields pointing toward regions of **higher probability**.
- $\mathcal D_F(p\Vert q)\ge 0$ and equals $0$ iff $p=q$ a.e. (the score fields match).

**Why it’s useful**
- **Invariant to normalization constants** (depends only on gradients of log-densities), so it works with **unnormalized models**.
- Forms the basis of **score matching**: learn a model score $s_\phi(\mathbf x)=\nabla_{\mathbf x}\log p_\phi(\mathbf x)$ that minimizes
$$
\mathbb E_{p_{\text{data}}}\!\left[\|s_\phi(\mathbf x)-s_{p_{\text{data}}}(\mathbf x)\|_2^2\right].
$$
- Core to **score-based/diffusion generative models**: take $p=p_{\text{data}}$ (target) and $q=p_\phi$ (model), train $s_\phi$ to align with the data score field, then **sample** using Langevin dynamics / SDEs.

**Takeaway**
> Fisher divergence trains models to match the **direction & magnitude** of “move-to-higher-density” vectors. Matching these scores is enough to match the **entire distribution**.


 

**General $f$-divergence**  
$$
D_f(p\Vert q)=\int q(x)\,f\!\left(\frac{p(x)}{q(x)}\right)\,dx,\qquad
f:\mathbb R_{+}\!\to\mathbb R\ \text{convex},\ f(1)=0.
$$

**Examples (choose $f(u)$):**
- $f(u)=u\log u \ \Rightarrow\ D_f=D_{\mathrm{KL}}(p\Vert q)$ *(forward KL)*  
- $f(u)=\tfrac12\!\left[u\log u-(u+1)\log\!\tfrac{1+u}{2}\right] \ \Rightarrow\ D_f=D_{\mathrm{JS}}(p\Vert q)$ *(Jensen–Shannon)*
- $f(u)=\tfrac12|u-1|\ \Rightarrow\ D_f=D_{\mathrm{TV}}(p,q)$ *(total variation)*

**Explicit forms**
- $D_{\mathrm{JS}}(p\Vert q)=\tfrac12 D_{\mathrm{KL}}(p\Vert m)+\tfrac12 D_{\mathrm{KL}}(q\Vert m)$, with $m=\tfrac12(p+q)$  
- $D_{\mathrm{TV}}(p,q)=\tfrac12\int_{\mathbb R^D}\!|p(x)-q(x)|\,dx=\displaystyle\sup_{A\subset\mathbb R^D}|p(A)-q(A)|$

**Intuition / when useful**
- **JS:** smooth, symmetric, bounded; balances both distributions (key in GAN theory).  
- **TV:** largest possible probability gap across events; sensitive to pointwise differences.  
- **KL (forward):** penalizes missing data support $\Rightarrow$ *mode covering*.  

**Optimal transport viewpoint — Wasserstein distance**  
Measures the **minimal cost to move mass** from $p$ to $q$ (depends on sample-space geometry, not density ratios). Unlike $f$-divergences, it remains meaningful even when supports of $p$ and $q$ do not overlap.

> **Bottom line:** different divergences encode different notions of “closeness,” leading to distinct optimization dynamics and learning behavior in generative modeling.


# Energy-Based Models (EBMs)

**Idea.** EBMs (Ackley et al., 1985; LeCun et al., 2006) define a probability distribution via an **energy** function $E_\phi(\mathbf{x})$ that assigns **lower energy to more probable** data.

**Density**
$$
p_\phi(\mathbf{x}) \;=\; \frac{1}{Z(\phi)}\,\exp\!\big(-E_\phi(\mathbf{x})\big),
\qquad
Z(\phi) \;=\; \int \exp\!\big(-E_\phi(\mathbf{x})\big)\,d\mathbf{x}
$$

- $Z(\phi)$ is the **partition function** (normalizing constant).

**Training**
- Typically maximize **log-likelihood** $\sum_i \log p_\phi(\mathbf{x}^{(i)})$.
- Challenge: $Z(\phi)$ is often **intractable** to compute or differentiate.

**Connection to diffusion / score-based models**
- Diffusion models learn the **score** $\nabla_{\mathbf{x}}\log p(\mathbf{x})$ (gradient of log-density), which **does not depend on $Z(\phi)$**.
- This **circumvents** explicit partition-function computation during training and sampling.


# Autoregressive (AR) Models — One-Slide

**Idea.** Factorize the joint data distribution using the **chain rule**:
$$
p_{\text{data}}(\mathbf{x})=\prod_{i=1}^{D} p_\phi(x_i \mid \mathbf{x}_{<i}),
\qquad
\mathbf{x}=(x_1,\ldots,x_D),\ \mathbf{x}_{<i}=(x_1,\ldots,x_{i-1}).
$$

**Parameterization.** Each conditional $p_\phi(x_i \mid \mathbf{x}_{<i})$ is a neural net (e.g., **Transformer**), allowing rich dependencies.  
- Terms are **normalized by design** (e.g., softmax for discrete, parameterized Gaussian for continuous), so **global normalization is trivial**.

**Training.** Maximize exact likelihood (equivalently **minimize NLL**)
$$
\max_\phi \sum_{n} \sum_{i=1}^{D} \log p_\phi\!\big(x_i^{(n)} \mid \mathbf{x}_{<i}^{(n)}\big).
$$

**Pros**
- Strong **density estimation** with **exact likelihoods**.
- Flexible conditionals capture complex structure.

**Cons**
- **Sequential** sampling $\Rightarrow$ slower generation.
- **Fixed ordering** may restrict flexibility.

**Takeaway.** Despite sampling limits, AR models are a **foundational** class of likelihood-based generative models and remain central in modern research.


# Variational Autoencoders (VAEs) — One Slide

**Idea.** Add latent variables $\mathbf z$ to capture hidden structure in data $\mathbf x$.  
Learn:
- **Encoder** $q_\theta(\mathbf z\mid \mathbf x)$ (approx. posterior),
- **Decoder** $p_\phi(\mathbf x\mid \mathbf z)$ (generative likelihood),
with prior $p_{\text{prior}}(\mathbf z)$ (usually $\mathcal N(0,I)$).

**Model.** $p_\phi(\mathbf x,\mathbf z)=p_{\text{prior}}(\mathbf z)\,p_\phi(\mathbf x\mid \mathbf z)$ and  
$\log p_\phi(\mathbf x)\ \ge\ \mathcal L_{\text{ELBO}}(\theta,\phi;\mathbf x)$

**Training objective (ELBO).**
$$
\mathcal L_{\text{ELBO}}(\theta,\phi;\mathbf x)
= \mathbb E_{\mathbf z\sim q_\theta(\mathbf z\mid \mathbf x)}
\big[\log p_\phi(\mathbf x\mid \mathbf z)\big]
\;-\;
D_{\text{KL}}\!\big(q_\theta(\mathbf z\mid \mathbf x)\,\Vert\, p_{\text{prior}}(\mathbf z)\big).
$$

**Term meanings.**
- $\mathbb E[\log p_\phi(\mathbf x\mid \mathbf z)]$: **reconstruction** (fit data given latent).  
- $D_{\text{KL}}(q_\theta\Vert p_{\text{prior}})$: **regularization** (keep latents near prior).

**Pros.**
- Principled likelihood-based learning with **tractable** objective.  
- Scales with neural nets; amortized inference via encoder.

**Cons.**
- Samples can be **less sharp**; training pathologies (e.g., posterior collapse where encoder ignores $\mathbf z$).

**Takeaway.** VAEs fuse neural nets with latent-variable models, set up modern likelihood-based generative modeling, and paved the way for diffusion/score-based methods.


# Normalizing Flows & Neural ODE Flows — One Slide

**Idea.** Learn an **invertible** map $f_\phi:\mathbf z\!\to\!\mathbf x$ that pushes a simple base density $p(\mathbf z)$ (e.g., $\mathcal N(0,I)$) to a complex data density $p_\phi(\mathbf x)$.  
- **NFs:** compose bijective layers with tractable Jacobians.  
- **NODEs:** model a continuous-time bijection via an ODE.

**Change of variables (likelihood)**
Let $\mathbf z=f_\phi^{-1}(\mathbf x)$. Then
$$
\log p_\phi(\mathbf x)
= \log p(\mathbf z) + \log\!\left|\det\!\left(\frac{\partial f_\phi^{-1}(\mathbf x)}{\partial \mathbf x}\right)\right|.
$$
*Enables exact MLE training.*  
(For NODEs, the log-det becomes a time integral of the Jacobian trace along the ODE path.)

**Training.** Maximize $\sum_n \log p_\phi(\mathbf x^{(n)})$; backprop through the invertible layers (or ODE solver).

**Pros**
- **Exact likelihoods**, no partition function.  
- **Invertible sampling/inference:** $\mathbf z\!\leftrightarrow\!\mathbf x$.  
- Efficient when Jacobian is **triangular/coupling** (e.g., RealNVP, Glow).

**Cons**
- Architectural constraints for **bijectivity** and **tractable Jacobians** may limit expressivity.  
- **NODEs** can be compute-heavy (solver steps, stiffness).  
- Scaling to very **high dimensions** can be challenging.

**Takeaway.** Flows provide likelihood-based generative models via invertible mappings and the change-of-variables formula; NODEs are their continuous-time counterpart.


# Generative Adversarial Networks (GANs) — One Slide

**Setup.** A **generator** $G_\phi$ maps noise $z\sim p_{\text{prior}}$ to samples $G_\phi(z)$; a **discriminator** $D_\xi$ scores real vs. fake.

**Min–max objective**
$$
\min_{G_\phi}\ \max_{D_\xi}\ 
\mathbb E_{\mathbf x\sim p_{\text{data}}}\big[\log D_\xi(\mathbf x)\big]
+\mathbb E_{z\sim p_{\text{prior}}}\big[\log\!\big(1-D_\xi(G_\phi(z))\big)\big].
$$

**Optimal discriminator (for fixed $G_\phi$)**
$$
D^{*}(\mathbf x)=\frac{p_{\text{data}}(\mathbf x)}{p_{\text{data}}(\mathbf x)+p_{G_\phi}(\mathbf x)}.
$$

**Generator reduces to Jensen–Shannon (JS) divergence**
$$
\min_{G_\phi}\ 2\,D_{\mathrm{JS}}\!\big(p_{\text{data}}\Vert p_{G_\phi}\big)-\log 4,
\qquad
D_{\mathrm{JS}}(p\Vert q)=\tfrac12 D_{\mathrm{KL}}(p\Vert m)+\tfrac12 D_{\mathrm{KL}}(q\Vert m),
$$
with $m=\tfrac12(p+q)$.

**Interpretation**
- GANs **do not** define an explicit density; they **bypass likelihood** and instead *match distributions* via an adversarial game.  
- JS link places GANs in the broader **$f$-divergence minimization** family (e.g., $f$-GAN).

**Pros / Cons**
- Can produce **high-fidelity** samples.  
- Training is **unstable** (min–max dynamics; architecture and regularization matter).

**Today**
- Often used as an **auxiliary** (e.g., adversarial losses) alongside other generative models, including **diffusion**.
