# What is KL divergence?

**Kullback–Leibler (KL) divergence** measures how one distribution $P$ differs from another distribution $Q$ that you use to approximate it.

* **Definition (discrete):**

  $$
  D_{\mathrm{KL}}(P\|Q)=\sum_x P(x)\,\log\frac{P(x)}{Q(x)}
  $$
* **Definition (continuous):**

  $$
  D_{\mathrm{KL}}(P\|Q)=\int p(x)\,\log\frac{p(x)}{q(x)}\,dx
  $$
* **Expectation form (very useful):**

  $$
  D_{\mathrm{KL}}(P\|Q)=\mathbb{E}_{x\sim P}\big[\log p(x)-\log q(x)\big]
  $$

# Key properties (the “feel”)

* $D_{\mathrm{KL}}(P\|Q)\ge 0$ (Gibbs’ inequality), and $=0$ iff $P=Q$ almost everywhere.
* **Not symmetric:** $D_{\mathrm{KL}}(P\|Q)\neq D_{\mathrm{KL}}(Q\|P)$.
* **Relation to cross-entropy:**
  $D_{\mathrm{KL}}(P\|Q)=H(P,Q)-H(P)$.
  When optimizing over $Q$, minimizing KL is the same as minimizing cross-entropy (since $H(P)$ is constant in $Q$).
* **Mode-covering vs mode-seeking:**
  Minimizing $D_{\mathrm{KL}}(P\|Q)$ (as in MLE) encourages $Q$ to “cover” all support where $P$ has mass; minimizing the **reverse** $D_{\mathrm{KL}}(Q\|P)$ is more mode-seeking.

# Why it matters (quick intuition)

* In supervised learning with softmax, the usual loss is cross-entropy $H(P,Q)$, equivalent to minimizing $D_{\mathrm{KL}}(P\|Q)$ w\.r.t. $Q$.
* In VAEs, a KL term regularizes the approximate posterior $q_\phi(z\mid x)$ toward a simple prior $p(z)$ (often $\mathcal N(0,I)$).
* In information theory, KL is the expected extra code length when coding samples from $P$ using a code optimized for $Q$.

# Handy closed form: Gaussian–Gaussian

For $P=\mathcal N(\mu_p,\Sigma_p)$, $Q=\mathcal N(\mu_q,\Sigma_q)$ in $k$ dimensions:

$$
D_{\mathrm{KL}}(P\|Q) = \frac{1}{2}\Big(
\log\frac{\det\Sigma_q}{\det\Sigma_p}
- k
+ \mathrm{tr}(\Sigma_q^{-1}\Sigma_p)
+ (\mu_q-\mu_p)^\top \Sigma_q^{-1}(\mu_q-\mu_p)
\Big)
$$

(For diagonal covariances this simplifies element-wise.)



In [1]:
import torch
import torch.nn.functional as F

In [2]:
## 1) Discrete KL between two categorical distributions


# Ground-truth distribution P (e.g., a soft label)
P = torch.tensor([0.7, 0.2, 0.1])  # must sum to 1
# Model logits for Q (before softmax)
logits_Q = torch.tensor([1.2, 0.3, -0.8])

# Convert logits to log-probs with log_softmax (stable)
log_Q = F.log_softmax(logits_Q, dim=0)
Q = log_Q.exp()

# KL(P || Q) = sum P * (log P - log Q)
log_P = torch.log(P)
kl_PQ = torch.sum(P * (log_P - log_Q))

# Cross-entropy & entropy check: KL = H(P,Q) - H(P)
cross_entropy = -torch.sum(P * log_Q)
entropy_P = -torch.sum(P * log_P)
kl_check = cross_entropy - entropy_P

print("Q probs:", Q)
print("KL(P||Q):", kl_PQ.item(), " | via CE-Entropy:", kl_check.item())

Q probs: tensor([0.6485, 0.2637, 0.0878])
KL(P||Q): 0.011200123466551304  | via CE-Entropy: 0.011200129985809326


**Notes**

* Use `log_softmax` + log-space math for numerical stability.
* If $P$ is one-hot, this reduces to the usual negative log-likelihood for the correct class.


In [3]:
## 2) KL between Gaussians (analytic) with `torch.distributions`

import torch
from torch.distributions import Normal, Independent, kl_divergence

# 1D Gaussians
P = Normal(loc=torch.tensor(0.0), scale=torch.tensor(1.0))      # N(0,1)
Q = Normal(loc=torch.tensor(1.0), scale=torch.tensor(0.5))      # N(1,0.5^2)

kl_PQ = kl_divergence(P, Q)
kl_QP = kl_divergence(Q, P)
print("KL(P||Q):", kl_PQ.item(), "  KL(Q||P):", kl_QP.item())

# Multivariate diagonal case using Independent(Normal(...))
mu_p = torch.tensor([0.0, 1.0, -1.0])
logvar_p = torch.tensor([0.0, -0.5, 0.2])   # log sigma^2
mu_q = torch.tensor([0.5, 0.5, -0.5])
logvar_q = torch.tensor([-0.2, 0.0, 0.0])

P_diag = Independent(Normal(loc=mu_p, scale=(0.5*logvar_p).exp()), 1)
Q_diag = Independent(Normal(loc=mu_q, scale=(0.5*logvar_q).exp()), 1)

print("Diag MV KL(P||Q):", kl_divergence(P_diag, Q_diag).item())

KL(P||Q): 2.8068528175354004   KL(Q||P): 0.8181471824645996
Diag MV KL(P||Q): 0.4773434102535248


## 3) Monte-Carlo estimate of KL when you can sample $x\sim P$ but don’t have a closed form

Using the expectation form $D_{\mathrm{KL}}(P\|Q)=\mathbb{E}_{x\sim P}[\log p(x)-\log q(x)]$.

In [4]:
import torch
from torch.distributions import Normal

torch.manual_seed(0)
P = Normal(0., 1.)      # true
Q = Normal(1., 0.5)     # approximation

N = 200000
x = P.sample((N,))       # x ~ P
mc_kl = (P.log_prob(x) - Q.log_prob(x)).mean()
print("MC estimate KL(P||Q):", mc_kl.item())

MC estimate KL(P||Q): 2.8220419883728027


As $N$ grows, this approaches the analytic kl_divergence(P,Q).

## 4) VAE-style KL: $q_\phi(z\mid x)=\mathcal N(\mu,\mathrm{diag}(\sigma^2))$ vs prior $p(z)=\mathcal N(0,I)$

The per-dimension closed form (summed over dims):

$$
D_{\mathrm{KL}}\big(\mathcal N(\mu,\mathrm{diag}(\sigma^2)) \,\|\, \mathcal N(0,I)\big)
= \frac{1}{2}\sum_i \big(\mu_i^2 + \sigma_i^2 - 1 - \log \sigma_i^2\big)
$$

In [5]:
import torch

def kl_standard_normal(mu, logvar):
    # mu, logvar: (batch, latent_dim)
    # returns KL per-sample
    return 0.5 * torch.sum(mu.pow(2) + logvar.exp() - 1.0 - logvar, dim=1)

# Example batch
batch = 4
latent_dim = 8
mu = torch.zeros(batch, latent_dim)               # pretend encoder output
logvar = torch.zeros(batch, latent_dim)           # log sigma^2
kl = kl_standard_normal(mu, logvar)               # shape: (batch,)
print("KL to N(0,I) per sample:", kl)             # zeros because mu=0, logvar=0 => σ^2=1

KL to N(0,I) per sample: tensor([0., 0., 0., 0.])


## 5) Training loop snippet: minimizing KL wrt $Q$

If you have fixed $P$ (e.g., labels or teacher distribution) and you’re learning $Q_\theta$, a stable objective is the cross-entropy $H(P,Q_\theta)$ (equivalent to KL up to a constant):


In [6]:
import torch
import torch.nn.functional as F

def kl_loss_to_target_distribution(logits, target_probs):
    # logits: (B, C), target_probs: (B, C) rows sum to 1
    log_q = F.log_softmax(logits, dim=1)
    log_p = torch.log(target_probs.clamp_min(1e-12))
    return torch.mean(torch.sum(target_probs * (log_p - log_q), dim=1))  # KL(P||Q)

# Example
B, C = 3, 5
logits = torch.randn(B, C, requires_grad=True)
with torch.no_grad():
    target_probs = F.softmax(torch.randn(B, C), dim=1)

loss = kl_loss_to_target_distribution(logits, target_probs)
loss.backward()
print("KL loss:", loss.item())


KL loss: 1.0116071701049805


## Common gotchas

* **Support mismatch:** If $Q(x)=0$ where $P(x)>0$, $D_{\mathrm{KL}}(P\|Q)=\infty$. In code, clamp probabilities when taking logs.
* **Direction matters:** Swapping $P$ and $Q$ changes both the value and the behavior of the optimizer.
* **Log-space for stability:** Prefer log-probs (`log_softmax`) and add small eps when taking logs of empirical probabilities.

# The deep connection between **KL divergence** and **maximum likelihood estimation (MLE)**.

---

### 1. Start with KL divergence

We want to measure how close our model distribution $p_\theta(x)$ is to the true data distribution $p_{\text{data}}(x)$.

$$
D(p_{\text{data}} \| p_\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \Big[ \log \frac{p_{\text{data}}(x)}{p_\theta(x)} \Big]
$$

---

### 2. Separate the terms

$$
= \mathbb{E}_{x \sim p_{\text{data}}}[\log p_{\text{data}}(x)] \;-\; \mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]
$$

* The **first term** ($\mathbb{E}[\log p_{\text{data}}(x)]$) depends only on the true data distribution, which we **can’t change**.
* The **second term** ($-\mathbb{E}[\log p_\theta(x)]$) depends on our model parameters $\theta$.

---

### 3. What this means

Minimizing KL divergence (making our model close to the real data distribution) is the same as:

$$
\arg \min_\theta D(p_{\text{data}} \| p_\theta)
= \arg \max_\theta \mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]
$$

**In words:**

> Training your model means adjusting $\theta$ so that it gives **high probability to the data you actually observe**.

This is exactly **maximum likelihood estimation (MLE)**.

---

### 4. Intuition

* KL divergence is just telling us: “Make the model’s probabilities match the real data’s probabilities.”
* Since we only have samples from $p_{\text{data}}$, we maximize the **log-likelihood** of those samples.
* Rarely predicted data points (where $p_\theta(x)$ is very small) penalize the model heavily, because of the **log**.

---

### 5. Simple analogy

Think of KL divergence like a teacher grading a model:

* The teacher knows the true answers (the real distribution).
* The model proposes probabilities for each answer.
* The teacher penalizes the model whenever it gives **low probability to the correct answer**.
* Minimizing KL = learning to assign the highest probability to the real data.

---

**Training with maximum likelihood = minimizing KL divergence between the real data distribution and your model.**


Let’s do a small **PyTorch demo** that shows how maximizing log-likelihood = minimizing KL divergence.

We’ll use a very simple case:

* The **true data distribution** is a Normal(μ=2, σ=1).
* Our **model** is a Normal(μ=θ, σ=1), and we’ll try to learn μ.
* Training with **maximum likelihood** will push μ toward 2.
* That’s the same as minimizing the **KL divergence**.


In [8]:
import torch
import torch.optim as optim
from torch.distributions import Normal, kl_divergence

# True data distribution
p_data = Normal(loc=2.0, scale=1.0)

# Model distribution (parameterized by mu = θ)
mu = torch.tensor([0.0], requires_grad=True)   # start from wrong mean
model = lambda: Normal(loc=mu, scale=torch.tensor(1.0))

optimizer = optim.SGD([mu], lr=0.1)

for step in range(20):
    # Sample from the true data distribution
    x = p_data.sample((512,))  # minibatch of data
    
    # Negative log-likelihood loss (equivalent to KL up to a constant)
    log_likelihood = model().log_prob(x).mean()
    loss = -log_likelihood
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Compare with true KL divergence analytically
    kl = kl_divergence(p_data, model()).item()
    
    print(f"Step {step:02d} | mu={mu.item():.3f} | NLL={loss.item():.3f} | KL={kl:.3f}")


Step 00 | mu=0.202 | NLL=3.473 | KL=1.617
Step 01 | mu=0.380 | NLL=2.987 | KL=1.313
Step 02 | mu=0.540 | NLL=2.669 | KL=1.066
Step 03 | mu=0.688 | NLL=2.505 | KL=0.861
Step 04 | mu=0.816 | NLL=2.204 | KL=0.700
Step 05 | mu=0.927 | NLL=2.047 | KL=0.575
Step 06 | mu=1.032 | NLL=1.979 | KL=0.469
Step 07 | mu=1.125 | NLL=1.844 | KL=0.382
Step 08 | mu=1.212 | NLL=1.802 | KL=0.311
Step 09 | mu=1.291 | NLL=1.733 | KL=0.251
Step 10 | mu=1.359 | NLL=1.633 | KL=0.205
Step 11 | mu=1.419 | NLL=1.518 | KL=0.169
Step 12 | mu=1.473 | NLL=1.546 | KL=0.139
Step 13 | mu=1.527 | NLL=1.573 | KL=0.112
Step 14 | mu=1.577 | NLL=1.562 | KL=0.090
Step 15 | mu=1.626 | NLL=1.532 | KL=0.070
Step 16 | mu=1.662 | NLL=1.508 | KL=0.057
Step 17 | mu=1.698 | NLL=1.505 | KL=0.046
Step 18 | mu=1.726 | NLL=1.494 | KL=0.038
Step 19 | mu=1.752 | NLL=1.447 | KL=0.031


---

### What happens:

* `loss` = **negative log-likelihood** → what we minimize.
* `kl` = **KL divergence** between the true distribution and our model.
* As training goes on:

  * `mu` moves closer to `2.0` (the true mean).
  * Both the **NLL** and **KL divergence** go down.

This shows directly that **minimizing NLL (MLE)** is the same as **minimizing KL divergence**.