# What is KL divergence?

**Kullback–Leibler (KL) divergence** measures how one distribution $P$ differs from another distribution $Q$ that you use to approximate it.

* **Definition (discrete):**

  $$
  D_{\mathrm{KL}}(P\|Q)=\sum_x P(x)\,\log\frac{P(x)}{Q(x)}
  $$
* **Definition (continuous):**

  $$
  D_{\mathrm{KL}}(P\|Q)=\int p(x)\,\log\frac{p(x)}{q(x)}\,dx
  $$
* **Expectation form (very useful):**

  $$
  D_{\mathrm{KL}}(P\|Q)=\mathbb{E}_{x\sim P}\big[\log p(x)-\log q(x)\big]
  $$

# Key properties (the “feel”)

* $D_{\mathrm{KL}}(P\|Q)\ge 0$ (Gibbs’ inequality), and $=0$ iff $P=Q$ almost everywhere.
* **Not symmetric:** $D_{\mathrm{KL}}(P\|Q)\neq D_{\mathrm{KL}}(Q\|P)$.
* **Relation to cross-entropy:**
  $D_{\mathrm{KL}}(P\|Q)=H(P,Q)-H(P)$.
  When optimizing over $Q$, minimizing KL is the same as minimizing cross-entropy (since $H(P)$ is constant in $Q$).
* **Mode-covering vs mode-seeking:**
  Minimizing $D_{\mathrm{KL}}(P\|Q)$ (as in MLE) encourages $Q$ to “cover” all support where $P$ has mass; minimizing the **reverse** $D_{\mathrm{KL}}(Q\|P)$ is more mode-seeking.

# Why it matters (quick intuition)

* In supervised learning with softmax, the usual loss is cross-entropy $H(P,Q)$, equivalent to minimizing $D_{\mathrm{KL}}(P\|Q)$ w\.r.t. $Q$.
* In VAEs, a KL term regularizes the approximate posterior $q_\phi(z\mid x)$ toward a simple prior $p(z)$ (often $\mathcal N(0,I)$).
* In information theory, KL is the expected extra code length when coding samples from $P$ using a code optimized for $Q$.

# Handy closed form: Gaussian–Gaussian

For $P=\mathcal N(\mu_p,\Sigma_p)$, $Q=\mathcal N(\mu_q,\Sigma_q)$ in $k$ dimensions:

$$
D_{\mathrm{KL}}(P\|Q) = \frac{1}{2}\Big(
\log\frac{\det\Sigma_q}{\det\Sigma_p}
- k
+ \mathrm{tr}(\Sigma_q^{-1}\Sigma_p)
+ (\mu_q-\mu_p)^\top \Sigma_q^{-1}(\mu_q-\mu_p)
\Big)
$$

(For diagonal covariances this simplifies element-wise.)



In [1]:
import torch
import torch.nn.functional as F

In [2]:
## 1) Discrete KL between two categorical distributions


# Ground-truth distribution P (e.g., a soft label)
P = torch.tensor([0.7, 0.2, 0.1])  # must sum to 1
# Model logits for Q (before softmax)
logits_Q = torch.tensor([1.2, 0.3, -0.8])

# Convert logits to log-probs with log_softmax (stable)
log_Q = F.log_softmax(logits_Q, dim=0)
Q = log_Q.exp()

# KL(P || Q) = sum P * (log P - log Q)
log_P = torch.log(P)
kl_PQ = torch.sum(P * (log_P - log_Q))

# Cross-entropy & entropy check: KL = H(P,Q) - H(P)
cross_entropy = -torch.sum(P * log_Q)
entropy_P = -torch.sum(P * log_P)
kl_check = cross_entropy - entropy_P

print("Q probs:", Q)
print("KL(P||Q):", kl_PQ.item(), " | via CE-Entropy:", kl_check.item())

Q probs: tensor([0.6485, 0.2637, 0.0878])
KL(P||Q): 0.011200123466551304  | via CE-Entropy: 0.011200129985809326


**Notes**

* Use `log_softmax` + log-space math for numerical stability.
* If $P$ is one-hot, this reduces to the usual negative log-likelihood for the correct class.


In [3]:
## 2) KL between Gaussians (analytic) with `torch.distributions`

import torch
from torch.distributions import Normal, Independent, kl_divergence

# 1D Gaussians
P = Normal(loc=torch.tensor(0.0), scale=torch.tensor(1.0))      # N(0,1)
Q = Normal(loc=torch.tensor(1.0), scale=torch.tensor(0.5))      # N(1,0.5^2)

kl_PQ = kl_divergence(P, Q)
kl_QP = kl_divergence(Q, P)
print("KL(P||Q):", kl_PQ.item(), "  KL(Q||P):", kl_QP.item())

# Multivariate diagonal case using Independent(Normal(...))
mu_p = torch.tensor([0.0, 1.0, -1.0])
logvar_p = torch.tensor([0.0, -0.5, 0.2])   # log sigma^2
mu_q = torch.tensor([0.5, 0.5, -0.5])
logvar_q = torch.tensor([-0.2, 0.0, 0.0])

P_diag = Independent(Normal(loc=mu_p, scale=(0.5*logvar_p).exp()), 1)
Q_diag = Independent(Normal(loc=mu_q, scale=(0.5*logvar_q).exp()), 1)

print("Diag MV KL(P||Q):", kl_divergence(P_diag, Q_diag).item())

KL(P||Q): 2.8068528175354004   KL(Q||P): 0.8181471824645996
Diag MV KL(P||Q): 0.4773434102535248


## 3) Monte-Carlo estimate of KL when you can sample $x\sim P$ but don’t have a closed form

Using the expectation form $D_{\mathrm{KL}}(P\|Q)=\mathbb{E}_{x\sim P}[\log p(x)-\log q(x)]$.

In [4]:
import torch
from torch.distributions import Normal

torch.manual_seed(0)
P = Normal(0., 1.)      # true
Q = Normal(1., 0.5)     # approximation

N = 200000
x = P.sample((N,))       # x ~ P
mc_kl = (P.log_prob(x) - Q.log_prob(x)).mean()
print("MC estimate KL(P||Q):", mc_kl.item())

MC estimate KL(P||Q): 2.8220419883728027


As $N$ grows, this approaches the analytic kl_divergence(P,Q).

## 4) VAE-style KL: $q_\phi(z\mid x)=\mathcal N(\mu,\mathrm{diag}(\sigma^2))$ vs prior $p(z)=\mathcal N(0,I)$

The per-dimension closed form (summed over dims):

$$
D_{\mathrm{KL}}\big(\mathcal N(\mu,\mathrm{diag}(\sigma^2)) \,\|\, \mathcal N(0,I)\big)
= \frac{1}{2}\sum_i \big(\mu_i^2 + \sigma_i^2 - 1 - \log \sigma_i^2\big)
$$

In [5]:
import torch

def kl_standard_normal(mu, logvar):
    # mu, logvar: (batch, latent_dim)
    # returns KL per-sample
    return 0.5 * torch.sum(mu.pow(2) + logvar.exp() - 1.0 - logvar, dim=1)

# Example batch
batch = 4
latent_dim = 8
mu = torch.zeros(batch, latent_dim)               # pretend encoder output
logvar = torch.zeros(batch, latent_dim)           # log sigma^2
kl = kl_standard_normal(mu, logvar)               # shape: (batch,)
print("KL to N(0,I) per sample:", kl)             # zeros because mu=0, logvar=0 => σ^2=1

KL to N(0,I) per sample: tensor([0., 0., 0., 0.])


## 5) Training loop snippet: minimizing KL wrt $Q$

If you have fixed $P$ (e.g., labels or teacher distribution) and you’re learning $Q_\theta$, a stable objective is the cross-entropy $H(P,Q_\theta)$ (equivalent to KL up to a constant):


In [6]:
import torch
import torch.nn.functional as F

def kl_loss_to_target_distribution(logits, target_probs):
    # logits: (B, C), target_probs: (B, C) rows sum to 1
    log_q = F.log_softmax(logits, dim=1)
    log_p = torch.log(target_probs.clamp_min(1e-12))
    return torch.mean(torch.sum(target_probs * (log_p - log_q), dim=1))  # KL(P||Q)

# Example
B, C = 3, 5
logits = torch.randn(B, C, requires_grad=True)
with torch.no_grad():
    target_probs = F.softmax(torch.randn(B, C), dim=1)

loss = kl_loss_to_target_distribution(logits, target_probs)
loss.backward()
print("KL loss:", loss.item())


KL loss: 1.0116071701049805


## Common gotchas

* **Support mismatch:** If $Q(x)=0$ where $P(x)>0$, $D_{\mathrm{KL}}(P\|Q)=\infty$. In code, clamp probabilities when taking logs.
* **Direction matters:** Swapping $P$ and $Q$ changes both the value and the behavior of the optimizer.
* **Log-space for stability:** Prefer log-probs (`log_softmax`) and add small eps when taking logs of empirical probabilities.

# The deep connection between **KL divergence** and **maximum likelihood estimation (MLE)**.

---

### 1. Start with KL divergence

We want to measure how close our model distribution $p_\theta(x)$ is to the true data distribution $p_{\text{data}}(x)$.

$$
D(p_{\text{data}} \| p_\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \Big[ \log \frac{p_{\text{data}}(x)}{p_\theta(x)} \Big]
$$

---

### 2. Separate the terms

$$
= \mathbb{E}_{x \sim p_{\text{data}}}[\log p_{\text{data}}(x)] \;-\; \mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]
$$

* The **first term** ($\mathbb{E}[\log p_{\text{data}}(x)]$) depends only on the true data distribution, which we **can’t change**.
* The **second term** ($-\mathbb{E}[\log p_\theta(x)]$) depends on our model parameters $\theta$.

---

### 3. What this means

Minimizing KL divergence (making our model close to the real data distribution) is the same as:

$$
\arg \min_\theta D(p_{\text{data}} \| p_\theta)
= \arg \max_\theta \mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]
$$

**In words:**

> Training your model means adjusting $\theta$ so that it gives **high probability to the data you actually observe**.

This is exactly **maximum likelihood estimation (MLE)**.

---

### 4. Intuition

* KL divergence is just telling us: “Make the model’s probabilities match the real data’s probabilities.”
* Since we only have samples from $p_{\text{data}}$, we maximize the **log-likelihood** of those samples.
* Rarely predicted data points (where $p_\theta(x)$ is very small) penalize the model heavily, because of the **log**.

---

### 5. Simple analogy

Think of KL divergence like a teacher grading a model:

* The teacher knows the true answers (the real distribution).
* The model proposes probabilities for each answer.
* The teacher penalizes the model whenever it gives **low probability to the correct answer**.
* Minimizing KL = learning to assign the highest probability to the real data.

---

**Training with maximum likelihood = minimizing KL divergence between the real data distribution and your model.**


Let’s do a small **PyTorch demo** that shows how maximizing log-likelihood = minimizing KL divergence.

We’ll use a very simple case:

* The **true data distribution** is a Normal(μ=2, σ=1).
* Our **model** is a Normal(μ=θ, σ=1), and we’ll try to learn μ.
* Training with **maximum likelihood** will push μ toward 2.
* That’s the same as minimizing the **KL divergence**.


In [8]:
import torch
import torch.optim as optim
from torch.distributions import Normal, kl_divergence

# True data distribution
p_data = Normal(loc=2.0, scale=1.0)

# Model distribution (parameterized by mu = θ)
mu = torch.tensor([0.0], requires_grad=True)   # start from wrong mean
model = lambda: Normal(loc=mu, scale=torch.tensor(1.0))

optimizer = optim.SGD([mu], lr=0.1)

for step in range(20):
    # Sample from the true data distribution
    x = p_data.sample((512,))  # minibatch of data
    
    # Negative log-likelihood loss (equivalent to KL up to a constant)
    log_likelihood = model().log_prob(x).mean()
    loss = -log_likelihood
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Compare with true KL divergence analytically
    kl = kl_divergence(p_data, model()).item()
    
    print(f"Step {step:02d} | mu={mu.item():.3f} | NLL={loss.item():.3f} | KL={kl:.3f}")


Step 00 | mu=0.202 | NLL=3.473 | KL=1.617
Step 01 | mu=0.380 | NLL=2.987 | KL=1.313
Step 02 | mu=0.540 | NLL=2.669 | KL=1.066
Step 03 | mu=0.688 | NLL=2.505 | KL=0.861
Step 04 | mu=0.816 | NLL=2.204 | KL=0.700
Step 05 | mu=0.927 | NLL=2.047 | KL=0.575
Step 06 | mu=1.032 | NLL=1.979 | KL=0.469
Step 07 | mu=1.125 | NLL=1.844 | KL=0.382
Step 08 | mu=1.212 | NLL=1.802 | KL=0.311
Step 09 | mu=1.291 | NLL=1.733 | KL=0.251
Step 10 | mu=1.359 | NLL=1.633 | KL=0.205
Step 11 | mu=1.419 | NLL=1.518 | KL=0.169
Step 12 | mu=1.473 | NLL=1.546 | KL=0.139
Step 13 | mu=1.527 | NLL=1.573 | KL=0.112
Step 14 | mu=1.577 | NLL=1.562 | KL=0.090
Step 15 | mu=1.626 | NLL=1.532 | KL=0.070
Step 16 | mu=1.662 | NLL=1.508 | KL=0.057
Step 17 | mu=1.698 | NLL=1.505 | KL=0.046
Step 18 | mu=1.726 | NLL=1.494 | KL=0.038
Step 19 | mu=1.752 | NLL=1.447 | KL=0.031


---

### What happens:

* `loss` = **negative log-likelihood** → what we minimize.
* `kl` = **KL divergence** between the true distribution and our model.
* As training goes on:

  * `mu` moves closer to `2.0` (the true mean).
  * Both the **NLL** and **KL divergence** go down.

This shows directly that **minimizing NLL (MLE)** is the same as **minimizing KL divergence**.

# how **Maximum Likelihood Estimation (MLE)** works in practice.

---

### 1. The big idea

We want our model $p_\theta(x)$ to match the true data distribution.
But we **don’t know the true distribution**, we only have samples (a dataset).

So instead of the *expected* log-likelihood:

$$
\mathbb{E}_{x \sim p_{\text{data}}}[\log p_\theta(x)]
$$

we approximate it with the **empirical average over the dataset**.

---

### 2. Empirical log-likelihood

$$
\mathbb{E}_D[\log p_\theta(x)] 
= \frac{1}{|D|} \sum_{x \in D} \log p_\theta(x)
$$

This just means: take all data points in your dataset $D$, compute their log-probabilities under your model, and average them.

---

### 3. Maximum likelihood learning

$$
\max_\theta \; \frac{1}{|D|} \sum_{x \in D} \log p_\theta(x)
$$

So, MLE says: **find parameters $\theta$ that maximize the average log-likelihood of the observed data**.

---

### 4. Equivalent form

Maximizing the log-likelihood is the same as maximizing the likelihood directly:

$$
p_\theta(x^{(1)}, \ldots, x^{(m)}) = \prod_{x \in D} p_\theta(x)
$$

Because log is monotonic, maximizing the product of probabilities is the same as maximizing the sum of log-probabilities.

---

### 5. Intuition

* Each data point “votes”: if your model assigns high probability to it, you get rewarded.
* If your model assigns tiny probability, you get punished a lot (because of the log).
* Training = adjust $\theta$ so the data points you actually see get as high probability as possible.

---

### 🔑 Connection to what we see previously

This ties back to your previous slide:

* Minimizing KL divergence $D(p_{\text{data}} \| p_\theta)$
  is equivalent to
* Maximizing this empirical log-likelihood (MLE).

---


## 1) Fit a Gaussian **mean** by MLE (variance known)

In [9]:
import torch
import torch.optim as optim
from torch.distributions import Normal

torch.manual_seed(0)

# ----- Data (unknown to the learner) -----
true_mu, true_sigma = 2.0, 1.0
p_data = Normal(true_mu, true_sigma)
D = p_data.sample((2000,))  # dataset of i.i.d. samples

# ----- Model -----
# Model is Normal(mu, sigma=true_sigma). We learn mu by maximizing avg log-likelihood.
mu = torch.tensor([0.0], requires_grad=True)  # wrong init on purpose
sigma = torch.tensor(true_sigma)              # treated as known

optimizer = optim.SGD([mu], lr=0.1)

def nll(x):
    # Negative log-likelihood = - average log p_theta(x)
    # For Normal, log p = -0.5*((x-mu)^2/sigma^2 + log(2π σ^2))
    return 0.5 * (((x - mu)**2) / (sigma**2) + torch.log(2*torch.pi*(sigma**2))).mean()

for step in range(25):
    loss = nll(D)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 5 == 0:
        print(f"step {step:02d}  mu={mu.item():.3f}  NLL={loss.item():.3f}")

print("Estimated mu (MLE):", mu.item(), " | True mu:", true_mu)

step 00  mu=0.202  NLL=3.471
step 05  mu=0.944  NLL=2.149
step 10  mu=1.383  NLL=1.688
step 15  mu=1.642  NLL=1.527
step 20  mu=1.795  NLL=1.471
Estimated mu (MLE): 1.8704493045806885  | True mu: 2.0


**What you’ll see:** `mu` moves toward \~2.0 and NLL goes down.
This is maximizing $\frac{1}{|D|}\sum \log p_\theta(x)$ (minimizing its negative).

## 2) Fit **mean and variance** by MLE
Here we learn both $\mu$ and $\sigma$. We parameterize $\log \sigma$ for stability and positivity.

In [11]:
import torch
import torch.optim as optim

torch.manual_seed(0)

# ----- Data -----
true_mu, true_sigma = 1.5, 0.7
x = torch.normal(true_mu, true_sigma, size=(3000,))  # dataset

# ----- Parameters to learn -----
mu = torch.tensor([0.0], requires_grad=True)
log_sigma = torch.tensor([0.0], requires_grad=True)  # sigma = exp(log_sigma)

optimizer = optim.Adam([mu, log_sigma], lr=0.05)

def nll_gaussian(x):
    # NLL for Normal(mu, sigma): 0.5*((x-mu)^2/sigma^2 + 2 log sigma + log(2π))
    sigma = torch.exp(log_sigma)
    return 0.5 * (((x - mu)**2) / (sigma**2) + 2*log_sigma + torch.log(torch.tensor(2*torch.pi))).mean()

for step in range(200):
    loss = nll_gaussian(x)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 40 == 0:
        print(f"step {step:03d}  mu={mu.item():.3f}  sigma={torch.exp(log_sigma).item():.3f}  NLL={loss.item():.3f}")

print("Estimated mu:", mu.item(), " | True mu:", true_mu)
print("Estimated sigma:", torch.exp(log_sigma).item(), " | True sigma:", true_sigma)


step 000  mu=0.050  sigma=1.051  NLL=2.292
step 040  mu=1.454  sigma=0.735  NLL=1.075
step 080  mu=1.490  sigma=0.713  NLL=1.065
step 120  mu=1.501  sigma=0.703  NLL=1.065
step 160  mu=1.501  sigma=0.702  NLL=1.065
Estimated mu: 1.5009100437164307  | True mu: 1.5
Estimated sigma: 0.7017973065376282  | True sigma: 0.7


**Why this is MLE:**
Both snippets directly minimize the **negative** of the empirical average log-likelihood (i.e., NLL). Because `log` is monotonic, this is equivalent to maximizing the product of probabilities for all samples in $D$.

# **main idea of Monte Carlo estimation**

---

### 1. Expectation = Average under the true distribution

You often want to compute:

$$
\mathbb{E}_{x \sim P}[g(x)] = \sum_x g(x) P(x) \quad \text{(discrete case)}
$$

or an integral in the continuous case.

But usually $P(x)$ is complicated, so computing this sum/integral exactly is hard.

---

### 2. Monte Carlo trick

Instead of calculating the expectation exactly:

* Draw random samples $x^1, x^2, \dots, x^T$ from the distribution $P$.
* Approximate the expectation by the **sample average**:

$$
\hat{g}(x^1,\dots,x^T) = \frac{1}{T} \sum_{t=1}^T g(x^t)
$$

---

### 3. Why does this work?

* By the **Law of Large Numbers**, as $T \to \infty$, the sample average converges to the true expectation.
* For finite $T$, it’s only an estimate — and it’s itself a **random variable**, because it depends on the random samples you happened to draw.

---

### 4. Intuition

Imagine you want to know the **average height** in a big population:

* True expectation = exact average over everyone (hard to compute).
* Monte Carlo = take a random sample of people, measure their heights, average them.

That sample mean is your estimate.

---

### 5. Why is $\hat{g}$ a random variable?

Because each time you sample, you could get a different set of $x^t$. So your estimate can change from run to run.

---

### 🔥 Tiny PyTorch example

Suppose $P = \mathcal{N}(0,1)$, and we want $\mathbb{E}[x^2]$.
We know analytically it should be $1$ (variance of standard normal).


In [12]:
import torch

# True distribution
P = torch.distributions.Normal(0., 1.)

# Monte Carlo estimate of E[x^2]
T = 1000
samples = P.sample((T,))
estimate = (samples**2).mean()

print("Monte Carlo estimate:", estimate.item())

Monte Carlo estimate: 0.9941948056221008


Run it a few times, you’ll see estimates close to 1, but not exactly 1 (because of randomness).

---

✅ So the **main idea**:
Monte Carlo replaces a difficult expectation with a simple average over random samples.

# how the **Maximum Likelihood Estimation (MLE) principle extends to autoregressive models**

---

### 1. Autoregressive factorization

If you have $n$ variables in a vector $x = (x_1, x_2, \dots, x_n)$, the joint probability can always be written as:

$$
p_\theta(x) = \prod_{i=1}^n p_\text{neural}(x_i \mid x_{<i}; \theta_i)
$$

* Each variable $x_i$ depends on the **previous variables** $x_{<i}$.
* The model learns each conditional distribution with parameters $\theta_i$.
* In practice, these conditionals are usually implemented by a **neural network** (e.g., RNN, Transformer, PixelCNN).

---

### 2. Parameters

$$
\theta = \{\theta_1, \dots, \theta_n\}
$$

These are all the weights of the neural network(s) used for each conditional.

---

### 3. Training data

You have a dataset:

$$
\mathcal{D} = \{x^{(1)}, \dots, x^{(m)}\}
$$

with $m$ training examples.

---

### 4. Likelihood of the data

The total likelihood is the product of the probabilities of each training example:

$$
\mathcal{L}(\theta, \mathcal{D}) = \prod_{j=1}^m p_\theta(x^{(j)})
= \prod_{j=1}^m \prod_{i=1}^n p_\text{neural}(x_i^{(j)} \mid x_{<i}^{(j)}; \theta_i)
$$

---

### 5. Maximum Likelihood Estimation (MLE)

We want to find parameters $\theta$ that maximize this likelihood:

$$
\arg \max_\theta \mathcal{L}(\theta, \mathcal{D})
= \arg \max_\theta \log \mathcal{L}(\theta, \mathcal{D})
$$

Taking the log turns products into sums (much easier for optimization):

$$
\log \mathcal{L}(\theta, \mathcal{D})
= \sum_{j=1}^m \sum_{i=1}^n \log p_\text{neural}(x_i^{(j)} \mid x_{<i}^{(j)}; \theta_i)
$$

---

### 6. Why this matters

* In **autoregressive models** (like GPT, PixelCNN, WaveNet), we can’t solve for $\theta$ analytically.
* Instead, we optimize this log-likelihood with gradient descent.
* This is why training these models looks like “minimize cross-entropy loss over each next-step prediction.”

---

✅ **In plain words:**
Training an autoregressive model by MLE means:
“Adjust the parameters so that, for every training example, the model assigns high probability to each token/variable given the previous ones.”

That’s exactly what language models do: predict the next word (maximize conditional likelihood), across all words in all sentences.


here’s a **tiny, self-contained PyTorch demo** of **MLE for an autoregressive model**.
We make a toy dataset of digit sequences where each next token is $(x_{t-1}+1)\bmod 10$ (with a little noise).
Training maximizes

$$
\sum_{j=1}^m\sum_{i=1}^{n-1}\log p_\theta(x_i^{(j)}\mid x_{<i}^{(j)}),
$$

which in code is just **cross-entropy over next-token predictions**.

In [13]:
# Autoregressive MLE demo (RNN language model over digits 0..9)
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(0)

# -----------------------
# 1) Toy dataset
# -----------------------
V = 10           # vocab: digits 0..9
SEQ_LEN = 12     # total length per sequence (including first token)
N_TRAIN = 5000
N_VAL   = 1000
NOISE_P = 0.1    # with this prob, next token is random (adds entropy)

def gen_seq():
    x0 = torch.randint(0, V, (1,))
    seq = [x0.item()]
    for t in range(SEQ_LEN-1):
        if torch.rand(()) < NOISE_P:
            nxt = torch.randint(0, V, (1,)).item()
        else:
            nxt = (seq[-1] + 1) % V
        seq.append(nxt)
    return torch.tensor(seq, dtype=torch.long)

class ToyAR(Dataset):
    def __init__(self, n):
        self.data = [gen_seq() for _ in range(n)]
    def __len__(self): return len(self.data)
    def __getitem__(self, idx):
        x = self.data[idx]
        # Teacher forcing: predict x[1:] from x[:-1]
        inp  = x[:-1]          # length L-1
        targ = x[1:]           # next-token targets
        return inp, targ

train_ds, val_ds = ToyAR(N_TRAIN), ToyAR(N_VAL)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True)
val_dl   = DataLoader(val_ds,   batch_size=128)

# -----------------------
# 2) Autoregressive model
# -----------------------
class TinyAR(nn.Module):
    def __init__(self, vocab_size=V, d_model=64):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, d_model)
        self.rnn = nn.GRU(d_model, d_model, batch_first=True)
        self.head = nn.Linear(d_model, vocab_size)
    def forward(self, x):
        # x: (B, L) -> logits: (B, L, V)
        h = self.emb(x)
        h, _ = self.rnn(h)
        return self.head(h)

model = TinyAR()
opt = torch.optim.Adam(model.parameters(), lr=3e-3)

# -----------------------
# 3) Training (MLE)
#    Maximize sum log p(x_t | x_<t)
#    => Minimize cross-entropy over all time steps
# -----------------------
def run_epoch(dl, train=True):
    model.train(train)
    total_nll = 0.0
    total_tokens = 0
    with torch.set_grad_enabled(train):
        for inp, targ in dl:
            logits = model(inp)                 # (B,L,V)
            loss = F.cross_entropy(
                logits.reshape(-1, V),          # all time steps as batch
                targ.reshape(-1)                # targets flattened
            )
            if train:
                opt.zero_grad(); loss.backward(); opt.step()
            # Track negative log-likelihood per token
            total_nll += loss.item() * targ.numel()
            total_tokens += targ.numel()
    nll_tok = total_nll / total_tokens          # average NLL per token
    ppl = torch.exp(torch.tensor(nll_tok)).item()
    return nll_tok, ppl

for epoch in range(1, 11):
    tr_nll, tr_ppl = run_epoch(train_dl, train=True)
    va_nll, va_ppl = run_epoch(val_dl,   train=False)
    print(f"epoch {epoch:02d} | train NLL/tok {tr_nll:.3f} (ppl {tr_ppl:.2f})"
          f" | val NLL/tok {va_nll:.3f} (ppl {va_ppl:.2f})")


epoch 01 | train NLL/tok 0.703 (ppl 2.02) | val NLL/tok 0.497 (ppl 1.64)
epoch 02 | train NLL/tok 0.499 (ppl 1.65) | val NLL/tok 0.495 (ppl 1.64)
epoch 03 | train NLL/tok 0.498 (ppl 1.65) | val NLL/tok 0.496 (ppl 1.64)
epoch 04 | train NLL/tok 0.497 (ppl 1.64) | val NLL/tok 0.495 (ppl 1.64)
epoch 05 | train NLL/tok 0.497 (ppl 1.64) | val NLL/tok 0.494 (ppl 1.64)
epoch 06 | train NLL/tok 0.496 (ppl 1.64) | val NLL/tok 0.497 (ppl 1.64)
epoch 07 | train NLL/tok 0.496 (ppl 1.64) | val NLL/tok 0.495 (ppl 1.64)
epoch 08 | train NLL/tok 0.495 (ppl 1.64) | val NLL/tok 0.495 (ppl 1.64)
epoch 09 | train NLL/tok 0.495 (ppl 1.64) | val NLL/tok 0.496 (ppl 1.64)
epoch 10 | train NLL/tok 0.495 (ppl 1.64) | val NLL/tok 0.496 (ppl 1.64)


### What to look for

* The printed **NLL per token** should go down, and **perplexity** (exp of NLL) should drop as the model learns the conditional next-digit distribution.
* The loss you minimize is exactly

  $$
  -\frac{1}{m}\sum_{j=1}^m\sum_{i=1}^{n-1}\log p_\theta(x_i^{(j)}\mid x_{<i}^{(j)}),
  $$