# The Data Generation Problem
Data comes in many modalities, such as text, images, audio, and video. We denote a generic data sample by the mathematical object $x$.

### Data Space
A data sample $x$ belongs to a known data space, denoted as $\mathcal{X}$ (i.e., $x \in \mathcal{X}$). Essentially, the data space is defined as a high dimensional space that contains all possible value combinations of the raw data representation. However, not all points within this high-dimensional space represent valid or meaningful data.

**Example:**
Consider a dataset of $5 \times 5$ black & white images of digits. If each pixel can take a binary value $\{0, 1\}$, the data space is $\mathcal{X} = \{0, 1\}^{5 \times 5}$. This space contains $2^{25}$ possible combinations. However, the vast majority of these points look like random noise; only a tiny subset actually forms coherent, recognizable digits, which are what we considered as valid data points.

### Data Distribution
The data distribution, denoted as $p(x)$, describes the probability of observing a specific sample $x$ from the space $\mathcal{X}$. This distribution is the key to differentiating meaningful data from noise in the data space
* Valid/Meaningful samples are assigned high probability.
* Invalid samples are assigned a probability near or equal to zero.

In essence, $p(x)$ mathematically represents the hidden patterns and structure governing the data.

### Dataset
A dataset is a finite collection of samples $\{x_1, x_2, ..., x_n\}$ that have been drawn (sampled) from the underlying data distribution $p(x)$.

Data spaces are typically high-dimensional and mathematically intractable. It is impossible to capture or enumerate every point in the space because the volume is too vast. However, valid data usually occupies a very small, concentrated region within this vast space (often referred to as a "manifold"). By modeling the data distribution $p(x)$, we can focus purely on these high-probability regions and ignore the massive regions of the data space that contain invalid noise.

### Goal of Generation
In generative modeling, the goal is to learn a target data distribution $p(x)$, either explicitly (learning the density function) or implicitly (learning a mechanism to generate samples).

We aim to construct a model that, after observing a limited training dataset, learns to sample new, unique data points that look as if they were drawn from the original distribution $p(x)$, although the original distribution is unknown.

# Discriminative and Generative Models
### Discriminative Models
For supervised learning tasks with data $(x, y)$, a discriminative model learns to directly estimate the conditional probability $P(y|x)$. This approach learns a decision boundary between different data points, which is effectively a direct mapping from an input to an output. Because they focus solely on the boundary, discriminative models do not explicitly model the underlying distribution of the input data, meaning they simply learn how to separate classes to minimize the loss but do not understand what the data actually looks like.

### Generative Models
A generative model aims to learn the complete data distribution. For a dataset without labels, it learns the distribution $P(x)$, and for a labelled dataset, it learns the joint distribution $P(x, y)$. This is often understood through the factorization $P(x|y)P(y)$, meaning the model learns "what the data looks like" (the likelihood $P(x|y)$) given a specific label.

This understanding allows generative models to perform both types of tasks:
* Generative Task: For a desired label $y$, the model can generate new synthetic features $x$ by sampling from $P(x|y)$.
* Discriminative Task: To classify a new input $x$, the model uses Bayes' Rule to calculate the posterior probability $P(y|x) \propto P(x|y)P(y)$, allowing it to determine the most likely label.

<img src="https://cdn.prod.website-files.com/65d8ee5f025f02594c614c17/66d088d78f87048868f94fee_66d085ede64cc943387d4695_1.webp" width=500>

# Bayes' Theorem in Machine Learning
Given labelled data $(x, y)$:

$$P(y|x) = \frac{P(x|y)P(y)}{P(x)} = \frac{P(x|y)P(y)}{\sum_{k} P(x|y_k)P(y_k)}$$

* **$P(x|y)$ (Likelihood):** The probability of observing feature $x$ given that it belongs to class $y$. This is what the generative model explicitly learns (the distribution of the data for each class). This model can be learned either statistically or through a neural network.
* **$P(y)$ (Prior):** The probability of a class $y$ occurring before seeing any specific data. This represents our prior belief or assumption. It is often easy to compute by calculating the frequency of each class in the dataset.
* **$P(x)$ (Evidence / Marginal):** The total probability of observing feature $x$ regardless of the class. 
    * Calculated as $\sum_{k} P(x|y_k)P(y_k)$ (summing the weighted likelihoods over all possible classes). 
    * In classification, this acts as a normalization constant to ensure the posterior probabilities sum to 1.
    * *Note:* While easy to compute in simple classification tasks with few classes, calculating $P(x)$ becomes computationally intractable in complex generative models with high-dimensional or continuous latent variables. Therefore, in classification tasks, we can simply ignore it and find the prediction by
    $$ \text{Prediction} = \arg\max_y \big( P(x|y)P(y) \big)$$
    In generative tasks, we will have to approximate it.
* **$P(y|x)$ (Posterior):** The probability that the data belongs to class $y$ given the observed feature $x$. This is the final prediction we want to make.

In discriminative approach, the model learns $P(y|x)$ directly (the mapping from input to class) without modeling the underlying data distribution. In generative approach, the model learns the likelihood $P(x|y)$ (the data distribution per class) and the prior $P(y)$, then combines them using Bayes' rule to compute the posterior $P(y|x)$ for classification.

# Naive Bayes
In the standard application of Bayes' rule, the likelihood term $P(x_1, x_2, \dots, x_n | y)$ represents the joint probability of all features occurring together given the class. Computing this is difficult because features often have complex correlations (dependencies), requiring a massive amount of data to model accurately.

Therefore, Naive Bayes simplifies the model by introducing the assumption of conditional independence. This assumes that, *given the class label*, knowing the value of one feature provides no information about the value of any other feature.

This assumption allows us to decompose the complex joint likelihood into a simple product of individual probabilities:

$$P(x|y) = P(x_1, x_2, \dots, x_n | y) \approx \prod_{i=1}^{n}P(x_i | y)$$

Consequently, the posterior probability for a class $y$ is proportional to:

$$P(y|x) \propto P(y) \cdot \prod_{i=1}^{n}P(x_i | y)$$

Note: This is a strong, naive assumption because real-world features are rarely independent. However, despite this theoretical flaw, Naive Bayes is computationally efficient and often performs surprisingly well as a baseline classifier. While its probability estimates might be inaccurate due to the assumption, its classification decisions are often correct.

### Example
**Scenario:**
We have a binary classification problem where the class $y$ can be either **Spam (1)** or **Not Spam (0)**. Our data has two features, $x_1$ and $x_2$.

We want to classify a new sample where $x_1=a$ and $x_2=b$. To do this, we calculate the posterior probability for both classes and pick the highest one.

**Step 1: Setup the equation for Class 1 (Spam)**
Using the Naive Bayes assumption, we expand the formula:
$$P(y=1 | x_1=a, x_2=b) \propto P(y=1) \cdot P(x_1=a | y=1) \cdot P(x_2=b | y=1)$$

**Step 2: "Learning" by Counting**
We estimate these values directly from our training data frequencies:
* $P(y=1)$: How frequent is Spam overall? (Prior)
* $P(x_1=a | y=1)$: In Spam messages, how often does feature $x_1$ equal $a$? (Likelihood 1)
* $P(x_2=b | y=1)$: In Spam messages, how often does feature $x_2$ equal $b$? (Likelihood 2)

**Step 3: Comparison**
We repeat the calculation for **Class 0 (Not Spam)**:
$$P(y=0 | x_1=a, x_2=b) \propto P(y=0) \cdot P(x_1=a | y=0) \cdot P(x_2=b | y=0)$$

**Step 4: Decision**
We compare the two resulting scores.
* If Score(Spam) > Score(Not Spam), we classify the sample as Spam.


The key distinction lies in how we treat the features $x_1$ and $x_2$. Without the Naive assumption, we must treat the features as a coupled pair $(x_1, x_2)$. To calculate the likelihood $P(x_1=a, x_2=b | y)$, we would need to count how often the exact combination **"a AND b"** appears together in the training data. However, if the specific pair $(a, b)$ never appeared in the training set (sparsity), the probability becomes 0, and the model fails to make a prediction.

Naive Bayes assumes $x_1$ and $x_2$ are independent given the class. We treat them separately
1.  Calculate $P(x_1=a | y)$ by looking only at the $x_1$ column.
2.  Calculate $P(x_2=b | y)$ by looking only at the $x_2$ column.
3.  Multiply the results.

This allows the model to predict the probability of a combination $(a, b)$ even if it has never physically seen those two features appear together in history.

# Maximum Likelihood Estimation (MLE)

In the context of generative modeling, our goal is to approximate an unknown underlying data distribution $P_{data}(\mathbf{x})$ using a parametric model $P_{\theta}(\mathbf{x})$, where $\theta$ represents the model parameters.

### Definition of Likelihood
Before defining the estimator, we must define the **Likelihood Function**. Given a set of observed data, the likelihood function $\mathcal{L}(\theta)$ measures how well a model (specific parameter set $\theta$) explains the observed data.

> **Definition:** Let $\mathcal{D} = \{\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)}\}$ be a dataset. The likelihood function $L(\theta; \mathcal{D})$ is defined as the joint probability (or joint density) of the observed data viewed as a function of the parameters $\theta$, which is the probablity of observing all data samples together based on the given model:
> $$L(\theta; \mathcal{D}) = P(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(n)} | \theta)$$

To make the computation of the joint probability tractable, we typically assume the data points are **independent and identically distributed (i.i.d.)**. Under this assumption, we treat each sample individually, so the joint probability factorizes into the product of marginal probabilities:
$$L(\theta; \mathcal{D}) = \prod_{i=1}^{n} P_{\theta}(\mathbf{x}^{(i)})$$

## Log-Likelihood
In practice, we work with the **Log-Likelihood** $\ell(\theta)$ because the product of many small probabilities leads to numerical underflow. Since the logarithm is a monotonically increasing function, maximizing the likelihood is equivalent to maximizing the log-likelihood:
$$\ell(\theta) = \log \left( \prod_{i=1}^{n} P_{\theta}(\mathbf{x}^{(i)}) \right) = \sum_{i=1}^{n} \log P_{\theta}(\mathbf{x}^{(i)})$$

## MLE Intuition
The data-space for high-dimensional inputs (e.g., $1024 \times 1024$ RGB images) is unimaginably vast, with dimensions in the millions. However, meaningful data (like images of cats) does not exist uniformly across this space; instead, it concentrates on a low-dimensional subset called a **manifold**. 

The intuition behind MLE is built on two observations:
1.  **Sparsity of Observations:** Out of all possible configurations in the high-dimensional space, our collected samples $\mathcal{D}$ represent the "realized" events. Because these specific points were observed, they must have a high density under the true probability distribution $P_{data}$.
2.  **The Optimization Goal:** We want our model $P_{\theta}$ to mimic this true distribution. We achieve this by adjusting the parameters $\theta$ so that the model "shifts" its probability mass away from the "empty" noise regions and concentrates it on the specific coordinates of our observed samples.

For example, in a dataset of cat images, the model should learn that a specific arrangement of pixels forming a "cat" is highly probable, while a random arrangement of "static" noise, despite being in the same high-dimensional pixel space, has a probability near zero.

Therefore, the training objective is to find the best model that assigns the highest possible probability to the observed data (the dataset). In other words, it maximize the likelihood or log-likelihood of the observed data.

$$\theta_{MLE} = \arg \max_{\theta} \sum_{i=1}^{n} \log P_{\theta}(\mathbf{x}^{(i)})$$


# Kullback-Leibler (KL) Divergence
The KL Divergence is a measure of how one probability distribution $Q$ diverges from a second, reference probability distribution $P$. In machine learning, we often use it to measure the distance between the true data distribution $P_{data}$ and our model $P_{\theta}$.

> **Definition:** For two distributions $P$ and $Q$ defined on the same probability space, the KL Divergence is defined as:
> $$D_{KL}(P \parallel Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx$$
> Or in expectation form:
> $$D_{KL}(P \parallel Q) = \mathbb{E}_{x \sim P} \left[ \log \frac{P(x)}{Q(x)} \right] = \mathbb{E}_{x \sim P} [\log P(x) - \log Q(x)]$$

## Properties
1.  **Distance**: The more different the two distributions are, the larger the divergence will be.
1.  **Identity:** $D_{KL}(P \parallel P) = 0$. A distribution has zero divergence from itself because they are the exact same.
2.  **Non-negativity:** $D_{KL}(P \parallel Q) \geq 0$ for any $P, Q$ (via Jensen's Inequality).
3.  **Asymmetry:** $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$.

## MLE as Divergence Minimization
In generative models, the objective is the minimize the divergence between the true data distribution $P_{data}$ and our model $P_{\theta}$ so that the model provides a good approximation for the true distribution. We show that the objective of minimizing the KL divergence is eqavelent to maximizing the likelihood. 

## The Mathematical Equivalence
We can prove that minimizing the KL divergence between the data and the model is mathematically equivalent to maximizing the log-likelihood of the dataset.

Given the definition of KL Divergence:
$$D_{KL}(P_{data} \parallel P_{\theta}) = \mathbb{E}_{\mathbf{x} \sim P_{data}} [\log P_{data}(\mathbf{x}) - \log P_{\theta}(\mathbf{x})]$$

Using the linearity of expectation, we can split this into two distinct terms:
$$D_{KL}(P_{data} \parallel P_{\theta}) = \underbrace{\mathbb{E}_{\mathbf{x} \sim P_{data}} [\log P_{data}(\mathbf{x})]}_{\text{Negative Entropy } -H(P_{data})} - \underbrace{\mathbb{E}_{\mathbf{x} \sim P_{data}} [\log P_{\theta}(\mathbf{x})]}_{\text{Cross-Entropy Term}}$$

When we minimizing the divergence with respect to the model parameters $\theta$, we notice the following:
* **The Entropy Term:** $\mathbb{E}_{\mathbf{x} \sim P_{data}} [\log P_{data}(\mathbf{x})]$ depends only on the true data. It is a constant with respect to $\theta$ and does not change during training.
* **The Empirical Estimate:** Since we don't know $P_{data}$ exactly, we estimate the second term using our samples $\mathcal{D}$ because we know the data points are sampled from the true distribution.
    $$\mathbb{E}_{\mathbf{x} \sim P_{data}} [\log P_{\theta}(\mathbf{x})] \approx \frac{1}{n} \sum_{i=1}^{n} \log P_{\theta}(\mathbf{x}^{(i)})$$

Therefore, minimizing the divergence is equivalent to maximizing the average log-likelihood:
$$\arg \min_{\theta} D_{KL}(P_{data} \parallel P_{\theta}) = \arg \max_{\theta} \sum_{i=1}^{n} \log P_{\theta}(\mathbf{x}^{(i)})$$

Here, we assume each of our $n$ samples was sampled uniformly from the distribution, meaning each has a probability mass of $1/n$. However, the true probability is not necessarily $1/n$ for each individual sample, making the entire objective of MLE or minimizing KL divergence only an estimate.

## The Lower Bound (Data Entropy)
Note that even in a "perfect" training scenario where the model perfectly matches the empirical data, the log-likelihood does not necessarily reach zero. It is bounded by the **Entropy of the data**, $H(P_{data})$. This represents the inherent "noise" or uncertainty in the data that no model can eliminate.

# Explicit Distribution Learning
Explicit distribution learning involves the construction of a generative model that provides an explicit functional form for the probability density function (or mass function), denoted as $p_{\theta}(\mathbf{x})$. Unlike implicit models (e.g., GANs), which only provide a mechanism to sample data, explicit models allow for the direct evaluation of the likelihood of a given observation.

## The Fundamental Challenge: Intractability and Normalization
Modeling a high-dimensional data distribution $p_{data}(\mathbf{x})$ (where $\mathbf{x} \in \mathbb{R}^D$) is computationally intensive due to the **curse of dimensionality**. A valid probability distribution must satisfy the normalization constraint:
$$\int_{\mathbf{x}} p_{\theta}(\mathbf{x}) d\mathbf{x} = 1$$
In high-dimensional spaces, calculating the denominator (the partition function) required to normalize a neural network's output is usually exponentially hard. Consequently, explicit modeling focuses on architectures that either bypass, simplify, or approximate this normalization while maintaining a clear expression for the density.

The standard approach to estimate the distribution based through methods like MLE (minimizing KL divergence). In practice, the MLE objective is ususally implemented through 3 different modelling apporaches

1. Autoregressive Modelling
2. Energy-based Modeling
3. Flow-based Modeling

## Autoregressive (AR) Modeling
Autoregressive (AR) models belong to the class of **explicit density generative models**. They decompose the high-dimensional joint probability distribution of a data point into a sequence of conditional distributions. This structure allows for exact likelihood computation and stable training via MLE.


The core principle of AR modeling is the application of the **Probability Chain Rule**. A $d$-dimensional data sample $\mathbf{x} = (x_1, x_2, \dots, x_d)$ is represented as a product of $d$ univariate conditional distributions.

For any specific component $x_i$ within a sample, the model predicts the probability of that component given all preceding components:
$$p(x_i | \mathbf{x}_{<i}) = p(x_i | x_1, x_2, \dots, x_{i-1})$$

The entire sample's probability is the product of these conditionals:
$$p(\mathbf{x}) = \prod_{i=1}^{d} p(x_i | \mathbf{x}_{<i})$$
By converting a complex $d$-dimensional joint distribution into $d$ one-dimensional distributions, the model simplifies the learning task while maintaining an **explicit** representation of the distributions for the sample.

AR models assume a fixed ordering of dimensions (e.g., raster scan for images, left-to-right for text). The model assumes that $x_i$ depends only on the observed values $\mathbf{x}_{<i}$, satisfying the **causal constraint**.

### Architectural Pipeline
The transformation from input to probability follows this general flow:
1.  **Input Context:** Previous elements $\mathbf{x}_{<i}$ are fed into the model.
2.  **Neural Network Feature Extractor:** A backbone architecture (e.g., RNN, LSTM, Transformer, or Causal CNN) processes the sequence to produce a hidden representation $h_i$.
3.  **Output Head:**
    * **Discrete Data (e.g., Text):** A Softmax layer over a vocabulary.
    * **Continuous Data (e.g., Audio/Images):** A parametric distribution, such as a **Gaussian Mixture Model (GMM)** or a discretized logistic distribution, where the network predicts parameters $\mu_i$ and $\sigma_i$.
4.  **Sampling:** A value is sampled from $p(x_i | \mathbf{x}_{<i})$ and appended to the context for the next step $i+1$. The first value $p(x_0)$ is sampled from the prior after training.

### MLE Objective and Loss Function
AR models are trained by maximizing the log-likelihood of the training data $\mathcal{D}$. This is equivalent to minimizing the **Cross-Entropy Loss**:
$$\mathcal{L}(\theta) = -\sum_{\mathbf{x} \in \mathcal{D}} \sum_{i=1}^{d} \log p_{\theta}(x_i | \mathbf{x}_{<i})$$

### Teacher Forcing
During training, we use **Teacher Forcing**. Instead of feeding the model's own (potentially erroneous) predictions from step $i-1$ into step $i$, we provide the **ground truth** values from the training set. This allows for parallelization during training (especially in Transformers and Causal CNNs) because all $x_i$ conditionals can be computed simultaneously, which also leads to faster convergence.

### Challenges
1. **Exposure Bias**: A significant drawback of Teacher Forcing is **Exposure Bias**. During training, the model only sees ground truth context. During inference (generation), it sees its own generated (noisy) samples. Errors accumulate over time, leading to a divergence between the training and testing distributions.

2. **Sampling Bottleneck**: While training of AR models can be parallelized, **inference is inherently sequential and recursive** since any future data depends on the past data. To generate $x_{100}$, the model must first generate and process $x_1$ through $x_{99}$. The computational complexity of sampling a single sequence is $O(d)$, where $d$ is the number of dimensions/tokens. This makes AR models significantly slower at test-time compared to parallel generative models like GANs or non-autoregressive Flows.

## Energy-Based Model (EBM)
Energy-Based Model (EBM) represents a fundamental shift from directly modeling probabilities to modeling the energy of a data point. Then, we convert the energies of data points into a valid distribution.

Instead of trying to design a neural network that explicitly outputs a normalized probability distribution (which must sum to 1), we design a model that outputs a single scalar value called **energy** for a given input $x$. Borrowing from statistical physics, energy represents the state of a system. Low energy implies a stable, likely state. High energy implies an unstable, unlikely state.

The Role of the Model: We define an energy function $E_\theta(x)$, parameterized by a neural network, which maps an input $x$ to a scalar energy value:
    $$E_\theta(x): \mathbb{R}^D \rightarrow \mathbb{R}$$

Ideally, the observed data (Real) should be assigned **low energy**, and noise/unobserved data (Fake) should be assigned **high energy**.

### The Boltzmann Distribution
To utilize the energy function for generative tasks, we must convert the energy (scalar output) of the neural network into a valid probability distribution. We achieve this using the **Boltzmann Distribution**

Given an energy function $E_\theta(x)$, the probability density function $p_\theta(x)$ is defined as:

$$p_\theta(x) = \frac{\exp\left(-E_\theta(x)\right)}{Z(\theta)}$$

*  **The Gibbs Measure ($\exp(-E_\theta(x))$):** This term converts the energy into a non-negative value. Because of the negative sign, lower energy results in a higher unnormalized probability score.
*  **The Partition Function ($Z(\theta)$):** This is the normalization constant required to ensure the probability distribution sums to 1.

The core difficulty in EBMs lies in the partition function $Z(\theta)$. It is obtained by **marginalizing** (integrating or summing) the unnormalized probabilities over the entire data space, which can be extremely difficult for high dimensional data space.

$$Z(\theta) = \int_{x \in \mathcal{X}} \exp\left(-E_\theta(x)\right) dx$$

* **Independence from Samples:** Note that $Z(\theta)$ is a function of the model parameters $\theta$, but it does **not** depend on any specific data point $x$ because it always sum over the entire data space.

### Universality of Boltzmann Distribution
Instead of modelling the distribution directly, the reason we go through the two step process of first defining energy of data and then converting it into a distribution is because of the **Universality of the Boltzmann Distribution**.

For almost any well-defined, strictly positive probability density $Q(x)$, there exists an energy function $f(x)$ such that $Q$ can be expressed as a Boltzmann distribution.

This implies that EBMs are extremely expressive and can be applied to almost any data. By learning an appropriate energy function $E_\theta(x)$, a neural network can theoretically approximate any complex distribution of data, without being restricted by specific architectural constraints required to ensure normalization (like in Autoregressive models or Flow-based models).

> Note: Softmax, Gaussian, Bernulli, etc, are all spcecial cases of the Boltzmann distribution.

### Data $\to$ Energy $\to$ Explicit Distribution
The ultimate workflow of learning an Energy-Based Model can be summarized as follows:

1.  **The Objective:** We want to learn the true underlying distribution of a dataset, $p_{data}(x)$.
2.  **The Proxy:** We define a parameterized energy model $f_\theta(x)$ (usually a deep neural network).
3.  **The Transformation:** We convert this energy model into an explicit distribution using the Boltzmann equation.
4.  **The Learning Process:** We adjust $\theta$ to assign **low** energy for real samples and high energy for other points in the space ($x_{fake}$).
$$\text{Data } x \xrightarrow{\text{Learn } f_\theta} \text{Energy } E(x) \xrightarrow{\text{Boltzmann}} \text{Distribution } p(x)$$