# The Data Generation Problem
Data comes in many modalities, such as text, images, audio, and video. We denote a generic data sample by the mathematical object $x$.

### Data Space
A data sample $x$ belongs to a known data space, denoted as $\mathcal{X}$ (i.e., $x \in \mathcal{X}$). Essentially, the data space is defined as a high dimensional space that contains all possible value combinations of the raw data representation. However, not all points within this high-dimensional space represent valid or meaningful data.

**Example:**
Consider a dataset of $5 \times 5$ black & white images of digits. If each pixel can take a binary value $\{0, 1\}$, the data space is $\mathcal{X} = \{0, 1\}^{5 \times 5}$. This space contains $2^{25}$ possible combinations. However, the vast majority of these points look like random noise; only a tiny subset actually forms coherent, recognizable digits, which are what we considered as valid data points.

### Data Distribution
The data distribution, denoted as $p(x)$, describes the probability of observing a specific sample $x$ from the space $\mathcal{X}$. This distribution is the key to differentiating meaningful data from noise in the data space
* Valid/Meaningful samples are assigned high probability.
* Invalid samples are assigned a probability near or equal to zero.

In essence, $p(x)$ mathematically represents the hidden patterns and structure governing the data.

### Dataset
A dataset is a finite collection of samples $\{x_1, x_2, ..., x_n\}$ that have been drawn (sampled) from the underlying data distribution $p(x)$.

Data spaces are typically high-dimensional and mathematically intractable. It is impossible to capture or enumerate every point in the space because the volume is too vast. However, valid data usually occupies a very small, concentrated region within this vast space (often referred to as a "manifold"). By modeling the data distribution $p(x)$, we can focus purely on these high-probability regions and ignore the massive regions of the data space that contain invalid noise.

### Goal of Generation
In generative modeling, the goal is to learn a target data distribution $p(x)$, either explicitly (learning the density function) or implicitly (learning a mechanism to generate samples).

We aim to construct a model that, after observing a limited training dataset, learns to sample new, unique data points that look as if they were drawn from the original distribution $p(x)$, although the original distribution is unknown.

# Discriminative and Generative Models
### Discriminative Models
For supervised learning tasks with data $(x, y)$, a discriminative model learns to directly estimate the conditional probability $P(y|x)$. This approach learns a decision boundary between different data points, which is effectively a direct mapping from an input to an output. Because they focus solely on the boundary, discriminative models do not explicitly model the underlying distribution of the input data, meaning they simply learn how to separate classes to minimize the loss but do not understand what the data actually looks like.

### Generative Models
A generative model aims to learn the complete data distribution. For a dataset without labels, it learns the distribution $P(x)$, and for a labelled dataset, it learns the joint distribution $P(x, y)$. This is often understood through the factorization $P(x|y)P(y)$, meaning the model learns "what the data looks like" (the likelihood $P(x|y)$) given a specific label.

This understanding allows generative models to perform both types of tasks:
* Generative Task: For a desired label $y$, the model can generate new synthetic features $x$ by sampling from $P(x|y)$.
* Discriminative Task: To classify a new input $x$, the model uses Bayes' Rule to calculate the posterior probability $P(y|x) \propto P(x|y)P(y)$, allowing it to determine the most likely label.

<img src="https://cdn.prod.website-files.com/65d8ee5f025f02594c614c17/66d088d78f87048868f94fee_66d085ede64cc943387d4695_1.webp" width=500>

# Bayes' Theorem in Machine Learning
Given labelled data $(x, y)$:

$$P(y|x) = \frac{P(x|y)P(y)}{P(x)} = \frac{P(x|y)P(y)}{\sum_{k} P(x|y_k)P(y_k)}$$

* **$P(x|y)$ (Likelihood):** The probability of observing feature $x$ given that it belongs to class $y$. This is what the generative model explicitly learns (the distribution of the data for each class). This model can be learned either statistically or through a neural network.
* **$P(y)$ (Prior):** The probability of a class $y$ occurring before seeing any specific data. This represents our prior belief or assumption. It is often easy to compute by calculating the frequency of each class in the dataset.
* **$P(x)$ (Evidence / Marginal):** The total probability of observing feature $x$ regardless of the class. 
    * Calculated as $\sum_{k} P(x|y_k)P(y_k)$ (summing the weighted likelihoods over all possible classes). 
    * In classification, this acts as a normalization constant to ensure the posterior probabilities sum to 1.
    * *Note:* While easy to compute in simple classification tasks with few classes, calculating $P(x)$ becomes computationally intractable in complex generative models with high-dimensional or continuous latent variables. Therefore, in classification tasks, we can simply ignore it and find the prediction by
    $$ \text{Prediction} = \arg\max_y \big( P(x|y)P(y) \big)$$
    In generative tasks, we will have to approximate it.
* **$P(y|x)$ (Posterior):** The probability that the data belongs to class $y$ given the observed feature $x$. This is the final prediction we want to make.

In discriminative approach, the model learns $P(y|x)$ directly (the mapping from input to class) without modeling the underlying data distribution. In generative approach, the model learns the likelihood $P(x|y)$ (the data distribution per class) and the prior $P(y)$, then combines them using Bayes' rule to compute the posterior $P(y|x)$ for classification.

# Naive bayes
In normal Bayes rule, the joint distribution represents all features occurring together given the class, but the hidden depdencies between features are usually difficult to compute. Therefore, we use Naive bayes to simplify the model by assuming all features are conditionally independent, meaning knowing one feature will not give any inofrmation about any other features. This is formalized by

$$P(x|y) = P(x_1, x_2, \dots, x_n | y) = \Pi_{i=1}^{n}P(x_i | y)$$

This is a huge assumption as the features are pretty much nenver independent in reality.

Ususally, naive bayes can be used as a statistical method so that it learns the distribution of data through counting. This is fast but rather unreliable way of learning.

### Example
For a binary classification problem with the dataset $({x_1, x_2}, y)$, we want to make classification given a new sample, which is $P(y| x_1=a, x_2=b)$
...


# Naive Bayes

In the standard application of Bayes' rule, the likelihood term $P(x_1, x_2, \dots, x_n | y)$ represents the joint probability of all features occurring together given the class. Computing this is difficult because features often have complex correlations (dependencies), requiring a massive amount of data to model accurately.

Therefore, Naive Bayes simplifies the model by introducing the assumption of conditional independence. This assumes that, *given the class label*, knowing the value of one feature provides no information about the value of any other feature.

This assumption allows us to decompose the complex joint likelihood into a simple product of individual probabilities:

$$P(x|y) = P(x_1, x_2, \dots, x_n | y) \approx \prod_{i=1}^{n}P(x_i | y)$$

Consequently, the posterior probability for a class $y$ is proportional to:

$$P(y|x) \propto P(y) \cdot \prod_{i=1}^{n}P(x_i | y)$$

Note on Reality: This is a strong, naive assumption because real-world features are rarely independent. However, despite this theoretical flaw, Naive Bayes is computationally efficient and often performs surprisingly well as a baseline classifier. While its probability estimates might be inaccurate due to the assumption, its classification decisions are often correct.

### Example
**Scenario:**
We have a binary classification problem where the class $y$ can be either **Spam (1)** or **Not Spam (0)**. Our data has two features, $x_1$ and $x_2$.

We want to classify a new sample where $x_1=a$ and $x_2=b$. To do this, we calculate the posterior probability for both classes and pick the highest one.

**Step 1: Setup the equation for Class 1 (Spam)**
Using the Naive Bayes assumption, we expand the formula:
$$P(y=1 | x_1=a, x_2=b) \propto P(y=1) \cdot P(x_1=a | y=1) \cdot P(x_2=b | y=1)$$

**Step 2: "Learning" by Counting**
We estimate these values directly from our training data frequencies:
* $P(y=1)$: How frequent is Spam overall? (Prior)
* $P(x_1=a | y=1)$: In Spam messages, how often does feature $x_1$ equal $a$? (Likelihood 1)
* $P(x_2=b | y=1)$: In Spam messages, how often does feature $x_2$ equal $b$? (Likelihood 2)

**Step 3: Comparison**
We repeat the calculation for **Class 0 (Not Spam)**:
$$P(y=0 | x_1=a, x_2=b) \propto P(y=0) \cdot P(x_1=a | y=0) \cdot P(x_2=b | y=0)$$

**Step 4: Decision**
We compare the two resulting scores.
* If Score(Spam) > Score(Not Spam), we classify the sample as Spam.


The key distinction lies in how we treat the features $x_1$ and $x_2$. Without the Naive assumption, we must treat the features as a coupled pair $(x_1, x_2)$. To calculate the likelihood $P(x_1=a, x_2=b | y)$, we would need to count how often the exact combination **"a AND b"** appears together in the training data. However, if the specific pair $(a, b)$ never appeared in the training set (sparsity), the probability becomes 0, and the model fails to make a prediction.

Naive Bayes assumes $x_1$ and $x_2$ are independent given the class. We treat them separately
1.  Calculate $P(x_1=a | y)$ by looking only at the $x_1$ column.
2.  Calculate $P(x_2=b | y)$ by looking only at the $x_2$ column.
3.  Multiply the results.

This allows the model to predict the probability of a combination $(a, b)$ even if it has never physically seen those two features appear together in history.