# The Data Generation Problem
Data comes in many modalities, such as text, images, audio, and video. We denote a generic data sample by the mathematical object $x$.

### Data Space
A data sample $x$ belongs to a known data space, denoted as $\mathcal{X}$ (i.e., $x \in \mathcal{X}$). Essentially, the data space is defined as a high dimensional space that contains all possible value combinations of the raw data representation. However, not all points within this high-dimensional space represent valid or meaningful data.

**Example:**
Consider a dataset of $5 \times 5$ black & white images of digits. If each pixel can take a binary value $\{0, 1\}$, the data space is $\mathcal{X} = \{0, 1\}^{5 \times 5}$. This space contains $2^{25}$ possible combinations. However, the vast majority of these points look like random noise; only a tiny subset actually forms coherent, recognizable digits, which are what we considered as valid data points.

### Data Distribution
The data distribution, denoted as $p(x)$, describes the probability of observing a specific sample $x$ from the space $\mathcal{X}$. This distribution is the key to differentiating meaningful data from noise in the data space
* Valid/Meaningful samples are assigned high probability.
* Invalid samples are assigned a probability near or equal to zero.

In essence, $p(x)$ mathematically represents the hidden patterns and structure governing the data.

### Dataset
A dataset is a finite collection of samples $\{x_1, x_2, ..., x_n\}$ that have been drawn (sampled) from the underlying data distribution $p(x)$.

Data spaces are typically high-dimensional and mathematically intractable. It is impossible to capture or enumerate every point in the space because the volume is too vast. However, valid data usually occupies a very small, concentrated region within this vast space (often referred to as a "manifold"). By modeling the data distribution $p(x)$, we can focus purely on these high-probability regions and ignore the massive regions of the data space that contain invalid noise.

### Goal of Generation
In generative modeling, the goal is to learn a target data distribution $p(x)$, either explicitly (learning the density function) or implicitly (learning a mechanism to generate samples).

We aim to construct a model that, after observing a limited training dataset, learns to sample new, unique data points that look as if they were drawn from the original distribution $p(x)$, although the original distribution is unknown.

# Discriminative and Generative Models
### Discriminative Models
For supervised learning tasks with data $(x, y)$, a discriminative model learns to directly estimate the conditional probability $P(y|x)$. This approach learns a decision boundary between different data points, which is effectively a direct mapping from an input to an output. Because they focus solely on the boundary, discriminative models do not explicitly model the underlying distribution of the input data, meaning they simply learn how to separate classes to minimize the loss but do not understand what the data actually looks like.

### Generative Models
A generative model aims to learn the complete data distribution. For a dataset without labels, it learns the distribution $P(x)$, and for a labelled dataset, it learns the joint distribution $P(x, y)$. This is often understood through the factorization $P(x|y)P(y)$, meaning the model learns "what the data looks like" (the likelihood $P(x|y)$) given a specific label.

This understanding allows generative models to perform both types of tasks:
* Generative Task: For a desired label $y$, the model can generate new synthetic features $x$ by sampling from $P(x|y)$.
* Discriminative Task: To classify a new input $x$, the model uses Bayes' Rule to calculate the posterior probability $P(y|x) \propto P(x|y)P(y)$, allowing it to determine the most likely label.

<img src="https://cdn.prod.website-files.com/65d8ee5f025f02594c614c17/66d088d78f87048868f94fee_66d085ede64cc943387d4695_1.webp" width=500>

# Bayes' Theorem in Machine Learning
Given labelled data $(x, y)$:

$$P(y|x) = \frac{P(x|y)P(y)}{P(x)} = \frac{P(x|y)P(y)}{\sum_{k} P(x|y_k)P(y_k)}$$

* **$P(x|y)$ (Likelihood):** The probability of observing feature $x$ given that it belongs to class $y$. This is what the generative model explicitly learns (the distribution of the data for each class). This model can be learned either statistically or through a neural network.
* **$P(y)$ (Prior):** The probability of a class $y$ occurring before seeing any specific data. This represents our prior belief or assumption. It is often easy to compute by calculating the frequency of each class in the dataset.
* **$P(x)$ (Evidence / Marginal):** The total probability of observing feature $x$ regardless of the class. 
    * Calculated as $\sum_{k} P(x|y_k)P(y_k)$ (summing the weighted likelihoods over all possible classes). 
    * In classification, this acts as a normalization constant to ensure the posterior probabilities sum to 1.
    * *Note:* While easy to compute in simple classification tasks with few classes, calculating $P(x)$ becomes computationally intractable in complex generative models with high-dimensional or continuous latent variables. Therefore, in classification tasks, we can simply ignore it and find the prediction by
    $$ \text{Prediction} = \arg\max_y \big( P(x|y)P(y) \big)$$
    In generative tasks, we will have to approximate it.
* **$P(y|x)$ (Posterior):** The probability that the data belongs to class $y$ given the observed feature $x$. This is the final prediction we want to make.

In discriminative approach, the model learns $P(y|x)$ directly (the mapping from input to class) without modeling the underlying data distribution. In generative approach, the model learns the likelihood $P(x|y)$ (the data distribution per class) and the prior $P(y)$, then combines them using Bayes' rule to compute the posterior $P(y|x)$ for classification.

# Naive bayes
Naive bayes is an algorithm that can be used for classification tasks based on the Bayes' theorem. It is used on a variety of datasets, including text data, image data, and numerical data

The naive bayes algorithm works by calculating the conditional probability of each class given the input features as conditions. The class with the highest probability is then predicted as the output



# ML version
$$P(y_k|X) = \frac{P(X|y_k)P(y)}{P(X)} = \frac{P(x_1|y-k)\cdot P(x_2|y_k)\cdots \dot P(x_n|y_k)\cdot P(y_k)}{P(X)} = \frac{P(y_k)\cdot \Pi^{n}_{i=1}P(x_i|y_k)}{P(X)}$$

where

$$P(y_k) = \frac{\sum^{m}_{i=1}(y_i=y_k)}{m}$$

$$P(x_i = a_j|y_k) = \frac{\sum^{m}_{i=1}(x_i=a_j \text{&} y_i=y_k)}{\sum^{m}_{i=1}(y_i=y_k)}$$


$P(y_k|X)$: the probability of the $k$th class, $y_k$, given the input features, $X$, which contains $n$ features ($x_1, x_2, \cdots, x_n$)

$P(y_k)$: the probability of the $k$th class (Prior probability), which equals the frequency of the $k$th class appears in all training examples, $m$

$P(x_i|y_k)$: the probaility of feature, $x_i$ given $y_k$ as the condition (conditioinal proability), which equals the number of cases when $x_i$ equals to $a_j$ and $y_i$ equals to $y_k$ divided by the number of cases when $y_i$ equals to $y_k$

To make predictions, we select the class with the highest probability, $P(y_k|X)$
