# Understanding Diffusion from scratch

```{warning}
This notes is a work in progress, the content is not organized yet, only the content is dumped. Once entire content is complete, I will work on organizing the web page for readability.
```

There are two parts to this tutorial. The first part uses the [Step-by-Step Diffusion](https://arxiv.org/pdf/2406.08929v1) tutorial for strong foundations on the topic, the second part uses the [Stanley Chan's Tutorial on Diffusion ModelsTutorial on Diffusion Models](https://arxiv.org/pdf/2403.18103) for clear intuition on Stochastic Differential Equation formulation in diffusion models.

## Fundamentals

Each fundamental is clearly explained. The content for fundamentals is intentionally verbose.

```{admonition} What is a random variable?
:class: note, dropdown

A **random variable** is a way to assign numbers to the outcomes of a random process.

**Think of it like this:**  
- You roll a die. The result (1, 2, 3, 4, 5, or 6) is a **random number** → This is a **random variable**.  
- You measure the time it takes for a website to load. The time is **random and can take any real value** → Another **random variable**.  

---

**Types of Random Variables**  

**1. Discrete Random Variable**  
A discrete random variable can take only specific values (like whole numbers).  

**Example: Rolling a Fair Die**  
Let $ X $ be the number shown on a fair six-sided die. The possible values of $ X $ are:  

$$
X \in \{1, 2, 3, 4, 5, 6\}
$$

Since the die is fair, each outcome has an equal probability:

$$
P(X = x) =
\begin{cases}
\frac{1}{6}, & x = 1,2,3,4,5,6 \\
0, & \text{otherwise}
\end{cases}
$$

For example, the probability of rolling a **4** is:

$$
P(X = 4) = \frac{1}{6}
$$

---

**2. Continuous Random Variable**  
A continuous random variable can take **any value within a range**.  

**Example: Webpage Loading Time**  
Let $ Y $ be the time (in seconds) for a webpage to load. The possible values of $ Y $ are:  

$$
Y \in [0, \infty)
$$

Since $ Y $ can take infinitely many values, we use a **probability density function (PDF)** instead of exact probabilities.  

If $ Y $ follows an **exponential distribution**, its PDF is:

$$
f_Y(y) = \lambda e^{-\lambda y}, \quad y \geq 0
$$

where $ \lambda $ is a constant that controls the rate of decay.  

To find the probability that the webpage loads in **less than 3 seconds**, we integrate the PDF:

$$
P(Y \leq 3) = \int_0^3 \lambda e^{-\lambda y} dy
$$

This gives the probability that the page loads within 3 seconds.  

---

**Key Takeaways**  
- **Discrete Random Variable** → Takes countable values (e.g., die rolls, number of heads in coin flips).  
- **Continuous Random Variable** → Takes any value in a range (e.g., time, temperature, height). 
```


```{admonition} What is a Probability Distribution?
:class: note, dropdown

A **probability distribution** describes how values of a random variable are distributed. It tells us the likelihood of different outcomes occurring.

---

**1. Discrete Probability Distribution**  
A discrete probability distribution is used for **discrete random variables**, where the variable takes a finite or countably infinite number of values.

Each possible value $ x_i $ has an associated probability $ P(X = x_i) $, and the total probability must sum to 1:

$$
\sum_{i} P(X = x_i) = 1
$$

**Probability Mass Function (PMF):**  
For a discrete random variable, the **probability mass function (PMF)**, denoted as $ P(X = x) $, gives the probability of the random variable taking a specific value $ x $. It satisfies:

1. $ 0 \leq P(X = x) \leq 1 $ for all $ x $.
2. The total probability is 1:

   $$
   \sum_x P(X = x) = 1
   $$

**Example: Rolling a Fair Die**  
For a fair six-sided die, the probability of each face is:

$$
P(X = x) =
\begin{cases}
\frac{1}{6}, & x = 1,2,3,4,5,6 \\
0, & \text{otherwise}
\end{cases}
$$

The sum of probabilities:

$$
\sum_{x=1}^{6} P(X = x) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = 1
$$

---

**2. Continuous Probability Distribution**  
A continuous probability distribution is used for **continuous random variables**, where the variable can take any value in an interval.

**Probability Density Function (PDF):**  
Instead of a probability mass function (PMF), we use a **probability density function (PDF)**, denoted as $ f_X(x) $. The probability of the variable lying in an interval $ [a, b] $ is:

$$
P(a \leq X \leq b) = \int_a^b f_X(x) \, dx
$$

For a valid probability density function, it must satisfy:

1. $ f_X(x) \geq 0 $ for all $ x $
2. The total probability must integrate to 1:

$$
\int_{-\infty}^{\infty} f_X(x) dx = 1
$$

**Example: Standard Normal Distribution (Gaussian)**  
The normal distribution is a common continuous distribution:

$$
f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}, \quad -\infty < x < \infty
$$

where:
- $ \mu $ is the mean (center of the distribution).
- $ \sigma^2 $ is the variance (spread of the distribution).

For a **standard normal distribution** ($ \mu = 0, \sigma^2 = 1 $):

$$
f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}
$$

To find the probability of $ X $ being in a range, we integrate:

$$
P(-1 \leq X \leq 1) = \int_{-1}^{1} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} dx
$$

---

**Key Differences Between Discrete and Continuous Distributions**  

| Feature            | Discrete Distribution                 | Continuous Distribution |
|-------------------|--------------------------------|--------------------------|
| Random Variable Type | Takes countable values (e.g., integers) | Takes uncountable values (real numbers) |
| Probability Function | Probability Mass Function (PMF) | Probability Density Function (PDF) |
| Probability Calculation | $ P(X = x) $ gives exact probability | $ P(a \leq X \leq b) = \int_a^b f_X(x) dx $ |
| Example | Rolling a die, number of heads in coin flips | Heights, weights, webpage load times |


```

```{admonition} What is a Probability Density Function?
:class: note, dropdown

A **Probability Density Function (PDF)** describes the likelihood of a continuous random variable taking on a specific value. Unlike a **Probability Mass Function (PMF)** (used for discrete variables), the PDF does not give the probability of a single outcome but instead provides a function that, when integrated over an interval, gives the probability of the variable falling within that range.

---

**Definition**  
For a continuous random variable $ X $, the **probability density function (PDF)**, denoted as $ f_X(x) $, satisfies the following properties:

1. **Non-negativity:**  

   $$
   f_X(x) \geq 0, \quad \forall x \in \mathbb{R}
   $$

2. **Total Probability is 1:** 

   $$
   \int_{-\infty}^{\infty} f_X(x) \, dx = 1
   $$

3. **Probability of an Interval:**  
   The probability that $ X $ lies in an interval $ [a, b] $ is given by:

   $$
   P(a \leq X \leq b) = \int_a^b f_X(x) \, dx
   $$

Since a continuous random variable can take an infinite number of values, the probability of it taking a specific single value is always **zero**:

$$
P(X = x) = \int_x^x f_X(x) \, dx = 0
$$

This is why we always consider probabilities over intervals rather than individual points.

---

**Example: Uniform Distribution**  
A continuous random variable $ X $ following a **Uniform Distribution** over the interval $ [a, b] $ has a PDF:

$$
f_X(x) =
\begin{cases}
\frac{1}{b - a}, & a \leq x \leq b \\
0, & \text{otherwise}
\end{cases}
$$

The probability of $ X $ being in a subinterval $ [c, d] $ (where $ a \leq c < d \leq b $) is:

$$
P(c \leq X \leq d) = \int_c^d \frac{1}{b - a} \, dx = \frac{d - c}{b - a}
$$

---

**Example: Normal (Gaussian) Distribution**  
A **Normal (Gaussian) Distribution** is given by the PDF:

$$
f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}, \quad -\infty < x < \infty
$$

where:
- $ \mu $ is the **mean** (center of the distribution).
- $ \sigma^2 $ is the **variance** (spread of the distribution).

To find the probability of $ X $ falling in a certain range $ [a, b] $, we compute:

$$
P(a \leq X \leq b) = \int_a^b \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} dx
$$

Since this integral **does not have a closed-form solution**, we use numerical integration or lookup tables for cumulative probabilities.

---

**Key Differences Between PMF and PDF**  

| Feature            | PMF (Discrete)                          | PDF (Continuous)                    |
|-------------------|--------------------------------|--------------------------------|
| Definition        | $ P(X = x) $ gives exact probability of a value | $ f_X(x) $ represents density, not probability |
| Total Probability | $ \sum P(X = x) = 1 $ | $ \int_{-\infty}^{\infty} f_X(x) dx = 1 $ |
| Single Value Probability | $ P(X = x) > 0 $ for some $ x $ | $ P(X = x) = 0 $ for all $ x $ |
| Example          | Number of heads in 10 coin flips | Height of people in cm |

```
```{admonition} What is a Cumulative Density Function?
:class: note, dropdown
The **Cumulative Distribution Function (CDF)** gives the probability that a random variable $ X $ takes on a value **less than or equal to** a given number $ x $. It is useful for describing both **discrete** and **continuous** probability distributions.

---

**Definition**  
The **Cumulative Distribution Function (CDF)** of a random variable $ X $, denoted as $ F_X(x) $, is defined as:

$$
F_X(x) = P(X \leq x)
$$

For **discrete random variables**, the CDF is the sum of probabilities up to $ x $:

$$
F_X(x) = \sum_{t \leq x} P(X = t)
$$

For **continuous random variables**, the CDF is obtained by integrating the probability density function (PDF):

$$
F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt
$$

where $ f_X(x) $ is the **Probability Density Function (PDF)**.

---

**Properties of the CDF**  
1. **Non-decreasing Function:**  
   Since probabilities accumulate, the CDF is always **non-decreasing**:

   $$
   F_X(a) \leq F_X(b), \quad \text{for } a \leq b
   $$

2. **Limits:**  
   - The smallest possible value of $ X $ has a probability of **0**:

     $$
     \lim_{x \to -\infty} F_X(x) = 0
     $$

   - The largest possible value of $ X $ has a probability of **1**:

     $$
     \lim_{x \to \infty} F_X(x) = 1
     $$

3. **Computing Probability Between Two Values:**  
   The probability that $ X $ lies in an interval $ [a, b] $ is:

   $$
   P(a \leq X \leq b) = F_X(b) - F_X(a)
   $$

4. **Relationship with PDF:**  
   If $ X $ is continuous, the CDF and PDF are related by differentiation:

   $$
   f_X(x) = \frac{d}{dx} F_X(x)
   $$

---

**Example: Discrete Random Variable (Rolling a Fair Die)**  
Let $ X $ be the result of rolling a fair six-sided die. The **PMF** is:

$$
P(X = x) =
\begin{cases}
\frac{1}{6}, & x = 1,2,3,4,5,6 \\
0, & \text{otherwise}
\end{cases}
$$

The **CDF** is:

$$
F_X(x) =
\begin{cases}
0, & x < 1 \\
\frac{1}{6}, & 1 \leq x < 2 \\
\frac{2}{6}, & 2 \leq x < 3 \\
\frac{3}{6}, & 3 \leq x < 4 \\
\frac{4}{6}, & 4 \leq x < 5 \\
\frac{5}{6}, & 5 \leq x < 6 \\
1, & x \geq 6
\end{cases}
$$

This means:
- The probability of rolling **≤ 3** is $ F_X(3) = \frac{3}{6} = 0.5 $.
- The probability of rolling **between 2 and 4** is:

  $$
  P(2 \leq X \leq 4) = F_X(4) - F_X(2) = \frac{4}{6} - \frac{2}{6} = \frac{2}{6}
  $$

---

**Example: Continuous Random Variable (Uniform Distribution on $ [0,1] $)**  
For a uniform distribution between 0 and 1, the **PDF** is:

$$
f_X(x) =
\begin{cases}
1, & 0 \leq x \leq 1 \\
0, & \text{otherwise}
\end{cases}
$$

The **CDF** is:

$$
F_X(x) =
\begin{cases}
0, & x < 0 \\
x, & 0 \leq x \leq 1 \\
1, & x > 1
\end{cases}
$$

For example:
- $ P(X \leq 0.5) = F_X(0.5) = 0.5 $
- $ P(0.2 \leq X \leq 0.8) = F_X(0.8) - F_X(0.2) = 0.8 - 0.2 = 0.6 $

---

**Example: Normal (Gaussian) Distribution**  
For a **Normal (Gaussian) distribution**:

$$
F_X(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(t - \mu)^2}{2\sigma^2}} dt
$$

This integral **does not have a closed-form solution**, so we use numerical approximations or lookup tables.

For a **standard normal distribution** ($ \mu = 0, \sigma^2 = 1 $), the CDF is denoted as:

$$
\Phi(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}} dt
$$

Common values (from a normal table):
- $ \Phi(0) = 0.5 $
- $ \Phi(1) \approx 0.8413 $
- $ \Phi(-1) \approx 0.1587 $

To find $ P(0 \leq X \leq 1) $ for a standard normal variable:

$$
P(0 \leq X \leq 1) = \Phi(1) - \Phi(0) = 0.8413 - 0.5 = 0.3413
$$

---

**Key Differences Between PDF and CDF**  

| Feature | PDF (Continuous) | CDF |
|---------|-----------------|-----|
| Definition | $ f_X(x) $ gives the density, not probability | $ F_X(x) = P(X \leq x) $ gives cumulative probability |
| Relationship | $ P(a \leq X \leq b) = \int_a^b f_X(x) dx $ | $ P(a \leq X \leq b) = F_X(b) - F_X(a) $ |
| Example | $ f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} $ (Normal) | $ F_X(x) = \int_{-\infty}^{x} f_X(t) dt $ |

```
```{admonition} What is Expectation of a random variable?
:class: note, dropdown

The **expectation** (or **expected value**) of a random variable represents its long-term average value over many trials. It gives an idea of the **center** of the distribution.

---

**Definition**  
For a random variable $ X $, the expectation (denoted as $ \mathbb{E}[X] $) is defined as:

- **For a discrete random variable:**

  $$
  \mathbb{E}[X] = \sum_{i} x_i P(X = x_i)
  $$

- **For a continuous random variable:**

  $$
  \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) \, dx
  $$

where:
- $ P(X = x_i) $ is the probability mass function (PMF) for discrete variables.
- $ f_X(x) $ is the probability density function (PDF) for continuous variables.

The expectation can be interpreted as a **weighted average**, where each possible value of $ X $ is weighted by its probability.

---

**Properties of Expectation**  

1. **Linearity:**  
   For any two random variables $ X $ and $ Y $, and constants $ a, b $:

   $$
   \mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]
   $$

2. **Expectation of a Constant:**  
   If $ c $ is a constant, then:

   $$
   \mathbb{E}[c] = c
   $$

3. **Expectation of a Function of X:**  
   If $ g(X) $ is a function of a random variable $ X $, then:

   $$
   \mathbb{E}[g(X)] = \sum_{i} g(x_i) P(X = x_i) \quad \text{(discrete)}
   $$

   $$
   \mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx \quad \text{(continuous)}
   $$

---

**Example: Discrete Random Variable (Rolling a Fair Die)**  
Let $ X $ be the result of rolling a fair six-sided die. The possible values are $ X = \{1,2,3,4,5,6\} $, and the probability of each value is:

$$
P(X = x) = \frac{1}{6}, \quad x \in \{1,2,3,4,5,6\}
$$

The expectation is:

$$
\mathbb{E}[X] = \sum_{x=1}^{6} x P(X = x)
$$

$$
= 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6}
$$

$$
= \frac{1+2+3+4+5+6}{6} = \frac{21}{6} = 3.5
$$

So, if you roll a fair die many times, the **average outcome** will be **3.5**.

---

**Example: Continuous Random Variable (Uniform Distribution on [0,1])**  
Let $ X $ follow a uniform distribution on $ [0,1] $, meaning its PDF is:

$$
f_X(x) =
\begin{cases}
1, & 0 \leq x \leq 1 \\
0, & \text{otherwise}
\end{cases}
$$

The expectation is:

$$
\mathbb{E}[X] = \int_0^1 x f_X(x) dx = \int_0^1 x \cdot 1 \, dx
$$

$$
= \frac{x^2}{2} \Big|_0^1 = \frac{1}{2}
$$

So, the **expected value of a uniformly distributed variable** on $ [0,1] $ is **0.5**.

---

**Expectation and Mean in Statistics**  
The expectation $ \mathbb{E}[X] $ is also called the **mean** or **first moment** of a random variable and is denoted as:

$$
\mu = \mathbb{E}[X]
$$

For a normal distribution $ X \sim \mathcal{N}(\mu, \sigma^2) $, the expected value is simply:

$$
\mathbb{E}[X] = \mu
$$

---

**Key Takeaways**  
- Expectation is a **long-run average** value of a random variable.
- For **discrete** variables, expectation is a **sum** over all possible values.
- For **continuous** variables, expectation is an **integral** over the probability density.
- Expectation satisfies **linearity**, meaning sums and constants can be pulled out.
```
```{admonition} What is meant by Expectation over a probability distribution?
:class: tip, dropdown
Expectation over a probability distribution means computing the **average value** of a function of a random variable, weighted by the probability distribution of that variable.

In simpler terms, if a random variable $ X $ follows a certain probability distribution, the expectation tells us the **average value of $ X $** when sampled from that distribution.

---

**Definition**  
For a function $ g(X) $ of a random variable $ X $, the expectation over a probability distribution is:

- **For a discrete random variable** with probability mass function (PMF) $ P(X = x) $:

  $$
  \mathbb{E}[g(X)] = \sum_{x} g(x) P(X = x)
  $$

- **For a continuous random variable** with probability density function (PDF) $ f_X(x) $:

  $$
  \mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) \, dx
  $$

This tells us the expected value of the function $ g(X) $ when $ X $ is sampled according to its probability distribution.

---

**Special Case: Expectation of $ X $ Itself**  
If $ g(X) = X $, then the expectation simply gives the **mean** of the distribution:

- **Discrete case:**

  $$
  \mathbb{E}[X] = \sum_{x} x P(X = x)
  $$

- **Continuous case:**

  $$
  \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) \, dx
  $$

---

**Example 1: Discrete Expectation Over a Probability Distribution**  
Consider a **biased** coin flip where $ X $ represents the number of heads in a single flip:

- $ P(X = 1) = p $ (probability of heads)
- $ P(X = 0) = 1 - p $ (probability of tails)

The expectation of $ X $ is:

$$
\mathbb{E}[X] = (1 \cdot p) + (0 \cdot (1 - p)) = p
$$

This means that if we repeatedly flip the coin, the **average number of heads per flip** is equal to the probability of getting heads.

---

**Example 2: Continuous Expectation Over a Probability Distribution**  
Let $ X $ follow a standard **normal distribution** $ \mathcal{N}(0,1) $, meaning it has the PDF:

$$
f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

To compute the expectation:

$$
\mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) dx
$$

Since the normal distribution is **symmetric around zero**, the positive and negative contributions cancel out:

$$
\mathbb{E}[X] = 0
$$

which makes sense because a standard normal distribution has a mean of zero.

---

**Expectation of a Function Over a Distribution**  
Instead of computing $ \mathbb{E}[X] $, we can compute the expectation of a function $ g(X) $.  

For example, consider computing the expectation of $ g(X) = X^2 $ for a normal distribution:

$$
\mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 f_X(x) dx
$$

For a normal distribution $ X \sim \mathcal{N}(\mu, \sigma^2) $, it is a known result that:

$$
\mathbb{E}[X^2] = \sigma^2 + \mu^2
$$

which shows how variance and mean influence the expected squared value.

---

**Key Takeaways**  
- Expectation over a probability distribution means computing the average outcome, weighted by the probability of each outcome.
- It applies to both **discrete** (sums) and **continuous** (integrals) cases.
- The expectation of a function $ g(X) $ can be computed using the probability distribution of $ X $.
- The expectation of $ X $ itself gives the **mean** of the distribution.
```
```{admonition} Expectation of a Function Over a Distribution vs. Expectation Over a Probability Distribution
:class: tip, dropdown
**1. Expectation Over a Probability Distribution**  
This refers to the general concept of computing an expected value based on a **probability distribution**. It can apply to a random variable $ X $ itself or to any function of $ X $.

If $ X $ is a random variable with a probability distribution given by:
- **PMF** $ P(X = x) $ (discrete case)
- **PDF** $ f_X(x) $ (continuous case)

Then the expectation of $ X $ itself is:

- **Discrete Case:**

  $$
  \mathbb{E}[X] = \sum_{x} x P(X = x)
  $$
  
- **Continuous Case:**

  $$
  \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) dx
  $$

This simply computes the **mean** or **average** value of $ X $ when sampled from its probability distribution.

---

**2. Expectation of a Function Over a Distribution**  
This extends the idea of expectation to **functions of a random variable**. Instead of computing the expectation of $ X $, we compute the expectation of some function **$ g(X) $**, which could be **nonlinear**.

The expectation of a function $ g(X) $ over a probability distribution is:

- **Discrete Case:**

  $$
  \mathbb{E}[g(X)] = \sum_{x} g(x) P(X = x)
  $$
  
- **Continuous Case:**

  $$
  \mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx
  $$

This formulation is useful when dealing with **moment calculations**, **variance computations**, and **statistical transformations**.

---

**Key Difference**  

| Concept | Expectation Over a Probability Distribution | Expectation of a Function Over a Distribution |
|---------|--------------------------------|--------------------------------|
| Definition | Computes the expectation of the random variable itself | Computes the expectation of a function of the random variable |
| Formula (Discrete) | $ \mathbb{E}[X] = \sum x P(X = x) $ | $ \mathbb{E}[g(X)] = \sum g(x) P(X = x) $ |
| Formula (Continuous) | $ \mathbb{E}[X] = \int x f_X(x) dx $ | $ \mathbb{E}[g(X)] = \int g(x) f_X(x) dx $ |
| Example | Mean of a normal distribution: $ \mathbb{E}[X] = \mu $ | Expected squared value: $ \mathbb{E}[X^2] = \sigma^2 + \mu^2 $ |
| Purpose | Computes the **average value** of a random variable | Computes the **average value of a transformed variable** |

---

**Example 1: Expectation Over a Probability Distribution (Mean of a Fair Die)**  
Let $ X $ be the result of rolling a fair six-sided die:

$$
P(X = x) =
\begin{cases}
\frac{1}{6}, & x \in \{1,2,3,4,5,6\} \\
0, & \text{otherwise}
\end{cases}
$$

The expectation (mean value) of $ X $ is:

$$
\mathbb{E}[X] = \sum_{x=1}^{6} x P(X = x)
$$

$$
= \frac{1}{6} (1 + 2 + 3 + 4 + 5 + 6) = 3.5
$$

This means the **average outcome of rolling the die** is **3.5**.

---

**Example 2: Expectation of a Function Over a Distribution (Expected Squared Value of a Fair Die)**  
Now, let's compute the expectation of $ g(X) = X^2 $:

$$
\mathbb{E}[X^2] = \sum_{x=1}^{6} x^2 P(X = x)
$$

$$
= \frac{1}{6} (1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2)
$$

$$
= \frac{1}{6} (1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.17
$$

This result tells us that the **expected squared outcome** of rolling the die is **15.17**, which is **not** the same as squaring the expected value:

$$
\mathbb{E}[X]^2 = 3.5^2 = 12.25
$$

This difference is important in variance calculations:

$$
\text{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2
$$

---

**Example 3: Expectation of a Function in a Continuous Distribution**  
Let $ X $ follow a standard **normal distribution** $ \mathcal{N}(0,1) $, meaning its PDF is:

$$
f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

To compute the expectation of $ X^2 $:

$$
\mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 f_X(x) dx
$$

For a normal distribution $ X \sim \mathcal{N}(\mu, \sigma^2) $, it is a known result that:

$$
\mathbb{E}[X^2] = \sigma^2 + \mu^2
$$

For a standard normal distribution ($ \mu = 0, \sigma^2 = 1 $), this simplifies to:

$$
\mathbb{E}[X^2] = 1
$$

Again, this shows the difference between $ \mathbb{E}[X] = 0 $ and $ \mathbb{E}[X^2] = 1 $.

---

**Key Takeaways**  
- **Expectation over a probability distribution** finds the average of the **random variable**.
- **Expectation of a function over a distribution** finds the average of a **transformed random variable**.
- If $ g(X) = X $, the expectation of the function reduces to the expectation of the variable.
- These concepts are essential for **moments**, **variance**, and **transformations** in probability.

```



```{admonition} What is chain rule of probability?
:class: note, dropdown
The **chain rule of probability** (also called the **product rule**) allows us to compute the joint probability of multiple events by breaking it down into conditional probabilities.

---

**Definition**  
For any $ n $ random variables $ X_1, X_2, \dots, X_n $, their joint probability can be decomposed as:

$$
P(X_1, X_2, \dots, X_n) = P(X_1) P(X_2 \mid X_1) P(X_3 \mid X_1, X_2) \dots P(X_n \mid X_1, X_2, \dots, X_{n-1})
$$

In general, for **two** random variables:

$$
P(A, B) = P(A \mid B) P(B) = P(B \mid A) P(A)
$$

For **three** random variables:

$$
P(A, B, C) = P(A) P(B \mid A) P(C \mid A, B)
$$

This process extends to **any number of variables**.

---

**Intuition**  
The chain rule breaks down the **joint probability** into a sequence of **conditional probabilities**, explaining how each variable depends on the previous ones.

Example: Suppose we have three events:
- $ X_1 $ = "It rains"
- $ X_2 $ = "I carry an umbrella"
- $ X_3 $ = "I stay dry"

Using the chain rule:

$$
P(\text{Rain, Umbrella, Dry}) = P(\text{Rain}) P(\text{Umbrella} \mid \text{Rain}) P(\text{Dry} \mid \text{Rain, Umbrella})
$$

Each probability **conditions on the previous** event, showing how they are **linked**.

---

**Example: Probability of Drawing Cards**  
Consider drawing **three** cards from a deck **without replacement**:

- $ A $ = "First card is an Ace"
- $ B $ = "Second card is an Ace"
- $ C $ = "Third card is an Ace"

The probability of drawing three Aces is:

$$
P(A, B, C) = P(A) P(B \mid A) P(C \mid A, B)
$$

Given there are **4 Aces in 52 cards**, we calculate:

$$
P(A) = \frac{4}{52}
$$

If we already drew an Ace, only **3 Aces remain in 51 cards**:

$$
P(B \mid A) = \frac{3}{51}
$$

If two Aces are drawn, only **2 Aces remain in 50 cards**:

$$
P(C \mid A, B) = \frac{2}{50}
$$

So:

$$
P(A, B, C) = \frac{4}{52} \times \frac{3}{51} \times \frac{2}{50} = \frac{24}{132600} \approx 0.00018
$$

---

**Application in Machine Learning and Bayesian Networks**  
The chain rule is fundamental in:
- **Bayesian Networks**: Used to compute probabilities in graphical models.
- **Hidden Markov Models (HMMs)**: Used for **sequence modeling**.
- **Naive Bayes Classifier**: Assumes conditional independence to simplify computations.

For a Bayesian network with **nodes $ X_1, X_2, \dots, X_n $** structured in a **dependency graph**, the chain rule becomes:

$$
P(X_1, X_2, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i))
$$

where **Parents($ X_i $)** are the nodes that influence $ X_i $.

---

**Key Takeaways**  
- The **chain rule** expresses **joint probability** in terms of **conditional probabilities**.
- It helps in **breaking down complex probability calculations**.
- It is widely used in **Bayesian inference, machine learning, and probability theory**.
```
```{admonition} Definition of Joint Probability
:class: note, dropdown

The **joint probability** of two or more random variables is the probability that all events occur simultaneously.

---

**Definition**  
For two random variables $ X $ and $ Y $, the **joint probability** is denoted as:

$$
P(X = x, Y = y)
$$

which represents the probability that **both** $ X = x $ and $ Y = y $ occur **together**.

For **multiple variables** $ X_1, X_2, \dots, X_n $, the joint probability is:

$$
P(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n)
$$

which represents the probability that all random variables take their respective values **at the same time**.

---

**Example: Rolling Two Dice**  
Let $ X $ and $ Y $ be the outcomes of rolling two fair six-sided dice.

- There are **36 possible outcomes** since each die has 6 sides.
- The probability of any specific outcome, e.g., $ (X = 2, Y = 5) $, is:

  $$
  P(X = 2, Y = 5) = \frac{1}{36}
  $$

since each of the 36 outcomes is equally likely.

---

**Computing Joint Probability Using Conditional Probability**  
Using the **chain rule**, joint probability can be computed as:

$$
P(X, Y) = P(X \mid Y) P(Y)
$$

or equivalently:

$$
P(X, Y) = P(Y \mid X) P(X)
$$

This expresses the joint probability in terms of **conditional probability**.

---

**Independent Events and Joint Probability**  
If $ X $ and $ Y $ are **independent**, then their joint probability simplifies to:

$$
P(X, Y) = P(X) P(Y)
$$

This means that knowing $ X $ does **not** affect the probability of $ Y $.

**Example: Two Independent Coin Flips**  
Let $ X $ and $ Y $ be the outcomes of two independent coin flips, where:

- $ P(X = H) = \frac{1}{2} $
- $ P(Y = H) = \frac{1}{2} $

Since the flips are independent:

$$
P(X = H, Y = H) = P(X = H) P(Y = H) = \frac{1}{2} \times \frac{1}{2} = \frac{1}{4}
$$

---

**Joint Probability Distribution (JPD)**  
The **joint probability distribution** describes the probability of all possible combinations of $ X $ and $ Y $.

For **discrete random variables**, the JPD is represented as a **table**.

| $ X \backslash Y $ | $ Y = 0 $ | $ Y = 1 $ |
|------------------|---------|---------|
| $ X = 0 $ | $ P(0,0) $ | $ P(0,1) $ |
| $ X = 1 $ | $ P(1,0) $ | $ P(1,1) $ |

Each entry in the table represents a **joint probability**.

For **continuous random variables**, the JPD is defined using the **joint probability density function (PDF)**:

$$
P(a \leq X \leq b, c \leq Y \leq d) = \int_a^b \int_c^d f_{X,Y}(x, y) \, dy \, dx
$$

where $ f_{X,Y}(x, y) $ is the **joint PDF**.

---

**Marginal Probability from Joint Probability**  
The **marginal probability** of a single variable is found by **summing** (discrete case) or **integrating** (continuous case) over the other variable.

- **Discrete Case:**
  
  $$
  P(X = x) = \sum_{y} P(X = x, Y = y)
  $$

- **Continuous Case:**
  
  $$
  P(X = x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy
  $$

This gives the probability of $ X $ occurring, regardless of $ Y $.

---

**Key Takeaways**  
- **Joint probability** measures the probability of two or more events occurring together.
- It can be computed using **conditional probability** and the **chain rule**.
- **Independence** simplifies the computation as $ P(X, Y) = P(X) P(Y) $.
- The **joint probability distribution (JPD)** describes how multiple variables interact.
```
```{admonition} Definition of Marginalization and Marginal Likelihood
:class: note, dropdown
**Marginalization** and **marginal likelihood** are related concepts in probability and Bayesian inference, both involving summing or integrating over hidden or unobserved variables. However, they serve different purposes.

---

**1. Marginalization**  

Marginalization refers to the process of obtaining the probability of a **subset** of random variables by summing or integrating over the remaining variables.

- **For discrete random variables**, marginalization is done by summing over all possible values of another variable:

  $$
  P(X = x) = \sum_{y} P(X = x, Y = y)
  $$

- **For continuous random variables**, marginalization is done by integrating over the unwanted variable:

  $$
  P(X = x) = \int_{-\infty}^{\infty} P(X = x, Y = y) \, dy
  $$

This process removes the dependency on the second variable, leaving only the probability distribution for the first variable.

---

**Example: Marginalization in a Joint Distribution**  

Consider a **joint probability table** for two discrete variables $ X $ and $ Y $:

| $ X \backslash Y $ | $ Y = 0 $ | $ Y = 1 $ | Marginal $ P(X) $ |
|------------------|---------|---------|--------------|
| $ X = 0 $ | $ 0.2 $ | $ 0.3 $ | $ 0.2 + 0.3 = 0.5 $ |
| $ X = 1 $ | $ 0.1 $ | $ 0.4 $ | $ 0.1 + 0.4 = 0.5 $ |

The **marginal probability** of $ X = 0 $ is:

$$
P(X = 0) = P(X = 0, Y = 0) + P(X = 0, Y = 1) = 0.2 + 0.3 = 0.5
$$

This removes the dependency on $ Y $, leaving only the probabilities for $ X $.

---

**2. Marginal Likelihood (Evidence in Bayesian Inference)**  

The **marginal likelihood**, also called the **evidence**, is the probability of observed data, **integrating out any hidden or latent variables**.

If $ X $ represents the observed data and $ Z $ is a latent (hidden) variable, the marginal likelihood is:

- **For discrete variables**:

  $$
  P(X) = \sum_{Z} P(X \mid Z) P(Z)
  $$

- **For continuous variables**:

  $$
  P(X) = \int P(X \mid Z) P(Z) \, dZ
  $$

This integral sums over all possible values of the latent variable $ Z $, making $ P(X) $ a **weighted sum of likelihoods over all possible latent variables**.

---

**Example: Bayesian Model Evidence**  

Suppose we are classifying an email as **spam ($ S $) or not spam ($ \neg S $)**, but we do not know the exact proportion of spam emails. Let:

- $ X $ = "email contains the word 'free'"
- $ S $ = "email is spam"
- $ P(X \mid S) = 0.8 $ (80% of spam emails contain "free")
- $ P(X \mid \neg S) = 0.1 $ (only 10% of non-spam emails contain "free")
- $ P(S) = 0.3 $, meaning 30% of emails are spam
- $ P(\neg S) = 0.7 $, meaning 70% of emails are not spam

Using **the law of total probability**, the marginal likelihood of $ X $ (observing "free") is:

$$
P(X) = P(X \mid S) P(S) + P(X \mid \neg S) P(\neg S)
$$

$$
P(X) = (0.8 \times 0.3) + (0.1 \times 0.7) = 0.24 + 0.07 = 0.31
$$

This marginal likelihood helps in **Bayesian inference**, particularly in **Bayes’ theorem**:

$$
P(S \mid X) = \frac{P(X \mid S) P(S)}{P(X)}
$$

which gives the probability of an email being spam given that it contains "free."

---

**Differences Between Marginalization and Marginal Likelihood**  

| Feature | Marginalization | Marginal Likelihood |
|---------|----------------|---------------------|
| Definition | Computes the probability of a subset of variables by summing or integrating over the others | Computes the probability of observed data by integrating out hidden variables |
| Purpose | To remove dependencies on other variables and find marginal probabilities | Used in **Bayesian inference** to compute the evidence for a model |
| Formula (Discrete) | $ P(X) = \sum_{Y} P(X, Y) $ | $ P(X) = \sum_{Z} P(X \mid Z) P(Z) $ |
| Formula (Continuous) | $ P(X) = \int P(X, Y) dy $ | $ P(X) = \int P(X \mid Z) P(Z) dZ $ |
| Application | Used in probability theory, graphical models | Used in Bayesian statistics and machine learning |
| Example | Summing over joint probabilities to get $ P(X) $ | Computing $ P(X) $ by integrating over a latent variable $ Z $ |

---

**Key Takeaways**  
- **Marginalization** computes the probability of a variable by removing dependencies on other variables.
- **Marginal probability** is found by summing (discrete case) or integrating (continuous case) over the other variables.
- **Marginal likelihood (evidence)** is used in **Bayesian inference** to find the probability of observed data **regardless of hidden variables**.
- These concepts are widely used in **probabilistic graphical models, Bayesian networks, and machine learning**.

```
```{admonition} What is reparametrization trick and why is it important?
:class: note, dropdown

The **reparametrization trick** is a technique used in **variational inference**, particularly in **variational autoencoders (VAEs)**, to enable gradient-based optimization of stochastic objectives. It allows gradients to pass through **random sampling operations**, making it possible to optimize models using **backpropagation**.

---

**1. The Problem: Why Do We Need the Reparametrization Trick?**  

In **stochastic neural networks**, we often need to optimize a loss function that involves **sampling from a probability distribution**. A common scenario is optimizing the **expected value** of a function:

$$
\mathbb{E}_{z \sim q(z \mid x)} [f(z)]
$$

This notation represents the **expectation of a function over a probability distribution**. It means that we are computing the expectation of the function $ f(z) $ with respect to the probability distribution $ q(z \mid x) $.

- **Expectation Over a Probability Distribution:**  
  The expectation is taken **over the distribution** $ q(z \mid x) $, meaning that we are integrating over all possible values of $ z $ weighted by their probability under $ q(z \mid x) $:

  $$
  \mathbb{E}_{z \sim q(z \mid x)} [f(z)] = \int f(z) q(z \mid x) dz
  $$

- **Expectation of a Function Over the Distribution:**  
  Here, $ f(z) $ is a function of $ z $, and we want to compute its average value under the probability distribution $ q(z \mid x) $. Since $ q(z \mid x) $ is typically a **latent variable distribution**, we cannot compute this expectation in closed form and instead rely on **Monte Carlo sampling**.

However, **direct sampling from $ q(z \mid x) $ prevents backpropagation**, since gradients cannot flow through the sampling operation. This makes it difficult to train models that involve such expectations.

---

**2. The Reparametrization Trick: A Solution**  

The reparametrization trick **re-writes the sampling process** in a way that allows gradients to be computed. Instead of directly sampling $ z \sim q(z \mid x) $, we express $ z $ as a **deterministic function** of some random noise $ \epsilon $ and parameters $ \mu, \sigma $:

$$
z = \mu + \sigma \cdot \epsilon, \quad \text{where} \quad \epsilon \sim \mathcal{N}(0,1)
$$

This trick **separates** the randomness (introduced by $ \epsilon $) from the learnable parameters ($ \mu, \sigma $), allowing **gradients to flow through $ \mu $ and $ \sigma $**.

Now, instead of optimizing:

$$
\mathbb{E}_{z \sim q(z \mid x)} [f(z)]
$$

we optimize:

$$
\mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)} [f(\mu + \sigma \epsilon)]
$$

which can be **differentiated w.r.t.** $ \mu $ and $ \sigma $ using standard **gradient-based methods**.

---

**3. Example: Variational Autoencoder (VAE)**  

A **Variational Autoencoder (VAE)** uses the reparametrization trick to learn a probabilistic latent representation.

1. Instead of sampling directly from $ q(z \mid x) \sim \mathcal{N}(\mu, \sigma^2) $, we sample from a standard normal distribution:

   $$
   \epsilon \sim \mathcal{N}(0,1)
   $$

2. We then reparametrize:

   $$
   z = \mu + \sigma \cdot \epsilon
   $$

3. The loss function includes a **KL-divergence term** and a **reconstruction loss**, both of which require differentiability.

4. The reparametrization trick allows **gradient updates to propagate through $ \mu $ and $ \sigma $**.

---

**4. Why Is the Reparametrization Trick Important?**  

✔ **Enables Backpropagation Through Stochastic Nodes**  
   - Without this trick, gradients cannot flow through sampling operations.  
   - It enables training probabilistic models like VAEs using **gradient descent**.

✔ **Reduces Variance in Gradient Estimates**  
   - Compared to Monte Carlo estimation methods, it provides more stable and lower-variance gradients.

✔ **Used in Bayesian Deep Learning and Reinforcement Learning**  
   - Essential for **Bayesian neural networks**, which learn uncertainty in deep learning.
   - Applied in **policy gradients** in reinforcement learning.

---

**5. Limitations and Extensions**  

- **Does not work for discrete random variables**  
  - Alternative methods like the **Gumbel-Softmax trick** are used for discrete distributions.

- **Assumes reparametrizable distributions**  
  - The trick works well for distributions like **Gaussian**, but for more complex distributions, alternative methods (e.g., normalizing flows) are needed.

---

**6. Understanding the Expectation in the Given Equation**  

The expectation in:

$$
\mathbb{E}_{z \sim q(z \mid x)} [f(z)]
$$

- **Expectation Over a Probability Distribution:**  
  The expectation is taken over the **latent variable distribution** $ q(z \mid x) $, meaning that we integrate over all possible values of $ z $ weighted by their probability under $ q(z \mid x) $.

  $$
  \mathbb{E}_{z \sim q(z \mid x)} [f(z)] = \int f(z) q(z \mid x) dz
  $$

- **Expectation of a Function Over the Distribution:**  
  The function $ f(z) $ could represent an objective function or loss that we are optimizing. Since $ q(z \mid x) $ is typically **complex and unknown**, we use **Monte Carlo estimation** to approximate this expectation:

  $$
  \frac{1}{N} \sum_{i=1}^{N} f(z_i), \quad z_i \sim q(z \mid x)
  $$

  However, **direct sampling blocks gradient flow**, which is why the **reparametrization trick** is crucial.

---

**Key Takeaways**  
- **Reparametrization Trick** allows **gradient-based learning** in models that involve **stochastic sampling**.  
- Converts sampling into a **deterministic function** of noise and parameters.  
- Used in **VAEs**, **Bayesian deep learning**, and **reinforcement learning**.  
- **Essential for optimizing probabilistic models using backpropagation**.  
- The expectation in $ \mathbb{E}_{z \sim q(z \mid x)} [f(z)] $ is over the **distribution** $ q(z \mid x) $, meaning we integrate over all possible values of $ z $ to compute the expected value of $ f(z) $.

```
```{admonition} What is meant by log likelihood? Why do we maximize in neural network training?
:class: note, dropdown

**1. What is Log Likelihood?**  

The **log likelihood** is a fundamental concept in probability and machine learning, used to estimate model parameters by **maximizing the probability of observed data**. It is commonly applied in **maximum likelihood estimation (MLE)** and is the foundation of many loss functions in deep learning.

**Likelihood Function**  
Given a dataset $ \mathcal{D} = \{ x_1, x_2, \dots, x_n \} $, where each $ x_i $ is a data point, and a probabilistic model with parameters $ \theta $, the **likelihood function** is defined as:

$$
L(\theta) = P(\mathcal{D} \mid \theta)
$$

Here:
- $ P(\mathcal{D} \mid \theta) $ represents the probability of observing the data $ \mathcal{D} $ given the model parameters $ \theta $.
- The goal of **maximum likelihood estimation (MLE)** is to find the parameters $ \theta^* $ that maximize this probability:

  $$
  \theta^* = \arg\max_{\theta} L(\theta)
  $$

Since models often assume **independent** data points, the likelihood function is expressed as a product:

$$
L(\theta) = \prod_{i=1}^{n} P(x_i \mid \theta)
$$

where:
- $ P(x_i \mid \theta) $ is the probability of the individual data point $ x_i $ under the model.
- The product arises because we assume that each data point is **independent and identically distributed (i.i.d.)**.

**Log Likelihood Function**  
The log likelihood is simply the **logarithm** of the likelihood function:

$$
\log L(\theta) = \log P(\mathcal{D} \mid \theta)
$$

Using the i.i.d. assumption:

$$
\log L(\theta) = \log \prod_{i=1}^{n} P(x_i \mid \theta)
$$

Applying the **logarithm property** ($ \log ab = \log a + \log b $):

$$
\log L(\theta) = \sum_{i=1}^{n} \log P(x_i \mid \theta)
$$

Here:
- The **logarithm transforms the product into a sum**, which is easier to compute and numerically more stable.
- Instead of multiplying many small probabilities (which can cause numerical underflow), we sum their log probabilities.

Thus, **maximizing likelihood is equivalent to maximizing the log likelihood**, which simplifies optimization.

---

**2. Why Do We Maximize Log Likelihood in Neural Networks?**  

Neural networks often predict probabilities. To train a model, we want to maximize the probability assigned to the correct data points, i.e., maximize:

$$
P(\mathcal{D} \mid \theta)
$$

Since working with probabilities directly can be unstable (due to small values), we use the log likelihood:

$$
\theta^* = \arg\max_{\theta} \log L(\theta) = \arg\max_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta)
$$

This is the objective function used in probabilistic models.

---

**3. Example: Log Likelihood in Classification (Softmax + Cross-Entropy Loss)**  

In **neural network classification**, we model the probability of each class using the **softmax function**:

$$
P(y \mid x, \theta) = \frac{\exp(f_{\theta}(x)_y)}{\sum_{j} \exp(f_{\theta}(x)_j)}
$$

where:
- $ f_{\theta}(x)_y $ is the predicted score for class $ y $.
- The denominator ensures all class probabilities sum to **1**.

The likelihood for a dataset $ \mathcal{D} $ is:

$$
L(\theta) = \prod_{i=1}^{n} P(y_i \mid x_i, \theta)
$$

Taking the **log likelihood**:

$$
\log L(\theta) = \sum_{i=1}^{n} \log P(y_i \mid x_i, \theta)
$$

which is **equivalent to minimizing the cross-entropy loss**:

$$
\mathcal{L}(\theta) = - \sum_{i=1}^{n} \log P(y_i \mid x_i, \theta)
$$

Thus, **maximizing log likelihood is the same as minimizing cross-entropy loss** in classification tasks.

---

**4. Why Log Likelihood is Used in Training?**  

✔ **Converts Products into Sums**  
   - Probabilities are small values between **0 and 1**.
   - Multiplying many probabilities leads to **numerical underflow**.
   - Taking the **log** prevents this issue by converting the product into a sum.

✔ **Easier Optimization**  
   - Log likelihood often results in **convex** loss functions, making gradient-based optimization more effective.

✔ **Directly Related to Cross-Entropy Loss**  
   - In classification, **maximizing log likelihood = minimizing cross-entropy loss**.

✔ **Has a Probabilistic Interpretation**  
   - Maximizing log likelihood ensures our model assigns **high probability to observed data**, leading to better generalization.

✔ **Log Likelihood Gradient Helps in Backpropagation**  
   - The gradients of log likelihood are well-defined, ensuring smooth updates in gradient descent.

---

**5. Key Takeaways**  

- **Log likelihood** is the logarithm of the likelihood function, used to estimate parameters in probabilistic models.  
- **Maximizing log likelihood** finds the best parameters that make the observed data most probable.  
- **In neural networks**, log likelihood optimization is equivalent to minimizing **cross-entropy loss** for classification.  
- **Computationally stable** and helps avoid underflow in probability calculations.  
- **Essential in probabilistic deep learning models**, such as **VAEs, Bayesian neural networks, and language models**.

```
```{admonition} What is KL Divergence and why is it important?
:class: note, dropdown

**1. What is KL Divergence?**  

**Kullback-Leibler (KL) Divergence** is a fundamental concept in probability theory and machine learning. It measures how one probability distribution differs from another. In other words, it quantifies the **information loss** when we approximate a true distribution with another.

For two probability distributions:
- **True distribution**: $ P(x) $
- **Approximate distribution**: $ Q(x) $

The **KL divergence** is defined as:

$$
D_{\text{KL}}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \quad \text{(discrete case)}
$$

or, for continuous distributions:

$$
D_{\text{KL}}(P \parallel Q) = \int P(x) \log \frac{P(x)}{Q(x)} \, dx
$$

where:
- $ P(x) $ is the **true** distribution (e.g., the actual data distribution).
- $ Q(x) $ is the **approximate** distribution (e.g., a model trying to approximate $ P $).
- The **log ratio** measures how much $ P(x) $ and $ Q(x) $ diverge at each point.

The **KL divergence is always non-negative**:

$$
D_{\text{KL}}(P \parallel Q) \geq 0
$$

with equality ($ D_{\text{KL}}(P \parallel Q) = 0 $) if and only if **$ P(x) = Q(x) $ for all $ x $**.

---

**2. Intuition Behind KL Divergence**  

✔ **Measures Information Loss**  
   - If we use $ Q(x) $ to approximate $ P(x) $, KL divergence tells us **how much information is lost**.

✔ **Asymmetry: $ D_{\text{KL}}(P \parallel Q) \neq D_{\text{KL}}(Q \parallel P) $**  
   - KL divergence is **not symmetric**, meaning it is **not a true distance metric**.

✔ **Expectation of Log Difference**  
   - The term $ \log \frac{P(x)}{Q(x)} $ represents the log difference between the two distributions.
   - KL divergence takes the **expectation under $ P(x) $**, meaning that the true distribution **weights the difference**.

✔ **Lower KL = Better Approximation**  
   - If $ D_{\text{KL}}(P \parallel Q) $ is small, $ Q(x) $ is a good approximation of $ P(x) $.
   - If KL is large, $ Q(x) $ is far from $ P(x) $, meaning a poor approximation.

---

**3. Why is KL Divergence Important?**  

KL divergence is widely used in **machine learning, statistics, and deep learning** for **probability estimation, model optimization, and generative modeling**.

✔ **Used in Variational Inference**  
   - In **variational autoencoders (VAEs)**, KL divergence is used to **regularize** the latent space by forcing the approximate posterior $ Q(z \mid x) $ to be close to a prior $ P(z) $.

✔ **Used in Bayesian Deep Learning**  
   - KL divergence measures how much information is lost when using an **approximate posterior** instead of the **true Bayesian posterior**.

✔ **Used in Reinforcement Learning (RL)**  
   - In **policy optimization**, KL divergence ensures that updates do not drastically change the policy distribution.

✔ **Related to Cross-Entropy Loss**  
   - Cross-entropy loss in classification problems is directly related to KL divergence.

---

**4. Example: KL Divergence Between Two Normal Distributions**  

For two Gaussian distributions:

- **True distribution**: $ P(x) = \mathcal{N}(\mu_1, \sigma_1^2) $
- **Approximate distribution**: $ Q(x) = \mathcal{N}(\mu_2, \sigma_2^2) $

The KL divergence is:

$$
D_{\text{KL}}(P \parallel Q) =
\log \frac{\sigma_2}{\sigma_1} +
\frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}
$$

where:
- The **first term** measures the difference in **variance**.
- The **second term** measures the difference in **mean**.
- If $ \mu_1 = \mu_2 $ and $ \sigma_1 = \sigma_2 $, then $ D_{\text{KL}}(P \parallel Q) = 0 $.

---

**5. Symmetric Alternative: Jensen-Shannon (JS) Divergence**  

Since KL divergence is **not symmetric**, an alternative is **Jensen-Shannon divergence (JS divergence)**:

$$
D_{\text{JS}}(P \parallel Q) = \frac{1}{2} D_{\text{KL}}(P \parallel M) + \frac{1}{2} D_{\text{KL}}(Q \parallel M)
$$

where:

$$
M(x) = \frac{1}{2} (P(x) + Q(x))
$$

JS divergence **is symmetric** and **bounded between 0 and 1**, making it useful for comparing distributions.

---

**6. Key Takeaways**  

✔ **KL divergence measures the difference between two probability distributions.**  
✔ **It quantifies how much information is lost when using $ Q(x) $ instead of $ P(x) $.**  
✔ **Lower KL means a better approximation.**  
✔ **Used in VAEs, Bayesian deep learning, RL, and probability models.**  
✔ **Not symmetric: $ D_{\text{KL}}(P \parallel Q) \neq D_{\text{KL}}(Q \parallel P) $.**  
✔ **JS divergence is a symmetric alternative.**


```

## Forward Process

![image.png](figures/gaussian_diff.png)

```{admonition} How does joint distribution and marginal distribution come into picture here?
:class: note, dropdown

**Understanding Equation (1) in the Context of Joint and Marginal Distributions**  

1. **Understanding Equation (1) and the Joint Distribution**  

   The given equation describes a **stochastic process** where noise is added to a random variable iteratively:

   $$
   x_{t+1} := x_t + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma^2).
   $$

   Here:
   - $ x_0 $ is the **original data** drawn from the true data distribution $ p^*(x_0) $.
   - $ x_{t+1} $ is generated **recursively** from $ x_t $ by adding Gaussian noise $ \eta_t $.
   - $ \eta_t $ follows a **normal distribution** $ \mathcal{N}(0, \sigma^2) $.
   - $ \sigma^2 $ controls the scale of noise added at each step.

   This **iterative noise addition** results in a **joint distribution** over the entire sequence:

   $$
   p(x_0, x_1, \dots, x_T).
   $$

   **Why Does This Define a Joint Distribution?**  
   A **joint distribution** describes the probability of all random variables occurring together. The equation defines a **Markov chain**, where:

   $$
   p(x_0, x_1, \dots, x_T) = p(x_0) \prod_{t=1}^{T} p(x_t \mid x_{t-1}).
   $$

   Each transition follows a **Gaussian conditional probability**:

   $$
   p(x_t \mid x_{t-1}) = \mathcal{N}(x_t \mid x_{t-1}, \sigma^2).
   $$

   This means:
   - The probability of a full trajectory $ (x_0, x_1, ..., x_T) $ is determined by the **initial distribution** $ p(x_0) $ and the **conditional distributions** $ p(x_t \mid x_{t-1}) $.
   - Since each $ x_t $ is conditionally dependent only on $ x_{t-1} $, this process follows a **Markov property**.
   - Expanding one transition step mathematically:

     $$
     x_t = x_{t-1} + \eta_{t-1}, \quad \text{where } \eta_{t-1} \sim \mathcal{N}(0, \sigma^2).
     $$

     This shows that $ x_t $ is sampled from a Gaussian distribution centered at $ x_{t-1} $ with variance $ \sigma^2 $.

2. **How the Marginal Distribution Appears in This Process**  

   A **marginal distribution** is obtained by summing (discrete case) or integrating (continuous case) over unwanted variables in a joint distribution.

   To find the **marginal distribution** of $ x_t $, we integrate out previous states:

   $$
   p_t(x_t) = \int p(x_t \mid x_{t-1}) p_{t-1}(x_{t-1}) \, dx_{t-1}.
   $$

   By repeating this process, we marginalize out all prior steps:

   $$
   p_t(x_t) = \int p(x_t \mid x_{t-1}) p(x_{t-1} \mid x_{t-2}) \dots p(x_1 \mid x_0) p(x_0) \, dx_0 \dots dx_{t-1}.
   $$

   This equation shows that the marginal distribution $ p_t(x_t) $ depends on how noise **accumulates** over time.

3. **What Happens as $ T \to \infty $?**  

   As $ t $ increases, more noise is added, and the **marginal distribution** $ p_T(x_T) $ approaches a **Gaussian distribution**, regardless of the original data distribution $ p^*(x_0) $:

   $$
   p_T(x_T) \approx \mathcal{N}(0, \sigma^2 I).
   $$

   This means that after **many diffusion steps**, the data is **completely transformed into Gaussian noise**, losing its original structure.

4. **Summary of Key Points**  

   ✔ **Equation (1) defines a joint distribution** because it constructs a **probabilistic chain** over multiple variables with **conditional dependencies**.  
   ✔ **The marginal distribution** of each $ x_t $ is obtained by integrating out previous steps from the joint distribution.  
   ✔ As the process progresses, the **marginal distribution of $ x_T $ converges to a Gaussian**, meaning the data is transformed into pure noise.  

```

## Reverse Process

We have a way to start with a data point $x_0$ and reach the point $x_t$ following the forward diffusion noise addition. Now lets, use this information to formulate a problem statement to represent the reverse diffusion process.

![image.png](figures/rev_diff.png)

```{admonition} More details on the reverse process
:class: note, dropdown
1. **Understanding the Forward and Reverse Diffusion Process**  

   The **forward diffusion process** follows the equation:

   $$
   x_t = x_{t-1} + \sqrt{\beta_t} \eta, \quad \eta \sim \mathcal{N}(0, I).
   $$

   - Here, **$ \beta_t $ is a time-dependent noise variance**, meaning the **variance changes at each step** but the **mean remains dependent on $ x_{t-1} $**.
   - The mean of this transition process is simply **$ x_{t-1} $**, implying that at each step, we retain information about the previous state but add Gaussian noise.

   In contrast, the **reverse diffusion process** aims to recover $ x_{t-1} $ given $ x_t $, using:

   $$
   p(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1} \mid \mu_{\theta}(x_t), \sigma_t^2 I).
   $$

   - The **mean function $\mu_{\theta}(x_t)$ is learned** and does not remain constant.
   - The **variance $\sigma_t^2$ can be learned or fixed**, depending on the formulation.

2. **What Happens When the Reverse Process is Not Gaussian?**  

   If the conditional probability **$ p(x_{t-1} \mid x_t) $ is not Gaussian**, then we cannot directly sample from a simple normal distribution. Instead, we must compute:

   $$
   p(x_{t-1} \mid x_t) = \frac{p(x_t \mid x_{t-1}) p(x_{t-1})}{p(x_t)}.
   $$

   - **$ p(x_t \mid x_{t-1}) $** is the known forward transition (Gaussian).
   - **$ p(x_t) $** is the marginal distribution obtained by integrating over all possible previous states:

     $$
     p(x_t) = \int p(x_t \mid x_{t-1}) p(x_{t-1}) \, dx_{t-1}.
     $$

   - This requires solving an **intractable integral** over all possible prior states, making direct computation difficult.

3. **Equation to be Solved for Reverse Diffusion**  

   Since direct computation of **$ p(x_{t-1} \mid x_t) $** is intractable, we need to approximate it. The equation to solve in the general case is:

   $$
   p(x_{t-1} \mid x_t) \propto p(x_t \mid x_{t-1}) p(x_{t-1}).
   $$

   - If **$ p(x_{t-1}) $** is not Gaussian, this distribution becomes **complex and multimodal**, requiring advanced generative modeling techniques like **normalizing flows or energy-based models**.
   - In such cases, sampling from **$ p(x_{t-1} \mid x_t) $** is difficult because we do not have a closed-form solution.

4. **How the Reverse Process Simplifies If It Is Gaussian**  

   If we assume that **$ p(x_{t-1}) $ is also Gaussian**, then **Bayesian inference simplifies the conditional probability** into another Gaussian:

   $$
   p(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1} \mid \mu_{\theta}(x_t), \sigma_t^2 I).
   $$

   - This means the entire reverse process can be modeled using a **deep neural network** that predicts the **mean function $\mu_{\theta}(x_t)$**.
   - The **variance $\sigma_t^2$ can either be learned or set using a predefined noise schedule**.

5. **Key Takeaways**  

   ✔ **The forward process has a fixed mean at each step ($ x_{t-1} $), but the variance increases with time ($ \beta_t $).**  
   ✔ **The reverse process does not have a fixed mean; instead, it must be learned using a function $ \mu_{\theta}(x_t) $.**  
   ✔ **If the reverse process is non-Gaussian, we must marginalize over all previous states, requiring intractable integrations.**  
   ✔ **If the reverse process is Gaussian, it simplifies to a normal distribution with a learnable mean and variance, making training feasible.**    
```

![](figures/small_sigma_more_gauss_rev.png)

![](figures/rev_diff1.png)

The goal of the reverse process is to recover **$ x_{t-1} $** from **$ x_t $**. However, directly modeling the full conditional probability **$ p(x_{t-1} \mid x_t) $** can be difficult. Instead, we focus on learning the **mean** of this distribution, denoted as **$ \mu_{t-1}(x_t) $**, since for a **Gaussian assumption**, knowing the mean is sufficient to describe the distribution.

Thus, instead of learning the entire probability distribution **$ p(x_{t-1} \mid x_t) $**, we approximate it using the expectation:

$$
\mu_{t-1}(z) := \mathbb{E}[x_{t-1} \mid x_t = z]
$$

This means that given **a specific value of $ x_t $**, the best estimate of $ x_{t-1} $ is its expected value.

---

**2. Why Are We Taking an Expectation?**  

Expectation appears because in the **reverse diffusion process**, each step is **stochastic**, meaning multiple values of **$ x_{t-1} $** could lead to the same $ x_t $ due to added noise in the forward process.

- The expectation **$ \mathbb{E}[x_{t-1} \mid x_t] $** gives us the **most likely** previous step given the current state $ x_t $.
- Instead of predicting a single deterministic value, we predict **the expected value** of $ x_{t-1} $.

Since the forward process follows:

$$
x_t = x_{t-1} + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma^2 I)
$$

taking expectation removes the randomness introduced by $ \eta_t $ and gives the best estimate of $ x_{t-1} $.

---

**3. How Do We Estimate the Expectation? (Regression Formulation)**  

To estimate **$ \mu_{t-1}(x_t) $**, we approximate it using a **function $ f(x_t) $** that tries to **predict** $ x_{t-1} $. This function is found by minimizing the **mean squared error** (MSE) loss:

$$
\mu_{t-1} = \arg\min_{f: \mathbb{R}^d \to \mathbb{R}^d} \mathbb{E}_{x_t, x_{t-1}} ||f(x_t) - x_{t-1}||_2^2.
$$

Here:
- **$ f(x_t) $** is a neural network (or any regression model) that tries to predict $ x_{t-1} $.
- **The squared norm $ ||f(x_t) - x_{t-1}||_2^2 $** measures how far our prediction is from the true value.
- **Expectation $ \mathbb{E}_{x_t, x_{t-1}} $** means we average over many samples to find the best function $ f(x_t) $ that minimizes this error.

This is a standard **regression problem**: we are training a function to predict **$ x_{t-1} $** given **$ x_t $** by minimizing the squared error.

---

**4. Why is the Expectation Over $ x_t, x_{t-1} $?**  

- **We do not have a single $ x_t $ for each $ x_{t-1} $** due to the randomness in the forward process.
- Instead, multiple **$ x_{t-1} $** values contribute to the observed $ x_t $ because of the Gaussian noise **$ \eta $**.
- This means we must train our function **by averaging over all possible ($ x_t, x_{t-1} $) pairs** from the dataset.

---

**5. Alternative Expression Using the Forward Process**  

Since we know the forward step follows:

$$
x_t = x_{t-1} + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma^2 I)
$$

we can rewrite the regression formulation as:

$$
\mu_{t-1} = \arg\min_{f: \mathbb{R}^d \to \mathbb{R}^d} \mathbb{E}_{x_{t-1}, \eta} ||f(x_{t-1} + \eta_t) - x_{t-1}||_2^2.
$$

This means:
- Instead of regressing directly on ($ x_t, x_{t-1} $) pairs, we use the known **noisy forward process** to **generate** training pairs.
- Given **$ x_{t-1} $**, we add Gaussian noise **$ \eta_t $** to obtain **$ x_t $**, and then train our model **to recover $ x_{t-1} $ from $ x_t $**

This regression **maps noisy inputs to clean ones**, making it essentially a **denoising problem**.

---

**6. Connection to Image Denoising and Deep Learning**  

- If **$ p^* $** is an image distribution, the regression objective becomes:
  - Given a **noisy image $ x_t $**, predict the **clean image $ x_{t-1} $**.
  - This is exactly what **image denoising networks** do, explaining why diffusion models can be trained using **convolutional neural networks (CNNs)**.

- The function **$ f(x_t) $** is usually parameterized by a neural network:
  - **UNet architectures** are commonly used because they are good at removing noise at different scales.

---

**7. Summary of Key Concepts**  

**Instead of modeling the full conditional distribution $ p(x_{t-1} \mid x_t) $, we only learn the mean function $ \mathbb{E}[x_{t-1} \mid x_t] $**, which simplifies the problem.  
**Expectation is necessary because multiple $ x_{t-1} $ values could lead to the same $ x_t $ due to noise, and we need the best average estimate.**  
**We solve this as a regression problem: given $ x_t $, predict $ x_{t-1} $ using a function $ f(x_t) $, trained to minimize mean squared error (MSE).**  
**Instead of using real ($ x_t, x_{t-1} $) pairs, we simulate them using the known forward process by adding Gaussian noise.**  
**This makes diffusion models closely related to denoising models, which explains why CNNs work well in this context.**  

```{admonition} Independent and Identically Distributed
:class: note, dropdown

1. **Definition of i.i.d.**  

   The term **i.i.d.** stands for **Independent and Identically Distributed** and refers to a collection of random variables that satisfy two conditions:

   - **Independent:** Each random variable does not affect the others.
   - **Identically Distributed:** All random variables follow the same probability distribution.

   Mathematically, a sequence of random variables **$ X_1, X_2, ..., X_n $** is **i.i.d.** if:

   - **Independence:**  
   
     $$
     P(X_1, X_2, ..., X_n) = P(X_1) P(X_2) ... P(X_n).
     $$
     
     This means knowing the value of **$ X_1 $** does not provide any information about **$ X_2 $**, and so on.

   - **Identical Distribution:**  
     For all **$ i $**, the probability distribution of **$ X_i $** is the same:
     
     $$
     P(X_i \leq x) = P(X_j \leq x), \quad \forall i, j.
     $$
     
     This means all random variables are drawn from the same probability distribution.

2. **Why is i.i.d. Important?**  

   Many fundamental results in probability, statistics, and machine learning assume that data points are **i.i.d.** because it simplifies analysis and model training.

   ✔ **Allows Use of the Law of Large Numbers (LLN):**  
      - If we take many i.i.d. samples, the sample mean **converges to the true mean** of the distribution.

   ✔ **Enables the Central Limit Theorem (CLT):**  
      - The sum (or average) of i.i.d. random variables follows a **normal distribution** when the number of samples is large, regardless of the original distribution.

   ✔ **Simplifies Machine Learning Models:**  
      - Many algorithms assume that training samples are **independent and identically distributed** to avoid biases in learning patterns.

   ✔ **Makes Statistical Inference Easier:**  
      - i.i.d. samples allow estimators (like mean and variance) to be **unbiased and consistent**.

3. **Examples of i.i.d. and Non-i.i.d. Data**  

   - **i.i.d. Example:**  
     - Rolling a fair six-sided die multiple times.
     - Each roll does not affect the next roll (independence).
     - Each roll has the same probability distribution (identical distribution).

   - **Non-i.i.d. Example (Dependent Data):**  
     - Stock prices: The price today depends on yesterday’s price.
     - Sentences in a book: The probability of a word depends on the previous words (language models assume sequential dependence).
     - Weather conditions: Tomorrow’s temperature depends on today’s temperature.

4. **Mathematical Implications of i.i.d. in Machine Learning**  

   - Suppose we have a dataset **$ X_1, X_2, ..., X_n $** that is i.i.d. with mean **$ \mu $** and variance **$ \sigma^2 $**.
   - The **sample mean** is:
   
     $$
     \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i.
     $$
     
   - The **Law of Large Numbers (LLN)** states:
   
     $$
     \bar{X} \to \mu \quad \text{as } n \to \infty.
     $$
     
     This ensures that as we collect more data, our estimates become more accurate.

   - The **Central Limit Theorem (CLT)** states:
   
     $$
     \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \approx \mathcal{N}(0,1) \quad \text{for large } n.
     $$
     
     This means that even if the original data distribution is not normal, the sample mean follows a **normal distribution**.

5. **Summary of Key Takeaways**  

   ✔ **i.i.d. means that each data point is drawn independently from the same probability distribution.**  
   ✔ **It simplifies many statistical and machine learning models, making inference easier.**  
   ✔ **Key theorems like LLN and CLT rely on i.i.d. assumptions for convergence properties.**  
   ✔ **Real-world data is often not strictly i.i.d. (e.g., time series, text, stock prices), requiring specialized models.**  
```

## Discretization of probability distribution

The key idea here is that we model the evolution of a probability distribution **continuously in time** rather than using discrete jumps. We want to understand how the distributions at each step are related and ensure that the final distribution is **independent of the choice of discretization steps**.

---

**1. The Meaning of Discretization in Diffusion**  

We define a sequence of probability distributions **$ p_0, p_1, \dots, p_T $** as a **discretization** of a **continuous-time function** $ p(x,t) $, where:

- **$ p_0(x) $** is the **original data distribution**.
- **$ p_T(x) $** is the **final noisy distribution** (Gaussian).
- **$ p(x, t) $** evolves continuously over time from **$ t=0 $** to **$ t=1 $**.

Instead of using discrete steps, we describe this as a **smooth, continuous-time process**:

$$
p(x, k\Delta t) = p_k(x), \quad \text{where} \quad \Delta t = \frac{1}{T}
$$

---

**2. Why Is This Important?**
- **The number of steps $T$ controls how fine the discretization is**:
  - If **$T$ is large**, adjacent distributions **$p_t$** and **$p_{t-1}$** are **very close**.
  - If **$T$ is small**, adjacent distributions are **more different**, making the reverse process harder.
- **This leads naturally to a continuous-time description of diffusion**, which connects to **stochastic differential equations (SDEs)**.

---

**Variance Scaling and the Noise Process**

**3. Understanding the Variance at Each Step**
The forward diffusion process at each discrete step is defined as:

$$
x_k = x_{k-1} + \mathcal{N}(0, \sigma^2)
$$

Since noise is added at every step, the distribution of $x_T$ after $T$ steps is:

$$
x_T \sim \mathcal{N}(x_0, T\sigma^2)
$$

This means that the **total variance grows linearly** with the number of steps **$T$**. However, we want the final variance to be **fixed and independent of $T$**.

To achieve this, we **scale the noise variance** at each step using:

$$
\sigma = \sigma_q \sqrt{\Delta t}, \quad \text{where} \quad \Delta t = \frac{1}{T}
$$

This ensures that:

$$
\sum_{k=1}^{T} \sigma^2 = \sigma_q^2
$$

Thus, the **total variance remains constant** regardless of the number of steps **$T$**.

---

**4. The Final Continuous-Time Formulation**
Now, we switch from **discrete indexing** $ k $ to a **continuous-time variable** $ t \in [0,1] $.

- Instead of writing **$ x_k $**, we use **$ x_t $** to represent the state of **$ x $** at time **$ t $**.
- The forward process becomes:

$$
x_{t+\Delta t} = x_t + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma_q^2 \Delta t)
$$

---

**5. What This Implies for the Marginal Distribution**
Since noise accumulates over time, the **total noise added up to time $ t $** is:

$$
x_t \sim \mathcal{N}(x_0, \sigma_t^2), \quad \text{where} \quad \sigma_t = \sigma_q \sqrt{t}
$$

This ensures that the process remains **Gaussian at all times**, which is crucial for the **reverse process**.

---

**Key Mathematical Insights**

**6. Why Scaling the Variance is Necessary**
If we did **not** scale the variance correctly:
- The **total variance would depend on $ T $**, meaning the final distribution **$ p_T(x) $** would change if we modified the number of steps.
- This would make **training unstable**, as different choices of **$T$** would lead to different results.

By choosing:

$$
\sigma = \sigma_q \sqrt{\Delta t}
$$

we ensure that the **final variance is always $\sigma_q^2$**, making the diffusion process **consistent**.

---

**7. The Role of Continuous-Time Notation**
- The switch from **discrete indices** $ k $ to a **continuous variable** $ t $ makes the transition smoother.
- We can now describe the process using **differential equations**, which makes analysis and computation easier.

---

**Summary of Everything We Covered**

✔ **The probability distributions $ p_0, …, p_T $ are a discrete approximation of a continuous function $ p(x,t) $.**  
✔ **The variance must be scaled properly to ensure that the final distribution does not depend on the number of steps $T$.**  
✔ **Switching to a continuous-time formulation allows us to use stochastic differential equations (SDEs).**  
✔ **The total noise added over time follows a Gaussian distribution with a variance that grows proportionally to time.**  
✔ **The final step is to express the forward and reverse processes in terms of SDEs, which is what we will explore next.**  