In [None]:
# # Comment the following lines if you're not in colab:
# from google.colab import drive
# drive.mount('/content/drive')
# # If you're in colab, cd to your own working directory here:
# %cd ..//..//content//drive//MyDrive//Colab-Notebooks//HY-673-Tutorials//Tutorial-9

In [None]:
import torch as tc

# <u>Noise Scheduler</u>

1) When training diffusion models, we gradually apply noise to our image $x_0$, until the output is normally distributed with $\mathcal{N}(0, 1)$. This is the task of the **noise scheduler** in a diffusion model, who dictates the amount and type of noise that should be added at each timestep $t$. It typically follows a predetermined schedule that determines how the variance of the noise increases over time, and the term *diffusion step* refers to a noise adding step in this forward deterministic process. During training, at each iteration, we apply $t$ diffusion steps, with $t$ chosen uniformly in $\{1,2, \dots, T\}$, where $T$ is a predefined hyperparameter. So, given a variance $\beta_t$ from a noise schedule $\beta_1, \beta_2, \dots, \beta_T$, the image $x$ at timestep $t$ can be described by:

\begin{equation}
x_{t} = \sqrt{1-\beta_t} \cdot x_{t-1} + \epsilon_t, \ \text{where} \ \epsilon_t \sim \mathcal{N}\left(0, \beta_t \right), \ \text{or, more shortly:}
\end{equation}

\begin{equation}
x_{t}\sim \mathcal{N}\left(\sqrt{1-\beta_t} \cdot x_{t-1}, \ \beta_t\right).
\end{equation}

![](fig/fwd.png)
![](https://drive.google.com/uc?export=view&id=1NUpt4iIL6CeZ4M-87lAjr8QFC2vxL53C)

2) During sampling (or generation), the diffusion model works in reverse, starting from noise and progressively removing it to generate a sample. Here, the noise scheduler specifies how to reverse the noise process. It effectively manages the variance of the noise being removed at each timestep, aiming to accurately reverse the noise addition process used during training. The scheduler guides the model with information about the noise level at each step. The model's objective during training, is to go back from $x_{t}$ to $x_{t-1}$, i.e., predict one single noise step $\epsilon_t$. So the loss can be, for example, the MSE between the true noise $\epsilon_t$ and the model's prediction $\hat{\epsilon}_{t}$.

![](fig/bkw.png)
![](https://drive.google.com/uc?export=view&id=1frSbtHEK41w_oBQSKvyPC186n7uB7b0k)

Both processes can also be defined by using Markov chains.

## <u>Problem</u>

Applying noise to get from $x_0$ to $x_t$ is normally an iterative procedure that requires sampling noise from a Gaussian distribution $t$ times, and averaging this noise with the image.  For large enough $T$, if we actually implemented this process iteratively, training large diffusion models would be **computationally heavy** and probably even unfeasible.

## <u>Solution</u>

Luckily, smart researchers have come up with a method to **pre-compute** noise with **one single step**. This method computes the noise parameters at step $t$ which are the $\bar{\alpha}_{t}$ coefficients below. You can find more details in the slides or reading the original paper, which you can find inside the References folder. The formula to get $x_t$ directly from $x_0$ given our noise schedule $\beta_1, \beta_2, \dots, \beta_T$, is:

\begin{equation}
x_t \sim \mathcal{N} \left(\sqrt{\bar{\alpha}_t} \cdot x_0, \ 1-\bar{\alpha}_t\right), \ \text{with:}
\end{equation}
\begin{equation}
\alpha_t = 1 - \beta_t, \ \text{and} \ \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s.
\end{equation}


It is useful to have all these parameters pre-computed in order to save extra computational costs that would be introduced during training and inference:

In [None]:
def get_schedules(beta_1, beta_t, timesteps):
    """
    A linear noise scheduler for precomputing all the parameters (fractions, square roots, etc).
    """
    beta_t = (beta_t - beta_1) * tc.arange(start=0, end=timesteps + 1, dtype=tc.float32) / timesteps + beta_1
    sqrt_beta_t = tc.sqrt(beta_t)
    alpha_t = 1 - beta_t
    # we'll take the logarithm of alpha_t to compute the cumulative product more stably:
    log_alpha_t = tc.log(alpha_t)
    # since we took the logarithm we exponentiate the cumulative sum:
    alphabar_t = tc.cumsum(log_alpha_t, dim=0).exp()
    sqrt_abar = tc.sqrt(alphabar_t)
    one_over_sqrt_a = 1 / tc.sqrt(alpha_t)
    sqrt_inv_abar = tc.sqrt(1 - alphabar_t)
    inv_abar_over_sqrt_inv_abar = (1 - alpha_t) / sqrt_inv_abar
    return {
        "alpha": alpha_t,
        "one_over_sqrt_a": one_over_sqrt_a,
        "sqrt_beta": sqrt_beta_t,
        "alphabar": alphabar_t,
        "sqrt_abar": sqrt_abar,
        "sqrt_inv_abar": sqrt_inv_abar,
        "inv_alpha_over_sqrt_inv_abar": inv_abar_over_sqrt_inv_abar,
    }

Usage:
+ Define maximum allowed number of diffusion steps $n_T$.
+ Define $b_0$ and $b_T$.

In [None]:
n_T = 1000
betas = [1e-4, 0.02]
schedules = get_schedules(beta_1=betas[0], beta_t=betas[1], timesteps=n_T)

# <u>Training </u>

Let us simulate how we'll use the scheduler during training. From our course theory we have the following:

![](fig/train.png)
![](https://drive.google.com/uc?export=view&id=1ueIRtbIhMl6PzqJpIEo_uBYJjnA8ImEn)

### Key Remarks:

+ It helps to rearrange the diffusion process we defined previously as to express $\epsilon$ in terms of $x_t$ and $x_0$:
\begin{equation}
\epsilon = \frac{x_t - \sqrt{\bar{\alpha}_t} \cdot x_0} {\sqrt{1-\bar{\alpha}_t}}, \ \epsilon \sim \mathcal{N}(0,1).
\end{equation}
+ The neural network $\epsilon_\theta$ aims to predict the noise $\epsilon$ that was added to the original data to get the noised version $x_t$ without knowing $x_0$. This prediction $\hat{\epsilon}$ will be used to iteratively denoise the data during the sampling phase, effectively reversing the diffusion process. Regarding training, all we care about is **approximating the noise** as accurately as possible.
+ Hence, our **loss function** will be between $\epsilon$ and $\hat{\epsilon}$.
+ The network learns to estimate $\epsilon$ from $x_t$ for any given $t$, which corresponds to a **specific level** of noise addition.

### Question:

We said that the network tries to predict the noise added in a single step, i.e., from $x_{t-1}$ to $x_{t}$, and the noise that we add in each step is different, because we have a schedule for our variances $\beta_t$ as we mentioned. Our loss function is between $\epsilon \sim \mathcal{N}(0,1)$ and the predicted noise $\hat{\epsilon}$, but this does not reflect the actual noise that was added in that particular step, because, $x_{t} - x_{t-1} \in \mathcal{N}(0, \beta_t)$. So, how is this strategy correct for training the model?

### Answer:

It is effectively the same problem, because we know deterministically how to get from an $\epsilon$ at timestep $t$ to the exact corresponding $\epsilon_t$ (we use this transformation in the sampling/reverse process). Whether we ask from the model to output samples in $\mathcal{N}(0,1)$ or samples $\mathcal{N}(0,1)$ multiplied by a scalar $\beta_t$ does not matter for our optimization problem. We actually prefer not to do that because the range we request is always the same, and it is thus numerically easier, so there is no need to overcomplicate the model's task since we're solving the same problem in the end.

In code, and adding some more details, the training procedure can be implemented as:

In [None]:
def model(noised_img, diffusion_steps):
    # Assume code of our diffusion model with parameters theta
    # returning its estimation of epsilon:
    return tc.randn_like(noised_img)

batch_size = 64
length, height = 28, 28
epochs = 1

def train_loop():

    # Step 1: Loop:
    for epoch in range(epochs):

        # Step 2: Fetch a batch from the dataset (lets simulate it with random numbers):
        # img, _ = next(iter(dataloader)).view(-1, 1, 28, 28)
        x = tc.randn(size=(batch_size, 1, length, height))

        # Step 3: Get number of timesteps t in U(1, n_T) for this iteration:
        timesteps = tc.randint(low=1, high=n_T + 1, size=(x.shape[0],))

        # Step 4: Sample N(0,1) in the shape of the input:
        epsilon = tc.randn_like(x)

        # Step 5: Zero gradients:
        # optim.zero_grad()

        # Step 6: Add noise to the input:
        # sqrt(abar) * x + sqrt(1-abar) * epsilon
        x_t = schedules["sqrt_abar"][timesteps, None, None, None] * x + \
              schedules["sqrt_inv_abar"][timesteps, None, None, None] * epsilon

        # Step 7: Divide by the maximum allowed number of timesteps:
        t = timesteps/n_T

        # Step 8: Get model's prediction of epsilon given the noised input x_t
        # and the number of diffusion steps chosen for this iteration t:
        epsilon_hat = model(noised_img=x_t, diffusion_steps=t)

        # Step 9: Calculate loss between epsilon and epsilon_hat:
        # loss = loss_fn(eps_hat, eps)

        # Step 10: Backpropagation:
        # loss.backward()

        # Step 11: Update weights and reiterate:
        # optim.step()

# <u>Sampling/Generation</u>

Before closing this notebook, let us simulate the sampling process of our diffusion model. We want to progressively transform noisy data into structured output, i.e., the reverse of the diffusion process.  From our course theory, the math says:

![](fig/sample.png)
![](https://drive.google.com/uc?export=view&id=1quoQs6zbP2LskAEwvffuJuqVIbiYo8SK)

Here, $\sigma_t$ is just our $\beta_t$, and the rest of the notation is consistent with what we've said so far. We will straight up implement this algorithm. For more details on the matter, you can visit the course theory or the original paper.

In [None]:
def sample(diffusion_model, timesteps, n_samples, sample_shape, schedule):
    # Step 1: x_t begins as N(0,1), the last diffusion step:
    x_T = tc.randn(size=(n_samples, *sample_shape))
    # Step 2: for all timesteps T:
    x_i = x_T
    for i in range(timesteps, 0, -1):
        # Step 3: sample z in N(0,1):
        z = tc.randn(size=(n_samples, *sample_shape)) if i > 1 else 0
        # Step 4: denoise the image to go to the previous step:
        ts = tc.tensor(i / timesteps).repeat(n_samples,)
        epsilon = diffusion_model(noised_img=x_i, diffusion_steps=ts)
        x_i = schedule["one_over_sqrt_a"][i] * \
              (x_i - schedule["inv_alpha_over_sqrt_inv_abar"][i]) * epsilon + \
              schedule["sqrt_beta"][i] * z
    # Step 6: after all denoising steps, we are in the sample space:
    return x_i

In [None]:
x = sample(diffusion_model=model, timesteps=n_T, n_samples=batch_size, sample_shape=(1, 28, 28), schedule=schedules)

## <u>Bonus: More Efficient Reverse Process</u>

There are several techniques to accelerate the reverse process, enabling the model to require fewer steps than the forward process. This is many times done by noticing where the most effective denoising steps take place. It is beyond the scope of our tutorial, but in case you are interested on these more SOTA methods:

+ **Learned Step Sizes:** Instead of using a fixed schedule for the noise levels, some models learn an optimal set of step sizes or intervals. This way, the model can focus computation on the most critical stages of the reverse process, skipping steps where changes are less impactful.
+ **Learned Variance Schedules:** Models can also learn more efficient variance schedules, which determine the amount of noise to add or remove at each step. By optimizing these schedules, the model can achieve better fidelity with fewer steps.
+ **Subsampling of Timesteps:** This technique involves selectively skipping certain timesteps during the reverse process. By training the model to handle larger "jumps" in the denoising path, fewer steps are needed to reconstruct the clean data from the noisy latent state.
+ **Conditioning on Multiple Steps:** Some methods involve conditioning the reverse diffusion on multiple timesteps at once, effectively teaching the model to predict several steps ahead. This approach can significantly reduce the number of necessary reverse steps.
+ **Using an Adaptive Solver:** In certain advanced implementations, adaptive solvers can be used to dynamically adjust the number of timesteps based on the complexity of the current reconstruction task. This means that simpler samples may require fewer steps, while more complex ones may still use more steps, optimizing overall efficiency.
+ **Auxiliary Networks:**  Auxiliary networks can be employed to predict larger chunks of the denoising trajectory, allowing for more substantial updates in each step. This reduces the number of steps required by improving the quality of each update.

### <u>Example Articles</u>
+ https://arxiv.org/pdf/2102.09672.pdf
+ https://arxiv.org/pdf/2105.14080.pdf
+ https://arxiv.org/pdf/2206.00927.pdf
+ https://arxiv.org/pdf/2009.09761.pdf