<div style="text-align: center;">
    <h1 style="font-style: italic;">Fun With Diffusion Models!</h1>
</div>

## ***Part A: The Power of Diffusion Models!***

&emsp;&emsp;In part A, I explore diffusion models, implement sampling loops, and apply them to tasks like inpainting and creating optical illusions.

### ***Part 0: Setup***

&emsp;&emsp;For this part, I instantiate DeepFloyd's `stage_1` and `stage_2` objects used for generation, as well as several text prompts for sample generation. To ensure that the generated images closely align with the textual descriptions, I experimented with various parameter settings, particularly adjusting num_inference_steps to observe changes in output quality. These trials helped me understand the model's ability to control image detail and refinement.

&emsp;&emsp;The random seed that I'm using here is $42$, and I would use the same seed all subsequent parts.

&emsp;&emsp;The text prompts used in this part are: *an oil painting of a snowy mountain village*, *a man wearing a hat* and *a rocket ship*. The corresponding generated images are as below:
<div style="text-align: center;">
    <h4 style="font-weight: bold;">Stage 1 with Size [3, 64, 64]</h4>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <p style="font-weight: bold;">an oil painting of a snowy mountain village</p>
        <img src="./media/A0/stage1_image_0.png" alt="Image 1" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a man wearing a hat</p>
        <img src="./media/A0/stage1_image_1.png" alt="Image 2" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a rocket ship</p>
        <img src="./media/A0/stage1_image_2.png" alt="Image 3" style="width: 200px;">
    </div>
</div>

&emsp;&emsp;We could notice that the generated images are blurred in this stage.

<div style="text-align: center;">
    <h4 style="font-weight: bold;">Stage 2 with Size [3, 256, 256]</h4>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <p style="font-weight: bold;">an oil painting of a snowy mountain village</p>
        <img src="./media/A0/stage2_image_0.png" alt="Image 1" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a man wearing a hat</p>
        <img src="./media/A0/stage2_image_1.png" alt="Image 2" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a rocket ship</p>
        <img src="./media/A0/stage2_image_2.png" alt="Image 3" style="width: 200px;">
    </div>
</div>
<div style="text-align: center;">
    <h5>num_inference_steps = 20</h5>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <p style="font-weight: bold;">an oil painting of a snowy mountain village</p>
        <img src="./media/A0/stage2_alter1_image_0.png" alt="Image 1" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a man wearing a hat</p>
        <img src="./media/A0/stage2_alter1_image_1.png" alt="Image 2" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a rocket ship</p>
        <img src="./media/A0/stage2_alter1_image_2.png" alt="Image 3" style="width: 200px;">
    </div>
</div>
<div style="text-align: center;">
    <h5>num_inference_steps = 50</h5>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <p style="font-weight: bold;">an oil painting of a snowy mountain village</p>
        <img src="./media/A0/stage2_alter2_image_0.png" alt="Image 1" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a man wearing a hat</p>
        <img src="./media/A0/stage2_alter2_image_1.png" alt="Image 2" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a rocket ship</p>
        <img src="./media/A0/stage2_alter2_image_2.png" alt="Image 3" style="width: 200px;">
    </div>
</div>
<div style="text-align: center;">
    <h5>num_inference_steps = 100</h5>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <p style="font-weight: bold;">an oil painting of a snowy mountain village</p>
        <img src="./media/A0/stage2_alter3_image_0.png" alt="Image 1" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a man wearing a hat</p>
        <img src="./media/A0/stage2_alter3_image_1.png" alt="Image 2" style="width: 200px;">
    </div>
    <div style="text-align: center;">
        <p style="font-weight: bold;">a rocket ship</p>
        <img src="./media/A0/stage2_alter3_image_2.png" alt="Image 3" style="width: 200px;">
    </div>
</div>
<div style="text-align: center;">
    <h5>num_inference_steps = 125</h5>
</div>

&emsp;&emsp;We could observe that the quality of the outputs would be higher, i.e. the outputs would be fancier, if we provide text prompts with more details. Increasing the value of `num_inference_steps` would also contribute to the quality of outputs, though it slows down generation. With higher `num_inference_steps` values, the outputs show clearer structure and improved detail generally.

### ***Part 1: Sampling Loops***

&emsp;&emsp;In this section of the problem set, I create my own "sampling loops" using the pretrained DeepFloyd denoisers to generate high-quality images. I adapt these sampling loops for various tasks, such as inpainting or creating optical illusions.

#### ***1.1 Implementing the Forward Process***

&emsp;&emsp;In this part, I implemente the forward process of the diffusion model, which involves gradually adding noise to a clean image. The forward process is defined by:
$$
    q(x_{t} | x_{0}) = \mathcal{N}(X_{t}, \sqrt{\overline{\alpha}}x_{0}, (1 - \overline{\alpha}_{t})\mathbf{I}),
$$
which is equivalent to computing
$$
    x_{t} = \sqrt{\overline{\alpha}_{t}}x_{0} + \sqrt{1 - \overline{\alpha}_{t}} \epsilon \quad \text{where } \epsilon \sim \mathcal{N}(0, 1).
$$
&emsp;&emsp;That is, given a clean image $x_{0}$, we get a noisy image $x_{t}$ at timestep $t$ by sampling from a Gaussian with mean $\sqrt{\overline{\alpha}_{t}}x_{0}$ and variance $(1 - \overline{\alpha}_{t})$.

&emsp;&emsp;Here is an example of adding noise to `campanile.jpg`:
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/campanile.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">campanile.png</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_250.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">noise level = 250</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_500.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">noise level = 500</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_750.png" alt="Image 4" style="width: 200px;">
        <p style="font-weight: bold;">noise level = 750</p>
    </div>
</div>

#### ***1.2 Classical Denoising***

&emsp;&emsp;First try to denoise these images using classical methods. Again I work with the noisy images from timesteps $[250, 500, 750]$, applying *Gaussian blur filtering* in an effort to reduce the noise. The results are as below:
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_250.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">Noisy Campanile at t=250</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_500.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">Noisy Campanile at t=500</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_750.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">Noisy Campanile at t=750</p>
    </div>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/A1/denoised_image_250.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">Gaussian Blur Denoising at t=250</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/denoised_image_500.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">Gaussian Blur Denoising at t=500</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/denoised_image_750.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">Gaussian Blur Denoising at t=750</p>
    </div>
</div>

#### ***1.3 One-Step Denoising***

&emsp;&emsp;Now I utilize a pretrained diffusion model to perform denoising. The denoiser is implemented in `stage_1.unet`, which is a UNet architecture that has been extensively trained on a vast dataset of $(x_{0}, x_{t})$ image pairs. This model enables us to estimate the Gaussian noise present in the image, which we can then subtract to retrieve an approximation of the original image.

&emsp;&emsp;Additionally, the diffusion model requires a text prompt embedding to guide the denoising process. I use `"a high quality photo"` as the relevant text prompt for conditioning the model.
<div style="text-align: center;">
    <h4 style="font-weight: bold;">Timestep = 250</h4>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/campanile.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">Original Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_250.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">Noisy Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/estimated_clean_image_250.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">Estimate of Original Campanile</p>
    </div>
</div>
<div style="text-align: center;">
    <h4 style="font-weight: bold;">Timestep = 500</h4>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/campanile.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">Original Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_500.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">Noisy Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/estimated_clean_image_500.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">Estimate of Original Campanile</p>
    </div>
</div>
<div style="text-align: center;">
    <h4 style="font-weight: bold;">Timestep = 750</h4>
</div>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/campanile.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">Original Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/noisy_image_750.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">Noisy Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/estimated_clean_image_750.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">Estimate of Original Campanile</p>
    </div>
</div>

#### ***1.4 Iterative Denoising***

&emsp;&emsp;In Part $1.3$, we could observe that the denoising UNet performs well at projecting the image onto the natural image manifold, though it worsens as more noise is added. This makes sense since the problem becomes increasingly challenging with higher noise levels.

&emsp;&emsp;Diffusion models are designed for iterative denoising. To speed up this process, we can create a new list of timesteps called `strided_timesteps`, allowing us to skip certain steps. The first element in `strided_timesteps` corresponds to the noisiest image , i.e. with the largest timestep, and `strided_timesteps[-1]` corresponds to a clean image. One straightforward way to construct this list is by introducing a regular stride. Here we apply a stride of $30$.

&emsp;&emsp;On the $i$-th denoising step, we’re at `strided_timesteps[i]` and aim to reach `strided_timesteps[i+1]`, moving from a noisier to a less noisy image. To do this, we apply the following formula:
$$
    x_{t'} = \frac{\sqrt{\overline{\alpha}_{t'}} \beta_{t}}{1 - \overline{\alpha}_{t}}x_{0} + \frac{\sqrt{\alpha_{t}}(1 - \overline{\alpha}_{t'})}{1 - \overline{\alpha}_{t}}x_{t} + v_{\sigma},
$$
where:
- $x_{t}$ is the image at timestep $t
- $x_{t'}$ is the noisy image at timestep $t'$ where $t' < t$ (less noisy)
- $\overline{\alpha}_{t}$ is defined by `alpha_cumprod`
- $\alpha_{t} = \frac{\overline{\alpha}_{t}}{\overline{\alpha}_{t'}}$
- $\beta_{t} = 1 - \alpha_{t}$
- $x_{0}$ is the current estimate of the clean image

&emsp;&emsp;This formula gives the current estimate of the clean image, and it’s similar to the approach in section $1.3$.

&emsp;&emsp;The $v_{\sigma}$ is random noise, which in the case of DeepFloyd is also predicted. The function called `add_variance` could add the correct amount of noise to the image.

&emsp;&emsp;The noisy images generated in the iteration of denoising are as below:
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/A1/iterative_image_690.png" alt="Image 1" style="width: 160px;">
        <p style="font-weight: bold;">Noisy Campanile at t=690</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/iterative_image_540.png" alt="Image 2" style="width: 160px;">
        <p style="font-weight: bold;">Noisy Campanile at t=540</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/iterative_image_390.png" alt="Image 3" style="width: 160px;">
        <p style="font-weight: bold;">Noisy Campanile at t=390</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/iterative_image_240.png" alt="Image 4" style="width: 160px;">
        <p style="font-weight: bold;">Noisy Campanile at t=240</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/iterative_image_90.png" alt="Image 5" style="width: 160px;">
        <p style="font-weight: bold;">Noisy Campanile at t=90</p>
    </div>
</div>
&emsp;&emsp;Comparing the result of iterative denoising with the results of the methods before, we could find that both the predicted clean image using iterative denoising and the predicted clean image using only a single denoising step look good, though the iterative method performs better on some details and provides a clearer and fancier image.
<br><br>
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/campanile.png" alt="Image 1" style="width: 200px;">
        <p style="font-weight: bold;">Original</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/clean_image.png" alt="Image 2" style="width: 200px;">
        <p style="font-weight: bold;">Iterative Denoised Campanile</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/single_denoising_image.png" alt="Image 3" style="width: 200px;">
        <p style="font-weight: bold;">One-Step Denoised</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/gaussian_image.png" alt="Image 4" style="width: 200px;">
        <p style="font-weight: bold;">Gaussian Blurred Campanile</p>
    </div>
</div>

#### ***1.5 Diffusion Model Sampling***

&emsp;&emsp;In Part $1.4$, we use the diffusion model to denoise an image. Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting `i_start = 0` and passing in random noise. This effectively denoises pure noise. Here are 5 results of `"a high quality photo"`:
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/A1/generated_image0.png" alt="Image 1" style="width: 160px;">
        <p style="font-weight: bold;">Sample 1</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/generated_image1.png" alt="Image 2" style="width: 160px;">
        <p style="font-weight: bold;">Sample 2</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/generated_image2.png" alt="Image 3" style="width: 160px;">
        <p style="font-weight: bold;">Sample 3</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/generated_image3.png" alt="Image 4" style="width: 160px;">
        <p style="font-weight: bold;">Sample 4</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/generated_image4.png" alt="Image 5" style="width: 160px;">
        <p style="font-weight: bold;">Sample 5</p>
    </div>
</div>

#### ***1.6 Classifier-Free Guidance (CFG)***

&emsp;&emsp;We could notice that the images generated in the previous section are not of high quality, with some appearing completely nonsensical. To significantly enhance image quality, we can employ a technique known as **Classifier-Free Guidance (CFG)**.

&emsp;&emsp;In CFG, we calculate both conditional and unconditional noise estimates, denoted as $\epsilon_{c}$ and $\epsilon_{u}$. Our new noise estimate is then formulated as:
$$
    \epsilon = \epsilon_{u} + \gamma (\epsilon_{c} - \epsilon_{u}),
$$
where $\gamma$ controls the strength of CFG. Notice that for $\gamma = 0$, we get an unconditional noise estimate, and for $\gamma = 1$ we get the conditional noise estimate. The magic happens when $\gamma > 1$. In this case, we get much higher quality images.

&emsp;&emsp;Here are $5$ images of `"a high quality photo"` with a CFG scale of $\gamma = 7$, which look much better than those in the prior section:
<div style="display: flex; justify-content: space-around; align-items: center;">
    <div style="text-align: center;">
        <img src="./media/A1/cfg_generated_image0.png" alt="Image 1" style="width: 160px;">
        <p style="font-weight: bold;">Sample 1 with CFG</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/cfg_generated_image1.png" alt="Image 2" style="width: 160px;">
        <p style="font-weight: bold;">Sample 2 with CFG</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/cfg_generated_image2.png" alt="Image 3" style="width: 160px;">
        <p style="font-weight: bold;">Sample 3 with CFG</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/cfg_generated_image3.png" alt="Image 4" style="width: 160px;">
        <p style="font-weight: bold;">Sample 4 with CFG</p>
    </div>
    <div style="text-align: center;">
        <img src="./media/A1/cfg_generated_image4.png" alt="Image 5" style="width: 160px;">
        <p style="font-weight: bold;">Sample 5 with CFG</p>
    </div>
</div>