# Proj 2

## Abstract
* What is the problem this paper addresses?
* Why is it an important problem?
* Why are current approaches insufficient?
* Methods: In this work, we develop an approach to address these deficiencies

This paper addresses the high computational cost of diffusion modelling. This is an important problem because pixel space diffusion requires forward passes over very high dimensional inputs (high resolution images), which makes training and inference very expensive, slows down image generation, and restricts maximum model size and input resolution. Furthermore, current diffusion models waste capacity modelling imperceptible, high frequency details in pixel space. To address these limitations, the authors develop Latent Diffusion Models, a class of 2-stage diffusion models that learn a mapping between pixel space and a low dimensional latent space, and perform diffusion only in that latent space. The resulting model is more computationally efficient and scalable, since all diffusion steps occur in latent space, and only a single forward pass of the decoder is required to reconstruct the final image in pixel space.

## Problem definition
* What question are you trying to solve?
* Observed and unobserved random variables?
* What is the goal of the project?

We aim to develop a generative model that can learn the probability distribution $p(x)$ over images $x \in D$, where $D$ is the MNIST dataset. Once trained, the model should be able to sample new images $\hat x \sim p(x)$.

Diffusion models learn $p(x)$ by learning to reverse the forward diffusion process, and can be sampled from by iteratively denoising pure noise into images. However, diffusion is currently done in high dimensional pixel space $\mathbb{R}^{28\times28}$, which is computationally expensive since each diffusion step has to process all 784 dimensions.

The main question we are trying to address is: how we can model $p(x)$ efficiently by avoiding the high computational cost of performing diffusion directly in the pixel space?

To address this, we introduce a latent variable model in which we learn a mapping between pixel space and a lower dimensional latent space through a Variational Auto-Encoder (VAE). During the training process, images $x$ are compressed into smaller latent vectors $z_0$, and diffusion is performed in latent space instead of pixel space. After the reverse diffusion process, $z_0$ is reconstructed back in pixel space to produce a generated image $\hat{x}$. See more details in Models and Methods section.

Observed Variables
* MNIST images: $x\in\mathbb{R}^{28\times28}$
* Digit labels: $y\in\{0,1,\ldots,9\}$

Latent Variables
* Compressed latent code of image produced by VAE: $z\in\mathbb{R}^d, \text{ latent dim }d < 784$
* Latent representation of the image after $t$ steps of the forward diffusion process: $z_t$ for $t\in\{1,2,..,T\}$, $T = \text{max diffusion steps}$

Our goal is to show that diffusion in a compressed latent representation using a VAE can reduce computational cost and speed up generation while retaining image quality. The final trained model should be able to start from a random noisy $z_T$, reverse diffusion in latent space to get $z_0$, and decode the final latent vector into pixel space to generate an image $x$ that looks like it comes from $D$.

## Models and Methods
* Describe the model and inference algorithms.
* Graphical and generative model.
* What parameters do we estimate and how?
* What is the interpretation of those parameters, how do they solve the problem?

The Latent Diffusion Model is made up of 2 parts: an autoencoder for perceptual compression and a denoising U-Net for diffusion.

### Autoencoder
The authors tested 2 autoencoders, a VAE (Variational Auto Encoder) and VQ-GAN (Vector-Quantized Generative Adversarial Network). The purpose of the autoencoder is to compress images into a smaller latent space so diffusion can run more efficiently, while losing as little image fidelity as possible. Both autoencoders are trained first (via SGD), separately from diffusion model, and are fixed during diffusion model training.

#### VAE
VAEs have 2 parts:
* an encoder $q_\phi(z | x)$ that defines the distribtuion over latents given an image
* a decoder $p_\theta(x | z)$ that defines the distribution of images reconstructed from latent codes

We model $q_\phi(z | x) = N(\mu_\phi(x), \sigma_\phi^2(x) I)$, where $\mu_\phi, \sigma_\phi^2$ are the outputs of the encoder.

To sample $z \sim q_\phi(z | x)$ while maintaining differentiability, we use the Reparameterization Trick: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \epsilon \sim N(0, I)$

We model $p_\theta(x | z) = N(\mu_\theta(z), I)$, where $\mu_\theta(z)$ is the decoder's output, the reconstruction of latent code $z$ in pixel space. Then: $\log p_\theta(x | z) = -\frac{1}{2} ||x - \mu_\theta(z)||^2 + \text{ const}$

Assuming the latent prior $p(z) = N(0, I)$: $\text{ KL} \left[ q_\phi(z | x) || p(z) \right] = \text{KL} \left[ N(\mu_\phi(x), \sigma_\phi^2(x) I) || N(0, I) \right] = \frac{1}{2} \sum_i \left[ \mu_{\phi, i}^2(x) + \sigma_{\phi, i}^2(x) -\log \sigma_{\phi, i}^2(x) - 1\right]$

Classic VAEs minimize the following objective: $L = -\mathbb{E}_{z \sim q_\phi(z | x)} \left[ \log p_\theta(x | z) \right] + \text{KL} \left[ q_\phi(z | x) || p(z) \right]$

The first term $L_{recon} = -\mathbb{E}_{z \sim q_\phi(z | x)} \left[ \log p_\theta(x | z) \right]$ is the reconstruction loss: after encoding an image $x$ and sampling a latent $z$ from the resulting distribution, how likely is the original image when reconstructed from latent space? This term encourages the VAE encoder to compress as much information about $x$ into the latent code $z$, and pushes the decoder to put high probability mass on reconstructing the original image.

The second term $L_{reg} = \text{KL} \left[ q_\phi(z | x) || p(z) \right]$ penalizes the VAE for latents $z$ that deviate from the prior $p(z) = N(0, I)$. This term encourages the model to learn a maximally information-dense latent space that compresses the high dimensional input space into independent Gaussian features.

VAE generative model:
* sample latents: $z \sim N(0, I)$
* reconstruct (decode) latents into images: $x \sim p_\theta(x | z)$

![vae graphical model](images/vae_graphical_model.png)

In Latent Diffusion Models, the authors use a modified VAE objective with $L_{recon} = \lambda_{L1} ||x - \mu_\theta(z)||_1 + \lambda_{per} \text{LPIPS}(x, \mu_\theta(z)) + \lambda_{adv} L_{adv}$

The first term $\lambda_{L1} ||x - \mu_\theta(z)||_1$ is just the L1 loss between the image $x$ and the reconstructed image $\mu_\theta(z)$. The authors chose to use an L1 instead of L2 loss because they found that using an L1 loss made images less blurry (L2 is minimized by the mean value, which for images, is gray and blurry, while L1 is minimized by the median value).

The second term $\lambda_{per} \text{LPIPS}(x, \mu_\theta(z))$ is the perceptual loss term. LPIPS is the Learned Perceptual Image Patch Similarity defined as $\text{LPIPS}(x, \hat x) = \sum_l w_l ||\phi_l(x) - \phi_l(\hat x)||^2$, where $\phi_l(x)$ are the intermediate layer activations of some pretrained neural network (VGG-16). This loss term encourages the reconstruction to match important (otherwise the pretrained neural network wouldn't have learned them) higher-level features (like textures) of the original image.

The final term $L_{adv} = - D(\mu_\theta(z))$ is the patch-based adversarial loss, where $D$ is a GAN discriminator that must predict if the patch is real or generated. This patch-based adversarial loss encourages the VAE decoder to generate sharp, realistic looking patches that are indistinguishable from ones coming from real images.

#### VQ-GAN
The VQ-GAN is very similar to the VAE modified with adversarial loss as described above. The major difference is that the encoded latents are no longer continuous, but instead come from a discrete set of learnable embeddings (codewords). Encoded image latents are quantized to the closest codewords $z_q = \argmin_{e_k} ||z - e_k||_2$, and the decoder only ever uses the quantized codewords to reconstruct the input image. The only additional loss term is $L_{quant} = || \text{sg}(z_q) - z ||^2$, where $z_q$ is the quantized latent and $\text{sg}$ is the stop gradient operator (codewords are updated using EMA in more recent versions of VQ-VAE, not SGD). This loss term is called the commitment loss, and pushes the encoder to use latents in the codebook (commit to them).

VQ-GAN generative model:
* sample latent codewords: $z \in \{z_1 .. z_D\}$
* reconstruct (decode) latent codewords into images: $x \sim p_\theta(x | z)$

![vq gan graphical model](images/vq_gan_graphical_model.png)

### Diffusion model
To reduce the cost of running the diffusion process in the pixel space, we first map each image $x\in D$ to a lower dimensional latent vector $z_0$ using a VAE

$$ z_0 \sim q_\phi(z_0 | x) $$

In latent space, we define a forward diffusion process that adds noise gradually (with $\beta_t$ following a variance schedule),

$$ p(z_t \mid z_{t-1}) = N\left(\sqrt{1-\beta_t}\, z_{t-1},\; \beta_t I\right), \qquad t = 1,\dots,T $$

$$ z_t = \sqrt{1-\beta_t} z_{t-1} + \sqrt{\beta_t} \epsilon, \qquad \epsilon \sim N(0, I) $$

By using the substitution

$$ \bar{\alpha}_t = \prod_{i=1}^t (1 - \beta_i)$$

we can get a closed form for the forward process in a single step (instead of having to perform $t$ steps of forward diffusion sampling)

$$ \qquad p(z_t \mid z_0) = N\left(\sqrt{\bar{\alpha}_t} z_0, (1 - \bar{\alpha}_t) I \right) $$

$$ z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \qquad \epsilon \sim N(0,I) $$

During training, the diffusion model learns to predict the noise added to $z_0$. We denote predicted noise by

$$ \epsilon_\theta(z_t,t) $$

The diffusion model is trained to minimize the objective (via SGD)

$$
L = \mathbb{E}_{z_0 \sim q(z_0\mid x)} \left[ \ \|\, \epsilon - \epsilon_\theta(z_t, t)\, \|^2 \ \right], \quad
z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad
\epsilon \sim N(0, I)
$$

Diffusion model sampling:
* sample $z_T \sim N(0, I)$
* for $t = T..1$:
    * $z_{t-1} = \frac{1}{\sqrt{\alpha_t}} (z_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}} \epsilon_\theta(z_t, t))$
    * if $t > 1 \ $: $\ z_{t-1} = z_{t-1} + \sigma_t \epsilon, \qquad \epsilon \sim N(0, I)$
* Map the final denoised latent $z_0$ back to pixel space using the decoder: $ \hat{x} = \mu_\theta(z_0) $

![ldm graphical model](images/ldm_graphical_model.png)


#### Architecture
The diffusion model $\epsilon_\theta(z_t, t)$ is implemented as a residual U-Net. Time embeddings are added to intermediate activations to condition the diffusion model on time. Domain-specific conditioning is done through cross-attention at intermediate U-Net layers. Domain specific encoders (ie unmasked transformer for text) embed token-based conditioning information to be fed into cross-attention.

### Parameter Interpretation
Since the estimated parameters are the weights of large neural networks ("giant inscrutable matrices"), we do not attempt to interpret them.

However, once these parameters are learned, we will have essentially learned $\epsilon_\theta(z_t, t) \approx \nabla \log p(z_t)$, and can follow the gradient of the log density to denoise completely noisy latents into ones that can reconstructed into plausible looking image by the VAE decoder.

## Results and Validation
* What will your results show?
* How will you quantify how well your approach answered the question?
* What other models and methods will you compare against?
* How do you validate your answers and uncertainty?
* What figures/tables will you use?

Our results will show whether diffusing in a low dimensional latent space is helpful compared to running diffusion directly on pixels. Our primary goal is to evaluate is whether the latent diffusion model (LDM) can still generate realistic images. If the samples are blurry or inconsistent, then reducing the dimensionality would not be worth it. Then, we will test if the runtime improves.

To measure how well the model generates images, we will use Frechet Inception Distance (FID), which measures the distance between the distribution of generated images and distribution of real images. Lower FID means the model is generating higher quality images. We will also check for precision, which tell us how often the samples look like actual MNIST digits, letting us know if our model captures the structure of MNIST. We will look at recall, which tells us whether we are covering the entire range of digits.

We will compare against our LDM a pixel space diffusion model, a GAN, and a standard VAE. We will look at the training time per iteration, generation speed, and memory usage. Since the latent space has fewer dimensions than the raw image, we expect the diffusion to be faster and less memory intensive. If the LDM achieves similar FID, precision, and recall but is noticeably faster and more memory efficient, then the approach is successful.

We will include generated images from both models, tables with the FID, precision, recall, and resource usage, and a plot comparing the training and generation time. 