
## I. VAE

#### Q1:

We sample an image by following the process of **ancestral sampling**, meaning that we sample by following the dependencies defined by our model. The process is as follows:

1. **Sample the latent variable $z_n$:**

   $$
   z_n \sim \mathcal{N}(0, \mathbf{I}_d)
   $$

   where $\mathbf{I}_d$ is the identity matrix.

2. **Compute the output of the decoder:**

   $$
   f_\theta(z_n)
   $$

   where $f_\theta$ is the decoder network parameterized by $\theta$.

3. **Sample the pixels of the image:**

   $$
   x_m \sim \mathcal{B}(f_\theta(z_n)_m)
   $$

   where $\mathcal{B}$ represents a Bernoulli distribution, and $f_\theta(z_n)_m$ gives the probability of each pixel $m$.

---

#### **Note**:
In step 3, **all pixels can be sampled in parallel**, as the pixels are independent by definition in our model.


#### Q2:

This method is inefficient because evaluating $p(x_n \mid z_n)$ is costly, and the cost increases as the dimension grows. Specifically:

- Evaluating $p(x_n \mid z_n)$ requires $M$ multiplications and $M$ sampling operations.

Additionally, this method can lead to **numerical underflows** due to the high number of multiplications of numbers between 0 and 1.


### **Q3: KL Divergence Analysis**

1. **Positivity of KL Divergence**:
   Using **Jensen's inequality**, we can see that the KL divergence is always positive:
   $$
   D_{\text{KL}}(p \| q) \geq 0
   $$

2. **Equality Case**:
   For a very small divergence, we have $p \approx q$.

3. **Infinite Divergence**:
   To get an infinite KL divergence, take $q$ such that the support of $p$ does not entirely cover the support of $q$.

   - For **Gaussian distributions**, this is not possible because both $p$ and $q$ have support over $\mathbb{R}$. 
   - However, we can approximate this behavior by choosing $p$ and $q$ such that:
     - $\mu_p$ and $\mu_q$ (the means) are far apart, i.e., $|\mu_p - \mu_q|$ is large.
     - $\sigma_p$ and $\sigma_q$ (the standard deviations) are small.

   This results in two spikes that barely overlap, leading to a very large divergence.


#### Q4: Rewriting $ \log p(x_n) $

The marginal likelihood $ \log p(x_n) $ can be expressed as:

$$
\log p(x_n) = \log \int p(x_n, z_n) \, dz_n
$$

Directly computing this is challenging because of the integral over the latent variable $ z_n $.


#### Q5: Pushing up the ELBO

When pushing up the ELBO 2 behaviors can arise:

- $ \log p(x_n) $ can increase. In that case, our model assigns higher probability to the observed data. This directly improves the quality of the generated images.

- $ KL(q(Z|x_n) \| p(Z|x_n)) $ can decrease. In that case, we are learning a better encoder since the probability distribution learned is closer to reality.

At any rate, optimizing the lower bound (ELBO) improves the model by either increasing the likelihood of the data or aligning the encoder's distribution with the true posterior, ultimately yielding better image generation.




## II. Diffusion models

#### Q11:

Using the definition of the marginal likelihood:

$$
\log p(x) = \log \int p(x, z_{1:T}) \, dz_{1:T},
$$

we introduce the variational distribution $q_\phi(z_{1:T} | x)$ to make the integral more tractable. By multiplying and dividing by $q_\phi(z_{1:T} | x)$, we get:

$$
\log p(x) = \log \int q_\phi(z_{1:T} | x) \frac{p(x, z_{1:T})}{q_\phi(z_{1:T} | x)} \, dz_{1:T}.
$$

Applying Jensen's inequality, we obtain:

$$
\log p(x) \geq \int q_\phi(z_{1:T} | x) \log \frac{p(x, z_{1:T})}{q_\phi(z_{1:T} | x)} \, dz_{1:T}.
$$

Simplifying the right-hand side, this becomes:

$$
\log p(x) \geq \mathbb{E}_{q_\phi(z_{1:T} | x)} \left[ \log p(x, z_{1:T}) - \log q_\phi(z_{1:T} | x) \right].
$$

Thus, the **Evidence Lower Bound (ELBO)** for the Markovian Hierarchical Variational Autoencoder is:

$$
\log p(x) \geq \mathbb{E}_{q_\phi(z_{1:T} | x)} \left[ \log \frac{p(x, z_{1:T})}{q_\phi(z_{1:T} | x)} \right].
$$


#### Q12:

The overall architecture of the model consists in an encoder-decoder architecture where in each step of the encoding phase some noise is added to the previous image.
The observed data point 
$x$ is treated as the initial latent variable $z_0$ in the hierarchical framework.

1. During the forward diffusion process, the latent variable $z_t$ is produced at each timestep by applying a linear Gaussian transformation to $z_{t-1}$ , introducing noise as defined by the model

2. During the reverse (generation) process, the VDM attempts to reconstruct the original data $x$ by iteratively denoising the latent variables $z_T, z_{t-1}...$

3. At each reverse timestep, the model predicts the noise added at that step to guide the reconstruction process.