# Assignment 3
Lucas Bezerra, 171412, lucas.camaradantasbezerra@kaust.edu.sa

### 1. Zero-Shot Learning and Vision-Language

1.1 Basic Understanding

- <strong>What are the differences between GAZSL [2] and CIZSL [1]?</strong>

In CIZSL an extra augmentation method is added. Besides the regular adversarial loss, the creativity-inspired loss is added when training the generator: hallucinated class descriptions $t^h$ are generated as linear combinations of the training set class descriptions ($t^h = \alpha t_a^s + (1-\alpha)t_b^s\;,\;\alpha \in \left[0.2,0.8\right]$). Then, for any $t^h \sim p_{text}^h, z \sim p_z$:

1. Maximize the likelihood that the generator output (given hallucinated text $t^h, z$) is classified as real by the discriminator
2. Maximize the entropy of the discriminator classifier the generator output (given hallucinated text $t^h, z$). This encourages the generator to create images that can't be classified by the discrminator as belonging to any particular class, thus preventing the generator from creating images that belong to any particular class.

This extra augmentation heavily improves the performance of the model as compared to GAZSL.

<span style="color:red">Anything else?</span>

- <strong>How the creativity loss is connected with the classification head over
classes? Why it can be helpful?</strong>

As stated in the previous question, the creativity loss also encourages the generator to create images that the discriminator can't properly classify. This is helpful because it prevents the generator to creating images that belong to any of the classes it saw in the dataset, and thus encouraging creativity when generating new images, but not too much creativity, since the discriminator still needs to be tricked into believing the images are real, as in the traditional GAN setting.

- Run the code of CIZSL on one text-based dataset (e.g., CUB-wiki).
Please report your performance using the provided hyperparameters (your
performance may slightly different from the reported due to instability and
different hyper-parameters). You can find the code here.

Interpolation between two real text features:

Implement SM Divergence suitable for our case: 

In [1]:
# %%bash
%cd CIZSL
!./run.sh

/home/camaral/code/gans_course/hw5/CIZSL
Namespace(dataset='CUB', splitmode='hard', model_number=2, exp_name='Reproduce', main_dir='./', creativity_weight=0.1, validate=0, SM_Alpha=0.5, SM_Beta=0.9999, gpu='0', manualSeed=None, resume=None, disp_interval=20, save_interval=200, evl_interval=10)
Random Seed:  2010
100%|█████████████████████████████████████| 3001/3001 [2:15:51<00:00,  2.72s/it]
Reproduce CUB hard
Accuracy is 14.26%, and Generalized AUC is 11.53%
Namespace(dataset='NAB', splitmode='hard', model_number=2, exp_name='Reproduce', main_dir='./', creativity_weight=0.1, validate=0, SM_Alpha=0.5, SM_Beta=0.9999, gpu='0', manualSeed=None, resume=None, disp_interval=20, save_interval=200, evl_interval=10)
Random Seed:  6154
100%|█████████████████████████████████████| 3001/3001 [8:51:46<00:00, 10.63s/it]
Reproduce NAB hard
Accuracy is 8.936%, and Generalized AUC is 6.715%
Namespace(dataset='NAB', splitmode='easy', model_number=2, exp_name='Reproduce', main_dir='./', creativity_weight

### 2. Score-Based Generative Modeling

2.1 Basic Understanding

- <strong>Describe the pipeline logic of [4] (i.e., forward and backward steps).</strong>

The forward process consists of a diffusion process, a Stochastic Differential Equation (SDE), that adds noise to the data distribution ($p_0(x)$) slowly, a bit in each layer, up to the last layer ($p_T(x)$) where the distribution is just noise and no more of the original data is left.

The diffusion process can be reversed by applying another diffusion process that can be obtained by estimating the score: $\nabla_x log\,p_t(x), t \in \left[ 0,T\right]$. The score estimate $s_\theta(x(t),t)$ is learned during training.

- <strong>What is Energy-Based Models (EBMs) and Score-Based Generative Models (SBGMs)?</strong>

Energy-based models are those that model a data distribution in the form:

$$ p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta},\;\text{where: } Z_\theta = \int e^{-E_\theta(x)} dx$$

$Z_\theta$ is not tractable as it involves computing an integral over all dimensions of the data, thus the usual way to go is to compute $\nabla_\theta log\,p_\theta(x) = - \nabla_\theta E_\theta(x) + \mathbb{E}_{x\sim p_\theta(x)}\left[\nabla_\theta E_\theta(x) \right]$ and use gradient ascent to maximize the log-likelihood of $E_\theta(x)$ given a dataset.

However, even if a model for the distribution is given, it still is hard to sample from it. Score-based sampling techniques such as Langevin MCMC estimate the score $\nabla_x log\,p_t(x)$ and use it to sample from a data distribution without the need to estimate the data distribution itself. Methods that use score-based sampling for generating data that is similar to prior dataset are called Score-Based Generative Models.

- <strong>What is the difference among Euler-Maruyama sampling, Langevin MCMC sampling, and Predictor-Corrector (PC) sampling?</strong>

The Euler-Maruyama sampling method applies the Euler-Maruyama SDE solver to come up with a solution for the reverse diffusion model (that takes a sample from the prior to the data distribution), which means generating a new sample given a sample from the prior.

The Langevin MCMC method generates samples by iterating over the equation:

$$x_i^m = x_i^{m-1} + \varepsilon_i s_\theta (x_i^{m-1}, \sigma_i) + \sqrt{2 \varepsilon_i} z_i^m,\quad m=1,2,3,\dots,M$$

It starts from $x_i$ and iterates over it until reaching at the final generated sample: $x_i^M$.

The Predictor-Corrector (PC) class of samplers generalize over the past 2 methods: the Predictor can be any SDE solver (e.g. Euler-Maruyama) and the Corrector can be any score-based MCMC approach (e.g. Langevin MCMC). The proposed PC samplers are tailored for reverse diffusion sampling, where they perform better than the previously mentioned techniques.

- <strong>What is the difference among VE SDE, VP SDE, and sub-VP SDE?</strong>

Variance Exploding (VE) SDE: The continuous-time version of the perturbation kernels used in SMLD. It is called variance exploding because it shows such property as $t\to \infty$. It is given by:

$$ dx = \sqrt{\frac{d\left[\sigma^2(t)\right]}{dt}} dw $$

Variance Preserving (VP) SDE: The continuous-time version of the perturbation kernels used in DDPM. It is called variance preserving as its variance is always one provided the initial distribution also has unit variance. It is given by:

$$ dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}dw $$

Sub-VP SDE: A new type of SDEs proposed by the authors, that has the property of having its variance always bounded by the VP-SDE at every intermediate step. It is given by:

$$ dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)\left(1-e^{-2\int_0^t \beta(s)ds}\right)}dw $$

- <strong>How and why SDE is connected with SBGMs?</strong>

Since the transformation from data distribution to prior distribution can be modelled as a diffusion model, the reverse is also a diffusion model. As diffusion models are SDEs, they can be solved using regular SDE solvers (e.g. Euler-Maruyama), including score-based ones (e.g. Langevin MCMC).

- <strong>How the likelihood is computed in probabilistic flow ODE? Why this can not be done for normal SDE?</strong>



- <strong>Clarify potential disadvantages of discrete noisy perturbation (Hint,
suggest reading [5].)</strong>