# Assignment 3
Lucas Bezerra, 171412, lucas.camaradantasbezerra@kaust.edu.sa

### 1. Zero-Shot Learning and Vision-Language

1.1 Basic Understanding

- <strong>What are the differences between GAZSL [2] and CIZSL [1]?</strong>

In CIZSL an extra augmentation method is added. Besides the regular adversarial loss, the creativity-inspired loss is added when training the generator: hallucinated class descriptions $t^h$ are generated as linear combinations of the training set class descriptions ($t^h = \alpha t_a^s + (1-\alpha)t_b^s\;,\;\alpha \in \left[0.2,0.8\right]$). Then, for any $t^h \sim p_{text}^h, z \sim p_z$:

1. Maximize the likelihood that the generator output (given hallucinated text $t^h, z$) is classified as real by the discriminator
2. Maximize the entropy of the discriminator classifier the generator output (given hallucinated text $t^h, z$). This encourages the generator to create images that can't be classified by the discrminator as belonging to any particular class, thus preventing the generator from creating images that belong to any particular class.

This extra augmentation heavily improves the performance of the model as compared to GAZSL.


- <strong>How the creativity loss is connected with the classification head over
classes? Why it can be helpful?</strong>

As stated in the previous question, the creativity loss also encourages the generator to create images that the discriminator can't properly classify. This is helpful because it prevents the generator to creating images that belong to any of the classes it saw in the dataset, and thus encouraging creativity when generating new images, but not too much creativity, since the discriminator still needs to be tricked into believing the images are real, as in the traditional GAN setting.

- Run the code of CIZSL on one text-based dataset (e.g., CUB-wiki).
Please report your performance using the provided hyperparameters (your
performance may slightly different from the reported due to instability and
different hyper-parameters). You can find the code here.

Interpolation between two real text features:

In [None]:
# TODO: Interpolation between two real text features

# Take all text descriptions
text_feat_1 = np.array([dataset.train_text_feature[i, :] for i in labels])
text_feat_2 = np.copy(text_feat_1)
# Shuffle them to get random pairs of text descriptions
np.random.shuffle(text_feat_1)  
np.random.shuffle(text_feat_2)
# Sample alpha (the interpolation coefficient used to create hallucinated text descriptions)
alpha = np.random.uniform(low=.2, high=.8, size=len(labels))

# Compute hallucinated text (combinations of text descriptions from training set)
text_feat_mean = alpha*text_feat_1.T + (1-alpha)*text_feat_2.T
# Normalize hallucinated features (l2-norm = 1)
text_feat_mean = normalize(text_feat_mean.T, norm='l2', axis=1)

Implement SM Divergence suitable for our case:

In [2]:
# TODO: Implement SM Divergence suitable for our case.

# Uniform distribution: 1/Ks for all classes
q_shape = Variable(torch.FloatTensor(G_fake_C.data.size(0), G_fake_C.data.size(1))).cuda()
q_shape.data.fill_(1.0 / G_fake_C.data.size(1))

SM_ab = F.sigmoid(log_SM_ab(ones))
# Compute Alpha
SM_a = 0.2 + torch.div(SM_ab[0][0], 1.6666666666666667).cuda()
# Compute Beta
SM_b = 0.2 + torch.div(SM_ab[0][1], 1.6666666666666667).cuda()
# Exponent: 1-Beta/(1-Alpha)
pow_a_b = torch.div(1 - SM_a, 1 - SM_b)
# Base: p_i^alpha * q_i^(1-alpha)
# Summed over all i
alpha_term = (torch.pow(G_fake_C + 1e-5, SM_a) * torch.pow(q_shape, 1 - SM_a)).sum(1)
# Result: 1/(Beta-1)*[Sum(Base^Exponent)-1]
entropy_GX_fake_vec = torch.div(torch.pow(alpha_term, pow_a_b) - 1, SM_b - 1)

Training:

In [1]:
# %%bash
%cd CIZSL
!./run.sh
%cd ..

/home/camaral/code/gans_course/hw5/CIZSL
Namespace(dataset='CUB', splitmode='hard', model_number=2, exp_name='Reproduce', main_dir='./', creativity_weight=0.1, validate=0, SM_Alpha=0.5, SM_Beta=0.9999, gpu='0', manualSeed=None, resume=None, disp_interval=20, save_interval=200, evl_interval=10)
Random Seed:  2010
100%|█████████████████████████████████████| 3001/3001 [2:15:51<00:00,  2.72s/it]
Reproduce CUB hard
Accuracy is 14.26%, and Generalized AUC is 11.53%
Namespace(dataset='NAB', splitmode='hard', model_number=2, exp_name='Reproduce', main_dir='./', creativity_weight=0.1, validate=0, SM_Alpha=0.5, SM_Beta=0.9999, gpu='0', manualSeed=None, resume=None, disp_interval=20, save_interval=200, evl_interval=10)
Random Seed:  6154
100%|█████████████████████████████████████| 3001/3001 [8:51:46<00:00, 10.63s/it]
Reproduce NAB hard
Accuracy is 8.936%, and Generalized AUC is 6.715%
Namespace(dataset='NAB', splitmode='easy', model_number=2, exp_name='Reproduce', main_dir='./', creativity_weight

1.2 Explorations

- (25pt) Reproduce the results on attribute-based datasets including AWA2 and SUN. You can mainly build your network on top of GAZSL [2].

Since GAZSL repository already includes the AWA2 and SUN datasets, I built CIZSL on top of it. Namely I added the creativity-loss based on hallucinated text to the GAZSL code, along with other minor modifications. The script "/ZSL_GAN/train_GBU_CIZSL.py" contains properly commented modifications.

In [19]:
%cd ZSL_GAN
!python train_GBU_CIZSL.py --dataset 'AWA2' --preprocessing --z_dim 10 --creativity_weight 0.1
%cd ..

/home/camaral/code/gans_course/hw5/ZSL_GAN
Running parameters:
{
    "dataset":"AWA2",
    "dataroot":"/home/camaral/code/gans_course/hw5/ZSL_GAN/data",
    "matdataset":true,
    "image_embedding":"res101",
    "class_embedding":"att",
    "preprocessing":true,
    "standardization":false,
    "validation":false,
    "gpu":"0",
    "exp_idx":"",
    "manualSeed":null,
    "resume":null,
    "z_dim":10,
    "disp_interval":20,
    "save_interval":200,
    "evl_interval":40,
    "exp_name":"Reproduce",
    "creativity_weight":0.1,
    "validate":0,
    "model_num":2
}
Random Seed:  1445
_netG_att(
  (main): Sequential(
    (0): Linear(in_features=95, out_features=4096, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Linear(in_features=4096, out_features=2048, bias=True)
    (3): Tanh()
  )
)
_netD(
  (D_shared): Sequential(
    (0): Linear(in_features=2048, out_features=4096, bias=True)
    (1): ReLU()
  )
  (D_gan): Linear(in_features=4096, out_features=1, bias=True)
  (D_a

In [1]:
%cd ZSL_GAN
!python train_GBU_CIZSL.py --dataset 'SUN' --preprocessing --z_dim 10 --creativity_weight 0.1
%cd ..

/home/camaral/code/gans_course/hw5/ZSL_GAN
Running parameters:
{
    "dataset":"SUN",
    "dataroot":"/home/camaral/code/gans_course/hw5/ZSL_GAN/data",
    "matdataset":true,
    "image_embedding":"res101",
    "class_embedding":"att",
    "preprocessing":true,
    "standardization":false,
    "validation":false,
    "gpu":"0",
    "exp_idx":"",
    "manualSeed":null,
    "resume":null,
    "z_dim":10,
    "disp_interval":20,
    "save_interval":200,
    "evl_interval":40,
    "exp_name":"Reproduce",
    "creativity_weight":0.1,
    "validate":0,
    "model_num":2
}
Random Seed:  9908
_netG_att(
  (main): Sequential(
    (0): Linear(in_features=112, out_features=4096, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Linear(in_features=4096, out_features=2048, bias=True)
    (3): Tanh()
  )
)
_netD(
  (D_shared): Sequential(
    (0): Linear(in_features=2048, out_features=4096, bias=True)
    (1): ReLU()
  )
  (D_gan): Linear(in_features=4096, out_features=1, bias=True)
  (D_a

- (Bonus Point.) (20pt) Replace the original text features with the one extracted from CLIP [3]. (Hint. You only need to get the text features from CLIP instead of re-training it.)

I could not implement this properly due to the lack of documentation on how the dataset stores the attributes. My idea was to:

- Get attributes for each image (e.g. has orange beak, has purple breast)
- Merge all attributes together in a string ("A bird that has orange beak and has purple breast")
- Tokenize the strings using CLIP: `import clip; token = clip.tokenize(string_batch)`
- Encode token using CLIP: `model = clip.load('ViT-B/32', device=torch.device("cuda"); text_feats = model.encode_text(token)` 

Now we can pass the text features to the model.

### 2. Score-Based Generative Modeling

2.1 Basic Understanding

- <strong>Describe the pipeline logic of [4] (i.e., forward and backward steps).</strong>

The forward process consists of a diffusion process, a Stochastic Differential Equation (SDE), that adds noise to the data distribution ($p_0(x)$) slowly, a bit in each layer, up to the last layer ($p_T(x)$) where the distribution is just noise and no more of the original data is left.

The diffusion process can be reversed by applying another diffusion process that can be obtained by estimating the score: $\nabla_x log\,p_t(x), t \in \left[ 0,T\right]$. The score estimate $s_\theta(x(t),t)$ is learned during training.

- <strong>What is Energy-Based Models (EBMs) and Score-Based Generative Models (SBGMs)?</strong>

Energy-based models are those that model a data distribution in the form:

$$ p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta},\;\text{where: } Z_\theta = \int e^{-E_\theta(x)} dx$$

$Z_\theta$ is not tractable as it involves computing an integral over all dimensions of the data, thus the usual way to go is to compute $\nabla_\theta log\,p_\theta(x) = - \nabla_\theta E_\theta(x) + \mathbb{E}_{x\sim p_\theta(x)}\left[\nabla_\theta E_\theta(x) \right]$ and use gradient ascent to maximize the log-likelihood of $E_\theta(x)$ given a dataset.

However, even if a model for the distribution is given, it still is hard to sample from it. Score-based sampling techniques such as Langevin MCMC estimate the score $\nabla_x log\,p_t(x)$ and use it to sample from a data distribution without the need to estimate the data distribution itself. Methods that use score-based sampling for generating data that is similar to prior dataset are called Score-Based Generative Models.

- <strong>What is the difference among Euler-Maruyama sampling, Langevin MCMC sampling, and Predictor-Corrector (PC) sampling?</strong>

The Euler-Maruyama sampling method applies the Euler-Maruyama SDE solver to come up with a solution for the reverse diffusion model (that takes a sample from the prior to the data distribution), which means generating a new sample given a sample from the prior.

The Langevin MCMC method generates samples by iterating over the equation:

$$x_i^m = x_i^{m-1} + \varepsilon_i s_\theta (x_i^{m-1}, \sigma_i) + \sqrt{2 \varepsilon_i} z_i^m,\quad m=1,2,3,\dots,M$$

It starts from $x_i$ and iterates over it until reaching at the final generated sample: $x_i^M$.

The Predictor-Corrector (PC) class of samplers generalize over the past 2 methods: the Predictor can be any SDE solver (e.g. Euler-Maruyama) and the Corrector can be any score-based MCMC approach (e.g. Langevin MCMC). The proposed PC samplers are tailored for reverse diffusion sampling, where they perform better than the previously mentioned techniques.

- <strong>What is the difference among VE SDE, VP SDE, and sub-VP SDE?</strong>

Variance Exploding (VE) SDE: The continuous-time version of the perturbation kernels used in SMLD. It is called variance exploding because it shows such property as $t\to \infty$. It is given by:

$$ dx = \sqrt{\frac{d\left[\sigma^2(t)\right]}{dt}} dw $$

Variance Preserving (VP) SDE: The continuous-time version of the perturbation kernels used in DDPM. It is called variance preserving as its variance is always one provided the initial distribution also has unit variance. It is given by:

$$ dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)}dw $$

Sub-VP SDE: A new type of SDEs proposed by the authors, that has the property of having its variance always bounded by the VP-SDE at every intermediate step. It is given by:

$$ dx = -\frac{1}{2}\beta(t)x\,dt + \sqrt{\beta(t)\left(1-e^{-2\int_0^t \beta(s)ds}\right)}dw $$

- <strong>How and why SDE is connected with SBGMs?</strong>

Since the transformation from data distribution to prior distribution can be modelled as a diffusion model, the reverse is also a diffusion model. As diffusion models are SDEs, they can be solved using regular SDE solvers (e.g. Euler-Maruyama), including score-based ones (e.g. Langevin MCMC).

- <strong>How the likelihood is computed in probabilistic flow ODE? Why this can not be done for normal SDE?</strong>

For any diffusion process (SDE), there exists a deterministic process (ODE) with the same marginal distribution, called the probabilistic flow ODE. It is given by:

$$ \text{log}\,p_0(x(0)) = \text{log}\,p_T(x(T))+\int\limits_0^T \nabla\cdot \mathbf{\tilde{f_\theta}}(\mathbf{x}(t),t) dt $$

We note that:

$$\nabla\cdot \mathbf{\tilde{f_\theta}}(\mathbf{x},t) = \mathbb{E}_{p(\varepsilon)}\left[ \varepsilon^T \nabla\mathbf{\tilde{f_\theta}}(\mathbf{x},t) \varepsilon \right] $$

The term $\varepsilon^T \nabla\mathbf{\tilde{f_\theta}}(\mathbf{x},t)$ can be computed using automatic differentiation (available in most deep learning packages), which in turn enables the exact computation of the likelihood of data: $\text{log}\,p_0(x(0))$.

Doing the same without translating the SDE into a probabilistic flow ODE is computationally expensive (Maximum Likelihood Estimation), and other methods focus on the use of sampling techniques (e.g. Langevin MCMC) to get samples of the data distribution, without ever estimating the likelihood.

- <strong>Clarify potential disadvantages of discrete noisy perturbation (Hint, suggest reading [5].)</strong>

The ideal case would be to have infinitely many noise cases, which lead to a continuous-time SDE, which then can be transformed into a probabilistic flow ODE, allowing for:

- Exact likelihood computation (as shown in the previous item)
- Faster sampling
- Uniquely identifiable representations

When using finitely-many noise perturbations the ODE is approximately equivalent, but not exactly, which lead to having no guarantees of achieving the aforementioned qualities. 

2.2 Explorations

- <strong>You are required to generate images by using Euler Maruyama sampling strategy. You need to fill in all the TODOs in the given .ipynb. You may check the tutorial here.</strong>

Please check the implementation below; full code is available on "Tutorial_on_Score_Based_Generative_Modeling_(PyTorch).ipynb". Code was thoroughly commented to prove understanding.

In [None]:
num_steps =  500#@param {'type':'integer'}
def Euler_Maruyama_sampler(score_model, 
                           marginal_prob_std,
                           diffusion_coeff, 
                           batch_size=64, 
                           num_steps=num_steps, 
                           device='cuda', 
                           eps=1e-3):
  """Generate samples from score-based models with the Euler-Maruyama solver.

  Args:
    score_model: A PyTorch model that represents the time-dependent score-based model.
    marginal_prob_std: A function that gives the standard deviation of
      the perturbation kernel.
    diffusion_coeff: A function that gives the diffusion coefficient of the SDE.
    batch_size: The number of samplers to generate by calling this function once.
    num_steps: The number of sampling steps. 
      Equivalent to the number of discretized time steps.
    device: 'cuda' for running on GPUs, and 'cpu' for running on CPUs.
    eps: The smallest time step for numerical stability.
  
  Returns:
    Samples.    
  """

  # We start by sampling z_T from our prior: N(0,I)
  z_shape = (batch_size, 1, 28, 28)
  z = torch.randn(*z_shape, device=device)
  # Then we multiply it by the std. deviation (given by marginal_prob_std)
  # T=1 (last possible time, as 0 <= t <= 1)
  t = torch.ones(*z_shape, device=device)
  x = z*marginal_prob_std(t)
  # x now stores samples from the prior distribution

  # Compute values of timesteps given num_steps and eps
  steps = torch.linspace(1., eps, num_steps, device=device)
  step_size = steps[0]-steps[1]
    
  # Now we run the Euler-Maruyama method for each timestep
  with torch.no_grad():
    for ts in tqdm.notebook.tqdm(steps):
      # Current t
      bts = ts*torch.ones(*z_shape, device=device)
      g = diffusion_coeff(bts)
      # Sample more noise
      z = torch.randn(*z_shape, device=device)
      # Execute Euler-Maruyama method
      x = x + step_size*(g**2)*score_model(x,bts[:,0,0,0]) \
            + torch.sqrt(step_size)*g*z

  return x

- <strong>Compute the likelihood on CIFAR10 (similar to the example provided on MNIST) for probabilistic flow ODE.</strong>

The code that implements SBGM for CIFAR10 is available on "Tutorial SBGM - CIFAR10.ipynb". Minor changes were made to adapt the code to this dataset:

- Changes in the model architecture made to enable the skip connections (guarantee that the encode-decode pairs have the same size)
- Changed the number of input/output channels of the model from 1 to 3 (RGB)
- Changed the shape of generated noise to be: (batch_size, 3, 32, 32)

After training, the negative log-likelihood shows in average 5.217463 bits/dim. This score (lower is better) shows how different the likelihood estimation using the probabilistic flow ODE is from actual data samples.