<a href="https://colab.research.google.com/github/notenoughsun/bayes_ML/blob/master/Assignment_3_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MA060129, Bayesian Machine Learning Homework 3: Practical Problems

#### List of problems:
[Task 1](#Task1) 30 points

[Task 2](#Task2) 15 points

[Task 3](#Task3) 5 points 

[Task 4](#Task4) 10 points 

------ Total : 60 points  -------

[Bonus](#Bonus) 20 point

## Variational Autoencoders

VAEs consider two-step generative process by a prior over latent space $p(z)$ and a conditional generative distribution $p_{\theta}(x|z)$, which is parametrized by a deep neural network (DNN). Our goal is to maximize marginal log-likehood  which is intractable in general case. Therefore, variational inference (VI) framework is considered.

\begin{equation*}
    \begin{aligned}
    & \log p(x) \geq \mathcal{L}(x;\theta;q) = \mathbb{E}_{z\sim q(z)}[\log p_{\theta}(x|z)] - \text{KL}[q(z)\|p(z)],
    \end{aligned}
\end{equation*}

where $q(z|x)$ is a variational posterior distribution. Given data distribution $p_e(x) = \frac1N\sum_{i=1}^N \delta_{x_i}(x)$ we aim at maximizing the average marginal log-likelihood. Following the variational auto-encoder architecture amortized inference is proposed by choice of the variational distribution $q_{\phi}(z|x)$ which is also parametrized by DNN.

\begin{equation*}
    \begin{aligned}
    & \arg\max\limits_{\phi, \theta}\mathbb{E}_{x}\mathcal{L}(x,\phi,\theta)=\arg\max\limits_{\phi, \theta}\mathbb{E}_{x}\mathbb{E}_{z\sim q(z)}[\log p_{\theta}(x|z)] - \mathbb{E}_x \text{KL}[q_{\phi}(z|x)\|p(z)].
    \end{aligned}
\end{equation*}

To evaluate the performance of the VAE approach, we will estimate a negative log likelihood (NLL) on the test set. NLL is calculated by importance sampling method:
\begin{equation*}
   - \log p(x) \approx - \log \frac{1}{K} 
   \sum_{i = 1}^K \frac{p_\theta(x | z_i) p(z_i)}{q_\phi(z_i | x)},\,\,\,\,z_i \sim q_\phi(z | x).     
\end{equation*}

### References
1. Auto-Encoding Variational Bayes https://arxiv.org/pdf/1312.6114.pdf
2. Beta-VAE https://pdfs.semanticscholar.org/a902/26c41b79f8b06007609f39f82757073641e2.pdf
3. Importance Weighted Autoencoders https://arxiv.org/pdf/1509.00519.pdf 

## VAE
Below you can find emty class for VAE model. It contians methods, which will help you to make all the tasks in this aasignment. You can use code from the VAE seminar to implement the methods, if you use other sources, provide the reference, please.

In [None]:
import math 
def log_gaussian(x, mean, logvar, dim=None):
    log_normal = -0.5 * (math.log(2.0*math.pi) + logvar + 
                         torch.pow(x - mean, 2) / (logvar.exp()+1e-5))
    return log_normal.sum(dim)

In [None]:
import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, hid_dim):
        """
        z_dim: int, dimention of the latent space
        x_dim: int, input image will have size (3, x_dim, x_dim)
        """
        super(VAE, self).__init__()
        self.hid_dim = hid_dim
        # Initialize Encoder and Decoder networks

       
        self.init_params()

    def init_params(self):
        for m in self.modules():
            if isinstance(m, nn.Linear) or isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
                nn.init.xavier_normal_(m.weight.data)
        
    def q_z(self, x):
        """
        VARIATIONAL POSTERIOR
        :param x: input image
        :return: parameters of q(z|x), (MB, hid_dim)
        """
        ## YOUR CODE HERE

    def p_x(self, z):
        """
        GENERATIVE DISTRIBUTION
        :param z: latent vector          (MB, hid_dim)
        :return: parameters of p(x|z)    (MB, inp_dim)
        """
        ## YOUR CODE HERE

    def forward(self, x):
        """
        Encoder the image, sample z and decode 
        :param x: input image
        :return: parameters of p(x|z_hat), z_hat, parameters of q(z|x)
        """
        # YOUR CODE HERE
        return x_mu, x_logvar, z_sample, z_mu, z_logvar

    def log_p_z(self, z):
        """
        Log density of the Prior
        :param z: latent vector     (MB, hid_dim)
        :return: \sum_i log p(z_i)  (1, )
        """
        # YOUR CODE HERE

    def reconstruct_x(self, x):
        x_mean, _, _, _, _ = self.forward(x)
        return x_mean

    def kl(self, z, z_mean, z_logvar):
        """
        KL-divergence between p(z) and q(z|x)
        :param z:                               (MB, hid_dim)
        :param z_mean: mean of q(z|x)           (MB, hid_dim)
        :param z_logvar: log variance of q(z|x) (MB, hid_dim)
        :return: KL                             (MB, )
        """
        # YOUR CODE HERE

    def calculate_loss(self, x, beta=1.):
        """
        Given the input batch, compute the negative ELBO 
        :param x:   (MB, inp_dim)
        :param beta: Float
        :return: nll + beta * KL  (MB, ) or (1, )
        """
        # YOUR CODE HERE

    def calculate_nll(self, X, K=100):
        """
        Estimate NLL by importance sampling 
        (see VAE seminar, but be carefull with dimentions)
        :param X: dataset, (N, 3, w, h)
        :param samples: Samples per observation
        :return: IS estimate
        """
        # YOUR CODE HERE       

    def generate_x(self, N=25):
        """
        Sample, using you VAE: sample z from prior and decode it 
        :param N: number of samples
        :return: X (N, inp_size)
        """
        # YOUR CODE HERE

    @staticmethod
    def reparametrize(mu, logvar):
        std = logvar.mul(0.5).exp_()
        eps = torch.FloatTensor(std.size()).normal_().to(mu.device)
        return eps.mul(std).add_(mu)

## Generalization

The size of the Dataset, that is used to train generative model is often exponentially small compared to the support of density $p(x)$. Thus, it is important to be able to evaluate genralization abilities of the learned model.

In this assignment you will be asked to evaluate generalization ability of the VAE, using the approach discussed in https://arxiv.org/abs/1811.03259. 

Authors propose to study the generalization ability of the generative model, using **probing features** - functions which map image to a value. E.g., number of objects on the images.


We will use the dataset with dots, which can be downloaded [here](https://drive.google.com/open?id=1CsDMOIGEsD1l3BLhuQDfEfEmLEb83wMz). 

---
<a id='Task1'></a>
**Task 1. [30 pts]: Training**
* Train your VAE on the **subset** of images contating only 3 dots (use batch 0-5 for training and leave 6 and 7 for validation). 
* Plot ELBO vs Iteration, KL vs Iteration and NLL vs Iteration during training.
* Plot 10 pairs of `image`-`reconstruction` for 10 random images from the validation dataset 
* Plot 10 samples from the model

Note, that the task is considered completed only if your model produces reasonable **reconstructions** and **samples**. By resonable I mean:
- Reconstructions and true images have the same number of dots of similar colors
- Samples have dots of different colors on them (they may not have perfect shapes)

---
Some hints, that might be usefull (you do not have to use all of them):
- Use CNNs for encoder and decoder
- Use gaussian distribution for $p(x|z)$
- **Fix** variance of the $p(x|z)$ and train only mean value (aka NLL is MSE loss)
- Scale pixels of the input (dataset) and output (generated) images to [-1,1] range
- Use `Upsample` + `Conv` instead of `ConvTranspose` in the decoder
- Use $\beta$-VAE objective instead of simple ELBO:
    $$ -NLL + \beta \text{KL}(q(z|x)\|p(z))$$
$\beta$ is a hyperparameter. You can either fix it, or use so-called $\beta$-annealing. In the latter case, we gradually increase the value of $\beta$ from 0 during training.
- If reconstructions look nice but samples are bad, you probably need to put more weight on the KL-term  (larger $\beta$)

+ other standard DL tricks, e.g. lr annealing, early stopping, augmentations, etc.
---

In [None]:
# Load the data, define train and validation datasets

In [None]:
# Training VAE (do not forget to save checkpoint, 
# and send it along with HW solution)

In [None]:
# Plot reconstructions

---
<a id='Task2'></a>
**Task 2. [15 pts]: Evaluation**

Calculate NLL on a validation set, contating only 3 dots. 

In [None]:
# your code here

---
<a id='Task2'></a>
**Task 3. [5 pts]: Generalization [1]**

* Sample 25 images from the VAE and draw them on the 5 $\times$ 5 grid. 
* What can you say abuout generalization ability of the model based the results

In [None]:
# Sampling

# your code here

`comment`

--- 
<a id='Task4'></a>
**Task 4. [10 pts]: Generalization [2]** 

* Define 2 new validation sets: containing only 2 dots (batches 6 and 7) and only 4 dots (batches 6-7) 
* Plot reconstruction of 10 random from both dataset
* Compute NLL of your VAE on these datasets
* What can you say abuout generalization ability of the model based the results


In [None]:
# NLL for 5 dots per image

# your code here

In [None]:
`comment`

---
<a id='Bonus'></a>
## Bonus task [20 pts]

Assume that we want to quantify the generalization ability of the model. To do that, we need to accurately compute number of dots on all the generated images.

1. Train neural network, which will classify images based on number of dots with high probability (>95%)
2. Generate 10'000 images from you VAE
3. Classify generated images and plot the histogram

In [None]:
# your code here