<a href="https://colab.research.google.com/github/parthasarathydNU/gen-ai-coursework/blob/main/vae-intro/ConvolutionalVariationalAutoencoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Reference](https://www.tensorflow.org/tutorials/generative/cvae#define_the_loss_function_and_the_optimizer)

# Convolutional Variational Auto Encoder

In this notebook we explore how to build a Variational Auto Encoder and the various concepts around it.

Unlike a traditional auto encoder that maps the input vector into a latent space, the Variational Auto encoder tries to map the data into inputs for a known probability distribution such as mean and variance. This approach produces a continuous, structured latent space that is useful for image generation.

# Setup

In [2]:
!pip install tensorflow-probability

# to generate gifs
!pip install imageio
!pip install git+https://github.com/tensorflow/docs

Collecting git+https://github.com/tensorflow/docs
  Cloning https://github.com/tensorflow/docs to /tmp/pip-req-build-t7eg3a1f
  Running command git clone --filter=blob:none --quiet https://github.com/tensorflow/docs /tmp/pip-req-build-t7eg3a1f
  Resolved https://github.com/tensorflow/docs to commit 940d94cb568bcdd4e82402eff3403ddcf5f874f2
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting astor (from tensorflow-docs==2024.5.3.31743)
  Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Building wheels for collected packages: tensorflow-docs
  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone
  Created wheel for tensorflow-docs: filename=tensorflow_docs-2024.5.3.31743-py3-none-any.whl size=182531 sha256=ccd9f9d4d9807aae7c336d4fd61c198f48158a3208fa56d3c910064bfb03063a
  Stored in directory: /tmp/pip-ephem-wheel-cache-qdy9p26f/wheels/86/0f/1e/3b62293c8ffd0fd5a49508e6871cdb7554abe9c62afd35ec53
Successfully built tensorflow-docs
Installing collected packages: 

In [3]:
from IPython import display

import glob
import imageio
import matplotlib.pyplot as plt
import numpy as np
import PIL
import tensorflow as tf
import tensorflow_probability as tfp
import time

# Load the MNIST dataset

Each MNIST image is a vector of 784 integers, each between 0 - 255 and represents the intensity of a pixel.

In [19]:
(train_images, _), (test_images, _) = tf.keras.datasets.mnist.load_data()

# Converting data to a bernoulli distribution

In [20]:
def preprocess_images(images):
  """
  In this function we convert the images to a binary format
  Based on a bernoulli distribution
  We first normalize the immage values between 0 and 1
  Then we binarize it to 0 or 1 based on the threshold value
  """
  images = images.reshape((images.shape[0], 28, 28, 1)) / 255.
  return np.where(images > .5, 1.0, 0.0).astype('float32')

In [21]:
train_images.shape

(60000, 28, 28)

In [22]:
train_images[0]

In [24]:
train_images[0][0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0], dtype=uint8)

# Converting to binary distribution

In [25]:
train_images = preprocess_images(train_images)
test_images = preprocess_images(test_images)

In [26]:
train_images.shape

(60000, 28, 28, 1)

In [27]:
train_images[0].shape

(28, 28, 1)

In [28]:
train_images[0][0]

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]], dtype=float32)

# Reshaping the images and converting the Bernoulli Distribution : WHY ?

The function `preprocess_images(images)` is designed to prepare image data for use in a machine learning model, specifically by processing the images into a format that is suitable for a neural network. The function does two main operations on the input images: reshaping and binarizing. Here's a breakdown of each step:

1. **Reshaping the Images:**
   - **`images.reshape((images.shape[0], 28, 28, 1))`:** This line changes the shape of the input array `images`. The reason for this reshaping involves several factors:
     - **Original Shape:** MNIST dataset images are typically stored in a flat array of 784 elements per image (since 28x28 pixels = 784). For use in most convolutional neural networks (CNNs), it is necessary for the input data to be in the form of matrices (or more precisely, tensors) that represent the 2D structure of the images.
     - **Target Shape:**
       - **`images.shape[0]`** is the number of images in the batch. This part remains dynamic, accommodating whatever batch size is being processed.
       - **`(28, 28)`** converts each flat 784 element vector into a 28x28 matrix which represents the original 2D structure of the images.
       - **`1`** at the end of the shape tuple adds a single channel to the images, making it compatible with the expected input format for CNNs which typically expect images to have dimensions [batch_size, height, width, channels]. For grayscale images like those in the MNIST dataset, the channel dimension is 1.
     - **Normalization:** The division by 255 is a normalization step, converting pixel values from the range [0, 255] to [0, 1]. This step is important for neural network models as it helps in faster convergence during training by maintaining numerical stability.

2. **Binarizing the Images:**
   - **`np.where(images > .5, 1.0, 0.0).astype('float32')`:** After the images are reshaped and normalized, this line binarizes the image data:
     - **Thresholding at 0.5:** This uses a threshold of 0.5 on the normalized pixel values. If a pixel's value is greater than 0.5, it is set to 1.0; otherwise, it is set to 0.0. This step is effectively turning the grayscale images into black and white images, where each pixel is either fully "on" (1.0) or fully "off" (0.0). The choice of 0.5 as a threshold works well in practice for images where pixel values are normalized between 0 and 1.
     - **Type Conversion:** `.astype('float32')` converts the data type of the numpy array to `float32`, which is a common data type used in neural networks due to a good balance between precision and memory requirements.

Overall, the `preprocess_images` function not only adjusts the format and shape of the image data to make it suitable for processing with convolutional neural networks but also simplifies the data representation by converting it to binary. This can lead to more efficient learning when dealing with binary or nearly-binary images such as handwritten digits, which often don't require the full range of grayscale to be accurately represented and recognized.

# Why do we convert it to a bernoulli distribution ?

Variational Autoencoders (VAEs) are a type of generative model that often use specific assumptions about the distribution of input data to simplify the training process and improve the model's performance. When working with image data like the MNIST dataset, these assumptions can play a crucial role. Here's why pixels are converted to a Bernoulli distribution and statistically binarized:

1. **Binarization of Images:**
   - **Data Simplification:** MNIST images, which are grayscale, contain pixel values ranging from 0 to 255. Binarizing these values (i.e., converting them to 0s and 1s) simplifies the model by reducing the complexity of the input space. Instead of modeling 256 possible intensities for each pixel, the model only needs to consider two states: on or off (pixel is white or black).
   - **Match to Output Activation:** VAEs often use a sigmoid activation function in the output layer, which constrains the output values between 0 and 1. Binarizing the inputs to also be 0 or 1 makes it easier for the model to learn the appropriate mappings since the output directly corresponds to the probability of a pixel being on (1) or off (0).

2. **Modeling Pixels with Bernoulli Distribution:**
   - **Probabilistic Interpretation:** The Bernoulli distribution is a simple discrete distribution, which has two possible outcomes: 0 or 1. This matches the binarized nature of the input data. By modeling each pixel as a Bernoulli distributed variable, the model learns to output the probability that a given pixel should be 1 (pixel on) based on the latent space representation.
   - **Training Objective:** In VAEs, the objective includes a reconstruction loss that measures how well the output of the decoder matches the original input. For binarized data, this is typically measured using the binary cross-entropy between the input images and the reconstructed probabilities. The cross-entropy is a natural choice for loss when dealing with probabilities output from a Bernoulli distribution, effectively encouraging the model to adjust its parameters to minimize the difference between the predicted probabilities and the actual binary values of the input pixels.

3. **Statistical Efficiency and Stability:**
   - **Reduction of Variability:** By converting pixel values to a binary format, you reduce the intra-class variability among the pixels, which can lead to more stable and efficient learning. This is because the model no longer needs to account for varying degrees of intensity within the same class of objects (digits in the case of MNIST).
   - **Robustness:** Models trained on binary data tend to be less sensitive to small variations and noise in pixel values, which can be beneficial, especially in simpler models or when computational resources are limited.

In summary, converting pixel values from grayscale to a binary format and modeling these binary values with a Bernoulli distribution simplifies the VAE's learning process, aligns with the nature of the output layer's activation function, and is suitable for the kind of reconstruction loss used (binary cross-entropy). This approach makes training more straightforward and often more effective for generating or reconstructing binary images.

# Setting Parameters

In [29]:
test_images.shape

(10000, 28, 28, 1)

In [31]:
train_images.shape

(60000, 28, 28, 1)

In [30]:
train_size = 60000
batch_size = 32
test_size = 10000

# Use tf.data to batch and shuffle the data

In [32]:
train_dataset = (tf.data.Dataset.from_tensor_slices(train_images)
                 .shuffle(train_size).batch(batch_size))
test_dataset = (tf.data.Dataset.from_tensor_slices(test_images)
                .shuffle(test_size).batch(batch_size))

# Defining the encoder and decoder networks with tf.keras

In this VAE we use two small Convolutional Neural Networks for the encoder and decoder networks. These models are referred to as inference/recognition and generation models respectively.

Let `x` and `z` denote the observation and latent variable respectively in the following descriptions.

## Encoder Network

This defines the approximate posterior distribution `q(z|x)`, which takes as input an observation and outputs a set of parameters for specifying the conditional distribution of the latent representation `z`. In this example, simply model the distribution as a diagonal Gaussian, and the network outputs the mean and log-variance parameters of a factorized Gaussian. Output log-variance instead of the variance directly for numerical stability.

> Let's break down the concepts mentioned:

### Diagonal Gaussian
In a VAE, the encoder network models the distribution of latent variables as a Gaussian (normal) distribution. The term **"diagonal Gaussian"** refers to the type of covariance matrix used in this Gaussian distribution. A diagonal covariance matrix means that all off-diagonal entries are zero, implying that there are no correlations between different latent variables. Each latent variable has its variance, but these variances don't interact. This simplification makes it easier to compute and less parameter-intensive than a full covariance matrix.

### Mean and Log-Variance Outputs
The encoder network does two key things:
1. **Outputs the Mean:** This is the mean of the Gaussian distribution of the latent variables. It represents the central point around which the latent variable values are distributed, essentially capturing the "average" state of the latent factors for the given input.
2. **Outputs the Log-Variance:** Instead of outputting the variance directly, the encoder outputs the logarithm of the variance. This step is crucial for numerical stability:
   - **Avoid Negative Variance:** Variance must be non-negative, but if the network were to output variance directly, there could be computational issues or optimization difficulties that might lead to negative values, which are not valid. By using the log-variance, the network ensures that the variance is always positive, as the exponential of any real number (which is the inverse operation to logarithm) is positive.
   - **Easier Optimization:** Variance values can span several orders of magnitude (very small to very large). Working in the log space normalizes this scale, potentially leading to smoother and more stable optimization landscapes.

### Why Log-Variance and Not Variance
Outputting the log-variance instead of the variance directly has advantages in terms of numerical stability and computational efficiency:
- **Numerical Stability:** Calculating the variance directly can lead to numerical issues, such as floating-point underflow or overflow, especially when the variance is extremely low (close to zero) or high. Using the logarithm mitigates these risks.
- **Ease of Optimization:** In optimization, especially with gradient descent methods, dealing with parameters that can vary over several orders of magnitude (as variances can) is challenging. Using the logarithm transforms the variance into a scale that's more manageable for the optimizer.

In summary, in a VAE, the encoder network approximates the posterior distribution of the latent variables given the input data. By outputting parameters (mean and log-variance) for a diagonal Gaussian distribution, the encoder simplifies the representation of the input data into a latent space where each dimension is independent and normally distributed. This structured approach allows the VAE not only to efficiently encode variations in the data but also to handle the intricacies of probabilistic generative modeling.

## Decoder Network

This defines the conditional distribution of the observation `p(x|z)`, which takes a latent sample `z` as an input and outputs the parameters for a conditional distribution of the observation. Model the latent distribution priod `p(z)` as a unit Gaussian.

>  My understandign was that the decoder network was supposed to take in a distribution and output back an image. But here I see that it outputs the parameters for a distribution, isn't that the job of the encoder ? 🤔

🤖 💬 Your understanding of the roles of the encoder and decoder in a Variational Autoencoder (VAE) is on the right track, but there's a bit more nuance to how these components interact and what they output, particularly in the case of the decoder.

### Encoder vs. Decoder Roles

- **Encoder Network:** The encoder network takes in the original data (like images) and transforms it into a latent space. As you noted, it outputs parameters for a probability distribution—typically the mean and log-variance of a Gaussian distribution. These parameters define how we think about the latent variables given the input data, thus approximating the posterior distribution of the latent variables.

- **Decoder Network:** The decoder, on the other hand, aims to reconstruct the original input data from the latent space. It takes a sample \( z \) from the latent space—drawn from the distribution defined by the encoder's outputs—and attempts to reconstruct the original input.

### Output of the Decoder

The decoder does not simply spit out a reconstructed image directly. Instead, it outputs parameters that define a probability distribution of the possible outputs. Here’s how this works and why:

- **Conditional Distribution:** When the description states that the decoder outputs parameters for a "conditional distribution of the observation," it means that the decoder specifies how likely certain outputs are, given the latent sample \( z \). The decoder is essentially modeling \( p(x|z) \), the probability of the data \( x \) given the latent variables \( z \).

- **Probabilistic Reconstruction:** The actual output data (like an image) is then sampled from this distribution. This approach allows the model to express uncertainty about the reconstructions and to generate multiple possible outputs from the same latent variables, reflecting variations that could plausibly lead to the observed inputs.

- **Example with Images:** If the input data are images, and the VAE is trained on these, the decoder might output the mean and variance for a Gaussian distribution for each pixel. Alternatively, in cases like binarized images, it might output the probability that each pixel should be 1 (if using a Bernoulli distribution, as often done with MNIST).

### Why Use Distributions at Both Ends?

Using distributions both at the encoder and decoder ends makes VAEs powerful for a couple of reasons:
- **Flexibility in Data Generation:** By handling distributions rather than fixed outputs, VAEs can generate new data samples that are variations on the learned data, useful for tasks like data augmentation, anomaly detection, or generative models.
- **Better Handling of Uncertainty:** This approach inherently allows the model to handle and express uncertainty about the data it processes and generates, a key advantage in probabilistic modeling.

Thus, while it might initially seem like the job of outputting distributions should be confined to the encoder, having the decoder also output parameters for distributions enables more robust, flexible, and realistic modeling of data in various conditions.

> It will be an interesting exersice for us to have a look at the outputs that were generated by the decoder

