## VQ-VAE
vector quantisation variational autoencoder allows you to learn discrete latent codes from data. Language is inherently discrete [1](https://arxiv.org/pdf/1711.00937), similarly speech is typically represented as a sequence of symbols. Images can often be described concisely by language.


So the understanding is that we will have fixed codes to map from high dimension to low dimension. Lets take an image x and encode to $z_{e}$ using an encoder p(z/x). Now using L2 distance we will select the closest code $z_{q}$ to $z_{e}$. Now we will decode $z_{q}$ to get $x_{q}$ using decoder p(x/z).

$$
    x \overset{\text{encoder}}{\longrightarrow} z_{e} \overset{\text{quantisation}}{\longrightarrow} z_{q} \overset{\text{decoder}}{\longrightarrow} x_{q}
$$

so if u have a image of dimension 32x32x3, we can encode it to 512 dimension ($z_{e} \in \mathbb{R}^{512}$) and have $z_{q} \in \mathbb{R}^{1024}$ codes, with each code being 512, we will calculate L2 distance between $z_{e}$ and all the codes and select the one with the least distance, use this code to decode to $x_{q}$. Now the fundamental question i have is say if u have 50k images, how are we encoding them to just 1024 codes? and how can learn/reconstruct all the images from just 1024 codes? 

## Codebook size
Input Image: 32x32x3 (RGB image)
Encoded Representation: 4x4x512
Codebook Size: 1024 codes (each code is 512-dimensional)

The encoder produces 16 vectors (4x4=16) where each vector is 512-dimensional. Each of these 16 vectors gets mapped to its nearest neighbor in the codebook. This means:
- Each spatial location in the 4x4 encoded space selects one code from the codebook
- We need 16 selections from the codebook to represent one image
- Total possible combinations = 1024¹⁶ (or 2⁵¹²) unique images
- This massive space (2⁵¹²) easily accommodates our 50k training images

## Compression Analysis
VQ-VAE performs lossy compression. Here's why:

Original Image Size:
- Dimensions: 32x32x3 pixels
- Bits per pixel: 8 (0-255)
- Total bits: 32 × 32 × 3 × 8 = 24,576 bits

Compressed Representation:
- Need to store 16 indices (4x4 spatial locations)
- Each index needs 10 bits (to represent 1024 choices)
- Total bits: 16 × 10 = 160 bits

Compression ratio = 24,576/160 = 153.6x reduction in size


## How to learn the codebook?
The VQ_VAE paper using the following loss function to learn the codebook.

$$
    L = ||x - D[e(x)]||^2 + ||sg[e(x)] - C||^2 + \beta ||sg[C] - e(x)||^2
$$

where $D$ is the decoder, $e$ is the encoder, $C$ is the codebook, $sg$ is the stop gradient operation, $\beta$ is a scalar hyper parameter.
The first term is the reconstruction loss, the second term is the codebook loss and the third term is the commitment loss. The encoder and decoder are trained to minimise the reconstruction loss, while the codebook is updated to minimise the codebook loss and commitment loss. 


## Commitment loss
The commitment loss is used as a regularizer to ensure that both $z_{e}$ and $z_{q}$ are not changed simultaneously which makes training unstable. Lets take an example again.
- Iteration one - image1 has selected [10, 20] code book indices and $z_{e}$ is [0.1, 0.2]
- Iteration two - image1 has selected [30, 40] code book indices and $z_{e}$ is [0.15, 0.18]
so in the above case both $z_{e}$ and $z_{q}$ are changed simultaneously. To avoid this we use the commitment loss, where we keep the codebook fixed (hence the stop gradient operation so that it does not backpropagate and change the codebook) and try to push $z_{e}$ to be close to $z_{q}$. fixing the codebook makes more sense as ultimately code book learning useful features from the data is our main priority. 



In [9]:
32*32*3*8

24576