# Variational Autoencoders

## Introduction
An autoencoder is a pair of neural networks designed to learn an identity function without supervision. Its main purpose is two fold 

* learn efficient compression of data
* decompress data with minimal errors

Thus it consists of two neural networks
* **Encoder network** : This takes a high dimension input $\mathbf{x}\in \mathbb{R}^n$ and maps it to a latent code output $\mathbf{z} \in \mathbb{R}^k$ such that $k<<n$. The map is given by 
        $$ \mathbf{z} = g_\phi(\mathbf{x})$$
* **Decoder network**" This recovers the low dimensional latent code into the high dimensional input
        $$ \mathbf{x}' = f_\theta g_\phi (\mathbf{x})$$

the parameters $\theta$ and $\phi$ are trained by simultaneously optimizing the loss function

$$L(\theta,\phi) = \frac{1}{n} \sum_{i=1}^n (x_i - f_\theta g_\phi(x_i))^2$$

Generally for videos, in addition to the dimensions and frames, we also encode channel data
$$\mathbf{x}\in \mathbb{R}^{B\times C\times T \times H \times W} $$
$$\mathbf{z} \in \mathbb{R}^{B'\times C'\times T' \times H' \times W'}$$

where $B$ is the batch size (for many number of videos), $C$ is the channel data (eg:RGB), $T$ the number of frames, $H,W$ frame dimensions, which are assumed to be uniform across frames for these notes. The channel data essentially encode the "labeling" of the abstract features that a model can learn. 

We further define the spatio temporal compression ratios $C_t\times C_f\times C_f$ with $C_t = T/T' \ , \ C_f = H/H' = W/W'$. For example in LTX video, with batch size 1
* input video dim: $3\times 64 \times 512 \times 512$, with a pixel space volume of $50331648$. 
* latent space dim: $128\times 8 \times 16 \times 16$ with a latent space volume of $262144$. 

This gives rise to a compression ratio of $1:192$, with a compression of $8\times 32\times 32$ pixels per token in latent space. Generally the pixel space volume is patchified in terms of the compression, and processed independently. 