## Issues with Autoencoders

#### 1. Lack of Robust Latent Space Structure

- **Traditional AE Issue**:

    - In a traditional AE, the encoder compresses data into a latent space, but there is `no guarantee` that this space is `structured` or `organized in a meaningful way`.
    - This disorganized latent space makes it challenging to sample meaningful new data points, as the space doesn’t capture the data distribution well.

- **VAE Solution**:

    - VAEs introduce a probabilistic approach by encoding `data as distributions` rather than fixed points in the latent space.
    - Instead of mapping each input to a single point, a VAE maps it to a **mean** and **variance**.
    - During training, the VAE learns to model the latent space as a `continuous and structured space by encouraging it to approximate a known distribution, typically a **standard normal distribution**.
    - This makes it possible to sample coherent new data from any point within the distribution, leading to a smoother and more organized latent space.


#### 2. Difficulty in Sampling New Data

- **Traditional AE Issue**:

    - Since traditional AEs map data to specific points in an unstructured latent space, `random sampling` in this space often leads to `meaningless outputs` that don’t resemble the training data.

- **VAE Solution**:

    - VAEs address this with **reparameterization** and **probabilistic sampling**.
    - By sampling from the learned distribution, we can generate new points in the latent space that are close to real data points.
    - The **reparameterization trick**, enables VAEs to perform stochastic sampling in a way that’s compatible with gradient-based learning.
    - This approach makes it easier to sample new data points that closely resemble the original data distribution.


#### 3. Inflexible and Deterministic Output

- **Traditional AE Issue**:

    - Traditional AEs are **deterministic**, meaning for any input, they always produce the same output in the latent space.
    - This approach limits the AE’s ability to generalize to minor variations in the data and doesn’t allow for **variability** in generated outputs.

- **VAE Solution**:

    - VAEs introduce **variability** by encoding data as distributions (mean and variance) rather than single points.
    - For a given input, VAEs can output slightly different reconstructed data by sampling from the distribution.
    - This added **variability** enhances the model’s robustness, particularly for generating new data that includes inherent randomness.

## Transition to Variational AE

#### 1. Encoder: Change to Probabilistic Encoding
- Traditional AE: Encodes input into a fixed point in the latent space.

- VAE Modification: Encodes input into a `mean` and `variance` for a probability distribution in the latent space, rather than a single point.

- This change allows the model to sample from a range of values around the mean, capturing variability in the data.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Dummy data representing height (cm) and weight (kg)
data = np.array([
    [160, 55],
    [170, 65],
    [180, 75],
    [190, 85],
    [200, 95]
])

In [3]:
# Convert to a DataFrame for easy viewing
df = pd.DataFrame(data, columns=['Height (cm)', 'Weight (kg)'])
df

Unnamed: 0,Height (cm),Weight (kg)
0,160,55
1,170,65
2,180,75
3,190,85
4,200,95


#### 2. Traditional AE Encoding vs. VAE Encoding

- `Traditional AE Encoding`: Each input (height and weight) would map directly to a `single fixed point` in the latent space. So, our encoder might convert each height-weight pair into a unique 2D point like [1.5, -0.3], [0.8, 1.2], etc.

- `VAE Encoding`: Instead of a single point, the encoder `maps each input` to a `mean` and `variance`. For instance, for an input [160, 55], the VAE might output a `mean` of [1.5, -0.3] and a `variance` of [0.2, 0.1].

- This mean and variance define a normal distribution for each data point in the latent space.

- In a VAE, we want to learn a distribution over the latent space rather than fixed deterministic points. By treating the output of z_mean as the center of a Gaussian distribution, we can encode each input as a distribution instead of a single point.

- In a Variational Autoencoder (VAE), instead of encoding the input data into a single point, the encoder network learns to represent each input data point as a probability distribution. This distribution is defined by a mean and variance.

    - `Mean`: The mean is the center of the distribution in the latent space. It is a `vector`, and each dimension of the vector corresponds to a dimension in the latent space.
    - `Variance`: The variance defines the spread or uncertainty in the latent space for that particular point. Again, it's a `vector`, and each dimension corresponds to a variance for a particular latent dimension.
      
- **Interpretation from the VAE Objective Function**:
  - The **KL divergence loss** in a VAE objective function compares the distribution represented by `z_mean` and `z_log_var` to a prior distribution (often a standard Gaussian).
  - By minimizing this KL divergence, we ensure that the values in `z_mean` represent the central location of a Gaussian that resembles the prior, enforcing a **structured latent space**.
  - Thus, although `z_mean` is computed via a dense layer, it acts as the **learned mean of the distribution** in the latent space.

If we want to represent a simple `2D latent space`, we could have a `2D mean` and `2D variance`. 

Here's how that looks:

- Input: [Height, Weight] = [160, 55]
- Latent space (2D): The VAE would `learn` a `2D mean vector` and a `2D variance vector` that define a distribution over the 2D latent space.
  
Let’s say the VAE outputs the following for one data point:

- Mean Latent Vector: [1.5, -0.3] (This could represent a point in the 2D latent space where the data is centered).
- Variance Latent Vector: [0.2, 0.1] (This represents how much spread there is around the mean along each axis).
- 
Now, the VAE does not simply map [160, 55] to the latent point [1.5, -0.3] directly. Instead, it creates a distribution centered around [1.5, -0.3] with a spread determined by [0.2, 0.1].

In [5]:
# Example dataset (Height, Weight) for 3 data points
data_points = np.array([[160, 55], 
                        [165, 60], 
                        [170, 65]])

In [6]:
# Encoder outputs (mean and variance for each data point)
mean_latent     = np.array([[1.5, -0.3], 
                            [1.6, -0.2], 
                            [1.7, -0.1]])  # 3 data points, 2D means

variance_latent = np.array([[0.2, 0.1], 
                            [0.3, 0.15], 
                            [0.25, 0.2]])  # 3 data points, 2D variances

**Example** - encoder

```python
# a single data point, e.g., an image or a feature vector.
encoder_input = Input(shape=(input_dim,), name='encoder_input')    

```python
# fully connected Dense layer with 300 units and 
# a LeakyReLU activation function to learn features of the input data
x = Dense(300)(encoder_input)
x = LeakyReLU()(x)

```python
# the output of a Dense layer with latent_dim units.
# Explanation: Each data point is mapped to a vector of size latent_dim 
# representing the mean of the latent distribution.  (Gaussian)
# The output of this layer is the mean for the probabilistic latent space for each input data point.
# For example, if latent_dim=2, then z_mean is a 2D vector: [mean_1, mean_2].
# Center of the Gaussian in the latent space.
z_mean    = Dense(latent_dim, name='z_mean')(x)

```python
# the output of another Dense layer with latent_dim units.
# does not require an explicit log() function
# raw output of the Dense layer is treated as the log-variance
# log transformation is implicitly applied in the sense that the Dense layer directly 
# predicts the logarithmic scale of the variance.
# Later, during the reparameterization trick, the log-variance is exponentiated to 
# obtain the standard deviation, which is then used for sampling in the latent space.
z_log_var = Dense(latent_dim, name='z_log_var')(x)

```python
# Reparameterization Trick
# Applies the reparameterization trick, enabling backpropagation through stochastic sampling.
# Lambda(sampling): Custom layer that applies the sampling function to z_mean and z_log_var.
# sampling function: Generates random latent points based on the mean and log-variance. 
# This is achieved by adding random noise to the mean, scaled by the standard deviation.
z = Lambda(sampling, 
           output_shape=(latent_dim,), 
           name        ='z')([z_mean, z_log_var])

```python
# # Build the encoder model
encoder = Model(encoder_input, [z_mean, z_log_var, z], name='encoder')

```python
import keras.backend as K
import numpy as np

def sampling(args):
    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]  # Number of samples in the batch
    dim   = K.int_shape(z_mean)[1]  # Dimensionality of the latent space
    
    # Sampling from a standard normal distribution
    epsilon = K.random_normal(shape=(batch, dim))  # Standard normal noise

    # Apply the reparameterization trick: z = mean + std * epsilon
    z = z_mean + K.exp(0.5 * z_log_var) * epsilon  # Returns the latent variable z
    
    return z

**What happens during training?**

- `Backpropagation`: The sampling function ensures that the sampling operation is differentiable, allowing gradients to flow through the reparameterization step and enabling the VAE to be trained with standard gradient descent.
  
- `During inference`: For inference (during testing or sampling), you can directly sample from the latent space using the learned mean and variance without needing to perform reparameterization.

#### Loss Function in VAE

The **loss function** in a **Variational Autoencoder (VAE)** consists of two main components:
1. **Reconstruction Loss**: Measures how well the decoder reconstructs the input data.
2. **KL Divergence Loss**: Regularizes the latent space by measuring the difference between the learned latent space distribution and a prior (usually a standard normal distribution).

#### 1. Reconstruction Loss

The **reconstruction loss** is responsible for ensuring that the model can accurately reconstruct the input data. It measures how far the output (reconstructed data) is from the original input data.

- **Formula**:
  $$
  \text{Reconstruction Loss} = \mathbb{E}_{q(z|x)}[ \log p(x|z) ]
  $$
  Where:
  - \( q(z|x) \) is the approximate posterior (encoded distribution) of the latent variables \( z \) given the input data \( x \).
  - \( p(x|z) \) is the likelihood of the data \( x \) given the latent variable \( z \), which is modeled by the decoder.
  
  The most commonly used loss for reconstruction is either **Mean Squared Error (MSE)** or **Binary Cross-Entropy**, depending on the type of data being reconstructed. 

For example:
- **MSE** is used for continuous data (like images).
- **Binary Cross-Entropy** is used for binary data.

In code:
```python
from keras.losses import MeanSquaredError

# Assume x and x_reconstructed are the true and reconstructed inputs
reconstruction_loss = MeanSquaredError()(x, x_reconstructed)


#### 2. KL Divergence Loss

The KL Divergence Loss serves as a regularizer by ensuring that the learned latent space distribution $q(z \mid x)$ is close to a known distribution (typically a standard normal distribution $\mathcal{N}(0, I)$ ).
- Formula:

$$
\text { KL Divergence Loss }=D_{\mathrm{KL}}(q(z \mid x) \| p(z))=\frac{1}{2} \sum_{i=1}^{D_z}\left(\exp \left(\log \sigma_i^2\right)+\mu_i^2-1-\log \sigma_i^2\right)
$$


Where:
- $\mu_i$ and $\sigma_i^2$ are the mean and variance of the approximate posterior distribution $q(z \mid x)$.
- $p(z)$ is the prior distribution (usually $\mathcal{N}(0, I)$ ).
- $D_z$ is the dimensionality of the latent space.