# **Project: Anomaly Detection for AITEX Dataset**
#### Track: Bonus
## `Notebook`: Understanding Variational Autoencoders (VAEs) with AitexVAE
**Author**: Oliver Grau 

**Date**: 21.04.2025  
**Version**: 1.0

# Introduction

This document breaks down the core concepts of Variational Autoencoders (VAEs) and maps them to a specific PyTorch implementation, AitexVAE (version 1). We'll explore the architecture, the mathematics, and key design choices.

## High-Level VAE Concepts

A Variational Autoencoder (VAE) is a generative model that learns a compressed representation (latent space) of input data and can then generate new data by sampling from this latent space. It comprises four primary conceptual components:

1.  **Encoder**: This network, parameterized by $ \phi $, takes an input $ x $ and maps it to the parameters of a probability distribution $ q_\phi(z|x) $ in the latent space. This distribution represents our belief about the latent variable $ z $ given the input $ x $.
2.  **Latent Space Sampling**: A latent vector $ z $ is sampled from the distribution $ q_\phi(z|x) $ (e.g., a Gaussian $ \mathcal{N}(\mu, \sigma^2) $). This is typically done using the reparameterization trick to allow for backpropagation.
3.  **Decoder**: This network, parameterized by $ \theta $, takes the sampled latent vector $ z $ and maps it back to a reconstructed output $ \hat{x} $. This process models the likelihood $ p_\theta(x|z) $ of observing $ x $ given $ z $.
4.  **Loss Function**: The VAE is trained by optimizing a loss function that typically consists of two terms:
    * **Reconstruction Loss**: This term measures the dissimilarity between the original input $ x $ and the reconstructed output $ \hat{x} $. Common choices include Mean Squared Error (MSE) or Binary Cross-Entropy (BCE).
    * **KL Divergence**: This term acts as a regularizer. It encourages the learned latent distribution $ q_\phi(z|x) $ to be close to a predefined prior distribution $ p(z) $, often a standard normal distribution $ \mathcal{N}(0, I) $.

Mathematically, the objective is to maximize the Evidence Lower Bound (ELBO), which is equivalent to minimizing the negative ELBO:
$$ \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \,\|\, p(z)) $$
The first term corresponds to the reconstruction likelihood, and the second term is the KL divergence.

## Mapping Theory to Your AitexVAE Class: Building the Model

Let's explore how each conceptual block of the VAE maps to the AitexVAE PyTorch implementation.

### 1. The Encoder: From Input to Latent Representation

The encoder's role is to process the input data $ x $ and output the parameters of the latent distribution $ q_\phi(z|x) $.

```python
self.encoder = nn.Sequential(
    nn.Conv2d(in_channels, 32, kernel_size=3, stride=2, padding=1), # Output: B x 32 x H/2 x W/2
    nn.ReLU(),
    nn.Dropout2d(dropout_p),
    nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),          # Output: B x 64 x H/4 x W/4
    nn.ReLU(),
    nn.Dropout2d(dropout_p),
    nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),         # Output: B x 128 x H/8 x W/8
    nn.ReLU(),
    nn.Dropout2d(dropout_p),
    nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1),        # Output: B x 256 x H/16 x W/16
    nn.ReLU(),
    nn.Dropout2d(dropout_p),
    nn.Flatten()                                                   # Output: B x (256*H/16*W/16)
)
```

* **Convolutional Layers (`Conv2d`)**: Each `Conv2d` layer with `stride=2` effectively reduces the spatial dimensions (height and width) of the feature maps by half. For an input image of size $256 \times 256$:
    * $256 \times 256 \rightarrow 128 \times 128 \rightarrow 64 \times 64 \rightarrow 32 \times 32 \rightarrow 16 \times 16$
* **Channel Depth**: Simultaneously, the channel depth increases ($ \text{in\_channels} \rightarrow 32 \rightarrow 64 \rightarrow 128 \rightarrow 256 $), allowing the network to learn increasingly complex and abstract features.
* **Flattening (`Flatten()`)**: After the convolutional blocks, the resulting 3D feature map (channels, height, width) is flattened into a 1D vector. This vector serves as the input to the fully connected layers that define the latent space parameters.

➡️ This entire encoder block corresponds to the process of inferring a latent distribution over the code $ z $ from the input $ x $. The output of `self.encoder` is a flattened feature vector that will be used to determine $ \mu $ and $ \log\sigma^2 $.

#### Understanding `feature_map_dim`

The `feature_map_dim` variable represents the total number of features in the flattened output of the final convolutional layer in the encoder.

```python
# Example: If in_channels=1, for a 256x256 input image
# After 4 stride-2 Conv layers, H_out = 256 / (2^4) = 16, W_out = 256 / (2^4) = 16
# self.feature_map_size would be 16 in this case.
# The number of channels in the last Conv layer is 256.
self.feature_map_dim = 256 * (self.feature_map_size ** 2)
```

* **Calculation**: It's calculated as the product of the number of output channels of the last convolutional layer (256 in this VAE) and the squared spatial dimension of the feature map at that point (`self.feature_map_size ** 2`).
* **Purpose**: This dimension is critical because it defines the input size for the fully connected layers (`self.fc_mu` and `self.fc_logvar`) that will output the parameters of the latent distribution.

➡️ `feature_map_dim` essentially transforms the rich, spatially-aware features learned by the CNN part of the encoder into a single vector suitable for defining the latent distribution parameters.

**Example:**
If the input image is $256 \times 256$ and the encoder downsamples it 4 times with a stride of 2, the final feature map's spatial dimensions will be $16 \times 16$.
* `self.feature_map_size = 16`
* `self.feature_map_dim = 256 \cdot 16 \cdot 16 = 256 \cdot 256 = 65536`
This `feature_map_dim` of 65536 becomes the input dimension for `self.fc_mu` and `self.fc_logvar`.

### 2. The Latent Space: Defining the Bottleneck

After the encoder processes the input, the resulting feature vector is used to define the parameters of the latent distribution. In most VAEs, this is a Gaussian distribution.

```python
self.fc_mu = nn.Linear(self.feature_map_dim, latent_dim)
self.fc_logvar = nn.Linear(self.feature_map_dim, latent_dim)
```

These two fully connected layers take the flattened output from `self.encoder` and produce:
* `mu`: The mean vector $ \mu(x) $ of the latent distribution.
* `logvar`: The logarithm of the variance vector $ \log \sigma^2(x) $ of the latent distribution. Using log-variance helps ensure that the variance is always positive and can aid in numerical stability during training.

➡️ These outputs, $ \mu(x) $ and $ \log \sigma^2(x) $, define the parameters of the approximate posterior distribution $ q_\phi(z|x) = \mathcal{N}(\mu(x), \text{diag}(\sigma^2(x))) $.

#### 2a. Sampling from the Latent Space: The Reparameterization Trick

To train the VAE using gradient descent, we need to be able to backpropagate through the sampling process. The reparameterization trick allows this.

```python
def reparameterize(self, mu, logvar):
    std = torch.exp(0.5 * logvar)  # Calculate standard deviation: sigma = exp(0.5 * log(sigma^2))
    eps = torch.randn_like(std)   # Sample epsilon from a standard normal distribution N(0, I)
    return mu + eps * std         # Compute z = mu + epsilon * sigma
```

* **The Trick**: Instead of directly sampling $ z $ from $ \mathcal{N}(\mu, \sigma^2) $, we sample $ \epsilon $ from a standard normal distribution $ \mathcal{N}(0, I) $ (which is independent of $ \mu $ and $ \sigma^2 $) and then compute $ z = \mu + \sigma \cdot \epsilon $.
* **Benefit**: This separates the stochastic part (sampling $ \epsilon $) from the parameters $ \mu $ and $ \sigma $, allowing gradients to flow back to `fc_mu` and `fc_logvar` (and thus to the encoder) during backpropagation.

➡️ The reparameterization trick is crucial for enabling gradient-based optimization of the VAE, as it makes the sampling step differentiable with respect to the parameters of the latent distribution.

#### 2b. Deep Dive: The Posterior Distribution $q_\phi(z|x)$

The notation $ q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) $ specifies the form of the learned latent distribution.

It means:
* We are modeling the latent variables $ z $ with a multivariate Gaussian distribution.
* The mean of this Gaussian is $ \mu = [\mu_1, \mu_2, ..., \mu_d] $, where $d$ is the `latent_dim`.
* The covariance matrix $ \Sigma $ is a diagonal matrix, where the diagonal elements are the variances $ \sigma_i^2 $:
    $$ \Sigma = \text{diag}(\sigma^2) = \begin{pmatrix} \sigma_1^2 & 0 & \dots & 0 \\ 0 & \sigma_2^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_d^2 \end{pmatrix} $$

➡️ This choice implies that each latent dimension $ z_i $ is treated as independent of the others, conditioned on $ x $, with its own mean $ \mu_i $ and variance $ \sigma_i^2 $.

##### Why Diagonal Covariance?

Using a diagonal covariance matrix for $ q_\phi(z|x) $ is a common simplification with several advantages:

| Advantage                      | Explanation                                                                 |
| :----------------------------- | :-------------------------------------------------------------------------- |
| **Computationally efficient** | We only need to learn $ d $ variances instead of $ d(d+1)/2 $ parameters for a full covariance matrix. |
| **Easy sampling** | Sampling $ z $ simplifies to $ z_i = \mu_i + \sigma_i \cdot \epsilon_i $, where $ \epsilon_i \sim \mathcal{N}(0, 1) $. This is what the reparameterization trick leverages. |
| **Closed-form KL divergence** | The KL divergence between $ q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) $ and a standard normal prior $ p(z) = \mathcal{N}(0, I) $ has a simple analytical (closed-form) solution, making the loss calculation straightforward. |

##### Exploring Alternatives: Full Covariance Matrix

What if we used a full covariance matrix, $ q_\phi(z|x) = \mathcal{N}(\mu, \Sigma) $, where $ \Sigma \in \mathbb{R}^{d \times d} $ is a dense, symmetric positive definite matrix?

* **Pros**:
    * Can model correlations between latent dimensions, potentially capturing more complex relationships in the data.
    * May lead to a better approximation of the true posterior distribution if latent dimensions are indeed correlated.
* **Cons**:
    * Requires estimating $ \mathcal{O}(d^2) $ parameters for $ \Sigma $, which can be computationally expensive and prone to overfitting for high-dimensional latent spaces.
    * Sampling requires a Cholesky decomposition of $ \Sigma = LL^\top $, so $ z = \mu + L \cdot \epsilon $. This is more computationally intensive than the diagonal case.
    * The KL divergence between two general multivariate Gaussians is more complex and may not have a simple closed form relative to a standard normal prior if $ \Sigma $ is arbitrary (though it's still analytical).

| Feature         | Diagonal Covariance                | Full Covariance                      |
| :-------------- | :--------------------------------- | :----------------------------------- |
| Parameter count | Linear in $ d $ ($d$ means, $d$ variances) | Quadratic in $ d $ ($d$ means, $d(d+1)/2$ covariance terms) |
| Sampling        | Easy: $ z = \mu + \sigma \cdot \epsilon $           | More complex: Requires Cholesky decomposition for $ L $ in $ z = \mu + L \cdot \epsilon $ |
| KL divergence   | Closed-form and simple to $ \mathcal{N}(0, I) $ | More complex, though analytical |
| Flexibility     | Lower (assumes latent dimension independence) | Higher (can model correlations)    |

##### A Practical Compromise: Low-Rank + Diagonal Covariance

Some VAE variants aim for a middle ground by modeling the covariance matrix with more structure than purely diagonal but less complexity than full covariance. One such approach is:
$$ \Sigma = D + UU^\top $$
Where:
* $ D $ is a diagonal matrix (easy to learn, captures individual variances).
* $ UU^\top $ is a low-rank matrix (where $ U $ is a $ d \times k $ matrix with $ k \ll d $), which can capture the most significant correlations.

This offers a good trade-off between expressiveness and computational efficiency.

### 3. The Decoder: Reconstructing from Latent Space

The decoder takes a sampled latent vector $ z $ and attempts to reconstruct the original input $ x $. It models $ p_\theta(x|z) $.

```python
# First, a fully connected layer to prepare z for upsampling
self.fc_dec = nn.Linear(latent_dim, 64 * (self.feature_map_size ** 2))

self.decoder = nn.Sequential(
    # Unflatten projects the 1D vector from fc_dec back into a 3D spatial feature map
    # Target shape: B x 64 x feature_map_size x feature_map_size
    nn.Unflatten(1, (64, self.feature_map_size, self.feature_map_size)),

    # Example: If feature_map_size was H/16, W/16 and we need to get back to H, W
    # And original was H/16 -> H/8 -> H/4 -> H/2 -> H using 4 upsampling steps
    # The following layers are simplified for illustration if target is 256x256 from 16x16 (H/16)
    # If feature_map_size is 16x16, this layer would upsample 16x16 -> 64x64
    nn.ConvTranspose2d(64, 32, kernel_size=4, stride=4, padding=0),
    nn.ReLU(),
    # This layer would then upsample 64x64 -> 256x256
    nn.ConvTranspose2d(32, in_channels, kernel_size=4, stride=4, padding=0),
    nn.Sigmoid() # Output activation
)
```

* **Fully Connected Layer (`fc_dec`)**: This layer takes the `latent_dim`-sized vector $ z $ and projects it to a higher-dimensional vector. The size of this vector is chosen to match the flattened dimensions of the feature map that the convolutional transpose layers expect (e.g., $64 \times \text{feature\_map\_size} \times \text{feature\_map\_size}$).
* **Unflattening (`Unflatten`)**: This layer reshapes the 1D output of `fc_dec` back into a 3D volume (channels, height, width), suitable for input to convolutional transpose layers.
* **Convolutional Transpose Layers (`ConvTranspose2d`)**: These layers, also known as "deconvolutional" layers, perform upsampling. With `stride=4`, each `ConvTranspose2d` layer quadruples the spatial dimensions (height and width) of its input. The channel depth typically decreases as spatial dimensions increase.
* **Output Activation (`Sigmoid`)**: The `Sigmoid` activation function is used at the end. It squashes the output values to be in the range $[0, 1]$. This is often suitable when the input data (e.g., images) is normalized to this range and a Binary Cross-Entropy (BCE) loss is used.

➡️ The decoder implements the generative part of the VAE, learning the likelihood model $ p_\theta(x|z) $ that maps latent codes back to the data space.

#### 3a. Choosing the Decoder's Output Activation: The Role of Sigmoid

The choice of the final activation function in the decoder is crucial and depends on the nature of the input data and the reconstruction loss function.

**When is `Sigmoid()` appropriate?**

| Use Case                                  | Input Data Range | Typical Loss Function | Use Sigmoid? |
| :---------------------------------------- | :--------------- | :-------------------- | :----------- |
| Images with pixel values normalized to $[0,1]$ | $[0,1]$          | BCE                   | ✅ **Yes** |
| Grayscale images normalized to $[0,1]$    | $[0,1]$          | MSE                   | ❌ Optional (can sometimes help, but not strictly necessary) |
| Data normalized to $[-1,1]$ (e.g., some audio) | $[-1,1]$         | MSE / Other           | ❌ No (Use `Tanh` instead) |
| Data with arbitrary range (e.g., FFT outputs) | Any              | Spectral Loss / MSE   | ❌ **Avoid Sigmoid** (it clips values) |

**Your Case: AITEX with MSE and potentially a Frequency-Domain Loss**

Given that your VAE (AitexVAE) might use Mean Squared Error (MSE) and potentially a frequency-domain loss, and you are not strictly using Binary Cross-Entropy (BCE):
* The `Sigmoid` function constrains the output to be between 0 and 1. If your input data isn't strictly in this range, or if intermediate representations before a custom loss don't require this clamping, `Sigmoid` can be detrimental.
* For MSE loss, the decoder doesn't necessarily need to be clipped by a sigmoid, especially if the input data isn't normalized to $[0,1]$ or if you want the model to learn to output values outside this range if appropriate for the reconstruction.

➡️ **Recommendation**: If you are not using BCE loss and your input data is not strictly normalized to $[0,1]$ (or if you want the flexibility for outputs to exceed this range before loss calculation), it's generally better to **remove the `Sigmoid` activation** from the final layer of the decoder.

**Proposed Replacement:**

Instead of:
```python
    nn.ConvTranspose2d(32, in_channels, kernel_size=4, stride=4, padding=0),
    nn.Sigmoid()
```

Consider:
```python
    nn.ConvTranspose2d(32, in_channels, kernel_size=4, stride=4, padding=0)
    # Optionally, you might use nn.ReLU() if outputs should be non-negative,
    # or no activation at all if outputs can be any real number.
    # For image data that is ultimately [0,1] but using MSE, sometimes Tanh scaled to [0,1]
    # or just no activation (and clipping data before MSE) are alternatives.
```

#### Summary of Activation Choices for Decoder Output

| Activation | Output Range | Common Use When...                                      |
| :--------- | :----------- | :---------------------------------------------------- |
| `Sigmoid`  | $(0, 1)$     | Output is a probability or data normalized to $[0,1]$ (often with BCE loss). |
| `Tanh`     | $(-1, 1)$    | Data is normalized to $[-1,1]$ (e.g., mean-centered data). |
| `ReLU`     | $[0, \infty)$ | Output values are expected to be non-negative.        |
| None (Linear)| $(-\infty, \infty)$ | No specific constraints on output range (e.g., for regression, or when using MSE on unnormalized data, or before custom loss functions like FFT loss). |

## AitexVAE in Action: The Forward Pass and Loss Function

With the building blocks defined, let's see how they connect in the forward pass and how the model is trained using the VAE loss function.

### 1. The Forward Pass: Connecting the Components

The `forward` method defines how input data flows through the VAE.

```python
def forward(self, x):
    encoded = self.encoder(x)          # Pass input through encoder
    mu = self.fc_mu(encoded)           # Get latent mean
    logvar = self.fc_logvar(encoded)   # Get latent log-variance
    z = self.reparameterize(mu, logvar) # Sample latent vector z using reparameterization
    
    dec_input = self.fc_dec(z)         # Prepare z for the convolutional decoder
    x_recon = self.decoder(dec_input)  # Reconstruct the input
    
    return x_recon, mu, logvar         # Return reconstruction and latent parameters
```

The `forward` pass executes the VAE's pipeline:
1.  Encode the input $ x $ into a feature vector.
2.  Calculate the mean $ \mu $ and log-variance $ \log\sigma^2 $ of the latent distribution.
3.  Sample a latent vector $ z $ using the reparameterization trick.
4.  Decode $ z $ to produce the reconstructed input $ \hat{x} $ (denoted `x_recon`).

It returns:
* `x_recon` ($ \hat{x} $): The reconstructed image/data.
* `mu` ($ \mu $), `logvar` ($ \log\sigma^2 $): The parameters of the learned latent distribution, necessary for calculating the KL divergence part of the loss.

### 2. The VAE Loss Function: Training the Model

The VAE loss function combines the reconstruction quality with a regularization term for the latent space.

```python
def vae_loss(x, x_recon, mu, logvar):
    # Reconstruction Loss (example: Binary Cross-Entropy)
    # F.binary_cross_entropy sums over pixels and then over batch if reduction='sum'
    recon_loss = F.binary_cross_entropy(x_recon, x, reduction='sum') 
    
    # KL Divergence
    # This is the analytical form for KL(N(mu, sigma^2) || N(0, I))
    # KL = -0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    
    return recon_loss + kl_loss # Total loss
```

* **Reconstruction Loss (`recon_loss`)**:
    * The example uses `F.binary_cross_entropy(x_recon, x, reduction='sum')`. This measures the pixel-level difference between the original input `x` and the reconstructed output `x_recon`. It's suitable when inputs and outputs are in the $[0,1]$ range (often used with a sigmoid output).
    * If using MSE, this would be `F.mse_loss(x_recon, x, reduction='sum')`.
* **KL Divergence (`kl_loss`)**:
    * This term ` -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())` is the analytical formula for the KL divergence between the learned Gaussian posterior $ q_\phi(z|x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) $ and a standard Gaussian prior $ p(z) = \mathcal{N}(0, I) $.
    * It encourages the learned distributions $ q_\phi(z|x) $ for different inputs $ x $ to be, on average, close to the standard normal distribution. This regularizes the latent space, making it more structured and suitable for generation.

The VAE is trained by minimizing this combined `recon_loss + kl_loss`.

## Conclusion: Synthesizing VAE Concepts and AitexVAE Implementation

This exploration has provided a comprehensive walkthrough from the fundamental principles of Variational Autoencoders to their practical implementation within the AitexVAE framework. We've dissected the journey of data through the model: starting with the **encoder**'s role in compressing input into a probabilistic latent representation, understanding the critical `feature_map_dim` that bridges convolutional features to latent parameters, and delving into the **latent space** itself including how the **reparameterization trick** enables learning and the implications of choices like diagonal versus full covariance matrices for the posterior distribution $q_\phi(z|x)$.

Furthermore, we examined the **decoder**'s function in reconstructing data from these latent codes and the importance of selecting appropriate **output activation functions** based on data characteristics and the chosen loss function. Finally, the **forward pass** demonstrated how these components interoperate, culminating in the **VAE loss function** which synergistically balances reconstruction fidelity with latent space regularization.

Understanding these interconnected elements and design considerations is crucial not only for effectively utilizing AitexVAE but also for adapting it or developing new VAE architectures. The choices made at each stage from network depth and layer types to the specifics of the latent distribution and loss components collectively define the VAE's capacity to learn meaningful representations and generate coherent data. This detailed breakdown should serve as a solid foundation for further experimentation, fine-tuning, and innovative applications of VAEs in various domains.

<p style="font-size: 0.8em; text-align: center;">© 2025 Oliver Grau. Educational content for personal use only. See LICENSE.txt for full terms and conditions.</p>