# **Project: Anomaly Detection for AITEX Dataset**
#### Track: VAE
## `Notebook 9`: Concepts and Math for VAE
**Author**: Oliver Grau 

**Date**: 27.03.2025  
**Version**: 1.0


# **What is a Variational Autoencoder (VAE)?**

# 📚 Table of Contents

- [Introduction](#introduction)
- [What is a Variational Autoencoder (VAE) and what problem does it solve?](#what-is-a-variational-autoencoder-vae-and-what-problem-does-it-solve)
- [Core Idea of VAE](#core-idea)
- [The Mathematics of a VAE](#the-mathematics-of-a-vae)
  - [Latent Distributions: Mean and Standard Deviation](#latent-distributions-mean-and-standard-deviation)
  - [Loss Function](#loss-function)
- [Why is this useful for Anomaly Detection?](#why-is-this-useful-for-anomaly-detection)
- [Difference between a Normal Autoencoder and a VAE](#difference-between-a-normal-autoencoder-and-a-vae)
- [Why Use a Distribution Instead of a Direct Vector?](#why-use-a-distribution-instead-of-a-direct-vector)
- [The Reparameterization Trick](#the-reparameterization-trick)
- [What is KL Divergence Doing?](#what-is-kl-divergence-doing)
- [Summary: Why This Matters for Anomaly Detection](#summary-why-this-matters-for-anomaly-detection)
- [Generative Models and Anomaly Detection](#generative-models-and-anomaly-detection)
  - [What Does "Generative" Mean?](#what-does-generative-mean)
  - [Anomaly Detection with Generative Models](#anomaly-detection-with-generative-models)
  - [Concrete Example](#concrete-example)
  - [Key Insight](#key-insight)
- [Conclusion](#conclusion)


## Introduction

In this notebook, we dive deep into the **core concepts and mathematics** behind **Variational Autoencoders (VAEs)**, focusing specifically on how they enable **anomaly detection** in highly regular data such as **fabric textures**.

We will explore:
- Why VAEs model **distributions** instead of fixed encodings
- How the **Reparameterization Trick** allows gradient-based optimization
- The role of **KL Divergence** in shaping a meaningful latent space
- And most importantly: **How the generative power of VAEs helps detect anomalies** through reconstruction errors.

This is not just a theoretical journey. Every concept is linked back to **real-world anomaly detection**, preparing you to deeply understand **why VAEs work or not** and **how to use them effectively or not**.

---

A **Variational Autoencoder (VAE)** is a type of **generative model**. It learns to approximate the **distribution** of your input data. In your case: clean, **defect-free grayscale fabric patches**.

## Core Idea of VAE:
> "Learn what *normal* fabric looks like, so that you can detect when something is *not normal* (an anomaly)."

VAEs consist of two main components:

1. **Encoder**: Compresses the input image $ x $ into a **distribution** over latent variables $ z $.  
2. **Decoder**: Reconstructs the original image $ x $ from a sampled $ z \sim q(z|x) $.

---

### The Mathematics of a VAE

Unlike a regular Autoencoder (which gives you a fixed encoding $ z = f(x) $), the VAE encoder outputs **two vectors**:

- $ \mu(x) $: Mean of the latent distribution
- $ \sigma(x) $: Standard deviation of the latent distribution

Then it samples:

$$
z = \mu(x) + \sigma(x) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)
$$

→ This is called the **reparameterization trick**.

---

### Loss Function

The total VAE loss is made up of two parts:

$$
\mathcal{L}_{\text{VAE}}(x) = \underbrace{\mathbb{E}_{q(z|x)} [\log p(x|z)]}_{\text{Reconstruction Loss}} - \underbrace{D_{\text{KL}}(q(z|x) \parallel p(z))}_{\text{KL Divergence}}
$$

Or more commonly written for optimization:

$$
\text{Loss}(x) = \text{Reconstruction Loss} + \beta \cdot D_{\text{KL}}(q(z|x) \parallel \mathcal{N}(0, I))
$$

Where:

- The **Reconstruction Loss** is usually **MSE (Mean Squared Error)** or **BCE (Binary Cross Entropy)** — it ensures that the output looks like the input.
- The **KL Divergence** forces the encoder’s output distributions to stay close to a standard normal distribution.

---

### Why is this useful for Anomaly Detection?

Because your fabric images are **highly regular**, the VAE can **efficiently learn the structure of normal images**.

So when you input an **anomalous patch** (e.g. a thread defect or stain):

- The encoder struggles to represent it as a "normal" latent variable.
- The decoder **fails to reconstruct** the image well.
- ⇒ This leads to a **high reconstruction error**, which you interpret as an **anomaly score**.

---

### Example: Your Fabric Images

VAEs will learn:

> "Normal fabrics have repetitive patterns, regular textures, straight lines..."

When something **breaks that regularity** (e.g., missing threads, irregular weave), the model won’t know how to reconstruct it, and that **failure becomes your anomaly signal**.


Let's dive deeper into the **KL Divergence**, **Reparameterization**, and **why VAEs use distributions (mean & std)** instead of direct latent vectors.

---

## What’s the key difference between a normal Autoencoder and a VAE?

| Type | Latent Representation                                      |
|------|------------------------------------------------------------|
| Autoencoder | Deterministic: $ z = f(x) $                                |
| VAE | Probabilistic: $ z \sim \mathcal{N}(\mu(x), \sigma^2(x)) $ |

Instead of learning a **single point in latent space**, the VAE learns a **distribution** – specifically, a Gaussian with **mean** $ \mu $ and **standard deviation** $ \sigma $.

---

## Why use a distribution (mean & std) instead of a direct vector?

### Intuition:
When using a single point:
- The model may **overfit** the training data.
- The latent space becomes **disjointed** — there's no smooth transition between similar images.

By using a distribution:
- The VAE encourages the latent space to be **continuous and smooth**.
- It becomes possible to **generate new samples** by drawing from a normal distribution $ \mathcal{N}(0, I) $.
- This makes the model **generative** — it can create new, plausible data from noise!

---

## The Reparameterization Trick

The problem: we need to **sample** from a distribution during training, but sampling is **non-differentiable**, which breaks backpropagation.

The solution:  
Use the **reparameterization trick**:

Instead of sampling directly from $ z \sim \mathcal{N}(\mu, \sigma^2) $, we do this:

$$
z = \mu + \sigma \cdot \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, 1)
$$

Now, $ \mu $ and $ \sigma $ are part of a **deterministic computation graph** (differentiable), and the randomness is isolated in $ \epsilon $.

---

## 📏 What is KL Divergence doing?

KL Divergence measures how "far apart" two distributions are.

In a VAE, it compares:

$$
D_{\text{KL}}(q(z|x) \;||\; p(z))
$$

- $ q(z|x) $: The **learned** distribution (from the encoder).
- $ p(z) $: A **fixed prior** distribution — usually a standard normal $ \mathcal{N}(0, I) $.

### Why penalize this divergence?

We **force** the learned latent distributions to stay close to the prior. That way:

- The **latent space stays well-organized**.
- We can sample **anywhere in latent space**, and the decoder will still produce realistic outputs.

### The KL Term (Analytic Expression):

If both $ q(z|x) $ and $ p(z) $ are Gaussians:

$$
D_{\text{KL}}(\mathcal{N}(\mu, \sigma^2) \;||\; \mathcal{N}(0, 1)) = \frac{1}{2} \sum_{i=1}^{d} \left( \mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1 \right)
$$

This can be implemented easily in PyTorch.

---

### Why this all matters for anomaly detection

- By enforcing a **smooth latent space**, the model learns a strong **prior over normal data**.
- **Anomalous inputs** (e.g., defects in fabric) do **not follow this prior** well → they result in poor reconstructions.
- So, **high reconstruction error** = likely anomaly.

---

In **anomaly detection**, you're **not using the VAE to generate images for fun**, like a GAN generating cat pictures. Instead, you want to use its **generative capacity** in a very specific way.

Let’s unpack this step by step.

---

## What does “generative” mean?

A **generative model** is one that can **learn the distribution of a dataset** and then **generate new samples** that "look like" the training data.

For a VAE, this means:

- During training, it learns a mapping from inputs $ x $ (your fabric patches) to a latent space $ z $ (via $ q(z|x) $)
- This latent space is forced (via KL loss) to look like a **standard normal distribution** $ \mathcal{N}(0, I) $
- After training, you can sample a random vector $ z \sim \mathcal{N}(0, I) $ and feed it into the decoder
- The decoder then generates a **new, plausible image** that resembles your training data

➡️ **So the model learns to map "noise" (random vector $ z $) into realistic outputs.**

---

## In the context of your project: why does this matter?

You’re not **sampling random noise** to create new fabrics.  
You're using this **generative ability in reverse**:

### 🚨 Anomaly Detection with Generative Models:
- You feed a possibly **defective image patch** into the VAE.
- The VAE encodes it into a latent distribution.
- It samples from that distribution and decodes it → giving a **reconstructed version of what the model thinks is "normal"**.

If the input was a **normal fabric**, the reconstruction will be **very accurate**.

If the input was **anomalous**:
- The VAE **never saw such defects** during training.
- Its decoder will reconstruct what it **thinks should have been there**.
- The result will look like a **"corrected" version** of the patch.
- You compare the input and the output → **difference = anomaly score**

---

### 📸 Concrete Example:

Let’s say the original input patch looks like this:

```
Input image:     Patch with a broken thread (defect)
Latent z:        Sample from encoder (conditioned on input)
Output image:    Patch without the broken thread (model's guess of normal)
```

If the input is:

- Normal → model reconstructs it well → low error
- Abnormal → model reconstructs it as if **nothing was wrong** → large error at the defect region

---

## Key Insight

Even though you're feeding in **real input images**, the decoder’s job is **not to memorize or copy**.

It’s to **guess the most likely "normal version"** of that input **based on everything it learned**.

This ability to **infer what should be there** is **what makes the model generative**, and it's **why it works so well for anomaly detection**. At least if the anomalies are really anomal and not tiny.

---

A VAE is called "generative" because, once trained, it can take random vectors from latent space and create realistic images. In anomaly detection, we **use this to reconstruct the "normal" version** of a possibly defective image and then detect discrepancies between the reconstruction and original.

Let’s **separate two things** that often cause confusion:

---

## Two modes of using a VAE:

### 1. **Sampling Mode (generative use case)**

- You **sample a random latent vector** $ z \sim \mathcal{N}(0, I) $
- Pass it through the **decoder**
- Output: a **completely new image** that *looks like* your training data

✅ This is the *true generative use case*. Like what GANs or VAEs use for **creative generation**

> You’re not using this mode in your anomaly detection setup.

---

### 2. **Reconstruction Mode (your use case)**

- You **input a real image** $ x $
- The **encoder** maps it to $ \mu(x) $ and $ \sigma(x) $ → a distribution over latent codes
- You sample $ z \sim \mathcal{N}(\mu(x), \sigma^2(x)) $
- The **decoder** reconstructs the image: $ \hat{x} = p(x|z) $

✅ This is what you're doing for **anomaly detection**

> Here, the input image is **always** the starting point.

---

## 🔁 So why is it still called “generative”?

Even though you're not generating random samples, the **decoder itself has learned how to “generate” the normal-looking version of any latent code**.

And by **training it with a KL loss**, you made sure that the latent space is:

- Smooth
- Well-organized
- Close to a standard Gaussian

So **in theory**, you *could* sample a random $ z \sim \mathcal{N}(0, I) $ and get a realistic output. That’s why it’s still called a **generative model** — it has this potential, even if you don’t use it that way.

---

## 🔍 In practice: reconstruction from input

You can think of it this way:

> “I feed the model a possibly defective image, and it *imagines* what the most probable normal version of that image would be based on everything it has learned from normal data.”

That’s the **power of the generative decoder**:  
It doesn't copy the input. It tries to "guess what should be there."

And if there’s something strange (a defect), it often **fails to reconstruct that part correctly**, leading to a **large reconstruction error → anomaly detected**.

---

## Summary

| Term              | Meaning in your context                                                                                    |
|-------------------|------------------------------------------------------------------------------------------------------------|
| Input image       | The actual 256×256 fabric patch                                                                            |
| Latent code $ z $ | Sampled from encoder output distribution $ \mu(x), \sigma(x) $                                             |
| Decoder           | Tries to reconstruct what a “normal” version of the input would look like                                  |
| Generative        | Refers to decoder’s ability to output plausible data based on latent $ z $ — even without a specific input |

The input image is the starting point in your anomaly detection setup and the "generative" part is what enables the decoder to guess what should be there based on learned regularities.

Let’s now dive into the **KL Divergence** — both **conceptually** and **mathematically** — and see exactly **why it appears in a VAE**, what it **does**, and **how to compute it**.

---

## What is KL Divergence?

The **Kullback-Leibler (KL) divergence** is a measure of how one probability distribution **differs** from another.

Formally, for two probability distributions $ q(z) $ and $ p(z) $:

$$
D_{\text{KL}}(q(z) \,\|\, p(z)) = \int q(z) \log \frac{q(z)}{p(z)} \, dz
$$

It tells you how much **information is lost** when you approximate $ p(z) $ using $ q(z) $.  
It is **not symmetric**: $ D_{\text{KL}}(q \,\|\, p) \neq D_{\text{KL}}(p \,\|\, q) $

---

## Why do we use KL Divergence in VAEs?

In a VAE, the **encoder** learns a distribution $ q(z|x) $ (typically a Gaussian with mean and variance).  
But we want this distribution to be **close to a fixed prior**, usually:

$$
p(z) = \mathcal{N}(0, I)
$$

So we penalize the **difference** between the learned distribution \( q(z|x) \) and this prior using KL divergence:

$$
\text{Loss} = \text{Reconstruction Error} + \beta \cdot D_{\text{KL}}(q(z|x) \,\|\, p(z))
$$

This forces the encoder to produce **latent spaces that are organized, continuous, and close to a standard Gaussian**.

---

## The Math: KL Between Two Gaussians

Assume:

- $ q(z|x) = \mathcal{N}(\mu, \sigma^2) $
- $ p(z) = \mathcal{N}(0, 1) $

Then the KL divergence has a **closed-form solution**:

$$
D_{\text{KL}}(q(z|x) \,\|\, p(z)) = \frac{1}{2} \sum_{i=1}^d \left( \mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1 \right)
$$

Where:
- $ \mu_i $: mean of latent dimension $ i $
- $ \sigma_i $: stddev of latent dimension $ i $
- $ d $: dimensionality of the latent space (e.g. 16, 32…)

This is **simple to compute** in PyTorch and fully differentiable.

---

## Intuition of each term:

$$
\mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1
$$

| Term                  | Intuition |
|-----------------------|----------|
| $ \mu_i^2 $           | Penalizes moving the center of the distribution away from 0 |
| $ \sigma_i^2 $        | Penalizes making the distribution too wide |
| $ -\log(\sigma_i^2) $ | Penalizes making it too narrow |
| $ -1 $                | Constant offset for proper scaling |

So the KL loss:
- Keeps the encoder’s output distribution centered around zero
- Prevents it from collapsing into a **deterministic** encoder (where $ \sigma \to 0 $)
- Keeps the **entire latent space “used”** and **smooth**

---

## Visual Intuition

Imagine each training input creates a **small Gaussian cloud** in latent space.  
The KL loss makes sure that:

- All clouds **overlap well**
- They don’t wander off into the corners
- They **match the shape** of the big, round standard normal cloud

This helps ensure that the latent space is **compact**, **smooth**, and **well-behaved**. This is crucial for both reconstruction and generative capabilities.

---

## PyTorch Example

```python
# Assume mu and logvar come from the encoder
def kl_divergence(mu, logvar):
    return -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
```

- $ \text{logvar} = \log(\sigma^2) $ is often predicted by the encoder for numerical stability
- This function works per batch, summing across all latent dimensions

---

## 🧩 Summary

| Concept | Meaning                                                                    |
|--------|----------------------------------------------------------------------------|
| **KL Divergence** | Measures how far the encoder's distribution is from the standard Gaussian  |
| **Why it's used** | Ensures the latent space is smooth, compact, and similar for all inputs    |
| **What it does** | Regularizes the encoder and keeps $ \mu \approx 0 $, $ \sigma \approx 1 $    |
| **Practical form** | Simple closed-form expression with $ \mu $ and $ \sigma $ from the encoder |

Great! Let’s dust off the Fourier Toolbox and make this intuitive again! We’ll compare **pixel-wise loss** and **frequency-based loss** in the context of **anomaly detection** with VAEs, especially for **fabric textures** like in the AITEX dataset.

---

## 1. Pixel-wise Loss (e.g. MSE)

This is the **classic reconstruction loss**. You’re simply comparing the **reconstructed image** $ \hat{x} $ to the **original image** $ x $, pixel by pixel:

$$
\mathcal{L}_{\text{pixel}} = \frac{1}{N} \sum_{i=1}^{N} \left( x_i - \hat{x}_i \right)^2
$$

- Measures local differences in intensity
- Penalizes **per-pixel** errors equally
- Very effective when images have sharp features (e.g., digits, faces)

### ✅ Pros:
- Simple and widely used
- Works well for images where anomalies are visible as pixel-level changes

### ❌ Cons (especially for textures):
- **Insensitive to structural patterns**
- Can miss **global pattern irregularities** in textures (like fabrics)
- Can allow **blurry reconstructions** that "fool" the loss even if the pattern is broken

---

## 2. Frequency-based Loss (e.g. FFT loss)

This loss compares **differences in the frequency domain**, rather than pixel-by-pixel. You transform both the input and reconstructed image using the **Fourier Transform (FT)**:

$$
X_f = \text{FFT}(x), \quad \hat{X}_f = \text{FFT}(\hat{x})
$$

Then compute the loss (e.g. L1 or L2) between **magnitude spectra**:

$$
\mathcal{L}_{\text{freq}} = \frac{1}{N} \sum_{i=1}^{N} \left|\, |X_{f, i}| - |\hat{X}_{f, i}| \,\right|^2
$$

Note: often, we **only use the magnitudes** and ignore the phase.

### ✅ Pros:
- Captures **global structure**, regularity, and periodicity
- Very sensitive to **changes in pattern frequency** (e.g. broken weave, missing lines)
- Much better suited for **fabric-like textures**

### ❌ Cons:
- May ignore small local anomalies (e.g. small stains)
- Ignores phase (i.e., location) unless explicitly preserved

---

## 3. Combining Both Losses

In practice, **combining both pixel-wise and frequency loss** works very well:

$$
\mathcal{L}_{\text{total}} = \lambda_{\text{pixel}} \cdot \mathcal{L}_{\text{pixel}} + \lambda_{\text{freq}} \cdot \mathcal{L}_{\text{freq}}
$$

Where $ \lambda $ are weights to balance the two.  
This gives you the best of both worlds:

| Type | Captures |
|------|----------|
| Pixel loss | Local image details |
| Frequency loss | Global, periodic structure & texture |

For AITEX, where patterns are **highly repetitive**, the **frequency component is crucial**.

---

## Refresher: What’s in the Frequency Domain?

When you apply a **2D FFT** to an image, the result tells you:

- **What frequencies are present** (how often things repeat)
- **How strong they are** (intensity of those patterns)
- The center of the FFT image contains **low frequencies** (smooth gradients)
- The outer edges contain **high frequencies** (sharp edges, fine textures)

In fabrics:
- Regular stripes, grids → show as strong **peaks** in frequency space
- Anomalies → distort these peaks or introduce irregular ones

---

## PyTorch Snippet (Magnitude Spectrum)

Here’s a tiny example of how to implement frequency loss in PyTorch:

```python
def fft_magnitude(img):
    # Assumes img is (B, 1, H, W)
    fft = torch.fft.fft2(img)
    return torch.abs(fft)

def frequency_loss(x, x_recon):
    x_f = fft_magnitude(x)
    x_recon_f = fft_magnitude(x_recon)
    return torch.mean((x_f - x_recon_f) ** 2)
```

Then combine it with MSE:

```python
loss = pixel_loss(x, x_recon) + 0.1 * frequency_loss(x, x_recon)
```

The `0.1` is a hyperparameter you’ll want to tune for your dataset.

---

## 🧩 Summary

| Loss | Good For | Weakness |
|------|----------|----------|
| **Pixel-wise (MSE)** | Local intensity differences | Misses global patterns |
| **Frequency-based** | Repetitive textures, structural regularity | Misses small, localized defects |
| **Combined** | Balanced sensitivity | Best for fabrics like AITEX |


## 🏁 Conclusion

Through this notebook, we've unpacked the **mathematical foundations** and **intuitive insights** that make VAEs such powerful tools for anomaly detection.

Key takeaways:
- VAEs learn a **probabilistic latent space**, enabling them to generate plausible reconstructions of normal data.
- The **KL Divergence** ensures the latent space is **structured and smooth**, critical for generalization.
- Anomalies are detected by observing **large reconstruction errors**, as the VAE struggles to recreate data it was never trained to model.

Understanding these principles empowers you to not just **apply VAEs**, but also **adapt and improve** them for your specific anomaly detection tasks.


<p style="font-size: 0.8em; text-align: center;">© 2025 Oliver Grau. Educational content for personal use only. See LICENSE.txt for full terms and conditions.</p>