# **Project: Anomaly Detection for AITEX Dataset**
#### Track: VAE
## `Notebook 7`: Diagnostic Playbook for VAE Training
**Author**: Oliver Grau 

**Date**: 27.03.2025  
**Version**: 1.0


A **diagnostic playbook** like this will give you clarity, confidence, and decision points during training.

---

## 🧭 VAE Anomaly Detection Debugging Cheat Sheet  
**Symptoms → Root Causes → Suggested Actions**

---

### 🔴 **SYMPTOM 1: Precision, Recall, F1 = 0 but ROC AUC > 0.5**

#### ✅ Meaning:
- The model ranks anomalies *slightly better than chance*
- But the **threshold** results in **no correct detections**

#### 🧠 Root Cause:
- Threshold too high (e.g., 95th percentile) for early-stage or weak reconstructions
- Error distribution too narrow (over-smoothing)
- Too little separation between normal and defect errors

#### 🛠 Suggested Actions:
- Manually adjust threshold (e.g., 85–90 percentile) and rerun evaluation
- Plot error histogram (normal vs defect)
- Apply **early stopping** before over-smoothing flattens differences

---

### 🔴 **SYMPTOM 2: All reconstructions look the same**

#### 🧠 Root Cause:
- Decoder is ignoring latent code (posterior collapse)
- KL weight too high too early
- Latent space underutilized
- Over-regularization or low decoder capacity

#### 🛠 Suggested Actions:
- Lower `kl_weight`, or use **KL annealing** over epochs
- Increase decoder capacity (more ConvTranspose, residuals, etc.)
- Visualize μ and σ histograms
- Try **Conditional VAE** (add fabric code as input)

---

### 🔴 **SYMPTOM 3: Heatmaps are almost all red or uniform**

#### 🧠 Root Cause:
- Model fails to reconstruct almost everything (underfitting)
- Reconstruction quality is low overall — too noisy or blurred
- Might also occur if decoder is too weak or latent too small

#### 🛠 Suggested Actions:
- Train longer (if early epochs)
- Increase latent dimensionality (e.g., 32 → 64 or 128)
- Use BatchNorm or better weight initialization in decoder
- Try skip connections or shallow UNet-style decoder

---

### 🔴 **SYMPTOM 4: Latent μ is flat or centered too tightly around 0**

#### 🧠 Root Cause:
- Posterior collapse: model ignores latent code
- Decoder learns to reconstruct without variability
- KL loss dominates too early

#### 🛠 Suggested Actions:
- Apply **KL warm-up/annealing**
- Lower `kl_weight`
- Visualize latent histograms regularly (you already do that well)

---

### 🔴 **SYMPTOM 5: σ → 0 or σ very narrow**

#### 🧠 Root Cause:
- Encoder is too confident → latent sampling becomes deterministic
- Anomalies won't be well explored in latent space
- No uncertainty modeled

#### 🛠 Suggested Actions:
- Add noise to input (mild)
- Reduce encoder layer depth
- Lower `kl_weight` (let encoder explore more)

---

### 🔴 **SYMPTOM 6: ROC stays ~0.55–0.60 across training, never improves**

#### 🧠 Root Cause:
- Latent space or decoder isn't expressive enough
- Model has no capacity to distinguish anomaly-specific cues
- Dataset doesn't contain detectable differences at patch level

#### 🛠 Suggested Actions:
- Try multi-fabric training (if back on per-fabric now)
- Increase latent size or decoder width
- Use hybrid loss with higher frequency weighting
- Re-check if defect masks align well with image patches

---

## ✅ Bonus Checks

| Check | Why |
|-------|-----|
| **Visualize recon error map** | Shows what regions the model fails to reconstruct |
| **FFT error map** | Detects missing structure in frequency space |
| **μ/σ histograms** | Health of latent space |
| **Error histograms** | Threshold effectiveness |
| **GIF over epochs** | Detects overtraining visually |

<p style="font-size: 0.8em; text-align: center;">© 2025 Oliver Grau. Educational content for personal use only. See LICENSE.txt for full terms and conditions.</p>