# Vanilla Autoencoder: Deterministic Representation Learning

## Core Idea

An autoencoder learns a compressed representation $z = f(x)$ by training to reconstruct its input
through an information bottleneck. The encoder maps high-dimensional data to a low-dimensional
latent space, forcing the network to learn the most salient features.

## Mathematical Foundation

### Architecture

$$x \xrightarrow{\text{Encoder } f_\phi} z \xrightarrow{\text{Decoder } g_\theta} \hat{x}$$

- Encoder: $f_\phi: \mathbb{R}^D \to \mathbb{R}^d$ where $d \ll D$
- Decoder: $g_\theta: \mathbb{R}^d \to \mathbb{R}^D$
- Reconstruction: $\hat{x} = g_\theta(f_\phi(x))$

### Loss Function

$$\mathcal{L}(\phi, \theta) = \mathbb{E}_{x \sim p_{\text{data}}}[\|x - g_\theta(f_\phi(x))\|^2]$$

For binary data (e.g., binarized MNIST):
$$\mathcal{L} = -\sum_i [x_i \log \hat{x}_i + (1-x_i) \log(1-\hat{x}_i)]$$

### Connection to PCA (Theorem)

**Bourlard & Kamp (1988):** For a linear autoencoder (no activation functions) with MSE loss,
the optimal encoder weights span the same subspace as the top-$d$ principal components of the data.

**Implication:** Nonlinear autoencoders generalize PCA to nonlinear manifolds.

## Problem Statement

Dimensionality reduction methods face a trade-off:
- **PCA:** Linear, closed-form, but cannot capture nonlinear structure
- **Kernel PCA:** Nonlinear, but $O(n^3)$ complexity and fixed kernel
- **Autoencoder:** Learns nonlinear mapping, scalable, flexible architecture

## Algorithm Comparison

| Method | Linearity | Complexity | Generative |
|--------|-----------|------------|------------|
| PCA | Linear | $O(D^2 n)$ | No |
| Kernel PCA | Nonlinear | $O(n^3)$ | No |
| Autoencoder | Nonlinear | $O(n \cdot T)$ | Limited |
| VAE | Nonlinear | $O(n \cdot T)$ | Yes |

## Complexity Analysis

- **Time:** $O(n \cdot L \cdot h^2)$ per epoch, where $L$ = layers, $h$ = hidden dim
- **Space:** $O(\sum_l h_l \cdot h_{l+1})$ for weights

In [None]:
from __future__ import annotations

from dataclasses import dataclass, field
from typing import Dict, List, Tuple

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from torch import Tensor
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.utils import make_grid

In [None]:
@dataclass
class AEConfig:
    """Configuration for Autoencoder.
    
    Core Idea:
        The latent_dim controls the information bottleneck.
        Smaller = more compression = more information loss.
    
    Mathematical Theory:
        Compression ratio = input_dim / latent_dim
        For MNIST: 784 / 32 = 24.5x compression
    """
    input_dim: int = 784
    hidden_dims: Tuple[int, ...] = (256, 128)
    latent_dim: int = 32
    
    lr: float = 1e-3
    batch_size: int = 128
    num_epochs: int = 20
    
    device: str = field(default_factory=lambda: "cuda" if torch.cuda.is_available() else "cpu")
    seed: int = 42
    
    @property
    def compression_ratio(self) -> float:
        return self.input_dim / self.latent_dim

In [None]:
class Encoder(nn.Module):
    """Encoder network: maps input to latent space.
    
    Core Idea:
        Progressive dimensionality reduction through hidden layers,
        forcing the network to learn compact representations.
    
    Mathematical Theory:
        $z = f_\phi(x) = W_L \cdot \sigma(W_{L-1} \cdots \sigma(W_1 x))$
        where $\sigma$ is ReLU activation.
    """
    
    def __init__(self, config: AEConfig) -> None:
        super().__init__()
        dims = [config.input_dim] + list(config.hidden_dims) + [config.latent_dim]
        
        layers = []
        for i in range(len(dims) - 1):
            layers.append(nn.Linear(dims[i], dims[i + 1]))
            if i < len(dims) - 2:
                layers.append(nn.ReLU(True))
        
        self.net = nn.Sequential(*layers)
    
    def forward(self, x: Tensor) -> Tensor:
        return self.net(x.view(x.size(0), -1))

In [None]:
class Decoder(nn.Module):
    """Decoder network: maps latent code to reconstruction.
    
    Core Idea:
        Mirror architecture of encoder, progressively expanding
        dimensions back to input space.
    
    Mathematical Theory:
        $\hat{x} = g_\theta(z) = \sigma_{out}(W_L \cdot \sigma(W_{L-1} \cdots \sigma(W_1 z)))$
        where $\sigma_{out}$ is Sigmoid for [0,1] normalized images.
    """
    
    def __init__(self, config: AEConfig) -> None:
        super().__init__()
        dims = [config.latent_dim] + list(reversed(config.hidden_dims)) + [config.input_dim]
        
        layers = []
        for i in range(len(dims) - 1):
            layers.append(nn.Linear(dims[i], dims[i + 1]))
            if i < len(dims) - 2:
                layers.append(nn.ReLU(True))
            else:
                layers.append(nn.Sigmoid())
        
        self.net = nn.Sequential(*layers)
    
    def forward(self, z: Tensor) -> Tensor:
        return self.net(z)

In [None]:
class Autoencoder(nn.Module):
    """Complete Autoencoder model.
    
    Core Idea:
        Composition of encoder and decoder with shared bottleneck.
        The bottleneck forces learning of compressed representations.
    
    Comparison:
        vs VAE: No probabilistic latent space, cannot sample new data
        vs PCA: Nonlinear, but no closed-form solution
    """
    
    def __init__(self, config: AEConfig) -> None:
        super().__init__()
        self.config = config
        self.encoder = Encoder(config)
        self.decoder = Decoder(config)
    
    def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return x_recon.view_as(x), z
    
    def encode(self, x: Tensor) -> Tensor:
        return self.encoder(x)
    
    def decode(self, z: Tensor) -> Tensor:
        return self.decoder(z)

In [None]:
class AETrainer:
    """Training orchestrator for Autoencoder.
    
    Core Idea:
        Minimize reconstruction error via gradient descent.
        MSE loss for continuous data, BCE for binary.
    """
    
    def __init__(self, config: AEConfig) -> None:
        self.config = config
        self.device = torch.device(config.device)
        
        torch.manual_seed(config.seed)
        
        self.model = Autoencoder(config).to(self.device)
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=config.lr)
        self.criterion = nn.MSELoss()
        self.history: Dict[str, List[float]] = {"loss": []}
    
    def train_epoch(self, dataloader: DataLoader) -> float:
        self.model.train()
        total_loss = 0.0
        
        for x, _ in dataloader:
            x = x.to(self.device)
            
            x_recon, _ = self.model(x)
            loss = self.criterion(x_recon, x)
            
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(dataloader)
        self.history["loss"].append(avg_loss)
        return avg_loss
    
    @torch.no_grad()
    def reconstruct(self, x: Tensor) -> Tuple[Tensor, Tensor]:
        self.model.eval()
        x = x.to(self.device)
        x_recon, z = self.model(x)
        return x_recon.cpu(), z.cpu()
    
    @torch.no_grad()
    def get_latent(self, dataloader: DataLoader) -> Tuple[np.ndarray, np.ndarray]:
        self.model.eval()
        latents, labels = [], []
        
        for x, y in dataloader:
            z = self.model.encode(x.to(self.device))
            latents.append(z.cpu().numpy())
            labels.append(y.numpy())
        
        return np.concatenate(latents), np.concatenate(labels)

In [None]:
def create_dataloader(config: AEConfig) -> DataLoader:
    transform = transforms.ToTensor()
    dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
    return DataLoader(dataset, batch_size=config.batch_size, shuffle=True, drop_last=True, num_workers=2)

In [None]:
def visualize_reconstruction(trainer: AETrainer, dataset: datasets.MNIST, n: int = 10) -> None:
    indices = torch.randperm(len(dataset))[:n]
    x = torch.stack([dataset[i][0] for i in indices])
    x_recon, _ = trainer.reconstruct(x)
    
    fig, axes = plt.subplots(2, n, figsize=(n, 2.5))
    for i in range(n):
        axes[0, i].imshow(x[i].squeeze(), cmap="gray")
        axes[0, i].axis("off")
        axes[1, i].imshow(x_recon[i].squeeze(), cmap="gray")
        axes[1, i].axis("off")
    axes[0, 0].set_ylabel("Original")
    axes[1, 0].set_ylabel("Recon")
    plt.tight_layout()
    plt.show()

In [None]:
def visualize_latent_space(trainer: AETrainer, dataloader: DataLoader) -> None:
    from sklearn.manifold import TSNE
    
    z, y = trainer.get_latent(dataloader)
    z_sample = z[:2000]
    y_sample = y[:2000]
    
    if z.shape[1] > 2:
        z_2d = TSNE(n_components=2, random_state=42).fit_transform(z_sample)
    else:
        z_2d = z_sample
    
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(z_2d[:, 0], z_2d[:, 1], c=y_sample, cmap="tab10", alpha=0.6, s=10)
    plt.colorbar(scatter, label="Digit")
    plt.title("Latent Space (t-SNE projection)")
    plt.xlabel("Dimension 1")
    plt.ylabel("Dimension 2")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
def plot_training_curve(history: Dict[str, List[float]]) -> None:
    plt.figure(figsize=(8, 5))
    plt.plot(history["loss"], marker="o")
    plt.xlabel("Epoch")
    plt.ylabel("MSE Loss")
    plt.title("Training Loss")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
if __name__ == "__main__":
    config = AEConfig(latent_dim=32, num_epochs=20)
    dataloader = create_dataloader(config)
    trainer = AETrainer(config)
    
    print(f"Model parameters: {sum(p.numel() for p in trainer.model.parameters()):,}")
    print(f"Compression ratio: {config.compression_ratio:.1f}x")
    print(f"Device: {config.device}")
    
    for epoch in range(config.num_epochs):
        loss = trainer.train_epoch(dataloader)
        if (epoch + 1) % 5 == 0:
            print(f"Epoch [{epoch+1}/{config.num_epochs}] Loss: {loss:.6f}")
    
    plot_training_curve(trainer.history)
    visualize_reconstruction(trainer, dataloader.dataset)
    visualize_latent_space(trainer, dataloader)

## Summary

Vanilla autoencoders learn compressed representations through reconstruction:

1. **Information bottleneck:** Latent dim << input dim forces feature learning
2. **Nonlinear PCA:** Generalizes linear dimensionality reduction
3. **Deterministic:** Unlike VAE, no probabilistic sampling in latent space

**Limitations:**
- Latent space is irregular (not suitable for generation)
- No explicit regularization on latent structure
- Cannot sample new data points reliably

**Next:** VAE adds probabilistic structure to enable generation.