# Generative Adversarial Networks: Theory and Implementation

## Core Idea

GAN frames generative modeling as a two-player minimax game between a Generator $G$ and Discriminator $D$.
The Generator learns to map noise $z \sim p_z$ to data space, while the Discriminator learns to distinguish
real samples from generated ones. At Nash equilibrium, $G$ produces samples indistinguishable from real data.

## Mathematical Foundation

### Value Function (Goodfellow et al., 2014)

$$\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

**Variable Definitions:**
- $p_{\text{data}}$: True data distribution
- $p_z$: Prior noise distribution (typically $\mathcal{N}(0, I)$)
- $p_g$: Implicit distribution defined by $G(z)$ where $z \sim p_z$
- $D(x) \in [0, 1]$: Probability that $x$ is real

### Optimal Discriminator (Theorem 1)

For fixed $G$, the optimal discriminator is:

$$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}$$

**Proof:** The value function can be written as:
$$V(D, G) = \int_x \left[ p_{\text{data}}(x) \log D(x) + p_g(x) \log(1 - D(x)) \right] dx$$

For any $(a, b) \in \mathbb{R}^2 \setminus \{0, 0\}$, the function $f(y) = a \log y + b \log(1-y)$
achieves maximum at $y = \frac{a}{a+b}$. Setting $a = p_{\text{data}}(x)$, $b = p_g(x)$ yields the result.

### Global Optimum (Theorem 2)

Substituting $D^*$ into $V$:

$$C(G) = V(D^*, G) = -\log 4 + 2 \cdot \text{JSD}(p_{\text{data}} \| p_g)$$

where JSD is the Jensen-Shannon Divergence. Since $\text{JSD} \geq 0$ with equality iff $p_{\text{data}} = p_g$,
the global minimum $C(G) = -\log 4$ is achieved iff $p_g = p_{\text{data}}$.

## Problem Statement

Traditional generative models (VAE, flow-based) require explicit density estimation or architectural constraints.
GAN circumvents this by using an implicit density model trained via adversarial learning, enabling:
- High-fidelity sample generation without tractable likelihood
- Flexible generator architectures
- Sharp, realistic outputs (vs. blurry VAE reconstructions)

## Algorithm Comparison

| Method | Likelihood | Sample Quality | Training Stability | Mode Coverage |
|--------|------------|----------------|-------------------|---------------|
| VAE | Tractable ELBO | Blurry | Stable | Good |
| Flow | Exact | Good | Stable | Good |
| GAN | Implicit | Sharp | Unstable | Mode collapse risk |
| Diffusion | Tractable | Excellent | Stable | Excellent |

## Complexity Analysis

- **Time:** $O(n \cdot (T_G + T_D))$ per epoch, where $n$ = samples, $T_G$, $T_D$ = forward pass costs
- **Space:** $O(|\theta_G| + |\theta_D|)$ for model parameters
- **Convergence:** No theoretical guarantees for non-convex case; empirically requires careful tuning

In [None]:
from __future__ import annotations

import math
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.utils import make_grid

In [None]:
@dataclass
class GANConfig:
    """Configuration for GAN training.
    
    Core Idea:
        Centralized hyperparameter management following the principle of separation
        of concerns. All training dynamics are controlled from a single source.
    
    Mathematical Theory:
        - latent_dim: Dimension of $z \sim p_z = \mathcal{N}(0, I_{d_z})$
        - label_smoothing: Replaces hard labels $y=1$ with $y=0.9$ to prevent
          discriminator overconfidence, improving gradient flow to generator.
    """
    latent_dim: int = 100
    hidden_dim: int = 256
    image_channels: int = 1
    image_size: int = 28
    
    lr_discriminator: float = 2e-4
    lr_generator: float = 2e-4
    beta1: float = 0.5
    beta2: float = 0.999
    
    batch_size: int = 64
    num_epochs: int = 50
    label_smoothing: float = 0.9
    
    device: str = field(default_factory=lambda: "cuda" if torch.cuda.is_available() else "cpu")
    seed: int = 42
    
    @property
    def image_dim(self) -> int:
        return self.image_channels * self.image_size * self.image_size

In [None]:
class Generator(nn.Module):
    """MLP-based Generator Network.
    
    Core Idea:
        Maps low-dimensional noise $z \in \mathbb{R}^{d_z}$ to high-dimensional
        data space $x \in \mathbb{R}^{d_x}$ through a series of learned nonlinear
        transformations. The mapping implicitly defines $p_g(x)$.
    
    Mathematical Theory:
        $G: \mathbb{R}^{d_z} \to \mathbb{R}^{d_x}$ where
        $G(z) = \tanh(W_L \cdot \text{LReLU}(W_{L-1} \cdots \text{LReLU}(W_1 z)))$
        
        Output activation $\tanh$ ensures $G(z) \in [-1, 1]^{d_x}$ matching
        normalized image range.
    
    Architecture:
        - Progressive dimension expansion: $d_z \to 256 \to 512 \to 1024 \to d_x$
        - BatchNorm after each hidden layer for training stability
        - LeakyReLU(0.2) to prevent dead neurons
    
    Complexity:
        Time: O(d_z * h + h^2 * L + h * d_x) where h=hidden_dim, L=num_layers
        Space: O(sum of weight matrices)
    """
    
    def __init__(self, config: GANConfig) -> None:
        super().__init__()
        self.config = config
        
        self.net = nn.Sequential(
            self._block(config.latent_dim, config.hidden_dim, normalize=False),
            self._block(config.hidden_dim, config.hidden_dim * 2),
            self._block(config.hidden_dim * 2, config.hidden_dim * 4),
            nn.Linear(config.hidden_dim * 4, config.image_dim),
            nn.Tanh(),
        )
        self._init_weights()
    
    def _block(self, in_features: int, out_features: int, normalize: bool = True) -> nn.Sequential:
        layers = [nn.Linear(in_features, out_features)]
        if normalize:
            layers.append(nn.BatchNorm1d(out_features))
        layers.append(nn.LeakyReLU(0.2, inplace=True))
        return nn.Sequential(*layers)
    
    def _init_weights(self) -> None:
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0.0, 0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm1d):
                nn.init.normal_(m.weight, 1.0, 0.02)
                nn.init.zeros_(m.bias)
    
    def forward(self, z: Tensor) -> Tensor:
        if z.dim() == 1:
            z = z.unsqueeze(0)
        out = self.net(z)
        return out.view(-1, self.config.image_channels, self.config.image_size, self.config.image_size)

In [None]:
class Discriminator(nn.Module):
    """MLP-based Discriminator Network.
    
    Core Idea:
        Binary classifier that estimates $D(x) = P(x \text{ is real})$.
        Trained to maximize classification accuracy, providing gradient
        signal to the generator.
    
    Mathematical Theory:
        $D: \mathbb{R}^{d_x} \to [0, 1]$ where
        $D(x) = \sigma(W_L \cdot \text{LReLU}(W_{L-1} \cdots \text{LReLU}(W_1 x)))$
        
        At optimum: $D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}$
    
    Architecture:
        - Progressive dimension reduction: $d_x \to 1024 \to 512 \to 256 \to 1$
        - Dropout(0.3) for regularization and to prevent discriminator dominance
        - No BatchNorm (can cause training instability in discriminator)
    
    Complexity:
        Time: O(d_x * h + h^2 * L) where h=hidden_dim, L=num_layers
        Space: O(sum of weight matrices)
    """
    
    def __init__(self, config: GANConfig) -> None:
        super().__init__()
        self.config = config
        
        self.net = nn.Sequential(
            self._block(config.image_dim, config.hidden_dim * 4),
            self._block(config.hidden_dim * 4, config.hidden_dim * 2),
            self._block(config.hidden_dim * 2, config.hidden_dim),
            nn.Linear(config.hidden_dim, 1),
            nn.Sigmoid(),
        )
        self._init_weights()
    
    def _block(self, in_features: int, out_features: int) -> nn.Sequential:
        return nn.Sequential(
            nn.Linear(in_features, out_features),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Dropout(0.3),
        )
    
    def _init_weights(self) -> None:
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0.0, 0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
    
    def forward(self, x: Tensor) -> Tensor:
        x_flat = x.view(x.size(0), -1)
        return self.net(x_flat)

In [None]:
class GANTrainer:
    """Training orchestrator for GAN.
    
    Core Idea:
        Implements alternating optimization: fix G, update D; fix D, update G.
        Uses non-saturating generator loss for stable gradients.
    
    Mathematical Theory:
        Discriminator update (maximize):
        $\nabla_{\theta_D} \frac{1}{m} \sum_{i=1}^m [\log D(x^{(i)}) + \log(1 - D(G(z^{(i)})))]$
        
        Generator update (non-saturating form):
        $\nabla_{\theta_G} \frac{1}{m} \sum_{i=1}^m [-\log D(G(z^{(i)}))]$
        
        The non-saturating loss $-\log D(G(z))$ provides stronger gradients
        early in training compared to $\log(1 - D(G(z)))$ which saturates
        when $D(G(z)) \approx 0$.
    """
    
    def __init__(self, config: GANConfig) -> None:
        self.config = config
        self.device = torch.device(config.device)
        
        torch.manual_seed(config.seed)
        
        self.generator = Generator(config).to(self.device)
        self.discriminator = Discriminator(config).to(self.device)
        
        self.optimizer_g = torch.optim.Adam(
            self.generator.parameters(),
            lr=config.lr_generator,
            betas=(config.beta1, config.beta2)
        )
        self.optimizer_d = torch.optim.Adam(
            self.discriminator.parameters(),
            lr=config.lr_discriminator,
            betas=(config.beta1, config.beta2)
        )
        
        self.criterion = nn.BCELoss()
        self.fixed_noise = torch.randn(64, config.latent_dim, device=self.device)
        self.history: Dict[str, List[float]] = {
            "loss_d": [], "loss_g": [], "d_real": [], "d_fake": []
        }
    
    def _train_discriminator(self, real_images: Tensor) -> Tuple[float, float, float]:
        batch_size = real_images.size(0)
        real_labels = torch.full((batch_size, 1), self.config.label_smoothing, device=self.device)
        fake_labels = torch.zeros(batch_size, 1, device=self.device)
        
        self.optimizer_d.zero_grad()
        
        output_real = self.discriminator(real_images)
        loss_real = self.criterion(output_real, real_labels)
        
        z = torch.randn(batch_size, self.config.latent_dim, device=self.device)
        fake_images = self.generator(z).detach()
        output_fake = self.discriminator(fake_images)
        loss_fake = self.criterion(output_fake, fake_labels)
        
        loss_d = loss_real + loss_fake
        loss_d.backward()
        self.optimizer_d.step()
        
        return loss_d.item(), output_real.mean().item(), output_fake.mean().item()
    
    def _train_generator(self, batch_size: int) -> float:
        real_labels = torch.full((batch_size, 1), self.config.label_smoothing, device=self.device)
        
        self.optimizer_g.zero_grad()
        
        z = torch.randn(batch_size, self.config.latent_dim, device=self.device)
        fake_images = self.generator(z)
        output = self.discriminator(fake_images)
        
        loss_g = self.criterion(output, real_labels)
        loss_g.backward()
        self.optimizer_g.step()
        
        return loss_g.item()
    
    def train_epoch(self, dataloader: DataLoader) -> Dict[str, float]:
        self.generator.train()
        self.discriminator.train()
        
        epoch_loss_d, epoch_loss_g = 0.0, 0.0
        epoch_d_real, epoch_d_fake = 0.0, 0.0
        num_batches = len(dataloader)
        
        for real_images, _ in dataloader:
            real_images = real_images.to(self.device)
            batch_size = real_images.size(0)
            
            loss_d, d_real, d_fake = self._train_discriminator(real_images)
            loss_g = self._train_generator(batch_size)
            
            epoch_loss_d += loss_d
            epoch_loss_g += loss_g
            epoch_d_real += d_real
            epoch_d_fake += d_fake
        
        metrics = {
            "loss_d": epoch_loss_d / num_batches,
            "loss_g": epoch_loss_g / num_batches,
            "d_real": epoch_d_real / num_batches,
            "d_fake": epoch_d_fake / num_batches,
        }
        
        for key, value in metrics.items():
            self.history[key].append(value)
        
        return metrics
    
    @torch.no_grad()
    def generate_samples(self, num_samples: int = 64) -> Tensor:
        self.generator.eval()
        z = torch.randn(num_samples, self.config.latent_dim, device=self.device)
        return self.generator(z).cpu()

In [None]:
def create_dataloader(config: GANConfig) -> DataLoader:
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ])
    dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
    return DataLoader(dataset, batch_size=config.batch_size, shuffle=True, drop_last=True, num_workers=2)

In [None]:
def visualize_samples(samples: Tensor, title: str = "Generated Samples") -> None:
    grid = make_grid(samples, nrow=8, normalize=True, value_range=(-1, 1))
    plt.figure(figsize=(10, 10))
    plt.imshow(grid.permute(1, 2, 0).numpy())
    plt.title(title)
    plt.axis("off")
    plt.tight_layout()
    plt.show()

In [None]:
def plot_training_curves(history: Dict[str, List[float]]) -> None:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    axes[0].plot(history["loss_d"], label="Discriminator", alpha=0.8)
    axes[0].plot(history["loss_g"], label="Generator", alpha=0.8)
    axes[0].set_xlabel("Epoch")
    axes[0].set_ylabel("Loss")
    axes[0].set_title("Training Loss")
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(history["d_real"], label="D(real)", alpha=0.8)
    axes[1].plot(history["d_fake"], label="D(fake)", alpha=0.8)
    axes[1].axhline(y=0.5, color="r", linestyle="--", label="Equilibrium")
    axes[1].set_xlabel("Epoch")
    axes[1].set_ylabel("Discriminator Output")
    axes[1].set_title("Discriminator Confidence")
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
if __name__ == "__main__":
    config = GANConfig(num_epochs=50, batch_size=64)
    dataloader = create_dataloader(config)
    trainer = GANTrainer(config)
    
    print(f"Generator parameters: {sum(p.numel() for p in trainer.generator.parameters()):,}")
    print(f"Discriminator parameters: {sum(p.numel() for p in trainer.discriminator.parameters()):,}")
    print(f"Device: {config.device}")
    
    for epoch in range(config.num_epochs):
        metrics = trainer.train_epoch(dataloader)
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch [{epoch+1}/{config.num_epochs}] "
                  f"Loss_D: {metrics['loss_d']:.4f} Loss_G: {metrics['loss_g']:.4f} "
                  f"D(real): {metrics['d_real']:.3f} D(fake): {metrics['d_fake']:.3f}")
            samples = trainer.generate_samples(64)
            visualize_samples(samples, f"Epoch {epoch+1}")
    
    plot_training_curves(trainer.history)

## Summary

GAN training alternates between:
1. **Discriminator step**: Maximize $\log D(x) + \log(1 - D(G(z)))$ to improve real/fake classification
2. **Generator step**: Maximize $\log D(G(z))$ (non-saturating) to fool the discriminator

Key implementation details:
- Label smoothing (0.9 instead of 1.0) prevents discriminator overconfidence
- Adam with $\beta_1=0.5$ reduces momentum, improving stability
- LeakyReLU prevents dead neurons in both networks
- Dropout in discriminator prevents overfitting to training data

At convergence, $D(x) \to 0.5$ for all $x$, indicating the generator has learned $p_g \approx p_{\text{data}}$.