# Comprehensive Tutorial on Wasserstein GANs (WGANs)

Wasserstein GANs (WGANs) are an improvement over the original GAN framework introduced by Martin Arjovsky, Soumith Chintala, and Léon Bottou in 2017. WGANs address several training issues of standard GANs by using the Wasserstein distance (also known as Earth Mover's distance) as the loss function.

## Mathematical Foundations

1. **Generator (G)**: This network takes a random noise vector $(\mathbf{z})$ from a prior distribution $(p_{\mathbf{z}})$ and maps it to the data space $(G(\mathbf{z}; \theta_G))$. The generator's objective is to generate data that resembles the true data distribution $(p_{\text{data}})$.

2. **Critic (C)**: In WGANs, the discriminator is replaced by a critic that outputs a scalar score representing the "realness" of the data. Unlike the original GAN discriminator, the critic does not output probabilities.

The key idea is to minimize the Wasserstein distance between the real data distribution and the generated data distribution:
$$
W(p_{\text{data}}, p_G) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_G)} \mathbb{E}_{(x,y) \sim \gamma} [\| x - y \|]
$$
where $\Pi(p_{\text{data}}, p_G)$ denotes the set of all joint distributions $\gamma(x,y)$ whose marginals are $p_{\text{data}}$ and $p_G$ respectively.

## Training Procedure

The training of WGANs involves the following steps, typically repeated iteratively:

1. **Sample real data** $(\mathbf{x} \sim p_{\text{data}})$.
2. **Sample noise** $(\mathbf{z} \sim p_{\mathbf{z}})$ and generate fake data $(\hat{\mathbf{x}} = G(\mathbf{z}))$.
3. **Update Critic**:
   - Compute critic loss:
  $
     L_C = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[C(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[C(G(\mathbf{z}))]
  $
   - Perform a gradient descent step on $L_C$ to update $\theta_C$.
4. **Update Generator**:
   - Compute generator loss:
  $
     L_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[C(G(\mathbf{z}))]
  $
   - Perform a gradient descent step on $L_G$ to update $\theta_G$.

## Mathematical Derivatives of the WGAN Training Process

### Critic Training

The critic aims to maximize the Wasserstein distance, which corresponds to minimizing the following loss:
$$
L_C = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[C(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[C(G(\mathbf{z}))]
$$

To update the critic, we compute the gradient of $L_C$ with respect to the critic's parameters $\theta_C$:
$$
\nabla_{\theta_C} L_C = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [\nabla_{\theta_C} C(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}} [\nabla_{\theta_C} C(G(\mathbf{z}))]
$$

### Generator Training

The generator aims to minimize the critic's evaluation of the generated data:
$$
L_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[C(G(\mathbf{z}))]
$$

To update the generator, we compute the gradient of $L_G$ with respect to the generator's parameters $\theta_G$:
$$
\nabla_{\theta_G} L_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}} [\nabla_{\theta_G} C(G(\mathbf{z}))]
$$

### Gradient Penalty

To ensure the Lipschitz constraint on the critic, WGAN-GP introduces a gradient penalty term:
$$
L_{\text{GP}} = \mathbb{E}_{\hat{\mathbf{x}} \sim p_{\hat{\mathbf{x}}}} \left[ (\|\nabla_{\hat{\mathbf{x}}} C(\hat{\mathbf{x}})\|_2 - 1)^2 \right]
$$
where $\hat{\mathbf{x}}$ is sampled uniformly along straight lines between pairs of points from the real data and the generated data distributions.

### Training Procedure with Gradient Penalty

The training procedure of WGANs with the gradient penalty term is as follows:

1. **Critic Update**:
    - Sample real data $(\mathbf{x} \sim p_{\text{data}})$.
    - Sample noise $(\mathbf{z} \sim p_{\mathbf{z}})$ and generate fake data $(\hat{\mathbf{x}} = G(\mathbf{z}))$.
    - Sample $\hat{\mathbf{x}}$ uniformly along straight lines between $\mathbf{x}$ and $\hat{\mathbf{x}}$.
    - Compute the critic loss with the gradient penalty:
  $
      L_C = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [C(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}} [C(G(\mathbf{z}))] + \lambda \mathbb{E}_{\hat{\mathbf{x}} \sim p_{\hat{\mathbf{x}}}} \left[ (\|\nabla_{\hat{\mathbf{x}}} C(\hat{\mathbf{x}})\|_2 - 1)^2 \right]
  $
    - Compute gradients:
  $
      \nabla_{\theta_C} L_C = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} [\nabla_{\theta_C} C(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}} [\nabla_{\theta_C} C(G(\mathbf{z}))] + \lambda \mathbb{E}_{\hat{\mathbf{x}} \sim p_{\hat{\mathbf{x}}}} \left[ 2 (\|\nabla_{\hat{\mathbf{x}}} C(\hat{\mathbf{x}})\|_2 - 1) \nabla_{\theta_C} \|\nabla_{\hat{\mathbf{x}}} C(\hat{\mathbf{x}})\|_2 \right]
  $
    - Update $\theta_C$ using gradient descent.

2. **Generator Update**:
    - Sample noise $(\mathbf{z} \sim p_{\mathbf{z}})$.
    - Generate fake data $(\hat{\mathbf{x}} = G(\mathbf{z}))$.
    - Compute the generator loss:
  $
      L_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[C(\hat{\mathbf{x}})]
  $
    - Compute gradients:
  $
      \nabla_{\theta_G} L_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}} [\nabla_{\theta_G} C(\hat{\mathbf{x}})]
  $
    - Update $\theta_G$ using gradient descent.

## Key Innovations

1. **Wasserstein Distance**: The use of the Wasserstein distance provides a smoother and more meaningful loss metric compared to the Jensen-Shannon divergence used in standard GANs.
2. **Lipschitz Constraint**: The enforcement of the Lipschitz constraint on the critic ensures more stable training and addresses gradient issues.
3. **Gradient Penalty**: The gradient penalty term provides a practical way to enforce the Lipschitz constraint without clipping weights, leading to better performance.

## Advantages of WGANs

1. **Improved Training Stability**: WGANs are more stable to train and less sensitive to hyperparameter choices compared to standard GANs.
2. **Meaningful Loss Metric**: The Wasserstein distance offers a meaningful loss metric that correlates better with the quality of generated samples.
3. **No Mode Collapse**: WGANs are less prone to mode collapse, where the generator produces limited varieties of samples.

## Drawbacks of WGANs

1. **Increased Computational Cost**: The gradient penalty term and the need for more critic updates per generator update increase the computational cost.
2. **Sensitive to Critic Capacity**: The performance of WGANs can be sensitive to the capacity and architecture of the critic network.

## Conclusion

Wasserstein GANs (WGANs) have significantly improved the stability and performance of GAN training by using the Wasserstein distance and enforcing the Lipschitz constraint on the critic. Understanding the mathematical foundations and training dynamics of WGANs, including the derivatives of the training process and the gradient penalty term, is crucial for leveraging their full potential and addressing their limitations.
