# Advanced Topics: GAE, A2C, and Beyond

## Overview
This notebook covers advanced policy gradient techniques including Generalized Advantage Estimation (GAE) and Advantage Actor-Critic (A2C).

### Learning Objectives
1. Understand Generalized Advantage Estimation (GAE)
2. Learn A2C algorithm with batch updates
3. Understand the bias-variance trade-off in advantage estimation
4. Explore practical training techniques

## 1. Generalized Advantage Estimation (GAE)

### Problem: Choosing the Right Advantage Estimator

Different advantage estimators have different bias-variance properties:

**1-step TD:**
$$A_t^{(1)} = r_t + \gamma V(s_{t+1}) - V(s_t)$$
- Low variance, high bias

**n-step TD:**
$$A_t^{(n)} = \sum_{l=0}^{n-1} \gamma^l r_{t+l} + \gamma^n V(s_{t+n}) - V(s_t)$$
- Medium variance, medium bias

**Monte Carlo (∞-step):**
$$A_t^{(\infty)} = G_t - V(s_t)$$
- High variance, low bias

### Solution: GAE

Generalized Advantage Estimation combines all n-step returns:

$$A_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_t^{(l)}$$

where $\delta_t^{(l)} = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

### Efficient Computation

$$A_t = \delta_t + (\gamma\lambda) A_{t+1}$$

This can be computed efficiently in reverse order!

## 2. GAE Parameters

### Lambda Parameter (λ)

- **λ = 0**: Uses only 1-step TD (low variance, high bias)
- **λ = 1**: Uses full Monte Carlo returns (high variance, low bias)
- **0 < λ < 1**: Interpolates between the two (typical: λ = 0.95)

### Gamma Parameter (γ)

- **γ = 0**: Only immediate reward matters
- **γ = 1**: All future rewards equally important
- **Typical**: γ = 0.99 or γ = 0.999

### Effect of λ on Advantage Estimates

```
λ = 0.0:  A_t = δ_t                           (1-step)
λ = 0.5:  A_t = δ_t + 0.5γδ_{t+1} + ...      (mixed)
λ = 1.0:  A_t = G_t - V(s_t)                 (MC)
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def compute_gae(rewards, values, gamma=0.99, lambda_=0.95):
    """
    Compute Generalized Advantage Estimation.
    
    Args:
        rewards: Array of rewards
        values: Array of value estimates (including final state)
        gamma: Discount factor
        lambda_: GAE lambda parameter
    
    Returns:
        advantages: Array of advantage estimates
        returns: Array of returns (advantages + values)
    """
    advantages = np.zeros(len(rewards))
    gae = 0.0
    
    for t in reversed(range(len(rewards))):
        # TD error
        delta = rewards[t] + gamma * values[t+1] - values[t]
        
        # GAE
        gae = delta + gamma * lambda_ * gae
        advantages[t] = gae
    
    returns = advantages + values[:-1]
    return advantages, returns

# Example
rewards = np.array([1.0, 1.0, 1.0, 0.0, 1.0])
values = np.array([2.5, 2.0, 1.5, 0.5, 1.5, 1.0])  # V(s_t) for each state

# Compare different lambda values
lambdas = [0.0, 0.5, 0.95, 1.0]
fig, axes = plt.subplots(1, len(lambdas), figsize=(15, 3))

for idx, lambda_ in enumerate(lambdas):
    advantages, returns = compute_gae(rewards, values, gamma=0.99, lambda_=lambda_)
    
    ax = axes[idx]
    ax.bar(range(len(advantages)), advantages, alpha=0.7, label='Advantages')
    ax.set_title(f'λ = {lambda_}')
    ax.set_xlabel('Timestep')
    ax.set_ylabel('Advantage')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("GAE with different lambda values:")
for lambda_ in lambdas:
    advantages, _ = compute_gae(rewards, values, gamma=0.99, lambda_=lambda_)
    print(f"λ = {lambda_}: advantages = {advantages}")

## 3. A2C Algorithm

### Advantage Actor-Critic (A2C)

A2C extends Actor-Critic with:
1. **Batch updates**: Collect multiple trajectories before updating
2. **GAE**: Use Generalized Advantage Estimation
3. **Parallel environments**: Collect data from multiple environments simultaneously

### Algorithm

```
Initialize actor π_θ and critic V_φ
for iteration = 1 to num_iterations:
    # Collect trajectories from N parallel environments
    for env_id = 1 to N:
        Collect trajectory from environment
    
    # Compute advantages using GAE
    advantages ← compute_gae(trajectories)
    
    # Batch update
    for epoch = 1 to num_epochs:
        # Critic update
        L_critic = MSE(V_φ(s), returns)
        φ ← φ - β∇_φ L_critic
        
        # Actor update
        L_actor = -log π_θ(a|s) * advantages
        θ ← θ + α∇_θ L_actor
```

## 4. Practical Training Techniques

### 1. Advantage Normalization

Normalize advantages to have zero mean and unit variance:

$$A_{norm} = \frac{A - \text{mean}(A)}{\text{std}(A) + \epsilon}$$

**Benefits:**
- Stabilizes training
- Makes learning rate less sensitive to reward scale
- Improves gradient flow

### 2. Return Normalization

Normalize returns for value function training:

$$G_{norm} = \frac{G - \text{mean}(G)}{\text{std}(G) + \epsilon}$$

### 3. Gradient Clipping

Clip gradients to prevent large updates:

$$\nabla \leftarrow \text{clip}(\nabla, -\text{max_norm}, \text{max_norm})$$

### 4. Entropy Regularization

Encourage exploration by adding entropy bonus:

$$L_{total} = L_{policy} + L_{value} - \beta H(\pi)$$

where $H(\pi) = -\mathbb{E}[\log \pi]$ is the policy entropy.

In [None]:
import torch
import torch.nn as nn

def normalize_advantages(advantages, epsilon=1e-8):
    """
    Normalize advantages to have zero mean and unit variance.
    
    Args:
        advantages: Tensor of advantages
        epsilon: Small constant for numerical stability
    
    Returns:
        Normalized advantages
    """
    return (advantages - advantages.mean()) / (advantages.std() + epsilon)

def compute_policy_loss(log_probs, advantages, entropy, entropy_coeff=0.01):
    """
    Compute policy loss with entropy regularization.
    
    Args:
        log_probs: Log probabilities of actions
        advantages: Advantage estimates
        entropy: Policy entropy
        entropy_coeff: Entropy regularization coefficient
    
    Returns:
        Policy loss
    """
    policy_loss = -(log_probs * advantages.detach()).mean()
    entropy_loss = -entropy_coeff * entropy.mean()
    return policy_loss + entropy_loss

# Example
log_probs = torch.randn(32, 1, requires_grad=True)
advantages = torch.randn(32, 1)
entropy = torch.randn(32)

# Normalize advantages
advantages_norm = normalize_advantages(advantages)

# Compute loss
loss = compute_policy_loss(log_probs, advantages_norm, entropy, entropy_coeff=0.01)

print(f"Original advantages - mean: {advantages.mean():.4f}, std: {advantages.std():.4f}")
print(f"Normalized advantages - mean: {advantages_norm.mean():.4f}, std: {advantages_norm.std():.4f}")
print(f"Policy loss: {loss.item():.4f}")

## 5. Comparison of Methods

### REINFORCE vs Actor-Critic vs A2C

| Aspect | REINFORCE | Actor-Critic | A2C |
|--------|-----------|--------------|-----|
| **Advantage Estimator** | Full return | 1-step TD | GAE |
| **Variance** | High | Medium | Low |
| **Bias** | Low | Medium | Medium |
| **Update Frequency** | Per episode | Per step | Per batch |
| **Parallel Envs** | No | No | Yes |
| **Sample Efficiency** | Low | Medium | High |
| **Convergence Speed** | Slow | Medium | Fast |
| **Complexity** | Simple | Medium | Complex |

### When to Use Each

- **REINFORCE**: Educational purposes, simple environments
- **Actor-Critic**: Standard choice for most problems
- **A2C**: When sample efficiency is critical, parallel training available

## 6. Summary

### Key Takeaways
1. **GAE**: Provides flexible bias-variance trade-off via λ parameter
2. **A2C**: Combines GAE with batch updates and parallel environments
3. **Practical techniques**: Normalization, clipping, entropy regularization
4. **Trade-offs**: Variance vs bias, sample efficiency vs convergence speed

### Next Steps
- Implement trust region methods (PPO, TRPO)
- Explore off-policy methods (SAC, TD3)
- Study model-based RL approaches