# Actor-Critic与广义优势估计 (GAE)

本教程深入讲解Actor-Critic架构和GAE的数学原理。

## 目录
1. [从REINFORCE到Actor-Critic](#1-从reinforce到actor-critic)
2. [TD误差与优势估计](#2-td误差与优势估计)
3. [广义优势估计GAE](#3-广义优势估计gae)
4. [A2C算法实现](#4-a2c算法实现)
5. [实践练习](#5-实践练习)

In [None]:
import sys
sys.path.insert(0, '..')

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(42)
np.random.seed(42)

## 1. 从REINFORCE到Actor-Critic

### 1.1 REINFORCE的局限

**问题1: 高方差**
- MC回报 $G_t$ 包含整个轨迹的随机性
- 方差随轨迹长度增加

**问题2: 必须等待episode结束**
- 无法在线学习
- 长episode效率低

### 1.2 Actor-Critic的解决方案

**核心思想**: 用学习的价值函数 $V(s)$ 替代MC回报

- **Actor**: 策略网络 $\pi_\theta(a|s)$
- **Critic**: 价值网络 $V_\phi(s)$

## 2. TD误差与优势估计

### 2.1 时序差分(TD)误差

**TD误差定义**:
$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

**直觉理解**:
- $r_t + \gamma V(s_{t+1})$: 实际获得的奖励 + 下一状态的估计价值
- $V(s_t)$: 当前状态的估计价值
- $\delta_t > 0$: 实际比预期好
- $\delta_t < 0$: 实际比预期差

In [None]:
def compute_td_error(reward, value, next_value, done, gamma=0.99):
    """
    计算TD误差
    
    δ_t = r_t + γ(1-done)V(s_{t+1}) - V(s_t)
    """
    if done:
        return reward - value
    return reward + gamma * next_value - value

# 演示
print("场景1: 获得奖励1, V(s)=5, V(s')=5")
delta1 = compute_td_error(1, 5, 5, False)
print(f"TD误差: {delta1:.4f} (略好于预期)")

print("\n场景2: 获得奖励10, V(s)=5, V(s')=5")
delta2 = compute_td_error(10, 5, 5, False)
print(f"TD误差: {delta2:.4f} (远好于预期)")

print("\n场景3: 获得奖励-5, V(s)=5, V(s')=5")
delta3 = compute_td_error(-5, 5, 5, False)
print(f"TD误差: {delta3:.4f} (差于预期)")

### 2.2 TD误差作为优势估计

**关键洞察**: TD误差是优势函数的有偏估计

$$\mathbb{E}[\delta_t | s_t, a_t] = \mathbb{E}[r_t + \gamma V(s_{t+1}) | s_t, a_t] - V(s_t)$$

如果 $V = V^\pi$ (真实价值函数):
$$\mathbb{E}[\delta_t | s_t, a_t] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)$$

## 3. 广义优势估计 (GAE)

### 3.1 偏差-方差权衡

| 方法 | 公式 | 偏差 | 方差 |
|------|------|------|------|
| TD(0) | $\delta_t$ | 高 | 低 |
| MC | $G_t - V(s_t)$ | 无 | 高 |
| n-step | $\sum_{k=0}^{n-1}\gamma^k r_{t+k} + \gamma^n V(s_{t+n}) - V(s_t)$ | 中 | 中 |

### 3.2 GAE公式

**GAE是TD误差的指数加权平均**:

$$A_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

**递归形式** (高效计算):
$$A_t = \delta_t + \gamma\lambda A_{t+1}$$

**特殊情况**:
- $\lambda = 0$: $A_t = \delta_t$ (TD(0))
- $\lambda = 1$: $A_t = G_t - V(s_t)$ (MC)

In [None]:
def compute_gae(rewards, values, next_value, dones, gamma=0.99, gae_lambda=0.95):
    """
    计算广义优势估计 (GAE)
    
    数学公式:
        δ_t = r_t + γ(1-done)V(s_{t+1}) - V(s_t)
        A_t = δ_t + γλ(1-done)A_{t+1}
    
    使用反向递归高效计算
    """
    T = len(rewards)
    advantages = np.zeros(T)
    gae = 0
    
    # 反向遍历
    for t in reversed(range(T)):
        if t == T - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]
        
        # TD误差
        delta = rewards[t] + gamma * (1 - dones[t]) * next_val - values[t]
        
        # GAE递归
        gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
        advantages[t] = gae
    
    # 回报 = 优势 + 价值
    returns = advantages + np.array(values)
    
    return advantages, returns

In [None]:
# 演示GAE计算
rewards = [1, 1, 1, 1, 1]
values = [4.5, 4.0, 3.5, 3.0, 2.5]
dones = [0, 0, 0, 0, 1]
next_value = 0

advantages, returns = compute_gae(rewards, values, next_value, dones)

print("奖励:", rewards)
print("价值估计:", values)
print("GAE优势:", advantages.round(4))
print("回报目标:", returns.round(4))

In [None]:
# 可视化不同λ的效果
def visualize_gae_lambda():
    rewards = [1] * 20
    values = list(np.linspace(10, 1, 20))
    dones = [0] * 19 + [1]
    
    lambdas = [0.0, 0.5, 0.9, 0.95, 1.0]
    
    plt.figure(figsize=(12, 5))
    
    for lam in lambdas:
        adv, _ = compute_gae(rewards, values, 0, dones, gae_lambda=lam)
        plt.plot(adv, label=f'λ={lam}', linewidth=2)
    
    plt.xlabel('Time Step')
    plt.ylabel('Advantage')
    plt.title('GAE with Different λ Values')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

visualize_gae_lambda()

## 4. A2C算法实现

### 4.1 网络架构

In [None]:
class ActorCritic(nn.Module):
    """
    共享特征的Actor-Critic网络
    
    架构:
        state -> [shared] -> features
                               |
                    +----------+----------+
                    |                     |
               [actor_head]          [critic_head]
                    |                     |
                 policy                 value
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        
        # 共享特征层
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor头
        self.actor = nn.Linear(hidden_dim, action_dim)
        
        # Critic头
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, state):
        features = self.shared(state)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value
    
    def get_action_and_value(self, state):
        logits, value = self.forward(state)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, entropy, value.squeeze(-1)

### 4.2 损失函数

**A2C总损失**:
$$L = L_{policy} + c_v \cdot L_{value} - c_{ent} \cdot H(\pi)$$

其中:
- $L_{policy} = -\mathbb{E}[\log \pi(a|s) \cdot A]$
- $L_{value} = \mathbb{E}[(V(s) - G)^2]$
- $H(\pi) = -\mathbb{E}[\pi \log \pi]$ (熵正则化)

In [None]:
def a2c_loss(log_probs, values, advantages, returns, entropies,
             value_coef=0.5, entropy_coef=0.01):
    """
    计算A2C损失
    
    L = L_policy + c_v * L_value - c_ent * H(π)
    """
    # 策略损失
    policy_loss = -(log_probs * advantages.detach()).mean()
    
    # 价值损失
    value_loss = ((values - returns.detach()) ** 2).mean()
    
    # 熵损失 (负号因为我们要最大化熵)
    entropy_loss = -entropies.mean()
    
    # 总损失
    total_loss = policy_loss + value_coef * value_loss + entropy_coef * entropy_loss
    
    return total_loss, {
        'policy_loss': policy_loss.item(),
        'value_loss': value_loss.item(),
        'entropy': -entropy_loss.item()
    }

## 5. 实践练习

### 练习1: 完整A2C训练

In [None]:
import gymnasium as gym

def train_a2c(env_name='CartPole-v1', num_updates=200, n_steps=128,
              gamma=0.99, gae_lambda=0.95, lr=3e-4):
    """
    A2C训练循环
    """
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    model = ActorCritic(state_dim, action_dim)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    episode_rewards = []
    current_reward = 0
    state, _ = env.reset()
    
    for update in range(num_updates):
        # 收集n_steps的数据
        states, actions, rewards, dones = [], [], [], []
        log_probs, values, entropies = [], [], []
        
        for _ in range(n_steps):
            state_t = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            action, log_prob, entropy, value = model.get_action_and_value(state_t)
            
            next_state, reward, terminated, truncated, _ = env.step(action.item())
            done = terminated or truncated
            
            states.append(state)
            actions.append(action.item())
            rewards.append(reward)
            dones.append(float(done))
            log_probs.append(log_prob)
            values.append(value.item())
            entropies.append(entropy)
            
            current_reward += reward
            
            if done:
                episode_rewards.append(current_reward)
                current_reward = 0
                state, _ = env.reset()
            else:
                state = next_state
        
        # 计算bootstrap value
        with torch.no_grad():
            state_t = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            _, next_value = model(state_t)
            next_value = next_value.item()
        
        # 计算GAE
        advantages, returns = compute_gae(rewards, values, next_value, dones, gamma, gae_lambda)
        
        # 转换为tensor
        advantages = torch.tensor(advantages, dtype=torch.float32)
        returns = torch.tensor(returns, dtype=torch.float32)
        log_probs = torch.stack(log_probs).squeeze()
        entropies = torch.stack(entropies).squeeze()
        
        # 重新计算values (需要梯度)
        states_t = torch.tensor(np.array(states), dtype=torch.float32)
        _, values_t = model(states_t)
        values_t = values_t.squeeze()
        
        # 标准化优势
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # 计算损失并更新
        loss, metrics = a2c_loss(log_probs, values_t, advantages, returns, entropies)
        
        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        
        if (update + 1) % 20 == 0 and episode_rewards:
            mean_reward = np.mean(episode_rewards[-20:])
            print(f"Update {update+1}, Mean Reward: {mean_reward:.2f}, "
                  f"Policy Loss: {metrics['policy_loss']:.4f}, "
                  f"Value Loss: {metrics['value_loss']:.4f}")
    
    env.close()
    return episode_rewards

# 训练
rewards = train_a2c(num_updates=150)

In [None]:
# 可视化
if rewards:
    plt.figure(figsize=(10, 5))
    plt.plot(rewards, alpha=0.3)
    window = min(20, len(rewards))
    smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
    plt.plot(range(window-1, len(rewards)), smoothed, linewidth=2)
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('A2C Training on CartPole')
    plt.grid(True, alpha=0.3)
    plt.show()

## 总结

### 关键要点

1. **Actor-Critic**: 结合策略学习和价值学习

2. **TD误差**: $\delta_t = r_t + \gamma V(s') - V(s)$

3. **GAE**: $A_t = \sum_l (\gamma\lambda)^l \delta_{t+l}$
   - $\lambda=0$: TD(0), 低方差高偏差
   - $\lambda=1$: MC, 高方差无偏差
   - $\lambda=0.95$: 推荐默认值

4. **A2C损失**: $L = L_{policy} + c_v L_{value} - c_{ent} H(\pi)$

### 下一步

- PPO: 添加裁剪目标提高稳定性
- 多环境并行: 提高采样效率