# 策略梯度方法教程

## Policy Gradient Methods: From REINFORCE to Actor-Critic

---

本教程系统介绍策略梯度方法的理论基础和实现细节，包括：

1. **策略梯度定理** - 直接优化策略的理论基础
2. **REINFORCE算法** - 最基础的策略梯度方法
3. **方差减少技术** - Baseline和优势函数
4. **Actor-Critic方法** - 结合值函数的策略梯度
5. **GAE** - 广义优势估计

---

**参考文献**:
- Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning
- Sutton et al. (1999). Policy gradient methods for reinforcement learning with function approximation
- Schulman et al. (2016). High-dimensional continuous control using generalized advantage estimation

In [None]:
# 环境设置
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical, Normal
import matplotlib.pyplot as plt
from collections import deque
from typing import List, Tuple, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# 设置随机种子
torch.manual_seed(42)
np.random.seed(42)

# 绘图设置
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.grid'] = True

print(f"PyTorch版本: {torch.__version__}")
print(f"设备: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

## 1. 策略梯度定理

### 1.1 从值方法到策略方法

| 方面 | 值方法 (DQN) | 策略方法 |
|------|-------------|----------|
| 学习目标 | Q(s,a) → 隐式策略 | 直接学习 π(a\|s) |
| 动作空间 | 主要用于离散 | 天然支持连续 |
| 策略类型 | 确定性 (argmax) | 随机性 (概率分布) |

### 1.2 目标函数

最大化期望累积回报：

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$

### 1.3 策略梯度定理 (Sutton et al., 1999)

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi}(s, a)\right]$$

**直觉理解**:
- $\nabla_\theta \log \pi_\theta(a|s)$: 增大动作a概率的方向
- $Q^{\pi}(s, a)$: 动作的好坏程度
- 好动作 → 增大概率，坏动作 → 减小概率

In [None]:
# 可视化：对数概率梯度的作用

def visualize_policy_gradient_intuition():
    """可视化策略梯度的直觉"""
    fig, axes = plt.subplots(1, 3, figsize=(14, 4))
    
    # 初始策略
    actions = ['左', '右', '跳']
    initial_probs = [0.33, 0.33, 0.34]
    
    axes[0].bar(actions, initial_probs, color='steelblue', alpha=0.7)
    axes[0].set_ylim(0, 0.8)
    axes[0].set_title('初始策略 π(a|s)', fontsize=12)
    axes[0].set_ylabel('概率')
    
    # 假设"右"获得高回报
    rewards = [0.1, 0.9, 0.0]  # 右获得高回报
    axes[1].bar(actions, rewards, color='green', alpha=0.7)
    axes[1].set_ylim(0, 1.0)
    axes[1].set_title('动作回报 Q(s,a)', fontsize=12)
    axes[1].set_ylabel('回报')
    
    # 更新后的策略
    updated_probs = [0.15, 0.70, 0.15]  # 右的概率增加
    axes[2].bar(actions, updated_probs, color='orange', alpha=0.7)
    axes[2].set_ylim(0, 0.8)
    axes[2].set_title('更新后策略 π\'(a|s)', fontsize=12)
    axes[2].set_ylabel('概率')
    
    plt.tight_layout()
    plt.suptitle('策略梯度更新直觉：增加高回报动作的概率', y=1.02, fontsize=14)
    plt.show()

visualize_policy_gradient_intuition()

## 2. 策略网络实现

### 2.1 离散动作空间：Softmax策略

$$\pi_\theta(a|s) = \frac{\exp(h(s, a; \theta))}{\sum_{a'} \exp(h(s, a'; \theta))}$$

In [None]:
class DiscretePolicy(nn.Module):
    """
    离散动作空间的策略网络
    
    使用Softmax输出动作概率分布。
    网络输出logits，使用Categorical分布处理，保证数值稳定性。
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        
        # 正交初始化
        for layer in self.net:
            if isinstance(layer, nn.Linear):
                nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
                nn.init.zeros_(layer.bias)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """输出动作logits"""
        return self.net(state)
    
    def get_distribution(self, state: torch.Tensor) -> Categorical:
        """获取动作分布"""
        logits = self.forward(state)
        return Categorical(logits=logits)
    
    def sample(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """采样动作，返回(action, log_prob, entropy)"""
        dist = self.get_distribution(state)
        action = dist.sample()
        return action, dist.log_prob(action), dist.entropy()

# 测试
policy = DiscretePolicy(state_dim=4, action_dim=2)
state = torch.randn(1, 4)

action, log_prob, entropy = policy.sample(state)
probs = policy.get_distribution(state).probs

print(f"状态: {state.squeeze().numpy().round(3)}")
print(f"动作概率: {probs.squeeze().detach().numpy().round(3)}")
print(f"采样动作: {action.item()}")
print(f"log π(a|s): {log_prob.item():.4f}")
print(f"熵: {entropy.item():.4f}")

### 2.2 连续动作空间：高斯策略

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)$$

对于有界动作空间，使用**Tanh**压缩并校正对数概率：

$$a = \tanh(u), \quad u \sim \mathcal{N}(\mu, \sigma^2)$$

$$\log \pi(a|s) = \log \mathcal{N}(u|\mu,\sigma^2) - \sum_i \log(1 - \tanh^2(u_i))$$

In [None]:
class ContinuousPolicy(nn.Module):
    """
    连续动作空间的高斯策略
    
    输出高斯分布的均值和标准差。
    使用Tanh将动作压缩到[-1, 1]范围。
    """
    
    LOG_STD_MIN = -20.0
    LOG_STD_MAX = 2.0
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        self.feature_net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.mean_layer = nn.Linear(hidden_dim, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))  # 可学习的对数标准差
    
    def forward(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """输出(mean, std)"""
        features = self.feature_net(state)
        mean = self.mean_layer(features)
        log_std = torch.clamp(self.log_std, self.LOG_STD_MIN, self.LOG_STD_MAX)
        std = log_std.exp()
        return mean, std
    
    def sample(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """采样动作，使用重参数化技巧"""
        mean, std = self.forward(state)
        dist = Normal(mean, std)
        
        # 重参数化采样
        u = dist.rsample()
        action = torch.tanh(u)  # 压缩到[-1, 1]
        
        # 校正log_prob（考虑Tanh变换的Jacobian）
        log_prob = dist.log_prob(u).sum(dim=-1)
        log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(dim=-1)
        
        entropy = dist.entropy().sum(dim=-1)
        return action, log_prob, entropy

# 测试
cont_policy = ContinuousPolicy(state_dim=3, action_dim=2)
state = torch.randn(1, 3)

action, log_prob, entropy = cont_policy.sample(state)
mean, std = cont_policy.forward(state)

print(f"均值 μ: {mean.squeeze().detach().numpy().round(3)}")
print(f"标准差 σ: {std.squeeze().detach().numpy().round(3)}")
print(f"采样动作 (tanh后): {action.squeeze().detach().numpy().round(3)}")
print(f"动作范围验证: [{action.min().item():.3f}, {action.max().item():.3f}]")

## 3. REINFORCE算法

### 3.1 算法原理

REINFORCE (Williams, 1992) 使用**蒙特卡洛回报**作为Q值的无偏估计：

$$G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$$

梯度估计：

$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) G_t^{(i)}$$

### 3.2 算法流程

```
1. 初始化策略参数 θ
2. 循环:
   a. 采集一个完整轨迹 τ = (s_0, a_0, r_1, ..., s_T)
   b. 对于 t = 0, 1, ..., T-1:
      - 计算回报 G_t = Σ γ^k r_{t+k}
      - θ ← θ + α · G_t · ∇_θ log π_θ(a_t|s_t)
```

In [None]:
def compute_returns(rewards: List[float], gamma: float, normalize: bool = True) -> torch.Tensor:
    """
    计算蒙特卡洛回报
    
    G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...
    
    从后向前累积计算，时间复杂度O(T)
    """
    returns = []
    G = 0.0
    
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    
    returns = torch.tensor(returns, dtype=torch.float32)
    
    if normalize and len(returns) > 1:
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    
    return returns

# 演示回报计算
rewards = [1.0, 1.0, 1.0, 1.0, 1.0]
gamma = 0.99

returns = compute_returns(rewards, gamma, normalize=False)
print("奖励序列:", rewards)
print(f"回报序列 (γ={gamma}):")
for t, (r, G) in enumerate(zip(rewards, returns)):
    print(f"  t={t}: r={r:.1f}, G_t={G.item():.4f}")

In [None]:
class REINFORCE:
    """
    REINFORCE算法实现
    
    特点:
    - 使用完整回合的回报 (Monte Carlo)
    - 无偏估计，但方差较高
    - 需要完整回合才能更新
    """
    
    def __init__(self, state_dim: int, action_dim: int, lr: float = 1e-3, gamma: float = 0.99):
        self.gamma = gamma
        
        self.policy = DiscretePolicy(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # 存储轨迹
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        """根据策略采样动作"""
        state_t = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob, _ = self.policy.sample(state_t)
        
        self.log_probs.append(log_prob)
        return action.item()
    
    def store_reward(self, reward: float):
        """存储奖励"""
        self.rewards.append(reward)
    
    def update(self) -> float:
        """回合结束后更新策略"""
        # 计算蒙特卡洛回报
        returns = compute_returns(self.rewards, self.gamma, normalize=True)
        
        # 策略损失: -E[log π(a|s) * G]
        log_probs = torch.stack(self.log_probs)
        policy_loss = -(log_probs * returns).mean()
        
        # 优化
        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()
        
        # 清空
        self.log_probs = []
        self.rewards = []
        
        return policy_loss.item()

print("REINFORCE算法类定义完成")

## 4. 方差减少：Baseline

### 4.1 高方差问题

REINFORCE的主要问题是**高方差**：
- 回报 $G_t$ 包含很多随机性
- 导致梯度估计不稳定
- 训练收敛慢

### 4.2 引入Baseline

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q(s, a) - b(s))\right]$$

**为什么不改变期望？**

$$\mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0$$

### 4.3 最优Baseline

状态价值函数 $V(s)$ 是接近最优的baseline：

$$A(s, a) = Q(s, a) - V(s)$$

这称为**优势函数** (Advantage Function)，表示动作相对于平均水平的好坏程度。

In [None]:
# 可视化：Baseline减少方差的效果

def visualize_baseline_effect():
    """演示Baseline如何减少方差"""
    np.random.seed(42)
    
    # 模拟回报分布
    n_samples = 1000
    base_return = 100  # 平均回报较高
    returns = np.random.normal(base_return, 20, n_samples)  # 高方差
    
    # 不使用baseline的梯度信号
    gradient_no_baseline = returns
    
    # 使用baseline (减去均值)
    baseline = returns.mean()
    gradient_with_baseline = returns - baseline
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].hist(gradient_no_baseline, bins=50, alpha=0.7, color='steelblue')
    axes[0].axvline(x=0, color='red', linestyle='--', label='零点')
    axes[0].set_title(f'无Baseline\n均值={np.mean(gradient_no_baseline):.1f}, 方差={np.var(gradient_no_baseline):.1f}')
    axes[0].set_xlabel('梯度信号')
    axes[0].legend()
    
    axes[1].hist(gradient_with_baseline, bins=50, alpha=0.7, color='orange')
    axes[1].axvline(x=0, color='red', linestyle='--', label='零点')
    axes[1].set_title(f'有Baseline\n均值={np.mean(gradient_with_baseline):.1f}, 方差={np.var(gradient_with_baseline):.1f}')
    axes[1].set_xlabel('梯度信号')
    axes[1].legend()
    
    plt.suptitle('Baseline减少方差（同时保持期望不变）', y=1.02)
    plt.tight_layout()
    plt.show()
    
    print(f"方差减少比例: {np.var(gradient_no_baseline) / np.var(gradient_with_baseline):.2f}x")

visualize_baseline_effect()

In [None]:
class ValueNetwork(nn.Module):
    """状态价值网络 V(s)"""
    
    def __init__(self, state_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.net(state)


class REINFORCEBaseline:
    """
    带Baseline的REINFORCE
    
    使用状态价值函数V(s)作为baseline减少方差：
    A(s,a) = G - V(s)  (优势函数)
    """
    
    def __init__(self, state_dim: int, action_dim: int, 
                 lr_policy: float = 1e-3, lr_value: float = 1e-3, gamma: float = 0.99):
        self.gamma = gamma
        
        self.policy = DiscretePolicy(state_dim, action_dim)
        self.value_net = ValueNetwork(state_dim)
        
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
        
        self.log_probs = []
        self.values = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        state_t = torch.FloatTensor(state).unsqueeze(0)
        
        action, log_prob, _ = self.policy.sample(state_t)
        value = self.value_net(state_t)
        
        self.log_probs.append(log_prob)
        self.values.append(value)
        
        return action.item()
    
    def store_reward(self, reward: float):
        self.rewards.append(reward)
    
    def update(self) -> Tuple[float, float]:
        # 计算回报
        returns = compute_returns(self.rewards, self.gamma, normalize=False)
        
        log_probs = torch.stack(self.log_probs)
        values = torch.cat(self.values).squeeze()
        
        # 计算优势: A = G - V
        advantages = returns - values.detach()  # detach很重要！
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # 策略损失
        policy_loss = -(log_probs * advantages).mean()
        
        # 价值损失 (MSE)
        value_loss = F.mse_loss(values, returns)
        
        # 更新
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()
        
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()
        
        # 清空
        self.log_probs = []
        self.values = []
        self.rewards = []
        
        return policy_loss.item(), value_loss.item()

print("REINFORCEBaseline算法类定义完成")

## 5. Actor-Critic方法

### 5.1 从REINFORCE到Actor-Critic

| 方法 | Q估计 | 更新时机 |
|------|--------|----------|
| REINFORCE | $G_t$ (MC) | 回合结束 |
| Actor-Critic | $r + \gamma V(s')$ (TD) | 每步更新 |

### 5.2 TD优势估计

$$A(s, a) = r + \gamma V(s') - V(s) = \delta_t \quad (\text{TD误差})$$

### 5.3 A2C损失函数

$$\mathcal{L} = \underbrace{-\log \pi(a|s) \cdot A}_{\text{Actor损失}} + \underbrace{c_1 (V(s) - G)^2}_{\text{Critic损失}} - \underbrace{c_2 H(\pi)}_{\text{熵正则化}}$$

In [None]:
class ActorCriticNetwork(nn.Module):
    """
    共享特征的Actor-Critic网络
    
    架构:
    state -> [shared_net] -> features
                              |-> [actor_head] -> policy
                              |-> [critic_head] -> value
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        # 共享特征层
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor头
        self.actor = nn.Linear(hidden_dim, action_dim)
        
        # Critic头
        self.critic = nn.Linear(hidden_dim, 1)
    
    def forward(self, state: torch.Tensor):
        features = self.shared(state)
        logits = self.actor(features)
        value = self.critic(features)
        return Categorical(logits=logits), value
    
    def get_action_and_value(self, state: torch.Tensor):
        dist, value = self.forward(state)
        action = dist.sample()
        return action, dist.log_prob(action), dist.entropy(), value.squeeze(-1)

# 测试
ac_net = ActorCriticNetwork(4, 2)
state = torch.randn(1, 4)
action, log_prob, entropy, value = ac_net.get_action_and_value(state)

print(f"动作: {action.item()}")
print(f"log π(a|s): {log_prob.item():.4f}")
print(f"熵: {entropy.item():.4f}")
print(f"V(s): {value.item():.4f}")

## 6. GAE: 广义优势估计

### 6.1 n-step Returns

$$G_t^{(n)} = r_t + \gamma r_{t+1} + \ldots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})$$

| n值 | 特点 |
|-----|------|
| n=1 | TD(0)，低方差，高偏差 |
| n=∞ | MC，高方差，无偏差 |

### 6.2 GAE公式

GAE通过指数加权的多步TD误差来平衡偏差-方差：

$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

$$A_t^{GAE} = \sum_{k=0}^{\infty} (\gamma\lambda)^k \delta_{t+k}$$

- $\lambda=0$: TD(0)，高偏差低方差
- $\lambda=1$: Monte Carlo，低偏差高方差
- $\lambda=0.95$: 常用值，较好的平衡

In [None]:
def compute_gae(
    rewards: List[float],
    values: List[float],
    next_value: float,
    dones: List[bool],
    gamma: float,
    gae_lambda: float
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    计算GAE (Generalized Advantage Estimation)
    
    δ_t = r_t + γV(s_{t+1}) - V(s_t)
    A_t^GAE = Σ (γλ)^k δ_{t+k}
    """
    advantages = []
    gae = 0.0
    
    values = list(values) + [next_value]
    
    for t in reversed(range(len(rewards))):
        # 如果done，下一状态价值为0
        next_val = 0.0 if dones[t] else values[t + 1]
        
        # TD误差
        delta = rewards[t] + gamma * next_val - values[t]
        
        # GAE累积
        gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)
    
    advantages = torch.tensor(advantages, dtype=torch.float32)
    returns = advantages + torch.tensor(values[:-1], dtype=torch.float32)
    
    return advantages, returns

# 演示GAE计算
rewards = [1.0, 1.0, 1.0, 1.0, 1.0]
values = [0.8, 0.9, 1.0, 0.95, 0.85]
dones = [False, False, False, False, True]
gamma = 0.99
gae_lambda = 0.95

advantages, returns = compute_gae(rewards, values, 0.0, dones, gamma, gae_lambda)

print("GAE计算示例:")
print(f"奖励: {rewards}")
print(f"价值估计: {values}")
print(f"优势 (GAE λ={gae_lambda}): {advantages.numpy().round(3)}")
print(f"回报目标: {returns.numpy().round(3)}")

In [None]:
# 可视化：不同λ值对优势估计的影响

def visualize_gae_lambda_effect():
    """可视化不同λ值的影响"""
    np.random.seed(42)
    
    # 模拟一个回合
    T = 50
    rewards = np.random.normal(1.0, 0.5, T)
    true_values = np.cumsum(rewards[::-1])[::-1] * 0.99 ** np.arange(T)  # 近似真实价值
    noisy_values = true_values + np.random.normal(0, 0.3, T)  # 加噪声的估计
    dones = [False] * (T-1) + [True]
    gamma = 0.99
    
    lambdas = [0.0, 0.5, 0.9, 1.0]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    for ax, lam in zip(axes.flat, lambdas):
        advantages, _ = compute_gae(rewards.tolist(), noisy_values.tolist(), 0.0, dones, gamma, lam)
        
        ax.plot(advantages.numpy(), label=f'GAE λ={lam}')
        ax.axhline(y=0, color='red', linestyle='--', alpha=0.5)
        ax.set_xlabel('时间步')
        ax.set_ylabel('优势估计')
        ax.set_title(f'λ={lam}: 方差={advantages.std().item():.3f}')
        ax.legend()
    
    plt.suptitle('GAE: 不同λ值对优势估计的影响', y=1.02)
    plt.tight_layout()
    plt.show()

visualize_gae_lambda_effect()

## 7. 完整A2C实现

In [None]:
class A2C:
    """
    Advantage Actor-Critic (A2C) 完整实现
    
    特点:
    - 使用GAE计算优势
    - 熵正则化促进探索
    - 梯度裁剪防止爆炸
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        entropy_coef: float = 0.01,
        value_coef: float = 0.5,
        max_grad_norm: float = 0.5
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.max_grad_norm = max_grad_norm
        
        self.model = ActorCriticNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        
        # 缓冲区
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
        self.values = []
        self.dones = []
        self.entropies = []
    
    def select_action(self, state: np.ndarray) -> int:
        state_t = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob, entropy, value = self.model.get_action_and_value(state_t)
        
        self.states.append(state)
        self.actions.append(action.item())
        self.log_probs.append(log_prob)
        self.values.append(value.item())
        self.entropies.append(entropy)
        
        return action.item()
    
    def store(self, reward: float, done: bool):
        self.rewards.append(reward)
        self.dones.append(done)
    
    def update(self, next_state: np.ndarray, done: bool) -> Dict[str, float]:
        # 获取下一状态价值
        if done:
            next_value = 0.0
        else:
            with torch.no_grad():
                next_state_t = torch.FloatTensor(next_state).unsqueeze(0)
                _, next_value = self.model(next_state_t)
                next_value = next_value.item()
        
        # 计算GAE
        advantages, returns = compute_gae(
            self.rewards, self.values, next_value, 
            self.dones, self.gamma, self.gae_lambda
        )
        
        # 标准化优势
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # 准备数据
        log_probs = torch.stack(self.log_probs)
        values = torch.tensor(self.values, dtype=torch.float32)
        entropies = torch.stack(self.entropies)
        
        # 计算损失
        policy_loss = -(log_probs * advantages.detach()).mean()
        value_loss = F.mse_loss(values, returns)
        entropy_bonus = entropies.mean()
        
        total_loss = (
            policy_loss 
            + self.value_coef * value_loss 
            - self.entropy_coef * entropy_bonus
        )
        
        # 优化
        self.optimizer.zero_grad()
        total_loss.backward()
        nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
        self.optimizer.step()
        
        # 清空
        self.states = []
        self.actions = []
        self.rewards = []
        self.log_probs = []
        self.values = []
        self.dones = []
        self.entropies = []
        
        return {
            "policy_loss": policy_loss.item(),
            "value_loss": value_loss.item(),
            "entropy": entropy_bonus.item(),
            "total_loss": total_loss.item()
        }

print("A2C算法类定义完成")

## 8. 训练与评估

In [None]:
# 尝试导入gymnasium
try:
    import gymnasium as gym
    HAS_GYM = True
    print("Gymnasium导入成功")
except ImportError:
    HAS_GYM = False
    print("警告: 未安装gymnasium，跳过实际训练演示")
    print("安装命令: pip install gymnasium")

In [None]:
def train_agent(agent, env_name: str, num_episodes: int = 300, log_interval: int = 50):
    """训练智能体"""
    if not HAS_GYM:
        print("需要安装gymnasium才能训练")
        return []
    
    env = gym.make(env_name)
    rewards_history = []
    
    print(f"\n{'='*50}")
    print(f"训练 {agent.__class__.__name__} on {env_name}")
    print(f"{'='*50}\n")
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            if hasattr(agent, 'store'):
                agent.store(reward, done)
            else:
                agent.store_reward(reward)
            
            total_reward += reward
            state = next_state
            
            if done:
                break
        
        # 更新
        if isinstance(agent, A2C):
            agent.update(next_state, done)
        else:
            agent.update()
        
        rewards_history.append(total_reward)
        
        if (episode + 1) % log_interval == 0:
            avg_reward = np.mean(rewards_history[-100:])
            print(f"Episode {episode+1:4d} | 平均奖励: {avg_reward:.2f}")
    
    env.close()
    return rewards_history

In [None]:
# 训练对比实验
if HAS_GYM:
    env_name = "CartPole-v1"
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    env.close()
    
    num_episodes = 300  # 可调整
    
    results = {}
    
    # REINFORCE
    print("\n[1/3] 训练REINFORCE...")
    agent_rf = REINFORCE(state_dim, action_dim, lr=1e-3, gamma=0.99)
    results['REINFORCE'] = train_agent(agent_rf, env_name, num_episodes)
    
    # REINFORCE + Baseline
    print("\n[2/3] 训练REINFORCE+Baseline...")
    agent_rfb = REINFORCEBaseline(state_dim, action_dim, lr_policy=1e-3, lr_value=1e-3, gamma=0.99)
    results['REINFORCE+Baseline'] = train_agent(agent_rfb, env_name, num_episodes)
    
    # A2C
    print("\n[3/3] 训练A2C...")
    agent_a2c = A2C(state_dim, action_dim, lr=3e-4, gamma=0.99, gae_lambda=0.95)
    results['A2C (GAE)'] = train_agent(agent_a2c, env_name, num_episodes)
else:
    print("跳过训练（未安装gymnasium）")
    results = {}

In [None]:
# 绘制学习曲线
def plot_learning_curves(results: Dict[str, List[float]], window: int = 50):
    """绘制学习曲线对比"""
    if not results:
        print("没有训练数据可以绘制")
        return
    
    fig, ax = plt.subplots(figsize=(12, 6))
    
    colors = ['steelblue', 'orange', 'green']
    
    for (name, rewards), color in zip(results.items(), colors):
        # 原始曲线
        ax.plot(rewards, alpha=0.2, color=color)
        
        # 滑动平均
        if len(rewards) > window:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            ax.plot(range(window-1, len(rewards)), smoothed, label=name, color=color, linewidth=2)
    
    ax.set_xlabel('Episode')
    ax.set_ylabel('Total Reward')
    ax.set_title('策略梯度方法对比')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_learning_curves(results)

## 9. 算法对比总结

| 算法 | Q估计 | 更新时机 | 方差 | 偏差 | 推荐场景 |
|------|-------|----------|------|------|----------|
| REINFORCE | MC ($G_t$) | 回合结束 | 高 | 无 | 教学/简单环境 |
| REINFORCE+Baseline | MC + V(s) | 回合结束 | 中 | 无 | 一般任务 |
| A2C | TD + GAE | 每步/n-step | 低 | 有 | 生产环境 |
| PPO | A2C + 裁剪 | 每步/n-step | 低 | 有 | **推荐首选** |

### 实践建议

1. **学习率**: Actor学习率应小于Critic（如1e-4 vs 1e-3）
2. **GAE λ**: 通常使用0.95，平衡偏差-方差
3. **熵系数**: 0.01左右，太大过度探索，太小过早收敛
4. **梯度裁剪**: 0.5~1.0，防止不稳定
5. **优势标准化**: 几乎总是有益的

## 10. 练习题

### 练习1: 理解策略梯度

为什么策略梯度公式中使用$\log \pi$而不是$\pi$？请从数学和直觉两个角度解释。

### 练习2: 实现改进

尝试给REINFORCE添加以下改进，观察效果：
1. 因果性：只使用未来奖励（而非整个回合）
2. 奖励归一化

### 练习3: 连续动作

使用`ContinuousPolicy`类在`Pendulum-v1`环境上训练A2C。

### 练习4: 超参数调优

实验不同的GAE λ值(0.9, 0.95, 0.99, 1.0)，比较学习曲线的稳定性。

In [None]:
# 练习代码区域
# 在这里编写你的练习代码

pass

---

## 参考资料

1. Sutton & Barto, "Reinforcement Learning: An Introduction", Chapter 13
2. Schulman et al., "High-Dimensional Continuous Control Using GAE", 2016
3. Mnih et al., "Asynchronous Methods for Deep RL" (A3C), 2016
4. OpenAI Spinning Up: https://spinningup.openai.com/

---

[返回上级](../README.md)