# Actor-Critic 与 PPO 策略梯度实战教程

---

## 学习目标

| 目标 | 描述 |
|------|------|
| 策略梯度 | 理解策略梯度定理的数学原理 |
| Actor-Critic | 掌握演员-评论家架构 |
| GAE | 理解广义优势估计的偏差-方差权衡 |
| PPO | 实现 Proximal Policy Optimization |
| 实战 | 在 CartPole 训练并对比不同算法 |

---

## Part 1: 策略梯度理论基础

### 1.1 从 Q-Learning 到策略梯度

| 方法 | 学习目标 | 优点 | 缺点 |
|------|---------|------|------|
| DQN | Q(s,a) | 样本效率高 | 只能处理离散动作 |
| 策略梯度 | π(a\|s) | 连续动作、随机策略 | 高方差 |

### 1.2 策略梯度定理

**目标**: 最大化期望回报

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

**梯度**: 

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s, a)\right]$$

**直觉**: 增加高回报动作的概率，减少低回报动作的概率

### 1.3 方差缩减: 基线 (Baseline)

**问题**: 原始策略梯度方差很大

**解决**: 减去基线 b(s)

$$\nabla_\theta J = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q(s, a) - b(s))\right]$$

**最优基线**: $b(s) = V(s)$

**优势函数**: $A(s, a) = Q(s, a) - V(s)$

---

## Part 2: Actor-Critic 架构

### 2.1 双网络协作

```
状态 s
    │
    ├──────────────┬────────────────┐
    ↓              ↓                │
┌───────┐     ┌───────┐            │
│ Actor │     │ Critic│            │
│ π(a|s)│     │ V(s)  │            │
└───┬───┘     └───┬───┘            │
    │             │                 │
    ↓             ↓                 │
 动作 a      优势 A = r + γV' - V   │
    │             │                 │
    ↓             ↓                 │
 环境交互 ←── 策略梯度更新 ←────────┘
```

### 2.2 组件职责

| 组件 | 输入 | 输出 | 作用 |
|------|------|------|------|
| Actor | s | π(a\|s) | 决定如何行动 |
| Critic | s | V(s) | 评估状态价值 |

---

## Part 3: 广义优势估计 (GAE)

### 3.1 偏差-方差权衡

| 方法 | 公式 | 偏差 | 方差 |
|------|------|------|------|
| 1-step TD | $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ | 高 | 低 |
| MC | $A_t = R_t - V(s_t)$ | 低 | 高 |

### 3.2 GAE 公式

$$\hat{A}_t^{GAE} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$$

其中 TD 误差:
$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

### 3.3 λ 参数效果

| λ | 效果 | 适用场景 |
|---|------|----------|
| 0 | 单步 TD | V(s) 准确时 |
| 1 | 蒙特卡洛 | V(s) 不准确时 |
| 0.95 | 最佳平衡 | 推荐默认值 |

---

## Part 4: 环境准备

In [None]:
# 导入核心库
import numpy as np
import random
from typing import Tuple, List, Dict, Optional
from dataclasses import dataclass

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical

import matplotlib.pyplot as plt

In [None]:
# 导入 Gymnasium
try:
    import gymnasium as gym
    HAS_GYM = True
except ImportError:
    HAS_GYM = False
    print("请安装 gymnasium: pip install gymnasium")

In [None]:
# 配置
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"计算设备: {DEVICE}")

plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10, 6)

---

## Part 5: 轨迹缓冲区实现

In [None]:
class RolloutBuffer:
    """
    On-Policy 轨迹缓冲区
    
    与 DQN 的经验回放不同:
    - 存储完整轨迹
    - 数据用后即弃 (on-policy 约束)
    - 计算 GAE 优势估计
    """
    
    def __init__(self, gamma: float = 0.99, gae_lambda: float = 0.95):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.reset()
    
    def reset(self):
        """清空缓冲区"""
        self.states = []
        self.actions = []
        self.log_probs = []
        self.rewards = []
        self.values = []
        self.dones = []
    
    def add(self, state, action, log_prob, reward, value, done):
        """添加单步数据"""
        self.states.append(state)
        self.actions.append(action)
        self.log_probs.append(log_prob)
        self.rewards.append(reward)
        self.values.append(value)
        self.dones.append(done)
    
    def __len__(self):
        return len(self.states)

In [None]:
def compute_gae(self, last_value: float) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    计算广义优势估计 (GAE)
    
    递推公式:
        δ_t = r_t + γ(1-d_t)V(s_{t+1}) - V(s_t)
        Â_t = δ_t + γλ(1-d_t)Â_{t+1}
    
    Returns:
        (returns, advantages)
    """
    rewards = np.array(self.rewards)
    values = np.array(self.values)
    dones = np.array(self.dones)
    n_steps = len(rewards)
    
    # 添加最后价值用于 bootstrap
    values = np.append(values, last_value)
    
    # GAE 递推计算
    advantages = np.zeros(n_steps, dtype=np.float32)
    gae = 0.0
    
    for t in reversed(range(n_steps)):
        delta = rewards[t] + self.gamma * values[t + 1] * (1 - dones[t]) - values[t]
        gae = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * gae
        advantages[t] = gae
    
    # 回报 = 优势 + 价值
    returns = advantages + values[:-1]
    
    return torch.FloatTensor(returns), torch.FloatTensor(advantages)

RolloutBuffer.compute_gae = compute_gae

In [None]:
# 测试 GAE 计算
buffer = RolloutBuffer(gamma=0.99, gae_lambda=0.95)
for i in range(10):
    buffer.add(
        state=np.random.randn(4),
        action=0,
        log_prob=-0.5,
        reward=1.0,
        value=0.5,
        done=(i == 9)
    )

returns, advantages = buffer.compute_gae(last_value=0.0)
print(f"缓冲区大小: {len(buffer)}")
print(f"回报形状: {returns.shape}")
print(f"优势形状: {advantages.shape}")

---

## Part 6: Actor-Critic 网络

In [None]:
class ActorCriticNetwork(nn.Module):
    """
    共享参数的 Actor-Critic 网络
    
    架构:
        State → 共享层 → Actor 头 → π(a|s)
                      → Critic 头 → V(s)
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        # 共享特征层
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )
        
        # Actor 头 (策略)
        self.actor = nn.Linear(hidden_dim, action_dim)
        
        # Critic 头 (价值)
        self.critic = nn.Linear(hidden_dim, 1)
        
        self._init_weights()
    
    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
                nn.init.zeros_(m.bias)
        # Actor 输出用小增益
        nn.init.orthogonal_(self.actor.weight, gain=0.01)
    
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        features = self.shared(x)
        action_logits = self.actor(features)
        value = self.critic(features)
        return action_logits, value

In [None]:
def get_action_and_value(self, state, action=None):
    """
    获取动作、对数概率、熵和价值
    
    用于:
    1. 动作选择 (action=None): 从 π 采样
    2. 计算对数概率 (action 给定): 计算 log π
    """
    logits, value = self(state)
    probs = F.softmax(logits, dim=-1)
    dist = Categorical(probs)
    
    if action is None:
        action = dist.sample()
    
    log_prob = dist.log_prob(action)
    entropy = dist.entropy()
    
    return action, log_prob, entropy, value.squeeze(-1)

ActorCriticNetwork.get_action_and_value = get_action_and_value

In [None]:
# 测试网络
net = ActorCriticNetwork(state_dim=4, action_dim=2).to(DEVICE)
x = torch.randn(32, 4).to(DEVICE)

action, log_prob, entropy, value = net.get_action_and_value(x)
print(f"动作: {action.shape}")
print(f"对数概率: {log_prob.shape}")
print(f"熵: {entropy.shape}")
print(f"价值: {value.shape}")
print(f"参数量: {sum(p.numel() for p in net.parameters()):,}")

---

## Part 7: A2C 智能体

In [None]:
class A2CAgent:
    """
    Advantage Actor-Critic 智能体
    
    损失函数:
        L = L_policy + c_v × L_value - c_e × H[π]
    
    其中:
        L_policy = -E[log π × Â]  (策略梯度)
        L_value = E[(V - R)²]     (价值函数误差)
        H[π] = -E[π log π]        (熵正则化)
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 256,
        lr: float = 7e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        value_coef: float = 0.5,
        entropy_coef: float = 0.01,
        max_grad_norm: float = 0.5,
        device: str = 'auto'
    ):
        self.gamma = gamma
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
        
        self.device = torch.device('cuda' if device == 'auto' and torch.cuda.is_available() else 'cpu')
        
        self.network = ActorCriticNetwork(state_dim, action_dim, hidden_dim).to(self.device)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        self.buffer = RolloutBuffer(gamma=gamma, gae_lambda=gae_lambda)

In [None]:
def get_action(self, state: np.ndarray) -> Tuple[int, float, float]:
    """选择动作"""
    state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
    with torch.no_grad():
        action, log_prob, _, value = self.network.get_action_and_value(state_t)
    return action.item(), log_prob.item(), value.item()

def store(self, state, action, log_prob, reward, value, done):
    """存储转换"""
    self.buffer.add(state, action, log_prob, reward, value, done)

A2CAgent.get_action = get_action
A2CAgent.store = store

In [None]:
def update_a2c(self, last_value: float) -> Dict[str, float]:
    """A2C 更新"""
    returns, advantages = self.buffer.compute_gae(last_value)
    
    states = torch.FloatTensor(np.array(self.buffer.states)).to(self.device)
    actions = torch.LongTensor(self.buffer.actions).to(self.device)
    returns = returns.to(self.device)
    advantages = advantages.to(self.device)
    
    # 标准化优势
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    
    # 计算新策略
    _, new_log_probs, entropy, values = self.network.get_action_and_value(states, actions)
    
    # 策略损失
    policy_loss = -(new_log_probs * advantages.detach()).mean()
    
    # 价值损失
    value_loss = F.mse_loss(values, returns)
    
    # 熵损失
    entropy_loss = -entropy.mean()
    
    # 总损失
    loss = policy_loss + self.value_coef * value_loss + self.entropy_coef * entropy_loss
    
    # 优化
    self.optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(self.network.parameters(), self.max_grad_norm)
    self.optimizer.step()
    
    self.buffer.reset()
    
    return {
        'policy_loss': policy_loss.item(),
        'value_loss': value_loss.item(),
        'entropy': -entropy_loss.item()
    }

A2CAgent.update = update_a2c

---

## Part 8: PPO 核心思想

### 8.1 信任域约束

**问题**: 策略梯度更新步长难以控制
- 步长太大 → 策略崩溃
- 步长太小 → 学习缓慢

### 8.2 PPO-Clip 目标函数

**策略比率**:
$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

**裁剪目标**:
$$L^{CLIP} = \mathbb{E}_t\left[\min\left(r_t \hat{A}_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]$$

### 8.3 裁剪效果

| 情况 | r_t 范围 | 效果 |
|------|---------|------|
| A > 0, r > 1+ε | 裁剪 | 阻止过度增加概率 |
| A < 0, r < 1-ε | 裁剪 | 阻止过度减少概率 |
| 其他 | 不裁剪 | 正常梯度 |

---

## Part 9: PPO 智能体

In [None]:
class PPOAgent:
    """
    Proximal Policy Optimization 智能体
    
    与 A2C 的关键差异:
    1. 多轮 epoch: 每批数据训练多次
    2. Mini-batch: 将大批次分成小批次
    3. PPO-Clip: 裁剪策略比率
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 256,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        clip_epsilon: float = 0.2,
        value_coef: float = 0.5,
        entropy_coef: float = 0.01,
        max_grad_norm: float = 0.5,
        n_epochs: int = 10,
        mini_batch_size: int = 64,
        device: str = 'auto'
    ):
        self.gamma = gamma
        self.clip_epsilon = clip_epsilon
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
        self.n_epochs = n_epochs
        self.mini_batch_size = mini_batch_size
        
        self.device = torch.device('cuda' if device == 'auto' and torch.cuda.is_available() else 'cpu')
        
        self.network = ActorCriticNetwork(state_dim, action_dim, hidden_dim).to(self.device)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr, eps=1e-5)
        
        self.buffer = RolloutBuffer(gamma=gamma, gae_lambda=gae_lambda)

In [None]:
# 复用动作选择和存储方法
PPOAgent.get_action = get_action
PPOAgent.store = store

In [None]:
def update_ppo(self, last_value: float) -> Dict[str, float]:
    """PPO 更新"""
    returns, advantages = self.buffer.compute_gae(last_value)
    
    states = torch.FloatTensor(np.array(self.buffer.states)).to(self.device)
    actions = torch.LongTensor(self.buffer.actions).to(self.device)
    old_log_probs = torch.FloatTensor(self.buffer.log_probs).to(self.device)
    returns = returns.to(self.device)
    advantages = advantages.to(self.device)
    
    # 标准化优势
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    
    batch_size = len(states)
    total_losses = {'policy': 0, 'value': 0, 'entropy': 0}
    n_updates = 0
    
    # 多轮 epoch
    for _ in range(self.n_epochs):
        indices = np.random.permutation(batch_size)
        
        # Mini-batch 更新
        for start in range(0, batch_size, self.mini_batch_size):
            end = start + self.mini_batch_size
            mb_idx = indices[start:end]
            
            mb_states = states[mb_idx]
            mb_actions = actions[mb_idx]
            mb_old_log_probs = old_log_probs[mb_idx]
            mb_returns = returns[mb_idx]
            mb_advantages = advantages[mb_idx]
            
            # 计算新策略
            _, new_log_probs, entropy, values = self.network.get_action_and_value(
                mb_states, mb_actions
            )
            
            # 计算比率
            ratio = torch.exp(new_log_probs - mb_old_log_probs)
            
            # PPO-Clip 目标
            surr1 = ratio * mb_advantages
            surr2 = torch.clamp(
                ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon
            ) * mb_advantages
            policy_loss = -torch.min(surr1, surr2).mean()
            
            # 价值损失
            value_loss = F.mse_loss(values, mb_returns)
            
            # 熵损失
            entropy_loss = -entropy.mean()
            
            # 总损失
            loss = (policy_loss + 
                   self.value_coef * value_loss + 
                   self.entropy_coef * entropy_loss)
            
            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.network.parameters(), self.max_grad_norm)
            self.optimizer.step()
            
            total_losses['policy'] += policy_loss.item()
            total_losses['value'] += value_loss.item()
            total_losses['entropy'] += entropy.mean().item()
            n_updates += 1
    
    self.buffer.reset()
    
    return {
        'policy_loss': total_losses['policy'] / n_updates,
        'value_loss': total_losses['value'] / n_updates,
        'entropy': total_losses['entropy'] / n_updates
    }

PPOAgent.update = update_ppo

---

## Part 10: 训练与评估

In [None]:
def train_policy_gradient(
    agent,
    env_name: str = 'CartPole-v1',
    total_steps: int = 50000,
    n_steps: int = 128,
    seed: int = 42,
    algo_name: str = 'Agent',
    verbose: bool = True
) -> List[float]:
    """训练策略梯度智能体"""
    if not HAS_GYM:
        return []
    
    env = gym.make(env_name)
    
    if verbose:
        print(f"\n{'='*50}")
        print(f"训练 {algo_name}")
        print(f"{'='*50}")
    
    rewards_history = []
    episode_reward = 0.0
    best_avg = float('-inf')
    
    state, _ = env.reset(seed=seed)
    step = 0
    
    while step < total_steps:
        for _ in range(n_steps):
            action, log_prob, value = agent.get_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.store(state, action, log_prob, reward, value, done)
            episode_reward += reward
            step += 1
            
            if done:
                rewards_history.append(episode_reward)
                episode_reward = 0.0
                state, _ = env.reset()
            else:
                state = next_state
            
            if step >= total_steps:
                break
        
        _, _, last_value = agent.get_action(state)
        agent.update(last_value)
        
        if verbose and len(rewards_history) >= 20 and len(rewards_history) % 20 == 0:
            avg = np.mean(rewards_history[-20:])
            best_avg = max(best_avg, avg)
            print(f"回合 {len(rewards_history):4d} | 平均: {avg:7.2f} | 最佳: {best_avg:7.2f}")
    
    env.close()
    return rewards_history

In [None]:
# 训练 A2C
if HAS_GYM:
    a2c_agent = A2CAgent(state_dim=4, action_dim=2)
    rewards_a2c = train_policy_gradient(
        a2c_agent,
        total_steps=50000,
        n_steps=5,
        algo_name='A2C'
    )

In [None]:
# 训练 PPO
if HAS_GYM:
    ppo_agent = PPOAgent(state_dim=4, action_dim=2)
    rewards_ppo = train_policy_gradient(
        ppo_agent,
        total_steps=50000,
        n_steps=128,
        algo_name='PPO'
    )

---

## Part 11: 结果可视化

In [None]:
def plot_comparison(results: dict, window: int = 20):
    """绘制算法对比"""
    plt.figure(figsize=(12, 5))
    colors = ['#1f77b4', '#ff7f0e']
    
    for idx, (name, rewards) in enumerate(results.items()):
        if len(rewards) >= window:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            plt.plot(smoothed, label=name, color=colors[idx], linewidth=2)
    
    plt.xlabel('Episode', fontsize=12)
    plt.ylabel('Total Reward', fontsize=12)
    plt.title('A2C vs PPO 学习曲线对比', fontsize=14)
    plt.legend(fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
if HAS_GYM and rewards_a2c and rewards_ppo:
    plot_comparison({
        'A2C': rewards_a2c,
        'PPO': rewards_ppo
    })

---

## Part 12: PPO 裁剪可视化

In [None]:
def visualize_ppo_clip(epsilon: float = 0.2):
    """可视化 PPO 裁剪机制"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    ratios = np.linspace(0.5, 1.5, 100)
    
    # 正优势
    ax = axes[0]
    advantage = 1.0
    unclipped = ratios * advantage
    clipped = np.clip(ratios, 1-epsilon, 1+epsilon) * advantage
    objective = np.minimum(unclipped, clipped)
    
    ax.plot(ratios, unclipped, 'b--', label='未裁剪', alpha=0.7)
    ax.plot(ratios, clipped, 'r--', label='裁剪后', alpha=0.7)
    ax.plot(ratios, objective, 'g-', linewidth=2, label='PPO 目标')
    ax.axvline(1-epsilon, color='gray', linestyle=':', alpha=0.5)
    ax.axvline(1+epsilon, color='gray', linestyle=':', alpha=0.5)
    ax.set_xlabel('策略比率 r(θ)')
    ax.set_ylabel('目标值')
    ax.set_title(f'正优势 (A > 0), ε = {epsilon}')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 负优势
    ax = axes[1]
    advantage = -1.0
    unclipped = ratios * advantage
    clipped = np.clip(ratios, 1-epsilon, 1+epsilon) * advantage
    objective = np.minimum(unclipped, clipped)
    
    ax.plot(ratios, unclipped, 'b--', label='未裁剪', alpha=0.7)
    ax.plot(ratios, clipped, 'r--', label='裁剪后', alpha=0.7)
    ax.plot(ratios, objective, 'g-', linewidth=2, label='PPO 目标')
    ax.axvline(1-epsilon, color='gray', linestyle=':', alpha=0.5)
    ax.axvline(1+epsilon, color='gray', linestyle=':', alpha=0.5)
    ax.set_xlabel('策略比率 r(θ)')
    ax.set_ylabel('目标值')
    ax.set_title(f'负优势 (A < 0), ε = {epsilon}')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_ppo_clip(epsilon=0.2)

---

## Part 13: 单元测试

In [None]:
def run_tests():
    """运行单元测试"""
    print("开始单元测试...\n")
    passed = failed = 0
    
    # 测试1: RolloutBuffer
    try:
        buf = RolloutBuffer(gamma=0.99, gae_lambda=0.95)
        for i in range(10):
            buf.add(np.random.randn(4), 0, -0.5, 1.0, 0.5, i == 9)
        returns, advantages = buf.compute_gae(0.0)
        assert returns.shape == (10,)
        print("测试1通过: RolloutBuffer")
        passed += 1
    except Exception as e:
        print(f"测试1失败: {e}")
        failed += 1
    
    # 测试2: ActorCriticNetwork
    try:
        net = ActorCriticNetwork(4, 2, 64)
        x = torch.randn(32, 4)
        action, log_prob, entropy, value = net.get_action_and_value(x)
        assert action.shape == (32,)
        assert value.shape == (32,)
        print("测试2通过: ActorCriticNetwork")
        passed += 1
    except Exception as e:
        print(f"测试2失败: {e}")
        failed += 1
    
    # 测试3: A2CAgent
    try:
        agent = A2CAgent(4, 2, device='cpu')
        state = np.random.randn(4).astype(np.float32)
        action, lp, val = agent.get_action(state)
        assert 0 <= action < 2
        for i in range(10):
            agent.store(state, 0, -0.5, 1.0, 0.5, i == 9)
        agent.update(0.0)
        print("测试3通过: A2CAgent")
        passed += 1
    except Exception as e:
        print(f"测试3失败: {e}")
        failed += 1
    
    # 测试4: PPOAgent
    try:
        agent = PPOAgent(4, 2, mini_batch_size=32, device='cpu')
        state = np.random.randn(4).astype(np.float32)
        action, lp, val = agent.get_action(state)
        assert 0 <= action < 2
        for i in range(64):
            agent.store(state, 0, -0.5, 1.0, 0.5, i == 63)
        agent.update(0.0)
        print("测试4通过: PPOAgent")
        passed += 1
    except Exception as e:
        print(f"测试4失败: {e}")
        failed += 1
    
    print(f"\n{'='*40}")
    print(f"结果: {passed} 通过, {failed} 失败")
    print(f"{'='*40}")

run_tests()

---

## 总结

### 算法对比

| 特性 | A2C | PPO |
|------|-----|-----|
| 数据使用 | 用一次即弃 | 多轮复用 |
| 更新方式 | 直接梯度 | 裁剪约束 |
| 稳定性 | 中等 | 高 |
| 调参难度 | 中等 | 简单 |

### 调参建议

| 参数 | A2C | PPO |
|------|-----|-----|
| 学习率 | 7e-4 | 3e-4 |
| γ | 0.99 | 0.99 |
| GAE λ | 0.95 | 0.95 |
| Clip ε | - | 0.2 |
| N-steps | 5 | 2048 |
| Epochs | 1 | 10 |

---

## 参考文献

1. Mnih et al., "Asynchronous Methods for Deep RL", ICML 2016
2. Schulman et al., "High-Dimensional Control Using GAE", ICLR 2016
3. Schulman et al., "Proximal Policy Optimization", 2017

---

[返回上级](../README.md)