# 策略梯度方法基础 (Policy Gradient Fundamentals)

本教程深入讲解策略梯度方法的数学原理和实现细节。

## 目录
1. [核心思想](#1-核心思想)
2. [数学推导](#2-数学推导)
3. [REINFORCE算法](#3-reinforce算法)
4. [方差减少技术](#4-方差减少技术)
5. [实践练习](#5-实践练习)

## 1. 核心思想

### 1.1 从价值方法到策略方法

**价值方法 (Value-Based)**:
- 学习价值函数 $Q(s,a)$ 或 $V(s)$
- 从价值函数导出策略: $\pi(s) = \arg\max_a Q(s,a)$
- 问题: 连续动作空间难以处理

**策略方法 (Policy-Based)**:
- 直接参数化策略 $\pi_\theta(a|s)$
- 通过梯度上升优化期望回报
- 优势: 天然支持连续动作、随机策略

### 1.2 策略参数化

**离散动作空间** - Softmax策略:
$$\pi_\theta(a|s) = \frac{\exp(f_\theta(s)_a)}{\sum_{a'} \exp(f_\theta(s)_{a'})}$$

**连续动作空间** - 高斯策略:
$$\pi_\theta(a|s) = \mathcal{N}(a | \mu_\theta(s), \sigma_\theta(s)^2)$$

In [None]:
# 环境设置
import sys
sys.path.insert(0, '..')

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# 设置随机种子
torch.manual_seed(42)
np.random.seed(42)

print("环境准备完成!")

## 2. 数学推导

### 2.1 目标函数

策略梯度的目标是最大化期望回报:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right]$$

其中 $\tau = (s_0, a_0, r_1, s_1, a_1, ...)$ 是轨迹。

### 2.2 策略梯度定理

**关键问题**: 如何计算 $\nabla_\theta J(\theta)$?

**Log-Derivative Trick**:
$$\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)$$

**策略梯度定理** (Sutton et al., 1999):
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot Q^\pi(s_t, a_t)\right]$$

In [None]:
# 演示Log-Derivative Trick
def demonstrate_log_derivative_trick():
    """展示log-derivative trick的数值验证"""
    
    # 简单的softmax策略
    theta = torch.tensor([1.0, 2.0], requires_grad=True)
    
    # 计算概率
    probs = torch.softmax(theta, dim=0)
    
    # 方法1: 直接对概率求梯度
    grad_direct = torch.autograd.grad(probs[0], theta, retain_graph=True)[0]
    
    # 方法2: 使用log-derivative trick
    log_prob = torch.log(probs[0])
    grad_log = torch.autograd.grad(log_prob, theta, retain_graph=True)[0]
    grad_trick = probs[0] * grad_log
    
    print("直接梯度:", grad_direct.detach().numpy())
    print("Log-trick:", grad_trick.detach().numpy())
    print("差异:", (grad_direct - grad_trick).abs().max().item())

demonstrate_log_derivative_trick()

### 2.3 推导过程

**Step 1**: 轨迹概率
$$p(\tau|\theta) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t)$$

**Step 2**: 对数轨迹概率
$$\log p(\tau|\theta) = \log p(s_0) + \sum_{t=0}^{T-1} \left[\log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t)\right]$$

**Step 3**: 梯度 (环境动态与θ无关)
$$\nabla_\theta \log p(\tau|\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)$$

**Step 4**: 策略梯度
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau}\left[R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\right]$$

## 3. REINFORCE算法

### 3.1 算法描述

REINFORCE使用蒙特卡洛回报 $G_t$ 作为 $Q^\pi(s_t, a_t)$ 的无偏估计:

$$G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k}$$

**梯度估计**:
$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^i|s_t^i) \cdot G_t^i$$

In [None]:
def compute_returns(rewards, gamma=0.99):
    """
    计算蒙特卡洛回报
    
    数学公式:
        G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...
            = r_t + γG_{t+1}
    
    使用反向递归高效计算
    """
    returns = []
    G = 0
    
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    return torch.tensor(returns, dtype=torch.float32)

In [None]:
# 演示回报计算
rewards = [1.0, 1.0, 1.0, 1.0, 1.0]
gamma = 0.99

returns = compute_returns(rewards, gamma)
print("奖励序列:", rewards)
print("回报序列:", returns.numpy().round(4))
print()
print("验证 G_0 = 1 + 0.99×1 + 0.99²×1 + 0.99³×1 + 0.99⁴×1")
print(f"手动计算: {1 + 0.99 + 0.99**2 + 0.99**3 + 0.99**4:.4f}")
print(f"函数结果: {returns[0]:.4f}")

### 3.2 简单实现

In [None]:
class SimplePolicy(nn.Module):
    """简单的离散策略网络"""
    
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state):
        """返回动作logits"""
        return self.net(state)
    
    def get_action(self, state):
        """采样动作并返回log概率"""
        logits = self.forward(state)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action, log_prob

In [None]:
def reinforce_update(policy, optimizer, states, actions, returns):
    """
    REINFORCE更新步骤
    
    损失函数:
        L = -E[log π(a|s) · G]
    
    负号将梯度上升转换为梯度下降
    """
    # 计算log概率
    logits = policy(states)
    dist = torch.distributions.Categorical(logits=logits)
    log_probs = dist.log_prob(actions)
    
    # 策略梯度损失
    policy_loss = -(log_probs * returns).mean()
    
    # 优化
    optimizer.zero_grad()
    policy_loss.backward()
    optimizer.step()
    
    return policy_loss.item()

## 4. 方差减少技术

### 4.1 问题: 高方差

REINFORCE的梯度估计方差很高:
$$\text{Var}[\nabla_\theta J] \propto \mathbb{E}[G_t^2]$$

这导致:
- 训练不稳定
- 需要大量样本
- 收敛缓慢

### 4.2 基线 (Baseline)

**关键洞察**: 减去任何不依赖于动作的基线 $b(s)$ 不改变梯度期望:

$$\mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \mathbb{E}_{a \sim \pi}[\nabla_\theta \log \pi_\theta(a|s)] = 0$$

**最优基线**: $b^*(s) = V^\pi(s)$

**优势函数**:
$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) \approx G_t - V(s_t)$$

In [None]:
def demonstrate_variance_reduction():
    """
    演示基线如何减少方差
    """
    np.random.seed(42)
    
    # 模拟回报
    returns = np.random.normal(100, 20, 1000)  # 均值100, 标准差20
    
    # 无基线
    var_no_baseline = np.var(returns)
    
    # 有基线 (减去均值)
    baseline = np.mean(returns)
    advantages = returns - baseline
    var_with_baseline = np.var(advantages)
    
    print(f"无基线方差: {var_no_baseline:.2f}")
    print(f"有基线方差: {var_with_baseline:.2f}")
    print(f"方差减少: {(1 - var_with_baseline/var_no_baseline)*100:.1f}%")
    
    # 可视化
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    axes[0].hist(returns, bins=30, alpha=0.7, label='Returns')
    axes[0].axvline(baseline, color='r', linestyle='--', label=f'Baseline={baseline:.1f}')
    axes[0].set_title('无基线: 回报分布')
    axes[0].legend()
    
    axes[1].hist(advantages, bins=30, alpha=0.7, color='green', label='Advantages')
    axes[1].axvline(0, color='r', linestyle='--', label='Zero')
    axes[1].set_title('有基线: 优势分布')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()

demonstrate_variance_reduction()

### 4.3 回报标准化

另一种简单有效的方差减少技术:

$$G_t^{\text{norm}} = \frac{G_t - \mu_G}{\sigma_G + \epsilon}$$

In [None]:
def normalize_returns(returns, eps=1e-8):
    """
    标准化回报
    
    好处:
    1. 减少方差
    2. 使梯度尺度一致
    3. 对奖励缩放不敏感
    """
    returns = torch.tensor(returns, dtype=torch.float32)
    return (returns - returns.mean()) / (returns.std() + eps)

# 演示
raw_returns = [100, 120, 80, 150, 90]
norm_returns = normalize_returns(raw_returns)

print("原始回报:", raw_returns)
print("标准化后:", norm_returns.numpy().round(3))
print(f"均值: {norm_returns.mean():.6f}, 标准差: {norm_returns.std():.6f}")

## 5. 实践练习

### 练习1: 实现完整的REINFORCE

In [None]:
import gymnasium as gym

def train_reinforce(env_name='CartPole-v1', num_episodes=500, gamma=0.99, lr=1e-3):
    """
    完整的REINFORCE训练循环
    """
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    policy = SimplePolicy(state_dim, action_dim)
    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
    
    episode_rewards = []
    
    for episode in range(num_episodes):
        # 收集轨迹
        states, actions, rewards, log_probs = [], [], [], []
        state, _ = env.reset()
        done = False
        
        while not done:
            state_t = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            action, log_prob = policy.get_action(state_t)
            
            next_state, reward, terminated, truncated, _ = env.step(action.item())
            done = terminated or truncated
            
            states.append(state)
            actions.append(action.item())
            rewards.append(reward)
            log_probs.append(log_prob)
            
            state = next_state
        
        # 计算回报
        returns = compute_returns(rewards, gamma)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # 标准化
        
        # 更新策略
        log_probs = torch.stack(log_probs)
        policy_loss = -(log_probs * returns).mean()
        
        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()
        
        episode_rewards.append(sum(rewards))
        
        if (episode + 1) % 50 == 0:
            mean_reward = np.mean(episode_rewards[-50:])
            print(f"Episode {episode+1}, Mean Reward: {mean_reward:.2f}")
    
    env.close()
    return episode_rewards

# 训练 (设置较少的episode用于演示)
rewards = train_reinforce(num_episodes=200)

In [None]:
# 可视化训练曲线
def plot_rewards(rewards, window=20):
    plt.figure(figsize=(10, 5))
    plt.plot(rewards, alpha=0.3, label='Episode Reward')
    
    # 滑动平均
    smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
    plt.plot(range(window-1, len(rewards)), smoothed, label=f'{window}-Episode Average')
    
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('REINFORCE Training on CartPole')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

plot_rewards(rewards)

### 练习2: 思考题

1. **为什么REINFORCE需要完整的episode?**
   - 提示: 考虑 $G_t$ 的计算方式

2. **基线为什么不改变梯度期望?**
   - 提示: 证明 $\mathbb{E}[\nabla \log \pi \cdot b(s)] = 0$

3. **如何选择最优基线?**
   - 提示: 最小化方差 $\text{Var}[\nabla J]$

## 总结

### 关键要点

1. **策略梯度定理**: $\nabla_\theta J = \mathbb{E}[\nabla \log \pi \cdot Q]$

2. **REINFORCE**: 使用MC回报 $G_t$ 估计 $Q$

3. **方差减少**: 基线、回报标准化

4. **优势函数**: $A(s,a) = Q(s,a) - V(s)$

### 下一步

- Actor-Critic: 学习价值函数作为基线
- GAE: 更好的优势估计
- PPO: 稳定的策略更新