# Deep Q-Network Variants: 深度Q网络变体详解

---

## 模块概述

本模块系统地介绍Deep Q-Network (DQN)的主要变体算法，包括理论推导、核心创新点、实现细节和对比分析。

### 学习目标

完成本模块后，您将能够：

1. **理解原始DQN的局限性**及其在实际应用中的失效模式
2. **掌握各DQN变体的数学原理**，包括Double DQN、Dueling DQN、Noisy Networks、Categorical DQN等
3. **实现各种replay buffer策略**（Uniform、Prioritized、N-step）
4. **分析Rainbow算法**如何组合所有改进达到SOTA性能
5. **进行算法对比实验**并解读结果

### 目录

1. [背景知识与问题定义](#1-背景知识与问题定义)
2. [Double DQN: 消除过估计偏差](#2-double-dqn-消除过估计偏差)
3. [Dueling DQN: 价值-优势分解](#3-dueling-dqn-价值-优势分解)
4. [Noisy Networks: 参数化探索](#4-noisy-networks-参数化探索)
5. [Categorical DQN (C51): 分布式强化学习](#5-categorical-dqn-c51-分布式强化学习)
6. [Prioritized Experience Replay: 优先经验回放](#6-prioritized-experience-replay-优先经验回放)
7. [N-step Learning: 多步学习](#7-n-step-learning-多步学习)
8. [Rainbow: 集大成者](#8-rainbow-集大成者)
9. [实验对比与分析](#9-实验对比与分析)
10. [总结与展望](#10-总结与展望)

---

## 环境配置与导入

In [None]:
# 标准库
import warnings
warnings.filterwarnings('ignore')

# 科学计算
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

# 深度学习
import torch
import torch.nn as nn
import torch.nn.functional as F

# 强化学习环境
try:
    import gymnasium as gym
    HAS_GYM = True
except ImportError:
    HAS_GYM = False
    print("Warning: gymnasium not installed. Run: pip install gymnasium")

# 本地模块
from dqn_variants import (
    DQNVariant,
    DQNVariantConfig,
    DQNVariantAgent,
    ReplayBuffer,
    PrioritizedReplayBuffer,
    NStepReplayBuffer,
    SumTree,
    NoisyLinear,
    DQNNetwork,
    DuelingNetwork,
    NoisyNetwork,
    CategoricalNetwork,
    RainbowNetwork,
    train_agent,
    evaluate_agent,
    compare_variants,
    plot_comparison,
)

# 设置随机种子
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

# 设备配置
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 可视化设置
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.unicode_minus'] = False

print("Environment configured successfully!")

---

## 1. 背景知识与问题定义

### 1.1 原始DQN回顾

Deep Q-Network (Mnih et al., 2015) 是深度强化学习的里程碑式工作，首次实现了从原始像素到动作的端到端学习。

**核心思想**：用神经网络 $Q(s, a; \theta)$ 近似最优动作价值函数 $Q^*(s, a)$。

**TD目标**：
$$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

**损失函数**：
$$L(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ (Q(s, a; \theta) - y)^2 \right]$$

**两个关键创新**：
1. **Experience Replay**: 打破样本相关性，提高数据效率
2. **Target Network**: 稳定学习目标，防止发散

### 1.2 原始DQN的局限性

尽管DQN取得了突破性成果，但它存在几个根本性的局限：

In [None]:
# 可视化DQN的局限性
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. 过估计偏差 (Overestimation Bias)
ax1 = axes[0, 0]
np.random.seed(42)
true_q = np.array([1.0, 1.5, 0.8, 1.2])  # 真实Q值
noise = np.random.randn(4, 100) * 0.5  # 估计噪声
estimated_q = true_q[:, np.newaxis] + noise
max_estimated = np.max(estimated_q, axis=0)

ax1.hist(max_estimated, bins=30, alpha=0.7, density=True, label='E[max Q]')
ax1.axvline(np.max(true_q), color='red', linestyle='--', linewidth=2, label=f'max Q* = {np.max(true_q)}')
ax1.axvline(np.mean(max_estimated), color='blue', linestyle='-', linewidth=2, label=f'E[max Q] = {np.mean(max_estimated):.2f}')
ax1.set_xlabel('Value')
ax1.set_ylabel('Density')
ax1.set_title('1. Overestimation Bias\n(过估计偏差)')
ax1.legend()
ax1.text(0.05, 0.95, r'$E[\max_a Q] \geq \max_a E[Q]$', transform=ax1.transAxes, fontsize=12, verticalalignment='top')

# 2. 样本效率问题 (Sample Inefficiency)
ax2 = axes[0, 1]
buffer_size = 1000
td_errors = np.abs(np.random.randn(buffer_size)) * np.exp(-np.arange(buffer_size) / 200)  # TD误差随时间衰减
uniform_samples = np.random.choice(buffer_size, 100, replace=False)
prioritized_samples = np.random.choice(buffer_size, 100, replace=False, p=td_errors/td_errors.sum())

ax2.scatter(np.arange(buffer_size), td_errors, alpha=0.3, s=10, label='All transitions')
ax2.scatter(uniform_samples, td_errors[uniform_samples], color='red', s=30, alpha=0.8, label='Uniform sampling')
ax2.scatter(prioritized_samples, td_errors[prioritized_samples], color='green', s=30, alpha=0.8, marker='^', label='Prioritized sampling')
ax2.set_xlabel('Transition index')
ax2.set_ylabel('|TD error|')
ax2.set_title('2. Sample Inefficiency\n(样本效率低)')
ax2.legend()

# 3. 探索问题 (Exploration Challenge)
ax3 = axes[1, 0]
steps = np.arange(100000)
epsilon_greedy = np.maximum(1.0 - steps / 50000, 0.1)  # ε-greedy decay
noisy_exploration = 0.5 * np.exp(-steps / 30000) + 0.1 * np.sin(steps / 5000)  # 噪声网络的state-dependent探索
noisy_exploration = np.maximum(noisy_exploration, 0.05)

ax3.plot(steps, epsilon_greedy, 'r-', linewidth=2, label='ε-greedy (state-independent)')
ax3.plot(steps, noisy_exploration, 'g-', linewidth=2, alpha=0.8, label='Noisy Networks (state-dependent)')
ax3.set_xlabel('Training steps')
ax3.set_ylabel('Exploration amount')
ax3.set_title('3. Exploration Challenge\n(探索问题)')
ax3.legend()
ax3.set_xlim(0, 100000)

# 4. 标量值局限 (Scalar Value Limitation)
ax4 = axes[1, 1]
x = np.linspace(-10, 30, 1000)
dist1 = 0.5 * np.exp(-(x - 10)**2 / 20) + 0.5 * np.exp(-(x - 10)**2 / 200)  # 低方差分布
dist2 = 0.25 * np.exp(-(x - 0)**2 / 10) + 0.75 * np.exp(-(x - 13.3)**2 / 10)  # 高方差分布
dist1 /= np.sum(dist1) * (x[1] - x[0])
dist2 /= np.sum(dist2) * (x[1] - x[0])

ax4.fill_between(x, 0, dist1, alpha=0.5, label=f'Action A (E[R]={np.sum(x * dist1 * (x[1]-x[0])):.1f}, low var)')
ax4.fill_between(x, 0, dist2, alpha=0.5, label=f'Action B (E[R]={np.sum(x * dist2 * (x[1]-x[0])):.1f}, high var)')
ax4.set_xlabel('Return')
ax4.set_ylabel('Probability density')
ax4.set_title('4. Scalar Value Limitation\n(标量值局限: 期望相同但分布不同)')
ax4.legend()

plt.tight_layout()
plt.suptitle('DQN的四大局限性', fontsize=14, y=1.02)
plt.show()

### 1.3 问题总结

| 问题 | 原因 | 影响 | 解决方案 |
|------|------|------|----------|
| **过估计偏差** | max操作同时用于选择和评估 | 不稳定、次优策略 | Double DQN |
| **样本效率低** | 均匀随机采样 | 学习慢、数据浪费 | Prioritized Replay |
| **探索能力弱** | ε-greedy与状态无关 | 难以逃离局部最优 | Noisy Networks |
| **标量值局限** | 只建模期望值 | 丢失分布信息、风险中立 | Categorical DQN |

---

## 2. Double DQN: 消除过估计偏差

### 2.1 核心思想

**问题根源**: 标准DQN使用同一个网络的max操作同时进行动作选择和动作评估。

$$y^{\text{DQN}} = r + \gamma \underbrace{\max_{a'} Q(s', a'; \theta^-)}_{\text{selection = evaluation}}$$

当Q值估计有噪声时，max会倾向于选择被高估的动作，导致系统性过估计。

**Double DQN解决方案**: 解耦动作选择（使用online network）和动作评估（使用target network）。

$$y^{\text{Double}} = r + \gamma Q\left(s', \underbrace{\arg\max_{a'} Q(s', a'; \theta)}_{\text{online selects}\;\;\;\;\;\;\;\;\;}; \theta^-\right)$$

### 2.2 数学推导

设真实Q值为 $Q^*(s,a)$，估计为 $Q(s,a) = Q^*(s,a) + \epsilon(s,a)$，其中 $\epsilon \sim \mathcal{N}(0, \sigma^2)$。

**标准DQN的过估计**:
$$\mathbb{E}\left[\max_a Q(s,a)\right] = \mathbb{E}\left[\max_a (Q^*(s,a) + \epsilon(s,a))\right] \geq \max_a Q^*(s,a)$$

**Double DQN的校正**:
由于选择和评估使用不同的噪声源，过估计偏差被大幅减少。

In [None]:
def simulate_overestimation(n_actions=10, n_samples=10000, noise_std=1.0):
    """模拟过估计偏差实验"""
    true_q = np.zeros(n_actions)  # 所有动作的真实Q值都是0
    
    # 标准DQN: 同一噪声源用于选择和评估
    dqn_estimates = []
    for _ in range(n_samples):
        noise = np.random.randn(n_actions) * noise_std
        estimated_q = true_q + noise
        dqn_estimates.append(np.max(estimated_q))  # max用于选择和评估
    
    # Double DQN: 不同噪声源用于选择和评估
    double_dqn_estimates = []
    for _ in range(n_samples):
        noise_online = np.random.randn(n_actions) * noise_std  # online network噪声
        noise_target = np.random.randn(n_actions) * noise_std  # target network噪声
        
        estimated_online = true_q + noise_online
        estimated_target = true_q + noise_target
        
        best_action = np.argmax(estimated_online)  # online选择
        double_dqn_estimates.append(estimated_target[best_action])  # target评估
    
    return np.mean(dqn_estimates), np.mean(double_dqn_estimates), np.max(true_q)

# 不同动作数量下的过估计
action_counts = [2, 4, 8, 16, 32, 64]
dqn_bias = []
ddqn_bias = []

for n_actions in action_counts:
    dqn_est, ddqn_est, true_max = simulate_overestimation(n_actions=n_actions)
    dqn_bias.append(dqn_est - true_max)
    ddqn_bias.append(ddqn_est - true_max)

plt.figure(figsize=(10, 6))
plt.plot(action_counts, dqn_bias, 'ro-', linewidth=2, markersize=10, label='Standard DQN (overestimation)')
plt.plot(action_counts, ddqn_bias, 'g^-', linewidth=2, markersize=10, label='Double DQN (corrected)')
plt.axhline(0, color='gray', linestyle='--', linewidth=1, label='True value')
plt.xlabel('Number of Actions', fontsize=12)
plt.ylabel('Estimation Bias', fontsize=12)
plt.title('Overestimation Bias: DQN vs Double DQN\n(动作数越多, 过估计越严重)', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xscale('log', base=2)
plt.show()

print(f"\n动作数={action_counts[-1]}时:")
print(f"  DQN过估计: {dqn_bias[-1]:.3f}")
print(f"  Double DQN偏差: {ddqn_bias[-1]:.3f}")
print(f"  改善幅度: {(dqn_bias[-1] - ddqn_bias[-1]) / dqn_bias[-1] * 100:.1f}%")

---

## 3. Dueling DQN: 价值-优势分解

### 3.1 核心思想

**关键洞察**: 在很多状态下，知道状态的价值比知道每个动作的精确价值更重要。

将Q函数分解为：
$$Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a')$$

其中：
- $V(s)$: 状态价值 — "这个状态有多好？"
- $A(s, a)$: 优势函数 — "动作a比平均动作好多少？"

### 3.2 可辨识性约束

**问题**: $V(s) + A(s,a) = (V(s) + c) + (A(s,a) - c)$ 对任意常数c都成立。

**解决方案**: 强制 $\sum_a A(s,a) = 0$，通过减去均值实现。

### 3.3 网络架构

In [None]:
# 可视化Dueling架构
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 标准DQN架构
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.set_aspect('equal')

# 画框和箭头
boxes_dqn = [
    {'pos': (1, 5), 'text': 'State\n$s$', 'color': 'lightblue'},
    {'pos': (4, 5), 'text': 'Shared\nLayers', 'color': 'lightyellow'},
    {'pos': (7, 5), 'text': 'Q-values\n$Q(s,a)$', 'color': 'lightgreen'},
]

for box in boxes_dqn:
    rect = plt.Rectangle((box['pos'][0]-0.8, box['pos'][1]-0.8), 1.6, 1.6, 
                          facecolor=box['color'], edgecolor='black', linewidth=2)
    ax1.add_patch(rect)
    ax1.text(box['pos'][0], box['pos'][1], box['text'], ha='center', va='center', fontsize=11)

ax1.annotate('', xy=(3.2, 5), xytext=(1.8, 5), arrowprops=dict(arrowstyle='->', lw=2))
ax1.annotate('', xy=(6.2, 5), xytext=(4.8, 5), arrowprops=dict(arrowstyle='->', lw=2))
ax1.set_title('Standard DQN Architecture', fontsize=14)
ax1.axis('off')

# Dueling DQN架构
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.set_aspect('equal')

boxes_dueling = [
    {'pos': (1, 5), 'text': 'State\n$s$', 'color': 'lightblue'},
    {'pos': (3.5, 5), 'text': 'Shared\nLayers', 'color': 'lightyellow'},
    {'pos': (6, 7), 'text': 'Value\n$V(s)$', 'color': 'lightcoral'},
    {'pos': (6, 3), 'text': 'Advantage\n$A(s,a)$', 'color': 'lightgreen'},
    {'pos': (8.5, 5), 'text': 'Q-values\n$Q(s,a)$', 'color': 'plum'},
]

for box in boxes_dueling:
    rect = plt.Rectangle((box['pos'][0]-0.8, box['pos'][1]-0.8), 1.6, 1.6, 
                          facecolor=box['color'], edgecolor='black', linewidth=2)
    ax2.add_patch(rect)
    ax2.text(box['pos'][0], box['pos'][1], box['text'], ha='center', va='center', fontsize=10)

# 箭头
ax2.annotate('', xy=(2.7, 5), xytext=(1.8, 5), arrowprops=dict(arrowstyle='->', lw=2))
ax2.annotate('', xy=(5.2, 7), xytext=(4.3, 5.5), arrowprops=dict(arrowstyle='->', lw=2))
ax2.annotate('', xy=(5.2, 3), xytext=(4.3, 4.5), arrowprops=dict(arrowstyle='->', lw=2))
ax2.annotate('', xy=(7.7, 5.5), xytext=(6.8, 6.5), arrowprops=dict(arrowstyle='->', lw=2))
ax2.annotate('', xy=(7.7, 4.5), xytext=(6.8, 3.5), arrowprops=dict(arrowstyle='->', lw=2))

# 聚合公式
ax2.text(5, 1, r'$Q(s,a) = V(s) + A(s,a) - \frac{1}{|A|}\sum_{a\'} A(s,a\')$', 
         ha='center', fontsize=12, bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax2.set_title('Dueling DQN Architecture', fontsize=14)
ax2.axis('off')

plt.tight_layout()
plt.show()

In [None]:
# 演示Dueling网络的前向传播
state_dim, action_dim, hidden_dim = 4, 2, 64

# 创建Dueling网络
dueling_net = DuelingNetwork(state_dim, action_dim, hidden_dim)
print("Dueling Network Architecture:")
print(dueling_net)

# 随机状态输入
sample_state = torch.randn(1, state_dim)

# 前向传播查看中间结果
with torch.no_grad():
    features = F.relu(dueling_net.fc1(sample_state))
    features = F.relu(dueling_net.fc2(features))
    
    value = dueling_net.value_stream(features)
    advantage = dueling_net.advantage_stream(features)
    q_values = dueling_net(sample_state)

print(f"\n中间结果:")
print(f"  V(s) = {value.item():.4f}")
print(f"  A(s,a) = {advantage.numpy().flatten()}")
print(f"  mean A(s,a) = {advantage.mean().item():.4f}")
print(f"  Q(s,a) = {q_values.numpy().flatten()}")
print(f"\n验证: V + (A - mean(A)) = Q")
print(f"  {value.item():.4f} + ({advantage.numpy().flatten()} - {advantage.mean().item():.4f}) = {q_values.numpy().flatten()}")

---

## 4. Noisy Networks: 参数化探索

### 4.1 核心思想

**问题**: ε-greedy探索是状态无关的，对所有状态使用相同的探索概率。

**解决方案**: 在网络权重中加入可学习的噪声参数，实现状态依赖的探索。

**Noisy Linear Layer**:
$$y = (\mu^w + \sigma^w \odot \varepsilon^w) x + (\mu^b + \sigma^b \odot \varepsilon^b)$$

其中：
- $\mu$: 可学习的均值参数
- $\sigma$: 可学习的噪声尺度参数
- $\varepsilon$: 随机噪声

### 4.2 因式分解噪声

为减少参数量，使用因式分解高斯噪声：
$$\varepsilon_{ij} = f(\varepsilon_i) \cdot f(\varepsilon_j), \quad f(x) = \text{sign}(x)\sqrt{|x|}$$

In [None]:
# 演示NoisyLinear层
in_features, out_features = 64, 32
noisy_layer = NoisyLinear(in_features, out_features)

print("NoisyLinear Layer Parameters:")
print(f"  mu_weight shape: {noisy_layer.mu_weight.shape}")
print(f"  sigma_weight shape: {noisy_layer.sigma_weight.shape}")
print(f"  Total params: {sum(p.numel() for p in noisy_layer.parameters())}")
print(f"  Standard Linear params would be: {in_features * out_features + out_features}")

# 演示噪声采样对输出的影响
sample_input = torch.randn(1, in_features)

outputs = []
for _ in range(100):
    noisy_layer.reset_noise()  # 重新采样噪声
    with torch.no_grad():
        output = noisy_layer(sample_input)
    outputs.append(output.numpy().flatten())

outputs = np.array(outputs)

# 可视化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 输出分布
ax1 = axes[0]
for i in range(min(5, out_features)):
    ax1.hist(outputs[:, i], bins=20, alpha=0.5, label=f'Output {i}')
ax1.set_xlabel('Output value')
ax1.set_ylabel('Frequency')
ax1.set_title('Noisy Layer Output Distribution\n(相同输入, 不同噪声采样)')
ax1.legend()

# 噪声参数可视化
ax2 = axes[1]
sigma_values = noisy_layer.sigma_weight.detach().numpy().flatten()
ax2.hist(sigma_values, bins=50, alpha=0.7, color='orange')
ax2.axvline(sigma_values.mean(), color='red', linestyle='--', label=f'Mean σ = {sigma_values.mean():.4f}')
ax2.set_xlabel('σ value')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of Learned Noise Scales (σ)\n(初始化后)')
ax2.legend()

plt.tight_layout()
plt.show()

print(f"\n输出统计:")
print(f"  均值: {outputs.mean(axis=0)[:5]}")
print(f"  标准差: {outputs.std(axis=0)[:5]}")

---

## 5. Categorical DQN (C51): 分布式强化学习

### 5.1 核心思想

**范式转变**: 从建模期望值转向建模完整的回报分布。

**分布表示**: 使用N个固定支撑点的离散分布
$$Z(s, a) \sim \text{Categorical}(z_1, ..., z_N; p_1(s,a), ..., p_N(s,a))$$

其中支撑点为：
$$z_i = V_{\min} + i \cdot \Delta z, \quad \Delta z = \frac{V_{\max} - V_{\min}}{N - 1}$$

### 5.2 分布式Bellman算子

$$\mathcal{T} Z(s, a) \stackrel{D}{=} R + \gamma Z(S', A')$$

### 5.3 投影到支撑点

由于 $r + \gamma z_j$ 可能落在支撑点之间，需要投影回离散支撑点。

In [None]:
# 可视化Categorical DQN的分布表示
n_atoms = 51
v_min, v_max = -10, 10
support = np.linspace(v_min, v_max, n_atoms)
delta_z = (v_max - v_min) / (n_atoms - 1)

# 模拟两个动作的回报分布
def create_distribution(mean, std, support):
    """创建在支撑点上的高斯分布"""
    probs = np.exp(-(support - mean)**2 / (2 * std**2))
    probs /= probs.sum()
    return probs

dist_action1 = create_distribution(mean=5, std=2, support=support)  # 动作1: 高均值低方差
dist_action2 = create_distribution(mean=5, std=5, support=support)  # 动作2: 相同均值高方差

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 原始分布
ax1 = axes[0]
ax1.bar(support, dist_action1, width=delta_z*0.8, alpha=0.7, label='Action 1 (low var)', color='blue')
ax1.bar(support, dist_action2, width=delta_z*0.8, alpha=0.5, label='Action 2 (high var)', color='red')
ax1.axvline(np.sum(support * dist_action1), color='blue', linestyle='--', label=f'E[Z1] = {np.sum(support * dist_action1):.1f}')
ax1.axvline(np.sum(support * dist_action2), color='red', linestyle='--', label=f'E[Z2] = {np.sum(support * dist_action2):.1f}')
ax1.set_xlabel('Return value')
ax1.set_ylabel('Probability')
ax1.set_title('Return Distributions for Two Actions\n(期望相同, 方差不同)')
ax1.legend()

# 分布式Bellman投影演示
ax2 = axes[1]
reward = 1.0
gamma = 0.99
shifted_support = reward + gamma * support

ax2.bar(support, dist_action1, width=delta_z*0.8, alpha=0.5, label='Original Z(s,a)', color='blue')
ax2.bar(shifted_support, dist_action1, width=delta_z*0.8, alpha=0.5, label=f'r + γZ(s\',a\') [r={reward}]', color='green')
ax2.axvline(v_min, color='red', linestyle=':', label=f'Support bounds: [{v_min}, {v_max}]')
ax2.axvline(v_max, color='red', linestyle=':')
ax2.set_xlabel('Return value')
ax2.set_ylabel('Probability')
ax2.set_title('Distributional Bellman Shift\n(需要投影回支撑点)')
ax2.legend()

# 投影后的分布
ax3 = axes[2]
# 简化的投影实现
projected_probs = np.zeros(n_atoms)
for i, z in enumerate(support):
    shifted_z = np.clip(reward + gamma * z, v_min, v_max)
    # 找到相邻的支撑点
    lower_idx = int((shifted_z - v_min) / delta_z)
    lower_idx = np.clip(lower_idx, 0, n_atoms - 2)
    upper_idx = lower_idx + 1
    # 线性插值
    upper_weight = (shifted_z - support[lower_idx]) / delta_z
    lower_weight = 1 - upper_weight
    projected_probs[lower_idx] += lower_weight * dist_action1[i]
    projected_probs[upper_idx] += upper_weight * dist_action1[i]

ax3.bar(support, dist_action1, width=delta_z*0.8, alpha=0.5, label='Original', color='blue')
ax3.bar(support, projected_probs, width=delta_z*0.4, alpha=0.7, label='After projection', color='green')
ax3.set_xlabel('Return value')
ax3.set_ylabel('Probability')
ax3.set_title('Categorical Projection\n(投影回原始支撑点)')
ax3.legend()

plt.tight_layout()
plt.show()

print(f"C51参数:")
print(f"  N_atoms = {n_atoms}")
print(f"  V_min = {v_min}, V_max = {v_max}")
print(f"  Δz = {delta_z:.4f}")

---

## 6. Prioritized Experience Replay: 优先经验回放

### 6.1 核心思想

**问题**: 均匀采样对待所有样本一视同仁，浪费了高信息量样本的学习价值。

**解决方案**: 根据TD误差大小分配采样优先级。

### 6.2 优先级定义

$$p_i = |\delta_i| + \epsilon$$

其中 $\delta_i$ 是TD误差，$\epsilon$ 防止零优先级。

### 6.3 采样概率

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

- $\alpha = 0$: 均匀采样
- $\alpha = 1$: 完全优先级采样

### 6.4 重要性采样校正

优先级采样改变了数据分布，需要IS权重校正：

$$w_i = \left( \frac{1}{N \cdot P(i)} \right)^\beta / \max_j w_j$$

$\beta$ 从 $\beta_0$ 逐渐退火到1。

In [None]:
# 演示SumTree数据结构
capacity = 8
tree = SumTree(capacity)

# 添加样本
priorities = [1.0, 3.0, 2.0, 4.0, 1.5, 2.5, 3.5, 0.5]
for i, p in enumerate(priorities):
    tree.add(p, f"data_{i}")

print("SumTree Structure:")
print(f"  Capacity: {capacity}")
print(f"  Priorities: {priorities}")
print(f"  Total priority (root): {tree.total_priority}")
print(f"  Expected total: {sum(priorities)}")

# 可视化采样分布
n_samples = 10000
sample_counts = np.zeros(capacity)

for _ in range(n_samples):
    cumsum = np.random.uniform(0, tree.total_priority)
    idx, priority, data = tree.get(cumsum)
    data_idx = int(data.split('_')[1])
    sample_counts[data_idx] += 1

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 优先级vs采样频率
ax1 = axes[0]
x = np.arange(capacity)
width = 0.35
ax1.bar(x - width/2, np.array(priorities) / sum(priorities), width, label='Expected (priority/total)', color='blue', alpha=0.7)
ax1.bar(x + width/2, sample_counts / n_samples, width, label='Empirical sampling freq', color='orange', alpha=0.7)
ax1.set_xlabel('Data index')
ax1.set_ylabel('Probability')
ax1.set_title('Prioritized Sampling Distribution')
ax1.set_xticks(x)
ax1.legend()

# β退火曲线
ax2 = axes[1]
beta_start = 0.4
beta_frames = 100000
frames = np.arange(beta_frames + 50000)
beta = beta_start + (1 - beta_start) * np.minimum(frames / beta_frames, 1.0)

ax2.plot(frames, beta, 'g-', linewidth=2)
ax2.axhline(1.0, color='red', linestyle='--', label='β = 1 (full correction)')
ax2.axvline(beta_frames, color='gray', linestyle=':', label=f'beta_frames = {beta_frames}')
ax2.fill_between(frames, beta_start, beta, alpha=0.3, color='green')
ax2.set_xlabel('Training frames')
ax2.set_ylabel('β (IS exponent)')
ax2.set_title('Importance Sampling Correction Annealing')
ax2.legend()

plt.tight_layout()
plt.show()

In [None]:
# 演示PrioritizedReplayBuffer
per_buffer = PrioritizedReplayBuffer(
    capacity=1000,
    alpha=0.6,
    beta_start=0.4,
    beta_frames=10000
)

# 添加样本
for _ in range(100):
    state = np.random.randn(4).astype(np.float32)
    action = np.random.randint(2)
    reward = np.random.randn()
    next_state = np.random.randn(4).astype(np.float32)
    done = np.random.random() < 0.1
    per_buffer.push(state, action, reward, next_state, done)

print(f"PER Buffer状态:")
print(f"  Size: {len(per_buffer)}")
print(f"  Beta: {per_buffer.beta:.4f}")

# 采样
states, actions, rewards, next_states, dones, indices, weights = per_buffer.sample(32)

print(f"\n采样结果:")
print(f"  States shape: {states.shape}")
print(f"  Weights range: [{weights.min():.4f}, {weights.max():.4f}]")
print(f"  Mean weight: {weights.mean():.4f}")

# 更新优先级
td_errors = np.random.randn(32)  # 模拟TD误差
per_buffer.update_priorities(indices, td_errors)
print(f"\n优先级更新完成")

---

## 7. N-step Learning: 多步学习

### 7.1 核心思想

**TD(0)**: 仅使用即时奖励，高偏差低方差
$$G_t^{(1)} = r_{t+1} + \gamma V(s_{t+1})$$

**Monte Carlo**: 使用完整回报，零偏差高方差
$$G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ...$$

**N-step**: 折中方案
$$G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1} + \gamma^n V(s_{t+n})$$

### 7.2 偏差-方差权衡

| n | 偏差 | 方差 | 特点 |
|---|------|------|------|
| 1 | 高 | 低 | 保守但稳定 |
| 3-5 | 中 | 中 | 常用sweet spot |
| ∞ | 零 | 高 | Monte Carlo |

In [None]:
# 演示N-step returns计算
def compute_n_step_return(rewards, gamma, n, bootstrap_value=0):
    """计算n-step return"""
    n_step_return = 0.0
    for i in range(min(n, len(rewards))):
        n_step_return += (gamma ** i) * rewards[i]
    if len(rewards) >= n:
        n_step_return += (gamma ** n) * bootstrap_value
    return n_step_return

# 模拟一个episode的奖励序列
np.random.seed(42)
episode_rewards = np.random.randn(20) + 0.5  # 均值为正的奖励
gamma = 0.99
bootstrap_value = 5.0  # 最终状态价值估计

# 计算不同n的return
n_values = [1, 3, 5, 10, len(episode_rewards)]
returns_by_n = {}

for n in n_values:
    returns = []
    for t in range(len(episode_rewards) - n + 1):
        rewards_slice = episode_rewards[t:t+n]
        g = compute_n_step_return(rewards_slice, gamma, n, bootstrap_value)
        returns.append(g)
    returns_by_n[n] = returns

# 可视化
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 不同n的return比较
ax1 = axes[0]
for n in n_values:
    label = 'MC' if n == len(episode_rewards) else f'n={n}'
    ax1.plot(returns_by_n[n], 'o-', alpha=0.7, label=label, markersize=4)
ax1.set_xlabel('Time step t')
ax1.set_ylabel(f'G_t^(n) (γ={gamma})')
ax1.set_title('N-step Returns at Different Time Steps')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 偏差-方差权衡可视化
ax2 = axes[1]

# 模拟多次实验
n_experiments = 100
n_range = range(1, 16)
bias_estimates = []
variance_estimates = []

true_return = np.sum([gamma**i * episode_rewards[i] for i in range(len(episode_rewards))])

for n in n_range:
    experiment_returns = []
    for _ in range(n_experiments):
        # 添加bootstrap噪声
        noisy_bootstrap = bootstrap_value + np.random.randn() * 2
        g = compute_n_step_return(episode_rewards[:n], gamma, n, noisy_bootstrap)
        experiment_returns.append(g)
    
    bias_estimates.append(abs(np.mean(experiment_returns) - true_return))
    variance_estimates.append(np.var(experiment_returns))

ax2.plot(list(n_range), bias_estimates, 'b-o', label='Bias', linewidth=2)
ax2.plot(list(n_range), variance_estimates, 'r-^', label='Variance', linewidth=2)
ax2.set_xlabel('n (number of steps)')
ax2.set_ylabel('Value')
ax2.set_title('Bias-Variance Tradeoff in N-step Learning')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 8. Rainbow: 集大成者

### 8.1 组合策略

Rainbow (Hessel et al., 2018) 组合了所有改进：

1. **Double DQN** - 消除过估计
2. **Dueling DQN** - 价值-优势分解
3. **Noisy Networks** - 参数化探索
4. **Categorical DQN** - 分布式RL
5. **Prioritized Experience Replay** - 优先采样
6. **N-step Learning** - 多步回报

### 8.2 性能提升

在Atari基准上的中位人类标准化分数：

| 算法 | 分数 | 相对DQN提升 |
|------|------|-------------|
| DQN | 79% | baseline |
| Double DQN | 117% | +48% |
| Prioritized DQN | 141% | +78% |
| Dueling DQN | 151% | +91% |
| Categorical DQN | 235% | +197% |
| **Rainbow** | **441%** | **+458%** |

In [None]:
# 可视化Rainbow组件贡献
components = ['DQN', '+ Double', '+ PER', '+ Dueling', '+ Noisy', '+ Categorical', '+ N-step\n(Rainbow)']
scores = [79, 117, 141, 151, 180, 235, 441]  # 大致的累积改进

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 累积性能
ax1 = axes[0]
colors = plt.cm.viridis(np.linspace(0.2, 0.9, len(components)))
bars = ax1.bar(components, scores, color=colors, edgecolor='black', linewidth=1)
ax1.axhline(100, color='red', linestyle='--', linewidth=2, label='Human performance')
ax1.set_ylabel('Median Human-Normalized Score (%)')
ax1.set_title('Cumulative Performance Improvements\n(Atari Benchmark)')
ax1.legend()

# 添加数值标签
for bar, score in zip(bars, scores):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
             f'{score}%', ha='center', va='bottom', fontsize=10)

# 消融实验结果
ax2 = axes[1]
ablation_components = ['Rainbow', '- PER', '- Multi-step', '- Distributional', '- Noisy', '- Dueling', '- Double']
ablation_scores = [441, 358, 340, 315, 312, 298, 377]

colors2 = ['green'] + ['red'] * (len(ablation_components) - 1)
bars2 = ax2.barh(ablation_components, ablation_scores, color=colors2, alpha=0.7, edgecolor='black')
ax2.set_xlabel('Median Human-Normalized Score (%)')
ax2.set_title('Ablation Study: Removing Components\n(哪个组件影响最大?)')

for bar, score in zip(bars2, ablation_scores):
    ax2.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2,
             f'{score}%', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print("消融分析关键发现:")
print("  1. 移除PER影响最大 (-83%)")
print("  2. 移除Multi-step影响次之 (-101%)")
print("  3. 所有组件都有正向贡献")
print("  4. 组件之间存在协同效应")

---

## 9. 实验对比与分析

下面我们在CartPole环境上对比各DQN变体的性能。

In [None]:
# 快速实验: 对比不同DQN变体
# 注意: 为了节省时间，使用较少的episodes

if HAS_GYM:
    print("开始对比实验 (这可能需要几分钟...)\n")
    
    # 使用较小的参数进行快速演示
    results = compare_variants(
        env_name="CartPole-v1",
        variants=[
            DQNVariant.VANILLA,
            DQNVariant.DOUBLE,
            DQNVariant.DUELING,
            DQNVariant.RAINBOW,
        ],
        num_episodes=100,  # 实际应用中使用更多episodes
        seed=42,
    )
    
    # 绘制对比图
    plot_comparison(results)
else:
    print("Gymnasium未安装, 跳过实验")
    print("安装命令: pip install gymnasium")

---

## 10. 总结与展望

### 10.1 核心要点回顾

| 变体 | 核心创新 | 解决的问题 |
|------|----------|------------|
| **Double DQN** | 解耦动作选择与评估 | 过估计偏差 |
| **Dueling DQN** | V/A分解架构 | 泛化能力 |
| **Noisy Networks** | 参数化噪声 | 状态无关探索 |
| **Categorical DQN** | 分布建模 | 标量值局限 |
| **PER** | TD误差优先级 | 样本效率 |
| **N-step** | 多步bootstrap | 信用分配 |
| **Rainbow** | 全部组合 | 最优性能 |

### 10.2 实践建议

1. **简单任务**: Vanilla DQN或Double DQN足够
2. **中等难度**: Double + Dueling + PER是好的组合
3. **困难任务**: 使用Rainbow获得最佳性能
4. **计算受限**: Double DQN + N-step是性价比最高的选择

### 10.3 后续学习方向

- **Implicit Quantile Networks (IQN)**: 更灵活的分布建模
- **Noisy Networks改进**: 如NoisyNet-A3C
- **Retrace/V-trace**: 离策略校正
- **R2D2**: 分布式+循环神经网络
- **Agent57**: 超越人类水平的Atari智能体

---

## 参考文献

1. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. *Nature*.
2. van Hasselt, H. et al. (2016). Deep Reinforcement Learning with Double Q-learning. *AAAI*.
3. Wang, Z. et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. *ICML*.
4. Fortunato, M. et al. (2017). Noisy Networks for Exploration. *ICLR*.
5. Bellemare, M. et al. (2017). A Distributional Perspective on Reinforcement Learning. *ICML*.
6. Schaul, T. et al. (2016). Prioritized Experience Replay. *ICLR*.
7. Hessel, M. et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. *AAAI*.