# Deep Q-Network (DQN) 深度强化学习完整教程

---

## 目录

1. [引言与核心思想](#1-引言与核心思想)
2. [数学原理深度剖析](#2-数学原理深度剖析)
3. [核心创新详解](#3-核心创新详解)
4. [代码实现与逐行解析](#4-代码实现与逐行解析)
5. [训练实验与可视化](#5-训练实验与可视化)
6. [算法变体对比分析](#6-算法变体对比分析)
7. [超参数敏感性分析](#7-超参数敏感性分析)
8. [常见问题诊断](#8-常见问题诊断)
9. [总结与进阶方向](#9-总结与进阶方向)

---

## 1. 引言与核心思想

### 1.1 什么是 DQN？

**Deep Q-Network (DQN)** 是深度强化学习的开山之作，由 DeepMind 于 2013 年提出，2015 年发表于 Nature。它首次证明了神经网络可以直接从原始像素输入学习控制策略，在 49 款 Atari 游戏中达到人类水平。

**核心思想一句话概括**：

> 用深度神经网络逼近 Q 函数，结合经验回放和目标网络实现稳定训练。

### 1.2 历史地位

| 时间 | 里程碑 |
|------|--------|
| 2013 | DQN 首次在 Atari 游戏上超越人类 |
| 2015 | Nature 论文发表，深度强化学习正式诞生 |
| 2016 | Double DQN、Dueling DQN、PER 相继提出 |
| 2017 | Rainbow 组合所有改进，刷新记录 |
| 2018 | DQN 思想扩展到连续动作空间 (SAC, TD3) |

### 1.3 DQN 解决的核心问题

传统 Q-Learning 使用 **Q 表** 存储状态-动作价值：

```
Q[state][action] = value
```

**问题**：
1. **维度灾难**：状态空间呈指数增长，无法存储
   - Atari 像素：210 × 160 × 3 = 100,800 维
   - 可能状态数：~$10^{20000}$
2. **缺乏泛化**：未见过的状态完全不知道如何处理

**DQN 解决方案**：

```
Q(s, a) ≈ f_θ(s)[a]  （神经网络逼近）
```

- 网络参数 $\theta$ 通常只有几百万，而非 $10^{20000}$ 个状态
- 相似状态自动共享权重，实现泛化

In [None]:
# 环境配置与依赖导入
import sys
import warnings
warnings.filterwarnings('ignore')

# 核心科学计算库
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams

# 深度学习框架
import torch
import torch.nn as nn
import torch.nn.functional as F

# 强化学习环境
try:
    import gymnasium as gym
    print(f"gymnasium version: {gym.__version__}")
except ImportError:
    print("请安装 gymnasium: pip install gymnasium")

# 本地模块导入
from dqn_core import DQNConfig, DQNAgent, DQNNetwork, DuelingDQNNetwork, ReplayBuffer
from training_utils import TrainingConfig, train_dqn, evaluate_agent, plot_training_curves

# 设置绘图风格
plt.style.use('seaborn-v0_8-whitegrid')
rcParams['font.size'] = 11
rcParams['axes.labelsize'] = 12
rcParams['axes.titlesize'] = 13
rcParams['figure.figsize'] = (10, 6)

# 设置随机种子确保可复现
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

# 检测计算设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

---

## 2. 数学原理深度剖析

### 2.1 强化学习基本框架：马尔可夫决策过程 (MDP)

强化学习问题形式化为 MDP 五元组 $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$：

- $\mathcal{S}$：状态空间
- $\mathcal{A}$：动作空间
- $P(s'|s, a)$：状态转移概率
- $R(s, a, s')$：奖励函数
- $\gamma \in [0, 1)$：折扣因子

**目标**：找到策略 $\pi$ 最大化累积折扣奖励：

$$
G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}
$$

### 2.2 动作价值函数 Q(s, a)

**定义**：从状态 $s$ 执行动作 $a$ 后，遵循策略 $\pi$ 的期望累积奖励：

$$
Q^\pi(s, a) = \mathbb{E}_\pi\left[ G_t \mid s_t = s, a_t = a \right]
$$

**最优 Q 函数** $Q^*$：所有策略中最大的 Q 值：

$$
Q^*(s, a) = \max_\pi Q^\pi(s, a)
$$

一旦知道 $Q^*$，最优策略就是在每个状态选择 Q 值最大的动作：

$$
\pi^*(s) = \arg\max_a Q^*(s, a)
$$

### 2.3 贝尔曼最优方程

最优 Q 函数满足递归关系：

$$
\boxed{Q^*(s, a) = \mathbb{E}_{s' \sim P}\left[ r + \gamma \max_{a'} Q^*(s', a') \mid s, a \right]}
$$

**直觉理解**：
- $Q^*(s, a)$ = 立即奖励 + 折扣后的未来最优价值
- 这是动态规划的核心：将长期问题分解为短期决策

In [None]:
# 可视化：贝尔曼方程的递归结构
def visualize_bellman():
    """
    贝尔曼方程的图解说明：
    
    当前状态 s 执行动作 a 后：
    1. 获得即时奖励 r
    2. 转移到下一状态 s'
    3. 在 s' 选择最优动作 a' = argmax Q(s', ·)
    4. 累积未来价值 γ * max_a' Q(s', a')
    """
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # 绘制状态节点
    ax.scatter([0], [0.5], s=3000, c='steelblue', alpha=0.8, zorder=5)
    ax.scatter([4], [0.5], s=3000, c='coral', alpha=0.8, zorder=5)
    ax.scatter([8], [0.5], s=3000, c='seagreen', alpha=0.8, zorder=5)
    
    # 标签
    ax.text(0, 0.5, 's', fontsize=16, ha='center', va='center', fontweight='bold', color='white')
    ax.text(4, 0.5, "s'", fontsize=16, ha='center', va='center', fontweight='bold', color='white')
    ax.text(8, 0.5, '...', fontsize=16, ha='center', va='center', fontweight='bold', color='white')
    
    # 箭头和标签
    ax.annotate('', xy=(3.3, 0.5), xytext=(0.7, 0.5),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    ax.text(2, 0.8, 'action a', fontsize=12, ha='center')
    ax.text(2, 0.2, 'reward r', fontsize=12, ha='center', color='darkred')
    
    ax.annotate('', xy=(7.3, 0.5), xytext=(4.7, 0.5),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    ax.text(6, 0.8, "argmax Q(s', ·)", fontsize=12, ha='center')
    ax.text(6, 0.2, r'$\gamma \max Q(s\prime, a\prime)$', fontsize=12, ha='center', color='darkgreen')
    
    # 公式
    ax.text(4, -0.3, r"$Q^*(s, a) = r + \gamma \max_{a'} Q^*(s', a')$", 
            fontsize=16, ha='center', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    
    ax.set_xlim(-1.5, 10)
    ax.set_ylim(-0.8, 1.3)
    ax.axis('off')
    ax.set_title('Bellman Optimality Equation Recursive Structure', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

visualize_bellman()

### 2.4 DQN 损失函数推导

**核心思想**：将 Q-Learning 转化为监督学习

1. **样本**：从经验回放缓冲区采样 $(s, a, r, s', \text{done})$

2. **预测值**：当前网络输出
   $$\hat{Q} = Q(s, a; \theta)$$

3. **目标值**：TD 目标（使用目标网络）
   $$y = r + \gamma (1 - \text{done}) \cdot \max_{a'} Q(s', a'; \theta^-)$$

4. **损失函数**：均方误差
   $$\boxed{L(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[ (y - Q(s, a; \theta))^2 \right]}$$

In [None]:
def demonstrate_td_target():
    """
    演示 TD 目标计算过程
    
    TD (Temporal Difference) 学习的核心是 "自举 (bootstrapping)"：
    使用当前估计更新当前估计，而非等到回合结束。
    """
    print("=" * 60)
    print("TD Target Calculation Demo")
    print("=" * 60)
    
    # 模拟一个批次的数据
    batch_size = 4
    gamma = 0.99
    
    # 模拟数据
    rewards = np.array([1.0, 0.0, -1.0, 10.0])  # 即时奖励
    dones = np.array([0, 0, 0, 1])  # 终止标志
    next_q_max = np.array([5.2, 3.1, 8.7, 0.0])  # 下一状态的最大 Q 值
    current_q = np.array([4.5, 3.0, 6.5, 8.0])  # 当前 Q 值估计
    
    # 计算 TD 目标
    td_target = rewards + gamma * (1 - dones) * next_q_max
    
    # 计算 TD 误差
    td_error = td_target - current_q
    
    print("\nSample Data:")
    print(f"  Immediate reward r:    {rewards}")
    print(f"  Terminal flag done:    {dones}")
    print(f"  max Q(s', a'):         {next_q_max}")
    print(f"  Current Q(s, a):       {current_q}")
    
    print(f"\nTD Target Calculation:")
    print(f"  y = r + gamma(1-done)max Q(s', a')")
    print(f"  gamma = {gamma}")
    
    for i in range(batch_size):
        print(f"\n  Sample {i+1}:")
        print(f"    y = {rewards[i]:.1f} + {gamma} * (1-{dones[i]}) * {next_q_max[i]:.1f}")
        print(f"      = {rewards[i]:.1f} + {gamma * (1 - dones[i]) * next_q_max[i]:.2f}")
        print(f"      = {td_target[i]:.2f}")
        print(f"    TD Error = {td_target[i]:.2f} - {current_q[i]:.2f} = {td_error[i]:.2f}")
    
    print(f"\n\nLoss = mean(TD_Error^2) = {np.mean(td_error**2):.4f}")

demonstrate_td_target()

---

## 3. 核心创新详解

### 3.1 为什么直接用神经网络 + Q-Learning 会失败？

**问题 1：样本相关性**
- 在线学习中，连续样本 $(s_t, s_{t+1}, ...)$ 高度相关
- 违反 SGD 的 i.i.d. 假设
- 导致：局部过拟合，灾难性遗忘

**问题 2：目标不稳定**
- TD 目标依赖当前网络：$y = r + \gamma \max Q(s'; \theta)$
- 每次更新网络，目标也在变化
- 导致：追逐移动目标，震荡不收敛

### 3.2 经验回放 (Experience Replay)

**解决问题**：样本相关性

**核心思想**：
1. 将交互经验 $(s, a, r, s', \text{done})$ 存入缓冲区
2. 训练时随机采样 mini-batch

$$
\mathcal{D} = \{(s_1, a_1, r_1, s_1'), ..., (s_N, a_N, r_N, s_N')\}
$$

$$
\text{batch} \sim \text{Uniform}(\mathcal{D})
$$

**优点**：
1. 打破时序相关性
2. 每个样本可多次使用（样本效率）
3. 梯度方差更低（batch 平均）

In [None]:
def demonstrate_experience_replay():
    """
    演示经验回放的工作原理
    
    经验回放将 RL 问题转化为监督学习：
    - 数据集 = 经验缓冲区
    - 输入 = 状态
    - 标签 = TD 目标
    """
    print("=" * 60)
    print("Experience Replay Mechanism Demo")
    print("=" * 60)
    
    # 创建回放缓冲区
    buffer = ReplayBuffer(capacity=10)
    
    # 模拟收集经验
    print("\n1. Collecting Experience:")
    for i in range(8):
        state = np.random.randn(4).astype(np.float32)
        action = np.random.randint(0, 2)
        reward = np.random.randn()
        next_state = state + np.random.randn(4).astype(np.float32) * 0.1
        done = i == 7
        
        buffer.push(state, action, reward, next_state, done)
        print(f"   Step {i+1}: Store (s, a={action}, r={reward:.2f}, s', done={done})")
    
    print(f"\n   Buffer status: {len(buffer)}/{buffer.capacity} samples")
    
    # 随机采样
    print("\n2. Random sample batch_size=4:")
    states, actions, rewards, next_states, dones = buffer.sample(4)
    
    print(f"   Sampled actions: {actions}")
    print(f"   Sampled rewards: {rewards.round(2)}")
    print(f"   Sampled dones: {dones}")
    
    print("\nKey Insight:")
    print("   - Random sampling breaks temporal correlation")
    print("   - Same experience can be sampled multiple times")
    print("   - This makes neural network training more stable")

demonstrate_experience_replay()

### 3.3 目标网络 (Target Network)

**解决问题**：目标不稳定

**核心思想**：
1. 维护两套网络：在线网络 $\theta$，目标网络 $\theta^-$
2. TD 目标使用目标网络计算：$y = r + \gamma \max Q(s'; \theta^-)$
3. 周期性同步：每 $C$ 步 $\theta^- \leftarrow \theta$

**直觉**：
- 固定目标 $C$ 步，给在线网络时间去"追赶"
- 类似于监督学习中固定标签

$$
\text{Without Target Network: } y_t = r_t + \gamma \max_a Q(s_{t+1}, a; \theta_t) \quad \text{(Moving target)}
$$

$$
\text{With Target Network: } y_t = r_t + \gamma \max_a Q(s_{t+1}, a; \theta^-) \quad \text{(Fixed target)}
$$

In [None]:
def demonstrate_target_network():
    """
    演示目标网络的稳定作用
    
    通过对比有无目标网络的 TD 目标变化来说明其重要性。
    """
    print("=" * 60)
    print("Target Network Stability Demo")
    print("=" * 60)
    
    # 创建两个简单网络
    state_dim, action_dim = 4, 2
    online_net = DQNNetwork(state_dim, action_dim, hidden_dims=[32, 32])
    target_net = DQNNetwork(state_dim, action_dim, hidden_dims=[32, 32])
    target_net.load_state_dict(online_net.state_dict())  # 初始同步
    
    # 固定输入状态
    test_state = torch.randn(1, state_dim)
    
    print("\nInitial State:")
    with torch.no_grad():
        q_online = online_net(test_state)
        q_target = target_net(test_state)
    print(f"   Online network Q values: {q_online.numpy().flatten().round(3)}")
    print(f"   Target network Q values: {q_target.numpy().flatten().round(3)}")
    
    # 模拟几次梯度更新（只更新在线网络）
    optimizer = torch.optim.Adam(online_net.parameters(), lr=0.01)
    
    print("\nSimulate 5 gradient updates (only update online network):")
    for step in range(5):
        # 随机伪损失来模拟更新
        loss = online_net(torch.randn(4, state_dim)).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        with torch.no_grad():
            q_online = online_net(test_state)
            q_target = target_net(test_state)
        
        print(f"\n   Step {step+1}:")
        print(f"      Online Q: {q_online.numpy().flatten().round(3)}  (changing)")
        print(f"      Target Q: {q_target.numpy().flatten().round(3)}  (stable)")
    
    print("\n\nKey Insight:")
    print("   - Target network provides stable regression target")
    print("   - Online network gradually learns, target stays fixed")
    print("   - Sync target network every C steps")

demonstrate_target_network()

---

## 4. 代码实现与逐行解析

### 4.1 Q 网络架构

In [None]:
def explain_network_architecture():
    """
    深入解析 DQN 网络架构
    
    Q 网络是一个从状态到 Q 值的映射：
        f_theta: S -> R^{|A|}
    
    输入：状态向量 s in R^d
    输出：所有动作的 Q 值 [Q(s, a_1), Q(s, a_2), ..., Q(s, a_n)]
    """
    print("=" * 60)
    print("Q Network Architecture")
    print("=" * 60)
    
    # 创建网络
    state_dim = 4  # CartPole 状态维度
    action_dim = 2  # CartPole 动作数
    hidden_dims = [128, 128]
    
    network = DQNNetwork(state_dim, action_dim, hidden_dims)
    
    print(f"\nNetwork Structure:")
    print(f"   Input layer:  {state_dim} dim (state)")
    for i, dim in enumerate(hidden_dims):
        print(f"   Hidden {i+1}: {dim} dim + ReLU")
    print(f"   Output layer: {action_dim} dim (Q values for each action)")
    
    # 计算参数量
    total_params = sum(p.numel() for p in network.parameters())
    trainable_params = sum(p.numel() for p in network.parameters() if p.requires_grad)
    
    print(f"\nParameter Statistics:")
    print(f"   Total parameters: {total_params:,}")
    print(f"   Trainable parameters: {trainable_params:,}")
    
    # 前向传播演示
    print("\nForward Pass Demo:")
    sample_state = torch.tensor([[1.5, 0.0, -0.2, 0.3]])  # 示例状态
    
    with torch.no_grad():
        q_values = network(sample_state)
    
    print(f"   Input state:  {sample_state.numpy().flatten()}")
    print(f"   Output Q values: {q_values.numpy().flatten().round(3)}")
    print(f"   Selected action: {q_values.argmax(dim=1).item()} (action with max Q)")
    
    # 打印网络层
    print("\nNetwork Layer Details:")
    print(network)

explain_network_architecture()

### 4.2 Dueling 架构

In [None]:
def explain_dueling_architecture():
    """
    Dueling DQN 架构详解
    
    核心思想：将 Q 函数分解为状态价值 V(s) 和动作优势 A(s, a)
    
    Q(s, a) = V(s) + A(s, a) - mean_a A(s, a)
    
    直觉：
    - V(s): 这个状态本身有多好（与动作无关）
    - A(s, a): 这个动作比平均好多少
    
    优势：
    - 当动作选择影响不大时，V(s) 可以直接学习
    - 更快的收敛，更好的泛化
    """
    print("=" * 60)
    print("Dueling DQN Architecture")
    print("=" * 60)
    
    state_dim = 4
    action_dim = 2
    
    # 创建 Dueling 网络
    dueling_net = DuelingDQNNetwork(state_dim, action_dim, hidden_dims=[128, 128])
    
    print("\nNetwork Structure:")
    print("""
           +---------------------+
           |   Shared Features   |
           |   (MLP backbone)    |
           +----------+----------+
                      |
              +-------+-------+
              |               |
        +-----v-----+   +-----v-----+
        | Value     |   | Advantage |
        | Stream    |   | Stream    |
        | V(s) -> 1 |   | A(s,.) -> |A|
        +-----+-----+   +-----+-----+
              |               |
              +-------+-------+
                      |
              +-------v-------+
              | Q(s,a) = V(s) |
              | + A(s,a)      |
              | - mean(A)    |
              +---------------+
    """)
    
    # 演示前向传播
    print("\nForward Pass Demo:")
    sample_state = torch.tensor([[1.5, 0.0, -0.2, 0.3]])
    
    with torch.no_grad():
        q_values = dueling_net(sample_state)
    
    print(f"   Input state:  {sample_state.numpy().flatten()}")
    print(f"   Output Q values: {q_values.numpy().flatten().round(3)}")
    
    print("\nMathematical Formula:")
    print("   Q(s, a) = V(s) + A(s, a) - (1/|A|) sum_a' A(s, a')")
    print("")
    print("Why subtract mean?")
    print("   - Ensure identifiability")
    print("   - Q = V + A has infinitely many decompositions")
    print("   - Constraint sum_a A(s,a) = 0 makes it unique")

explain_dueling_architecture()

### 4.3 Epsilon-Greedy Policy

In [None]:
def demonstrate_epsilon_greedy():
    """
    Epsilon-greedy policy详解
    
    探索-利用权衡 (Exploration-Exploitation Tradeoff)：
    - 利用 (Exploitation): 选择当前已知最优动作
    - 探索 (Exploration): 尝试未知动作，可能发现更好的
    
    Epsilon-greedy 策略：
    - 以概率 (1-epsilon) 选择贪心动作 (argmax Q)
    - 以概率 epsilon 随机选择动作
    
    Epsilon 衰减策略：
    - 训练初期：高 epsilon (更多探索)
    - 训练后期：低 epsilon (更多利用)
    """
    print("=" * 60)
    print("Epsilon-Greedy Policy Demo")
    print("=" * 60)
    
    # 模拟 epsilon 衰减
    epsilon_start = 1.0
    epsilon_end = 0.01
    epsilon_decay = 10000
    
    steps = np.arange(0, 15000)
    epsilon_values = []
    
    for step in steps:
        decay_progress = min(1.0, step / epsilon_decay)
        epsilon = epsilon_start + (epsilon_end - epsilon_start) * decay_progress
        epsilon_values.append(epsilon)
    
    # 可视化
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Epsilon 衰减曲线
    ax1.plot(steps, epsilon_values, 'b-', linewidth=2)
    ax1.axhline(y=epsilon_start, color='r', linestyle='--', alpha=0.5, label=f'Start: {epsilon_start}')
    ax1.axhline(y=epsilon_end, color='g', linestyle='--', alpha=0.5, label=f'End: {epsilon_end}')
    ax1.axvline(x=epsilon_decay, color='orange', linestyle='--', alpha=0.5, label=f'Decay steps: {epsilon_decay}')
    ax1.set_xlabel('Training Step')
    ax1.set_ylabel('Epsilon')
    ax1.set_title('Epsilon Decay Schedule')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 探索 vs 利用比例
    checkpoints = [0, 2500, 5000, 10000, 15000]
    explore_ratios = []
    exploit_ratios = []
    
    for cp in checkpoints:
        idx = min(cp, len(epsilon_values) - 1)
        eps = epsilon_values[idx]
        explore_ratios.append(eps)
        exploit_ratios.append(1 - eps)
    
    x = np.arange(len(checkpoints))
    width = 0.35
    
    ax2.bar(x - width/2, explore_ratios, width, label='Exploration', color='coral')
    ax2.bar(x + width/2, exploit_ratios, width, label='Exploitation', color='steelblue')
    ax2.set_xlabel('Training Step')
    ax2.set_ylabel('Probability')
    ax2.set_title('Exploration vs Exploitation Ratio')
    ax2.set_xticks(x)
    ax2.set_xticklabels([str(cp) for cp in checkpoints])
    ax2.legend()
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("\nKey Insight:")
    print("   - Initial epsilon=1.0: 100% random exploration, collect diverse experience")
    print("   - Linear decay: gradually shift to exploitation")
    print("   - Final epsilon=0.01: retain small exploration to avoid local optima")

demonstrate_epsilon_greedy()

---

## 5. 训练实验与可视化

### 5.1 CartPole 环境介绍

In [None]:
def explore_cartpole_env():
    """
    探索 CartPole-v1 环境
    
    CartPole 是强化学习的 "Hello World"：
    - 目标：通过左右移动小车来平衡杆子
    - 状态：[小车位置, 小车速度, 杆子角度, 杆子角速度]
    - 动作：0 (向左推), 1 (向右推)
    - 奖励：每一步不倒 +1
    - 终止：杆子倾斜 >12° 或小车出界
    - 成功标准：连续 100 回合平均奖励 >= 475
    """
    print("=" * 60)
    print("CartPole-v1 Environment")
    print("=" * 60)
    
    env = gym.make('CartPole-v1')
    
    print("\nEnvironment Info:")
    print(f"   Observation space: {env.observation_space}")
    print(f"   Action space: {env.action_space}")
    
    print("\nState Variables:")
    state_names = ['Cart Position', 'Cart Velocity', 'Pole Angle', 'Pole Angular Velocity']
    for i, name in enumerate(state_names):
        low = env.observation_space.low[i]
        high = env.observation_space.high[i]
        print(f"   [{i}] {name}: [{low:.2f}, {high:.2f}]")
    
    print("\nAction Meanings:")
    print("   0: Push cart left")
    print("   1: Push cart right")
    
    # 运行随机策略
    print("\nRandom Policy Demo (5 episodes):")
    episode_rewards = []
    
    for ep in range(5):
        state, _ = env.reset(seed=ep)
        total_reward = 0
        
        for step in range(500):
            action = env.action_space.sample()  # 随机动作
            next_state, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            
            if terminated or truncated:
                break
        
        episode_rewards.append(total_reward)
        print(f"   Episode {ep+1}: Reward = {total_reward:.0f}, Steps = {step+1}")
    
    print(f"\nRandom policy avg reward: {np.mean(episode_rewards):.1f} +/- {np.std(episode_rewards):.1f}")
    print("Success criterion: avg reward >= 475 (max 500 steps)")
    
    env.close()

explore_cartpole_env()

### 5.2 DQN 训练

In [None]:
def train_dqn_cartpole(num_episodes: int = 200):
    """
    在 CartPole 上训练 DQN
    
    使用较少的 episode 进行快速演示。
    生产环境建议使用 500-1000 episodes。
    """
    print("=" * 60)
    print("DQN Training Demo")
    print("=" * 60)
    
    # 获取环境信息
    env = gym.make('CartPole-v1')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    env.close()
    
    # 配置 DQN Agent
    config = DQNConfig(
        state_dim=state_dim,
        action_dim=action_dim,
        hidden_dims=[128, 128],
        learning_rate=1e-3,
        gamma=0.99,
        epsilon_start=1.0,
        epsilon_end=0.01,
        epsilon_decay=5000,
        buffer_size=50000,
        batch_size=64,
        target_update_freq=100,
        double_dqn=True,
        dueling=False,
        seed=42,
    )
    
    agent = DQNAgent(config)
    
    print(f"\nConfiguration:")
    print(f"   State dim: {state_dim}")
    print(f"   Action dim: {action_dim}")
    print(f"   Network: {config.hidden_dims}")
    print(f"   Double DQN: {config.double_dqn}")
    print(f"   Device: {agent.device}")
    
    # 配置训练参数
    training_config = TrainingConfig(
        num_episodes=num_episodes,
        max_steps_per_episode=500,
        eval_frequency=50,
        eval_episodes=5,
        log_frequency=20,
        save_frequency=100,
        checkpoint_dir='./dqn_checkpoints',
        early_stopping_reward=475,
        early_stopping_episodes=10,
    )
    
    print(f"\nStarting training for {num_episodes} episodes...")
    print("-" * 60)
    
    # 训练
    metrics = train_dqn(
        agent=agent,
        env_name='CartPole-v1',
        config=training_config,
        verbose=True,
    )
    
    return agent, metrics

# 运行训练
agent, metrics = train_dqn_cartpole(num_episodes=200)

### 5.3 训练结果可视化

In [None]:
# 绘制训练曲线
plot_training_curves(
    metrics,
    title="DQN Training on CartPole-v1",
    window=10,
    show=True
)

### 5.4 评估训练好的 Agent

In [None]:
def evaluate_trained_agent(agent: DQNAgent, num_episodes: int = 10):
    """
    评估训练好的 Agent
    """
    print("=" * 60)
    print("Agent Evaluation")
    print("=" * 60)
    
    mean_reward, std_reward = evaluate_agent(
        agent=agent,
        env_name='CartPole-v1',
        num_episodes=num_episodes,
        verbose=True,
        deterministic=True,
    )
    
    print(f"\nEvaluation Results:")
    print(f"   Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
    
    if mean_reward >= 475:
        print("   [OK] Reached CartPole success criterion (>=475)")
    else:
        print(f"   [X] Below criterion, gap: {475 - mean_reward:.1f}")

evaluate_trained_agent(agent, num_episodes=10)

---

## 6. 算法变体对比分析

### 6.1 Double DQN

**问题**：标准 DQN 系统性过估计 Q 值

数学解释：
$$
\mathbb{E}[\max_a Q(s, a)] \geq \max_a \mathbb{E}[Q(s, a)]
$$

max 操作会放大噪声，导致过估计。

**解决方案**：解耦动作选择和价值评估

$$
y^{\text{Double}} = r + \gamma Q\left(s', \underbrace{\arg\max_{a'} Q(s', a'; \theta)}_{\text{Online network selects}};  \underbrace{\theta^-}_{\text{Target network evaluates}}\right)
$$

In [None]:
def demonstrate_overestimation():
    """
    演示 Q 值过估计问题
    
    过估计来源：
    1. 函数近似误差（神经网络不完美）
    2. max 操作的统计偏差
    """
    print("=" * 60)
    print("Q Value Overestimation Demo")
    print("=" * 60)
    
    # 模拟：真实 Q 值 + 噪声
    np.random.seed(42)
    
    num_actions = 4
    num_experiments = 1000
    noise_std = 0.5
    
    true_q_values = np.array([1.0, 0.8, 0.6, 0.4])  # 真实 Q 值
    true_max = np.max(true_q_values)
    
    estimated_max_values = []
    
    for _ in range(num_experiments):
        # 添加估计噪声
        noisy_q = true_q_values + np.random.randn(num_actions) * noise_std
        estimated_max_values.append(np.max(noisy_q))
    
    estimated_max_mean = np.mean(estimated_max_values)
    overestimation = estimated_max_mean - true_max
    
    print(f"\nSimulation Setup:")
    print(f"   True Q values: {true_q_values}")
    print(f"   True max Q: {true_max}")
    print(f"   Noise std: {noise_std}")
    print(f"   Experiments: {num_experiments}")
    
    print(f"\nResults:")
    print(f"   Estimated max Q mean: {estimated_max_mean:.3f}")
    print(f"   Overestimation: {overestimation:.3f}")
    print(f"   Overestimation ratio: {overestimation / true_max * 100:.1f}%")
    
    # 可视化
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # 直方图
    ax1.hist(estimated_max_values, bins=30, alpha=0.7, color='steelblue', edgecolor='black')
    ax1.axvline(x=true_max, color='red', linestyle='--', linewidth=2, label=f'True max = {true_max}')
    ax1.axvline(x=estimated_max_mean, color='orange', linestyle='-', linewidth=2, 
                label=f'Estimated mean = {estimated_max_mean:.2f}')
    ax1.set_xlabel('Estimated Max Q Value')
    ax1.set_ylabel('Frequency')
    ax1.set_title('Max Q Estimation Distribution')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 不同噪声水平的过估计
    noise_levels = np.linspace(0.1, 2.0, 20)
    overestimations = []
    
    for noise in noise_levels:
        estimates = []
        for _ in range(500):
            noisy_q = true_q_values + np.random.randn(num_actions) * noise
            estimates.append(np.max(noisy_q))
        overestimations.append(np.mean(estimates) - true_max)
    
    ax2.plot(noise_levels, overestimations, 'b-o', linewidth=2, markersize=6)
    ax2.axhline(y=0, color='r', linestyle='--', alpha=0.5)
    ax2.set_xlabel('Noise Standard Deviation')
    ax2.set_ylabel('Overestimation')
    ax2.set_title('Overestimation vs Noise Level')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nKey Insight:")
    print("   - max operation produces positive bias in presence of noise")
    print("   - More noise leads to more overestimation")
    print("   - Double DQN mitigates this by decoupling selection and evaluation")

demonstrate_overestimation()

### 6.2 算法变体对比

In [None]:
def compare_dqn_variants(num_episodes: int = 150):
    """
    对比不同 DQN 变体的性能
    
    比较以下变体：
    1. Standard DQN
    2. Double DQN
    3. Dueling DQN
    4. Double Dueling DQN
    """
    print("=" * 60)
    print("DQN Variants Comparison")
    print("=" * 60)
    
    env = gym.make('CartPole-v1')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    env.close()
    
    variants = [
        ("Standard DQN", False, False),
        ("Double DQN", True, False),
        ("Dueling DQN", False, True),
        ("Double Dueling", True, True),
    ]
    
    results = {}
    
    training_config = TrainingConfig(
        num_episodes=num_episodes,
        log_frequency=num_episodes + 1,  # 禁用日志
        eval_frequency=num_episodes + 1,  # 禁用中间评估
        save_frequency=num_episodes + 1,  # 禁用保存
    )
    
    for name, double_dqn, dueling in variants:
        print(f"\nTraining {name}...")
        
        config = DQNConfig(
            state_dim=state_dim,
            action_dim=action_dim,
            double_dqn=double_dqn,
            dueling=dueling,
            seed=42,
            epsilon_decay=3000,
        )
        agent = DQNAgent(config)
        
        metrics = train_dqn(
            agent=agent,
            env_name='CartPole-v1',
            config=training_config,
            verbose=False,
        )
        
        results[name] = metrics
        final_avg = np.mean(metrics.episode_rewards[-30:])
        print(f"   Done! Last 30 episodes avg: {final_avg:.1f}")
    
    # 可视化对比
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    colors = plt.cm.tab10.colors
    
    window = 10
    for idx, (name, metrics) in enumerate(results.items()):
        rewards = metrics.episode_rewards
        if len(rewards) >= window:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            ax1.plot(range(window-1, len(rewards)), smoothed, 
                    label=name, color=colors[idx], linewidth=2)
    
    ax1.set_xlabel('Episode')
    ax1.set_ylabel('Reward')
    ax1.set_title('Learning Curves Comparison')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 最终性能柱状图
    names = list(results.keys())
    final_rewards = [np.mean(results[n].episode_rewards[-30:]) for n in names]
    
    bars = ax2.bar(names, final_rewards, color=[colors[i] for i in range(len(names))])
    ax2.set_ylabel('Average Reward (Last 30 Episodes)')
    ax2.set_title('Final Performance Comparison')
    ax2.tick_params(axis='x', rotation=15)
    ax2.grid(True, alpha=0.3, axis='y')
    
    for bar, reward in zip(bars, final_rewards):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
                f'{reward:.1f}', ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout()
    plt.show()
    
    return results

# 运行对比（使用较少 episodes 进行快速演示）
comparison_results = compare_dqn_variants(num_episodes=150)

---

## 7. 超参数敏感性分析

In [None]:
def analyze_hyperparameters():
    """
    超参数敏感性分析
    
    分析关键超参数对性能的影响：
    - learning_rate: 学习率
    - target_update_freq: 目标网络更新频率
    - epsilon_decay: 探索率衰减速度
    """
    print("=" * 60)
    print("Hyperparameter Sensitivity Analysis")
    print("=" * 60)
    
    print("\nKey Hyperparameters and Their Effects:")
    print("""
    +--------------------+----------------+---------------------------+
    | Hyperparameter     | Suggested      | Effect                    |
    +====================+================+===========================+
    | learning_rate      | 1e-4 ~ 1e-3    | Convergence vs Stability  |
    +--------------------+----------------+---------------------------+
    | gamma              | 0.99 ~ 0.999   | Long-term reward weight   |
    +--------------------+----------------+---------------------------+
    | buffer_size        | 10K ~ 1M       | Data diversity vs Memory  |
    +--------------------+----------------+---------------------------+
    | batch_size         | 32 ~ 256       | Gradient stability vs Cost|
    +--------------------+----------------+---------------------------+
    | target_update_freq | 100 ~ 10000    | Target stability vs Delay |
    +--------------------+----------------+---------------------------+
    | epsilon_decay      | 1K ~ 100K      | Exploration vs Convergence|
    +--------------------+----------------+---------------------------+
    | hidden_dims        | [64,64] ~      | Capacity vs Overfitting   |
    |                    | [512,512]      |                           |
    +--------------------+----------------+---------------------------+
    """)
    
    print("\nRecommended Config for CartPole-v1:")
    print("""
    config = DQNConfig(
        hidden_dims=[128, 128],     # Medium network sufficient
        learning_rate=1e-3,         # Relatively high, fast convergence
        gamma=0.99,                 # Standard discount factor
        epsilon_decay=5000,         # Fast decay
        target_update_freq=100,     # Frequent updates
        buffer_size=50000,          # Medium buffer
        batch_size=64,              # Standard batch
    )
    """)
    
    print("\nRecommended Config for LunarLander-v2:")
    print("""
    config = DQNConfig(
        hidden_dims=[256, 256],     # Larger network
        learning_rate=5e-4,         # Smaller learning rate
        gamma=0.99,
        epsilon_decay=20000,        # Slower decay
        target_update_freq=500,     # Less frequent updates
        buffer_size=100000,         # Larger buffer
        batch_size=64,
    )
    """)

analyze_hyperparameters()

---

## 8. 常见问题诊断

In [None]:
def troubleshooting_guide():
    """
    DQN 训练常见问题诊断指南
    """
    print("=" * 60)
    print("DQN Training Troubleshooting Guide")
    print("=" * 60)
    
    problems = [
        {
            "symptom": "Reward not improving, persistent oscillation",
            "causes": [
                "Learning rate too high",
                "Target network updated too frequently",
                "Network capacity insufficient",
                "Insufficient exploration (epsilon decays too fast)",
            ],
            "solutions": [
                "Reduce learning rate to 1e-4",
                "Increase target_update_freq to 1000+",
                "Increase hidden layer width/depth",
                "Increase epsilon_decay steps",
            ],
        },
        {
            "symptom": "Q values keep growing to extreme values",
            "causes": [
                "Severe overestimation",
                "Reward scale too large",
                "Gradient explosion",
            ],
            "solutions": [
                "Enable Double DQN",
                "Clip or normalize rewards",
                "Enable gradient clipping (grad_clip=10.0)",
            ],
        },
        {
            "symptom": "Slow learning, requires too many samples",
            "causes": [
                "Low exploration efficiency",
                "Buffer too small",
                "Batch size too small",
            ],
            "solutions": [
                "Increase initial exploration (epsilon_start=1.0)",
                "Increase buffer capacity",
                "Increase batch_size to 128",
            ],
        },
        {
            "symptom": "Converged to suboptimal policy",
            "causes": [
                "Network capacity insufficient",
                "Exploration stopped too early",
                "Stuck in local optimum",
            ],
            "solutions": [
                "Increase network capacity",
                "Keep epsilon_end > 0 (e.g., 0.01)",
                "Run multiple times with different seeds",
            ],
        },
    ]
    
    for i, problem in enumerate(problems, 1):
        print(f"\nProblem {i}: {problem['symptom']}")
        print("-" * 50)
        
        print("Possible Causes:")
        for cause in problem['causes']:
            print(f"   * {cause}")
        
        print("Solutions:")
        for solution in problem['solutions']:
            print(f"   [OK] {solution}")

troubleshooting_guide()

---

## 9. 总结与进阶方向

### 9.1 DQN 核心要点回顾

In [None]:
def summary():
    """
    DQN 学习要点总结
    """
    print("=" * 60)
    print("DQN Key Points Summary")
    print("=" * 60)
    
    print("""
    1. Core Idea
       Approximate Q function with neural network, combined with
       experience replay and target network for stable training.
    
    2. Key Innovations
       * Experience Replay: Break sample correlation, improve data efficiency
       * Target Network: Stabilize TD target, prevent divergence
    
    3. Mathematical Foundation
       * Bellman Equation: Q*(s,a) = E[r + gamma max Q*(s',a')]
       * TD Error: delta = r + gamma max Q(s') - Q(s,a)
       * Loss Function: L(theta) = E[(TD_target - Q(s,a;theta))^2]
    
    4. Algorithm Variants
       * Double DQN: Decouple selection and evaluation, reduce overestimation
       * Dueling DQN: Decompose V(s) and A(s,a), faster convergence
       * PER: Prioritize high TD-error samples
    
    5. Applicable Scenarios
       [OK] Discrete action space
       [OK] High-dimensional states (images, sensors)
       [OK] Delayed reward tasks
       [X] Continuous action space (need DDPG/TD3/SAC)
    """)
    
    print("\nLearning Path:")
    print("""
    DQN -> Double DQN -> Dueling DQN -> PER -> Rainbow
                 |
            DDPG/TD3/SAC (continuous actions)
                 |
            PPO/A3C (policy gradient)
                 |
            Model-based RL (MuZero, Dreamer)
    """)

summary()

### 9.2 参考文献

**核心论文：**
1. Mnih, V., et al. (2013). Playing Atari with Deep Reinforcement Learning. *NIPS Workshop*.
2. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. *Nature*.
3. van Hasselt, H., et al. (2016). Deep Reinforcement Learning with Double Q-learning. *AAAI*.
4. Wang, Z., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. *ICML*.
5. Schaul, T., et al. (2016). Prioritized Experience Replay. *ICLR*.
6. Hessel, M., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. *AAAI*.

**推荐资源：**
- [OpenAI Spinning Up](https://spinningup.openai.com/)
- [DeepMind Blog](https://deepmind.com/blog)
- [Berkeley Deep RL Course](http://rail.eecs.berkeley.edu/deeprlcourse/)