# Deep Q-Network (DQN) 深度实战教程

---

## 学习目标

| 目标 | 描述 |
|------|------|
| 理论基础 | 理解 DQN 的数学原理与核心创新 |
| 经验回放 | 掌握 Experience Replay 原理与实现 |
| 目标网络 | 理解 Target Network 稳定训练的机制 |
| 算法变体 | 实现 Double DQN、Dueling DQN |
| 实战能力 | 在 CartPole 环境训练和评估智能体 |

---

## Part 1: 为什么需要深度强化学习？

### 1.1 表格型 Q-Learning 的局限

| 问题 | 说明 | 影响 |
|------|------|------|
| 状态空间爆炸 | 围棋有 $10^{170}$ 状态 | 无法存储 Q 表 |
| 连续状态空间 | 机器人关节角度是连续值 | 无法离散化索引 |
| 无法泛化 | 相似状态需独立学习 | 样本效率极低 |
| 高维输入 | 图像有百万像素 | 无法直接作为索引 |

### 1.2 函数近似：核心解决方案

**核心思想**：用参数化函数近似价值函数

$$Q(s, a) \approx Q(s, a; \theta)$$

其中 $\theta$ 是神经网络参数。

**为什么选择神经网络？**

1. **通用近似定理**：可以逼近任意连续函数
2. **自动特征提取**：从原始输入学习有用表示
3. **泛化能力**：相似状态共享表示

---

## Part 2: DQN 核心技术解析

### 2.1 经验回放 (Experience Replay)

**问题**：在线学习时连续样本高度相关

```
时序相关：s_1 → s_2 → s_3 → s_4 → ...
          ↓     ↓     ↓     ↓
SGD假设：i.i.d. 独立同分布
```

**解决方案**：存储 → 打乱 → 采样

$$\text{Buffer}: D = \{(s_t, a_t, r_t, s_{t+1}, d_t)\}_{t=1}^{N}$$

$$\text{Sample}: \{\tau_i\}_{i=1}^{B} \sim \text{Uniform}(D)$$

### 2.2 目标网络 (Target Network)

**问题**：更新 Q 网络时目标也在变化

$$\text{Target} = r + \gamma \max_{a'} Q(s', a'; \theta) \quad \text{(不稳定)}$$

**解决方案**：使用滞后的目标网络 $\theta^-$

$$\text{Target} = r + \gamma \max_{a'} Q(s', a'; \theta^-) \quad \text{(稳定)}$$

**更新策略**：
- 每 $C$ 步硬更新：$\theta^- \leftarrow \theta$
- 或软更新：$\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$

### 2.3 DQN 损失函数

**TD 误差**：

$$\delta = r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)$$

**损失函数（MSE）**：

$$L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D}\left[\delta^2\right]$$

**梯度更新**：

$$\theta \leftarrow \theta - \alpha \nabla_\theta L(\theta)$$

---

## Part 3: 环境准备

In [None]:
# 导入核心库
import numpy as np
import random
from collections import deque
from dataclasses import dataclass
from typing import Tuple, List, Optional

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import matplotlib.pyplot as plt

In [None]:
# 导入 Gymnasium
try:
    import gymnasium as gym
    HAS_GYM = True
except ImportError:
    HAS_GYM = False
    print("请安装 gymnasium: pip install gymnasium")

In [None]:
# 配置设置
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"计算设备: {DEVICE}")

# 绘图配置
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10, 6)

### 3.1 CartPole 环境探索

CartPole 是经典控制问题：通过左右移动小车平衡竖直的杆子。

| 属性 | 值 |
|------|----|
| 状态空间 | 4 维连续 |
| 动作空间 | 2 个离散 |
| 奖励 | 每步 +1 |
| 成功标准 | 平均回报 ≥ 475 |

In [None]:
# 探索环境
if HAS_GYM:
    env = gym.make('CartPole-v1')
    
    print("CartPole-v1 环境信息:")
    print(f"  状态空间: {env.observation_space}")
    print(f"  动作空间: {env.action_space}")
    
    state, _ = env.reset(seed=SEED)
    print(f"\n状态含义:")
    print(f"  [0] 小车位置: {state[0]:.4f}")
    print(f"  [1] 小车速度: {state[1]:.4f}")
    print(f"  [2] 杆子角度: {state[2]:.4f}")
    print(f"  [3] 杆子角速度: {state[3]:.4f}")
    
    env.close()

---

## Part 4: 经验回放缓冲区

In [None]:
@dataclass
class Transition:
    """单步转换数据"""
    state: np.ndarray
    action: int
    reward: float
    next_state: np.ndarray
    done: bool

In [None]:
class ReplayBuffer:
    """
    经验回放缓冲区
    
    复杂度:
        push: O(1) 摊销
        sample: O(batch_size)
    """
    
    def __init__(self, capacity: int = 100000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append(Transition(state, action, reward, next_state, done))
    
    def sample(self, batch_size: int) -> Tuple[np.ndarray, ...]:
        batch = random.sample(self.buffer, batch_size)
        
        states = np.array([t.state for t in batch], dtype=np.float32)
        actions = np.array([t.action for t in batch], dtype=np.int64)
        rewards = np.array([t.reward for t in batch], dtype=np.float32)
        next_states = np.array([t.next_state for t in batch], dtype=np.float32)
        dones = np.array([t.done for t in batch], dtype=np.float32)
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        return len(self.buffer)
    
    def is_ready(self, batch_size: int) -> bool:
        return len(self) >= batch_size

In [None]:
# 测试缓冲区
buffer = ReplayBuffer(capacity=1000)

for i in range(100):
    buffer.push(
        state=np.random.randn(4),
        action=i % 2,
        reward=1.0,
        next_state=np.random.randn(4),
        done=False
    )

states, actions, rewards, _, _ = buffer.sample(32)
print(f"缓冲区大小: {len(buffer)}")
print(f"采样形状: states={states.shape}, actions={actions.shape}")

---

## Part 5: DQN 网络架构

In [None]:
class DQNNetwork(nn.Module):
    """
    标准 DQN 网络
    
    架构: State → FC → ReLU → FC → ReLU → FC → Q-values
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, action_dim)
        )
        
        self._init_weights()
    
    def _init_weights(self):
        for m in self.net:
            if isinstance(m, nn.Linear):
                nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
                nn.init.zeros_(m.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

In [None]:
# 测试网络
net = DQNNetwork(state_dim=4, action_dim=2).to(DEVICE)
x = torch.randn(32, 4).to(DEVICE)
q_values = net(x)

print(f"输入: {x.shape}")
print(f"输出: {q_values.shape}")
print(f"参数量: {sum(p.numel() for p in net.parameters()):,}")

---

## Part 6: Dueling DQN

### 6.1 核心思想

将 Q 值分解为**状态价值** V(s) 和**优势函数** A(s,a)：

$$Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')$$

**为什么有效？**

- V(s) 独立学习状态的整体价值
- A(s,a) 专注于比较动作的相对优劣
- 在动作影响不大的状态下学习更高效

In [None]:
class DuelingDQNNetwork(nn.Module):
    """
    Dueling DQN 网络
    
    架构:
        State → 共享层 → 价值流 → V(s)
                      → 优势流 → A(s,a)
        Q(s,a) = V(s) + (A(s,a) - mean(A))
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        
        # 共享特征层
        self.feature = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(inplace=True)
        )
        
        # 价值流 V(s)
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, 1)
        )
        
        # 优势流 A(s,a)
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = self.feature(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        
        # 聚合: Q = V + (A - mean(A))
        q_values = value + (advantage - advantage.mean(dim=-1, keepdim=True))
        return q_values

In [None]:
# 测试 Dueling 网络
dueling_net = DuelingDQNNetwork(state_dim=4, action_dim=2).to(DEVICE)
q_values = dueling_net(x)

print(f"Dueling DQN 输出: {q_values.shape}")
print(f"参数量: {sum(p.numel() for p in dueling_net.parameters()):,}")

---

## Part 7: DQN 智能体

In [None]:
class DQNAgent:
    """
    DQN 智能体
    
    支持:
    - 标准 DQN
    - Double DQN (解耦动作选择与评估)
    - Dueling DQN (V-A 分解)
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 128,
        lr: float = 1e-3,
        gamma: float = 0.99,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.01,
        epsilon_decay: float = 0.995,
        buffer_size: int = 100000,
        batch_size: int = 64,
        target_update_freq: int = 100,
        double_dqn: bool = False,
        dueling: bool = False,
        device: str = 'auto'
    ):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        self.double_dqn = double_dqn
        
        # 设备
        self.device = torch.device('cuda' if device == 'auto' and torch.cuda.is_available() else 'cpu')
        
        # 网络
        NetworkClass = DuelingDQNNetwork if dueling else DQNNetwork
        self.q_network = NetworkClass(state_dim, action_dim, hidden_dim).to(self.device)
        self.target_network = NetworkClass(state_dim, action_dim, hidden_dim).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        # 优化器
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        
        # 经验回放
        self.buffer = ReplayBuffer(buffer_size)
        
        # 计数器
        self.update_count = 0

In [None]:
# 动作选择方法
def get_action(self, state: np.ndarray, training: bool = True) -> int:
    """ε-greedy 动作选择"""
    if training and random.random() < self.epsilon:
        return random.randint(0, self.action_dim - 1)
    
    state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
    with torch.no_grad():
        q_values = self.q_network(state_t)
    return q_values.argmax(dim=1).item()

DQNAgent.get_action = get_action

In [None]:
# 更新方法
def update(self, state, action, reward, next_state, done) -> Optional[float]:
    """存储经验并训练"""
    self.buffer.push(state, action, reward, next_state, done)
    
    if not self.buffer.is_ready(self.batch_size):
        return None
    
    # 采样
    states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)
    
    states_t = torch.FloatTensor(states).to(self.device)
    actions_t = torch.LongTensor(actions).to(self.device)
    rewards_t = torch.FloatTensor(rewards).to(self.device)
    next_states_t = torch.FloatTensor(next_states).to(self.device)
    dones_t = torch.FloatTensor(dones).to(self.device)
    
    # 当前 Q 值
    current_q = self.q_network(states_t).gather(1, actions_t.unsqueeze(1)).squeeze(1)
    
    # 目标 Q 值
    with torch.no_grad():
        if self.double_dqn:
            # Double DQN: 在线网络选动作，目标网络评估
            next_actions = self.q_network(next_states_t).argmax(dim=1)
            next_q = self.target_network(next_states_t).gather(1, next_actions.unsqueeze(1)).squeeze(1)
        else:
            next_q = self.target_network(next_states_t).max(dim=1)[0]
        
        target_q = rewards_t + self.gamma * next_q * (1 - dones_t)
    
    # 损失与优化
    loss = F.mse_loss(current_q, target_q)
    
    self.optimizer.zero_grad()
    loss.backward()
    nn.utils.clip_grad_norm_(self.q_network.parameters(), 10)
    self.optimizer.step()
    
    # 更新目标网络
    self.update_count += 1
    if self.update_count % self.target_update_freq == 0:
        self.target_network.load_state_dict(self.q_network.state_dict())
    
    return loss.item()

DQNAgent.update = update

In [None]:
# 探索率衰减
def decay_epsilon(self):
    self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

DQNAgent.decay_epsilon = decay_epsilon

---

## Part 8: 训练与评估

In [None]:
def train_dqn(
    env_name: str = 'CartPole-v1',
    num_episodes: int = 200,
    double_dqn: bool = False,
    dueling: bool = False,
    seed: int = 42,
    verbose: bool = True
) -> Tuple[DQNAgent, List[float]]:
    """训练 DQN 智能体"""
    if not HAS_GYM:
        return None, []
    
    # 设置种子
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
    # 创建环境
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # 算法名称
    algo_name = "DQN"
    if double_dqn:
        algo_name = "Double " + algo_name
    if dueling:
        algo_name = "Dueling " + algo_name
    
    if verbose:
        print(f"\n{'='*50}")
        print(f"训练 {algo_name}")
        print(f"{'='*50}")
    
    # 创建智能体
    agent = DQNAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        double_dqn=double_dqn,
        dueling=dueling
    )
    
    # 训练
    rewards_history = []
    best_avg = float('-inf')
    
    for episode in range(num_episodes):
        state, _ = env.reset(seed=seed + episode)
        total_reward = 0
        done = False
        
        while not done:
            action = agent.get_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.update(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
        
        agent.decay_epsilon()
        rewards_history.append(total_reward)
        
        if verbose and (episode + 1) % 25 == 0:
            avg = np.mean(rewards_history[-25:])
            best_avg = max(best_avg, avg)
            print(f"回合 {episode+1:4d} | 平均: {avg:7.2f} | 最佳: {best_avg:7.2f} | ε: {agent.epsilon:.3f}")
    
    env.close()
    return agent, rewards_history

In [None]:
# 训练基础 DQN
agent_dqn, rewards_dqn = train_dqn(
    num_episodes=150,
    double_dqn=False,
    dueling=False
)

In [None]:
# 训练 Double Dueling DQN
agent_dd, rewards_dd = train_dqn(
    num_episodes=150,
    double_dqn=True,
    dueling=True
)

---

## Part 9: 结果可视化

In [None]:
def plot_learning_curves(results: dict, window: int = 20):
    """绘制学习曲线对比"""
    plt.figure(figsize=(12, 5))
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
    
    for idx, (name, rewards) in enumerate(results.items()):
        if len(rewards) >= window:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            plt.plot(smoothed, label=name, color=colors[idx % len(colors)], linewidth=2)
    
    plt.xlabel('Episode', fontsize=12)
    plt.ylabel('Total Reward', fontsize=12)
    plt.title('DQN 变体学习曲线对比', fontsize=14)
    plt.legend(loc='lower right', fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# 绘制对比
if rewards_dqn and rewards_dd:
    plot_learning_curves({
        'DQN': rewards_dqn,
        'Double Dueling DQN': rewards_dd
    })

---

## Part 10: 交互式实验

### 实验 1: ε 衰减策略

In [None]:
def visualize_epsilon_decay(decay_rates: List[float], episodes: int = 200):
    """可视化不同衰减率"""
    plt.figure(figsize=(10, 5))
    
    for decay in decay_rates:
        epsilons = []
        eps = 1.0
        for _ in range(episodes):
            epsilons.append(eps)
            eps = max(0.01, eps * decay)
        plt.plot(epsilons, label=f'decay={decay}')
    
    plt.xlabel('Episode')
    plt.ylabel('Epsilon')
    plt.title('ε-greedy 探索率衰减')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

visualize_epsilon_decay([0.99, 0.995, 0.999])

### 实验 2: Double DQN 消除过估计

**过估计问题**：

$$\mathbb{E}[\max_a Q(s, a)] \geq \max_a \mathbb{E}[Q(s, a)]$$

**Double DQN 解决方案**：

$$y = r + \gamma Q(s', \arg\max_{a'} Q(s', a'; \theta); \theta^-)$$

- 在线网络选择最优动作
- 目标网络评估该动作的价值

---

## Part 11: 单元测试

In [None]:
def run_tests():
    """运行单元测试"""
    print("开始单元测试...\n")
    passed = failed = 0
    
    # 测试1: ReplayBuffer
    try:
        buf = ReplayBuffer(100)
        for i in range(50):
            buf.push(np.random.randn(4), 0, 1.0, np.random.randn(4), False)
        assert len(buf) == 50
        s, a, r, ns, d = buf.sample(32)
        assert s.shape == (32, 4)
        print("测试1通过: ReplayBuffer")
        passed += 1
    except Exception as e:
        print(f"测试1失败: {e}")
        failed += 1
    
    # 测试2: DQNNetwork
    try:
        net = DQNNetwork(4, 2, 64)
        x = torch.randn(32, 4)
        out = net(x)
        assert out.shape == (32, 2)
        print("测试2通过: DQNNetwork")
        passed += 1
    except Exception as e:
        print(f"测试2失败: {e}")
        failed += 1
    
    # 测试3: DuelingDQNNetwork
    try:
        net = DuelingDQNNetwork(4, 2, 64)
        x = torch.randn(32, 4)
        out = net(x)
        assert out.shape == (32, 2)
        print("测试3通过: DuelingDQNNetwork")
        passed += 1
    except Exception as e:
        print(f"测试3失败: {e}")
        failed += 1
    
    # 测试4: DQNAgent
    try:
        agent = DQNAgent(4, 2, batch_size=32, device='cpu')
        state = np.random.randn(4).astype(np.float32)
        action = agent.get_action(state)
        assert 0 <= action < 2
        for _ in range(50):
            agent.update(state, 0, 1.0, state, False)
        print("测试4通过: DQNAgent")
        passed += 1
    except Exception as e:
        print(f"测试4失败: {e}")
        failed += 1
    
    print(f"\n{'='*40}")
    print(f"结果: {passed} 通过, {failed} 失败")
    print(f"{'='*40}")

run_tests()

---

## 总结

### 核心要点

| 技术 | 解决的问题 | 方法 |
|------|-----------|------|
| 经验回放 | 样本相关性 | 存储并随机采样 |
| 目标网络 | 目标不稳定 | 滞后更新的网络副本 |
| Double DQN | Q 值过估计 | 分离动作选择与评估 |
| Dueling DQN | 学习效率 | V-A 值分解 |

### 调参建议

| 参数 | 推荐范围 | 说明 |
|------|----------|------|
| 学习率 | 1e-4 ~ 1e-3 | 太大不稳定 |
| Batch Size | 32 ~ 256 | 视内存而定 |
| γ | 0.99 | 短期任务可用 0.95 |
| 目标更新 | 100 ~ 1000 步 | 或软更新 τ=0.005 |
| ε 衰减 | 0.995 ~ 0.9999 | 视任务难度 |

---

## 参考文献

1. Mnih et al., "Playing Atari with Deep Reinforcement Learning", 2013
2. Mnih et al., "Human-level control through deep RL", Nature 2015
3. van Hasselt et al., "Deep RL with Double Q-learning", AAAI 2016
4. Wang et al., "Dueling Network Architectures", ICML 2016

---

[返回上级](../README.md)