# 时序差分学习深度教程 (Temporal Difference Learning)

---

## 概述

本教程深入讲解时序差分(TD)学习——强化学习中最核心的概念之一。TD方法结合了蒙特卡洛方法的采样思想和动态规划的自举(Bootstrapping)思想，是现代强化学习的基石。

### 学习目标

完成本教程后，你将能够：

1. **理解TD学习的核心思想**：掌握"用猜测更新猜测"的自举原理
2. **区分各种TD算法**：TD(0), SARSA, Q-Learning, Expected SARSA, Double Q-Learning
3. **理解资格迹**：掌握TD(λ)如何统一TD(0)和Monte Carlo
4. **实践应用**：在经典环境中实现和比较各种TD算法
5. **分析算法特性**：理解on-policy vs off-policy、偏差-方差权衡

### 目录

1. [TD学习的直觉理解](#1.-TD学习的直觉理解)
2. [数学基础](#2.-数学基础)
3. [TD(0)算法详解](#3.-TD(0)算法详解)
4. [SARSA与Q-Learning对比](#4.-SARSA与Q-Learning对比)
5. [Expected SARSA与Double Q-Learning](#5.-Expected-SARSA与Double-Q-Learning)
6. [N-Step TD与TD(λ)](#6.-N-Step-TD与TD(λ))
7. [实验：Cliff Walking环境](#7.-实验：Cliff-Walking环境)
8. [实验：Random Walk与价值预测](#8.-实验：Random-Walk与价值预测)
9. [超参数分析](#9.-超参数分析)
10. [总结与进阶方向](#10.-总结与进阶方向)

## 环境准备

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# 设置matplotlib中文显示
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (12, 5)
plt.rcParams['figure.dpi'] = 100

# 导入本地模块
from td_algorithms import (
    TDConfig, TrainingMetrics,
    TD0ValueLearner, SARSA, ExpectedSARSA, QLearning,
    DoubleQLearning, NStepTD, TDLambda, SARSALambda,
    create_td_learner, EligibilityTraceType
)
from environments import (
    GridWorld, GridWorldConfig, CliffWalkingEnv,
    WindyGridWorld, RandomWalk, Action
)
from utils import (
    plot_training_curves, plot_value_heatmap,
    plot_policy_arrows, compute_rmse,
    visualize_cliff_walking_path
)

# 设置随机种子
np.random.seed(42)

print("环境准备完成!")

---

## 1. TD学习的直觉理解

### 1.1 三种学习范式的对比

在强化学习中，有三种主要的价值函数学习范式：

| 方法 | 更新时机 | 使用信息 | 特点 |
|------|----------|----------|------|
| **动态规划(DP)** | 每步 | 完整环境模型 | 需要知道转移概率 |
| **蒙特卡洛(MC)** | 回合结束 | 完整回合回报 | 无偏但高方差 |
| **时序差分(TD)** | 每步 | 单步奖励+下一状态估计 | 有偏但低方差 |

### 1.2 核心洞察：自举(Bootstrapping)

TD学习的核心是**自举**——用当前的估计来更新估计。这听起来像是循环论证，但实际上非常有效：

想象你在学习下棋：
- **MC方法**：打完整盘棋，根据输赢调整每一步的评估
- **TD方法**：走完一步，根据新局面的评估调整上一步的评估

TD的优势：
1. 不需要等到游戏结束
2. 每一步都能学习
3. 学习更稳定（低方差）

In [None]:
# 可视化：TD vs MC 更新机制

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Monte Carlo 更新
ax1 = axes[0]
states_mc = ['S₀', 'S₁', 'S₂', 'S₃', 'S₄', 'End']
rewards_mc = [0, -1, -1, -1, 10]
ax1.scatter(range(6), [0]*6, s=200, c='blue', zorder=5)
for i, (s, r) in enumerate(zip(states_mc, rewards_mc + [0])):
    ax1.annotate(s, (i, 0.1), ha='center', fontsize=12)
    if i < 5:
        ax1.annotate(f'r={rewards_mc[i]}', (i+0.5, -0.1), ha='center', fontsize=10, color='gray')
        ax1.arrow(i+0.15, 0, 0.7, 0, head_width=0.03, head_length=0.05, fc='blue', ec='blue')

# 显示MC的回报计算
g_return = sum(rewards_mc)
ax1.annotate(f'G = {g_return}', (2.5, -0.3), ha='center', fontsize=14, 
             bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
ax1.annotate('等待回合结束\n然后用G更新所有V(s)', (2.5, 0.35), ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
ax1.set_xlim(-0.5, 5.5)
ax1.set_ylim(-0.5, 0.6)
ax1.axis('off')
ax1.set_title('Monte Carlo: 用完整回报更新', fontsize=13, fontweight='bold')

# TD 更新
ax2 = axes[1]
states_td = ['S₀', 'S₁']
ax2.scatter([0, 2], [0, 0], s=200, c='green', zorder=5)
ax2.annotate('S₀\nV=5', (0, 0.15), ha='center', fontsize=12)
ax2.annotate('S₁\nV=8', (2, 0.15), ha='center', fontsize=12)
ax2.annotate('r=-1', (1, -0.08), ha='center', fontsize=10, color='gray')
ax2.arrow(0.2, 0, 1.5, 0, head_width=0.03, head_length=0.1, fc='green', ec='green')

# TD目标和更新
ax2.annotate('TD目标 = r + γV(S₁) = -1 + 0.9×8 = 6.2', (1, -0.25), ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))
ax2.annotate('TD误差 δ = 6.2 - 5 = 1.2', (1, -0.4), ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))
ax2.annotate('立即更新V(S₀)\n无需等待回合结束', (1, 0.4), ha='center', fontsize=11,
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))
ax2.set_xlim(-0.5, 2.5)
ax2.set_ylim(-0.6, 0.7)
ax2.axis('off')
ax2.set_title('TD: 用单步奖励+估计更新', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

---

## 2. 数学基础

### 2.1 价值函数回顾

**状态价值函数** $V^\pi(s)$：在状态 $s$ 下，遵循策略 $\pi$ 的期望回报

$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s\right]$$

**动作价值函数** $Q^\pi(s, a)$：在状态 $s$ 执行动作 $a$，然后遵循策略 $\pi$ 的期望回报

$$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s, A_t = a\right]$$

### 2.2 TD(0)更新规则

**TD目标(TD Target)**：
$$G_t^{(1)} = R_{t+1} + \gamma V(S_{t+1})$$

**TD误差(TD Error)**：
$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

**更新规则**：
$$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$$

### 2.3 为什么TD有效？

关键洞察：TD目标是真实回报的**无偏估计的有偏估计**：

$$\mathbb{E}[R_{t+1} + \gamma V^\pi(S_{t+1})] = V^\pi(S_t) \quad \text{(期望正确)}$$

但由于我们用 $V(S_{t+1})$ 代替 $V^\pi(S_{t+1})$，引入了偏差。然而，这种偏差会随着 $V$ 收敛到 $V^\pi$ 而消失。

In [None]:
# 数学公式可视化

fig, ax = plt.subplots(figsize=(12, 6))

# 绘制TD更新的流程图
boxes = [
    (0.1, 0.7, '当前状态\n$S_t$', 'lightblue'),
    (0.4, 0.7, '即时奖励\n$R_{t+1}$', 'lightgreen'),
    (0.7, 0.7, '下一状态\n$S_{t+1}$', 'lightblue'),
    (0.7, 0.4, '估计价值\n$V(S_{t+1})$', 'lightyellow'),
    (0.4, 0.4, 'TD目标\n$R_{t+1} + \gamma V(S_{t+1})$', 'lightcoral'),
    (0.1, 0.4, '当前估计\n$V(S_t)$', 'lightyellow'),
    (0.25, 0.15, 'TD误差\n$\delta_t = $TD目标$ - V(S_t)$', 'plum'),
]

for x, y, text, color in boxes:
    ax.add_patch(plt.Rectangle((x-0.08, y-0.08), 0.16, 0.16, 
                               facecolor=color, edgecolor='black', linewidth=2))
    ax.text(x, y, text, ha='center', va='center', fontsize=10)

# 箭头
arrows = [
    ((0.18, 0.7), (0.32, 0.7)),   # S_t -> R
    ((0.48, 0.7), (0.62, 0.7)),   # R -> S_{t+1}
    ((0.7, 0.62), (0.7, 0.48)),   # S_{t+1} -> V(S_{t+1})
    ((0.62, 0.4), (0.48, 0.4)),   # V(S_{t+1}) -> TD目标
    ((0.4, 0.62), (0.4, 0.48)),   # R -> TD目标
    ((0.32, 0.4), (0.18, 0.4)),   # TD目标 -> V(S_t)比较
    ((0.1, 0.32), (0.2, 0.23)),   # V(S_t) -> δ
    ((0.4, 0.32), (0.3, 0.23)),   # TD目标 -> δ
]

for start, end in arrows:
    ax.annotate('', xy=end, xytext=start,
                arrowprops=dict(arrowstyle='->', color='gray', lw=2))

# 最终更新公式
ax.text(0.5, 0.02, r'更新: $V(S_t) \leftarrow V(S_t) + \alpha \cdot \delta_t$', 
        ha='center', fontsize=14, 
        bbox=dict(boxstyle='round', facecolor='gold', alpha=0.8))

ax.set_xlim(0, 1)
ax.set_ylim(0, 0.9)
ax.axis('off')
ax.set_title('TD(0)更新流程图', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

---

## 3. TD(0)算法详解

### 3.1 TD(0)用于策略评估

给定固定策略 $\pi$，TD(0)估计该策略的状态价值函数 $V^\pi$。

**算法流程**：
```
初始化: V(s) 任意初始化，V(终止状态) = 0

对于每个回合:
    初始化 S
    对于回合中的每一步:
        A ← 按策略π选择动作
        执行A，观察 R, S'
        V(S) ← V(S) + α[R + γV(S') - V(S)]
        S ← S'
    直到 S 是终止状态
```

### 3.2 Random Walk实验

Random Walk是验证TD预测正确性的经典环境：
- 状态：A-B-C-D-E（5个非终止状态）
- 两端是终止状态
- 随机向左或向右移动
- 到达右端奖励+1，左端奖励0
- 真实价值有解析解，便于验证

In [None]:
# Random Walk 环境演示

env = RandomWalk(n_states=5)

# 获取真实价值
true_values = env.get_true_values(gamma=1.0)
print("Random Walk 真实价值函数 (γ=1):")
print("状态:  T   A    B    C    D    E   T")
print(f"价值: {true_values}")

# 可视化
fig, ax = plt.subplots(figsize=(12, 4))

# 绘制状态
states = ['T(0)', 'A', 'B', 'C', 'D', 'E', 'T(1)']
x_pos = np.arange(7)
colors = ['gray'] + ['skyblue']*5 + ['gold']

for i, (x, state, color) in enumerate(zip(x_pos, states, colors)):
    circle = plt.Circle((x, 0.5), 0.3, color=color, ec='black', lw=2)
    ax.add_patch(circle)
    ax.text(x, 0.5, state, ha='center', va='center', fontsize=12, fontweight='bold')
    ax.text(x, 0.05, f'V={true_values[i]:.2f}', ha='center', fontsize=10)

# 绘制转移箭头
for i in range(1, 6):
    # 向左
    ax.annotate('', xy=(i-0.7, 0.65), xytext=(i-0.35, 0.65),
                arrowprops=dict(arrowstyle='->', color='red', lw=1.5))
    ax.text(i-0.5, 0.8, '50%', ha='center', fontsize=8, color='red')
    # 向右
    ax.annotate('', xy=(i+0.7, 0.35), xytext=(i+0.35, 0.35),
                arrowprops=dict(arrowstyle='->', color='blue', lw=1.5))
    ax.text(i+0.5, 0.2, '50%', ha='center', fontsize=8, color='blue')

ax.set_xlim(-0.5, 6.5)
ax.set_ylim(-0.2, 1.1)
ax.axis('off')
ax.set_title('Random Walk 环境', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# TD(0) vs MC 在Random Walk上的比较

def td0_prediction(env, n_episodes=100, alpha=0.1, gamma=1.0):
    """TD(0)策略评估"""
    V = np.zeros(env.n_total_states)
    rmse_history = []
    true_v = env.get_true_values(gamma)
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        
        while True:
            next_state, reward, done, _, _ = env.step(0)
            
            # TD(0)更新
            td_target = reward + gamma * V[next_state] * (not done)
            td_error = td_target - V[state]
            V[state] += alpha * td_error
            
            state = next_state
            if done:
                break
        
        # 计算RMSE（仅非终止状态）
        rmse = np.sqrt(np.mean((V[1:-1] - true_v[1:-1])**2))
        rmse_history.append(rmse)
    
    return V, rmse_history


def mc_prediction(env, n_episodes=100, alpha=0.1, gamma=1.0):
    """Monte Carlo策略评估（每次访问MC）"""
    V = np.zeros(env.n_total_states)
    rmse_history = []
    true_v = env.get_true_values(gamma)
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        trajectory = [state]
        rewards = []
        
        # 收集完整轨迹
        while True:
            next_state, reward, done, _, _ = env.step(0)
            rewards.append(reward)
            trajectory.append(next_state)
            if done:
                break
        
        # MC更新（从后向前计算回报）
        G = 0
        for t in range(len(rewards) - 1, -1, -1):
            G = rewards[t] + gamma * G
            state = trajectory[t]
            V[state] += alpha * (G - V[state])
        
        # 计算RMSE
        rmse = np.sqrt(np.mean((V[1:-1] - true_v[1:-1])**2))
        rmse_history.append(rmse)
    
    return V, rmse_history


# 运行实验
env = RandomWalk(n_states=5)
n_episodes = 100
n_runs = 100  # 多次运行取平均

td_rmse_all = []
mc_rmse_all = []

print("运行TD(0) vs MC比较实验...")
for run in range(n_runs):
    np.random.seed(run)
    _, td_rmse = td0_prediction(env, n_episodes, alpha=0.1)
    _, mc_rmse = mc_prediction(env, n_episodes, alpha=0.1)
    td_rmse_all.append(td_rmse)
    mc_rmse_all.append(mc_rmse)

td_rmse_mean = np.mean(td_rmse_all, axis=0)
mc_rmse_mean = np.mean(mc_rmse_all, axis=0)
print("完成!")

# 绘图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RMSE曲线
ax1 = axes[0]
ax1.plot(td_rmse_mean, label='TD(0)', color='blue', linewidth=2)
ax1.plot(mc_rmse_mean, label='Monte Carlo', color='red', linewidth=2)
ax1.set_xlabel('回合', fontsize=12)
ax1.set_ylabel('RMSE', fontsize=12)
ax1.set_title('TD(0) vs MC: 价值估计误差', fontsize=13, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# 最终价值估计
ax2 = axes[1]
np.random.seed(42)
V_td, _ = td0_prediction(env, 100, alpha=0.1)
np.random.seed(42)
V_mc, _ = mc_prediction(env, 100, alpha=0.1)
true_v = env.get_true_values(1.0)

states_label = ['A', 'B', 'C', 'D', 'E']
x = np.arange(5)
width = 0.25

ax2.bar(x - width, true_v[1:-1], width, label='真实价值', color='green', alpha=0.8)
ax2.bar(x, V_td[1:-1], width, label='TD(0)估计', color='blue', alpha=0.8)
ax2.bar(x + width, V_mc[1:-1], width, label='MC估计', color='red', alpha=0.8)
ax2.set_xticks(x)
ax2.set_xticklabels(states_label)
ax2.set_xlabel('状态', fontsize=12)
ax2.set_ylabel('价值', fontsize=12)
ax2.set_title('100回合后的价值估计', fontsize=13, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 3.3 关键观察

从上面的实验可以看到：

1. **TD(0)收敛更快**：在相同的学习率下，TD(0)通常比MC更快收敛
2. **低方差**：TD(0)的更新只依赖单步，方差较低
3. **有偏但渐近无偏**：虽然TD(0)初期有偏差，但最终会收敛到正确值

---

## 4. SARSA与Q-Learning对比

### 4.1 从策略评估到控制

TD(0)用于评估固定策略。**SARSA**和**Q-Learning**将TD思想扩展到**控制问题**——同时学习和改进策略。

### 4.2 SARSA (State-Action-Reward-State-Action)

**On-Policy TD控制**：学习行为策略本身的Q值

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$$

关键：使用**实际执行的下一动作** $A_{t+1}$ 来计算TD目标

### 4.3 Q-Learning

**Off-Policy TD控制**：学习最优Q值，与行为策略无关

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

关键：使用**最大化的Q值** $\max_a Q(S_{t+1}, a)$ 来计算TD目标

### 4.4 核心区别

| 特性 | SARSA | Q-Learning |
|------|-------|------------|
| 类型 | On-Policy | Off-Policy |
| TD目标 | $R + \gamma Q(S', A')$ | $R + \gamma \max_a Q(S', a)$ |
| 学习的策略 | 行为策略(含探索) | 最优策略(不含探索) |
| 安全性 | 更安全 | 更激进 |
| 典型行为 | 避开风险 | 追求最优路径 |

In [None]:
# SARSA vs Q-Learning 算法对比图示

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# SARSA
ax1 = axes[0]
ax1.text(0.5, 0.9, 'SARSA (On-Policy)', ha='center', fontsize=14, fontweight='bold')

# 状态转移
ax1.add_patch(plt.Rectangle((0.15, 0.5), 0.15, 0.1, facecolor='lightblue', edgecolor='black'))
ax1.text(0.225, 0.55, '$S_t$', ha='center', fontsize=11)
ax1.add_patch(plt.Rectangle((0.35, 0.5), 0.15, 0.1, facecolor='lightgreen', edgecolor='black'))
ax1.text(0.425, 0.55, '$A_t$', ha='center', fontsize=11)
ax1.add_patch(plt.Rectangle((0.55, 0.5), 0.15, 0.1, facecolor='lightyellow', edgecolor='black'))
ax1.text(0.625, 0.55, '$R$', ha='center', fontsize=11)
ax1.add_patch(plt.Rectangle((0.75, 0.5), 0.15, 0.1, facecolor='lightblue', edgecolor='black'))
ax1.text(0.825, 0.55, "$S'$", ha='center', fontsize=11)

# A' (实际选择的动作)
ax1.add_patch(plt.Rectangle((0.75, 0.3), 0.15, 0.1, facecolor='lightcoral', edgecolor='black', linewidth=2))
ax1.text(0.825, 0.35, "$A'$", ha='center', fontsize=11, fontweight='bold')
ax1.annotate('', xy=(0.825, 0.4), xytext=(0.825, 0.5),
            arrowprops=dict(arrowstyle='->', color='red', lw=2))
ax1.text(0.88, 0.45, 'ε-greedy\n选择', fontsize=9, color='red')

# 更新公式
ax1.text(0.5, 0.15, r"$Q(S_t, A_t) \leftarrow Q + \alpha[R + \gamma Q(S', A') - Q]$",
         ha='center', fontsize=12, bbox=dict(facecolor='lightyellow', alpha=0.8))

ax1.text(0.5, 0.02, '使用实际执行的动作A\'来更新', ha='center', fontsize=11, color='red')

# 箭头
for x1, x2 in [(0.3, 0.35), (0.5, 0.55), (0.7, 0.75)]:
    ax1.annotate('', xy=(x2, 0.55), xytext=(x1, 0.55),
                arrowprops=dict(arrowstyle='->', color='gray'))

ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.axis('off')

# Q-Learning
ax2 = axes[1]
ax2.text(0.5, 0.9, 'Q-Learning (Off-Policy)', ha='center', fontsize=14, fontweight='bold')

# 状态转移
ax2.add_patch(plt.Rectangle((0.15, 0.5), 0.15, 0.1, facecolor='lightblue', edgecolor='black'))
ax2.text(0.225, 0.55, '$S_t$', ha='center', fontsize=11)
ax2.add_patch(plt.Rectangle((0.35, 0.5), 0.15, 0.1, facecolor='lightgreen', edgecolor='black'))
ax2.text(0.425, 0.55, '$A_t$', ha='center', fontsize=11)
ax2.add_patch(plt.Rectangle((0.55, 0.5), 0.15, 0.1, facecolor='lightyellow', edgecolor='black'))
ax2.text(0.625, 0.55, '$R$', ha='center', fontsize=11)
ax2.add_patch(plt.Rectangle((0.75, 0.5), 0.15, 0.1, facecolor='lightblue', edgecolor='black'))
ax2.text(0.825, 0.55, "$S'$", ha='center', fontsize=11)

# max Q (最大化)
ax2.add_patch(plt.Rectangle((0.75, 0.3), 0.15, 0.1, facecolor='lightgreen', edgecolor='black', linewidth=2))
ax2.text(0.825, 0.35, r"$\max_a Q$", ha='center', fontsize=10, fontweight='bold')
ax2.annotate('', xy=(0.825, 0.4), xytext=(0.825, 0.5),
            arrowprops=dict(arrowstyle='->', color='green', lw=2))
ax2.text(0.88, 0.45, '取最大\nQ值', fontsize=9, color='green')

# 更新公式
ax2.text(0.5, 0.15, r"$Q(S_t, A_t) \leftarrow Q + \alpha[R + \gamma \max_a Q(S', a) - Q]$",
         ha='center', fontsize=12, bbox=dict(facecolor='lightgreen', alpha=0.5))

ax2.text(0.5, 0.02, '使用最优动作的Q值来更新', ha='center', fontsize=11, color='green')

# 箭头
for x1, x2 in [(0.3, 0.35), (0.5, 0.55), (0.7, 0.75)]:
    ax2.annotate('', xy=(x2, 0.55), xytext=(x1, 0.55),
                arrowprops=dict(arrowstyle='->', color='gray'))

ax2.set_xlim(0, 1)
ax2.set_ylim(0, 1)
ax2.axis('off')

plt.tight_layout()
plt.show()

---

## 5. Expected SARSA与Double Q-Learning

### 5.1 Expected SARSA

Expected SARSA使用下一状态Q值的**期望**而不是单一采样：

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma \mathbb{E}_\pi[Q(S_{t+1}, A)] - Q(S_t, A_t)\right]$$

对于ε-greedy策略：
$$\mathbb{E}_\pi[Q(S', A)] = \frac{\epsilon}{|A|} \sum_a Q(S', a) + (1-\epsilon) \max_a Q(S', a)$$

**优势**：消除了动作采样带来的方差

### 5.2 Double Q-Learning

Q-Learning的max操作会导致**最大化偏差**（过估计）。Double Q-Learning通过维护两个Q表来解决：

```
以50%概率:
    a* = argmax_a Q_A(S', a)          # 用Q_A选择
    Q_A(S, A) ← Q_A + α[R + γQ_B(S', a*) - Q_A(S, A)]  # 用Q_B评估
否则:
    a* = argmax_a Q_B(S', a)          # 用Q_B选择
    Q_B(S, A) ← Q_B + α[R + γQ_A(S', a*) - Q_B(S, A)]  # 用Q_A评估
```

**关键洞察**：通过解耦"选择"和"评估"，避免了同一噪声来源的双重影响

In [None]:
# 最大化偏差演示

np.random.seed(42)

# 模拟场景：有多个动作，每个动作的真实价值为0，但估计有噪声
n_actions = 10
true_value = 0.0
noise_std = 1.0
n_samples = 1000

max_estimates = []
double_estimates = []

for _ in range(n_samples):
    # 标准Q-Learning：估计值 = 真实值 + 噪声
    q_estimates = true_value + np.random.randn(n_actions) * noise_std
    max_estimates.append(np.max(q_estimates))
    
    # Double Q-Learning：两组独立估计
    q_a = true_value + np.random.randn(n_actions) * noise_std
    q_b = true_value + np.random.randn(n_actions) * noise_std
    best_action = np.argmax(q_a)  # 用Q_A选择
    double_estimates.append(q_b[best_action])  # 用Q_B评估

# 绘图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 直方图
ax1 = axes[0]
ax1.hist(max_estimates, bins=50, alpha=0.7, label=f'Q-Learning (max)', color='red', density=True)
ax1.hist(double_estimates, bins=50, alpha=0.7, label=f'Double Q-Learning', color='blue', density=True)
ax1.axvline(true_value, color='green', linestyle='--', linewidth=2, label=f'真实值 = {true_value}')
ax1.axvline(np.mean(max_estimates), color='red', linestyle=':', linewidth=2)
ax1.axvline(np.mean(double_estimates), color='blue', linestyle=':', linewidth=2)
ax1.set_xlabel('估计值', fontsize=12)
ax1.set_ylabel('密度', fontsize=12)
ax1.set_title('最大化偏差演示', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)

# 期望值比较
ax2 = axes[1]
methods = ['真实值', 'Q-Learning\n(max)', 'Double\nQ-Learning']
values = [true_value, np.mean(max_estimates), np.mean(double_estimates)]
colors = ['green', 'red', 'blue']
bars = ax2.bar(methods, values, color=colors, alpha=0.7)

# 添加数值标签
for bar, val in zip(bars, values):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
             f'{val:.3f}', ha='center', fontsize=12)

ax2.axhline(true_value, color='green', linestyle='--', alpha=0.5)
ax2.set_ylabel('期望估计值', fontsize=12)
ax2.set_title(f'期望估计值比较 (动作数={n_actions})', fontsize=13, fontweight='bold')

# 添加说明
bias = np.mean(max_estimates) - true_value
ax2.text(1, np.mean(max_estimates) + 0.3, f'过估计偏差: {bias:.3f}', 
         ha='center', fontsize=11, color='red')

plt.tight_layout()
plt.show()

print(f"分析结果:")
print(f"  真实值: {true_value}")
print(f"  Q-Learning期望估计: {np.mean(max_estimates):.4f} (过估计 {np.mean(max_estimates) - true_value:.4f})")
print(f"  Double Q-Learning期望估计: {np.mean(double_estimates):.4f} (偏差 {np.mean(double_estimates) - true_value:.4f})")

---

## 6. N-Step TD与TD(λ)

### 6.1 N-Step TD

N-Step TD使用n步的实际奖励，然后用第n+1步的价值估计：

$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$

- n=1: TD(0)
- n=∞: Monte Carlo

### 6.2 TD(λ)与资格迹

TD(λ)不是选择特定的n，而是对所有n-step回报做**几何加权平均**：

$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$

**资格迹(Eligibility Traces)**提供了高效实现：

$$E_t(s) = \gamma \lambda E_{t-1}(s) + \mathbf{1}(S_t = s)$$

资格迹追踪哪些状态对当前TD误差"负责"，允许单步更新实现多步效果。

In [None]:
# λ-回报的几何加权可视化

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 不同λ的权重分布
ax1 = axes[0]
n_steps = 15
lambdas = [0.0, 0.5, 0.8, 0.95, 1.0]
colors = plt.cm.viridis(np.linspace(0, 1, len(lambdas)))

for lam, color in zip(lambdas, colors):
    if lam < 1:
        weights = [(1 - lam) * (lam ** (n-1)) for n in range(1, n_steps + 1)]
    else:
        weights = [0] * (n_steps - 1) + [1]  # MC: 只有最后一个
    
    ax1.plot(range(1, n_steps + 1), weights, 'o-', 
             label=f'λ={lam}', color=color, markersize=6)

ax1.set_xlabel('n-step回报', fontsize=12)
ax1.set_ylabel('权重 (1-λ)λⁿ⁻¹', fontsize=12)
ax1.set_title('TD(λ)中各n-step回报的权重', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# 资格迹衰减示意
ax2 = axes[1]
gamma = 0.99
lam = 0.9

# 模拟一个访问序列
# 状态1在t=0访问，状态2在t=5访问
t = np.arange(20)

# 状态1的资格迹
e1 = np.zeros(20)
e1[0] = 1
for i in range(1, 20):
    e1[i] = gamma * lam * e1[i-1]

# 状态2的资格迹
e2 = np.zeros(20)
e2[5] = 1
for i in range(6, 20):
    e2[i] = gamma * lam * e2[i-1]

ax2.plot(t, e1, 'b-o', label='状态1 (t=0访问)', markersize=5)
ax2.plot(t, e2, 'r-s', label='状态2 (t=5访问)', markersize=5)
ax2.axvline(0, color='blue', linestyle='--', alpha=0.5)
ax2.axvline(5, color='red', linestyle='--', alpha=0.5)

ax2.set_xlabel('时间步', fontsize=12)
ax2.set_ylabel('资格迹 E(s)', fontsize=12)
ax2.set_title(f'资格迹衰减 (γ={gamma}, λ={lam})', fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 7. 实验：Cliff Walking环境

Cliff Walking是展示SARSA与Q-Learning区别的经典环境：

```
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
S C C C C C C C C C C G
```

- S: 起点
- G: 终点
- C: 悬崖（掉落返回起点，-100奖励）
- 每步-1奖励

### 预期行为
- **SARSA**: 学到远离悬崖的**安全路径**（考虑探索时的风险）
- **Q-Learning**: 学到贴着悬崖的**最优路径**（假设执行时是贪婪的）

In [None]:
# Cliff Walking实验

env = CliffWalkingEnv()

# 配置
config = TDConfig(
    alpha=0.5,
    gamma=0.99,
    epsilon=0.1
)

n_episodes = 500

# 训练SARSA
print("训练SARSA...")
sarsa_learner = create_td_learner('sarsa', config=config)
sarsa_metrics = sarsa_learner.train(env, n_episodes=n_episodes, log_interval=n_episodes+1)

# 训练Q-Learning
print("训练Q-Learning...")
qlearn_learner = create_td_learner('q_learning', config=config)
qlearn_metrics = qlearn_learner.train(env, n_episodes=n_episodes, log_interval=n_episodes+1)

# 训练Expected SARSA
print("训练Expected SARSA...")
exp_sarsa_learner = create_td_learner('expected_sarsa', config=config)
exp_sarsa_metrics = exp_sarsa_learner.train(env, n_episodes=n_episodes, log_interval=n_episodes+1)

print("训练完成!")

In [None]:
# 绘制学习曲线

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

window = 20

# 奖励曲线
ax1 = axes[0]
for metrics, name, color in [
    (sarsa_metrics, 'SARSA', 'blue'),
    (qlearn_metrics, 'Q-Learning', 'red'),
    (exp_sarsa_metrics, 'Expected SARSA', 'green')
]:
    rewards = metrics.episode_rewards
    if len(rewards) >= window:
        smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
        ax1.plot(smoothed, label=name, color=color, linewidth=2)

ax1.set_xlabel('回合', fontsize=12)
ax1.set_ylabel('累积奖励', fontsize=12)
ax1.set_title('Cliff Walking 学习曲线', fontsize=13, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# 最后100回合的统计
ax2 = axes[1]
last_100 = 100

methods = ['SARSA', 'Q-Learning', 'Exp. SARSA']
mean_rewards = [
    np.mean(sarsa_metrics.episode_rewards[-last_100:]),
    np.mean(qlearn_metrics.episode_rewards[-last_100:]),
    np.mean(exp_sarsa_metrics.episode_rewards[-last_100:])
]
std_rewards = [
    np.std(sarsa_metrics.episode_rewards[-last_100:]),
    np.std(qlearn_metrics.episode_rewards[-last_100:]),
    np.std(exp_sarsa_metrics.episode_rewards[-last_100:])
]

colors = ['blue', 'red', 'green']
bars = ax2.bar(methods, mean_rewards, yerr=std_rewards, 
               color=colors, alpha=0.7, capsize=5)

for bar, val in zip(bars, mean_rewards):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
             f'{val:.1f}', ha='center', fontsize=11)

ax2.set_ylabel('平均奖励', fontsize=12)
ax2.set_title(f'最后{last_100}回合平均奖励', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

In [None]:
# 可视化学习到的路径

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

def visualize_path(ax, q_function, title):
    """可视化从Q函数推断的路径"""
    height, width = 4, 12
    
    # 绘制网格
    for i in range(height + 1):
        ax.axhline(i, color='gray', linewidth=0.5)
    for j in range(width + 1):
        ax.axvline(j, color='gray', linewidth=0.5)
    
    # 绘制悬崖
    for c in range(1, 11):
        ax.add_patch(plt.Rectangle((c, 0), 1, 1, facecolor='red', alpha=0.5))
    
    # 起点和终点
    ax.add_patch(plt.Rectangle((0, 0), 1, 1, facecolor='green', alpha=0.5))
    ax.text(0.5, 0.5, 'S', ha='center', va='center', fontsize=12, fontweight='bold')
    ax.add_patch(plt.Rectangle((11, 0), 1, 1, facecolor='gold', alpha=0.5))
    ax.text(11.5, 0.5, 'G', ha='center', va='center', fontsize=12, fontweight='bold')
    
    # 模拟贪婪路径
    state = (3, 0)  # 起点
    path = [state]
    deltas = [(-1, 0), (0, 1), (1, 0), (0, -1)]
    
    for _ in range(50):
        state_idx = state[0] * width + state[1]
        
        q_values = []
        for a in range(4):
            key = (state_idx, a)
            if key in q_function:
                q_values.append((a, q_function[key]))
        
        if not q_values:
            break
        
        best_action = max(q_values, key=lambda x: x[1])[0]
        delta = deltas[best_action]
        new_state = (
            np.clip(state[0] + delta[0], 0, height - 1),
            np.clip(state[1] + delta[1], 0, width - 1)
        )
        
        if new_state == (3, 11):
            path.append(new_state)
            break
        
        if new_state[0] == 3 and 1 <= new_state[1] <= 10:
            path.append(new_state)
            break
        
        state = new_state
        path.append(state)
    
    # 绘制路径
    if len(path) > 1:
        xs = [p[1] + 0.5 for p in path]
        ys = [height - p[0] - 0.5 for p in path]
        ax.plot(xs, ys, 'b-o', linewidth=3, markersize=8, alpha=0.7)
    
    ax.set_xlim(0, width)
    ax.set_ylim(0, height)
    ax.set_aspect('equal')
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.axis('off')

visualize_path(axes[0], sarsa_learner.q_function, 'SARSA (安全路径)')
visualize_path(axes[1], qlearn_learner.q_function, 'Q-Learning (最优路径)')
visualize_path(axes[2], exp_sarsa_learner.q_function, 'Expected SARSA')

plt.tight_layout()
plt.show()

### 7.1 结果分析

从上面的实验可以观察到：

1. **Q-Learning学到沿悬崖边缘的最短路径**
   - 它认为可以完美执行贪婪策略
   - 在训练时（有探索）会经常掉落悬崖
   - 但评估时（纯贪婪）走的是最优路径

2. **SARSA学到远离悬崖的安全路径**
   - 它知道自己会探索，可能失误
   - 因此选择更安全但更长的路径
   - 训练时的平均奖励更高（更少掉落）

3. **Expected SARSA介于两者之间**
   - 消除了动作采样方差
   - 行为更稳定

---

## 8. 实验：Random Walk与价值预测

现在用更大的Random Walk环境测试不同TD算法的价值预测能力。

In [None]:
# 不同学习率的TD(0)比较

env = RandomWalk(n_states=19)  # 更大的状态空间
true_values = env.get_true_values(gamma=1.0)

alphas = [0.01, 0.05, 0.1, 0.2, 0.5]
n_episodes = 100
n_runs = 50

results = {}

print("测试不同学习率...")
for alpha in alphas:
    rmse_all = []
    for run in range(n_runs):
        np.random.seed(run)
        _, rmse = td0_prediction(env, n_episodes, alpha=alpha, gamma=1.0)
        rmse_all.append(rmse)
    results[alpha] = np.mean(rmse_all, axis=0)
    print(f"  α={alpha}: 最终RMSE = {results[alpha][-1]:.4f}")

# 绘图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 学习曲线
ax1 = axes[0]
colors = plt.cm.viridis(np.linspace(0, 1, len(alphas)))

for alpha, color in zip(alphas, colors):
    ax1.plot(results[alpha], label=f'α={alpha}', color=color, linewidth=2)

ax1.set_xlabel('回合', fontsize=12)
ax1.set_ylabel('RMSE', fontsize=12)
ax1.set_title('不同学习率的TD(0)学习曲线', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# 最终RMSE vs 学习率
ax2 = axes[1]
final_rmse = [results[a][-1] for a in alphas]
ax2.plot(alphas, final_rmse, 'bo-', linewidth=2, markersize=10)
ax2.set_xlabel('学习率 α', fontsize=12)
ax2.set_ylabel('最终RMSE', fontsize=12)
ax2.set_title('学习率 vs 最终误差', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)

# 标记最优
best_idx = np.argmin(final_rmse)
ax2.scatter([alphas[best_idx]], [final_rmse[best_idx]], 
            color='red', s=200, zorder=5, marker='*')
ax2.annotate(f'最优 α={alphas[best_idx]}', 
             (alphas[best_idx], final_rmse[best_idx]),
             xytext=(alphas[best_idx]+0.05, final_rmse[best_idx]+0.02),
             fontsize=11, color='red')

plt.tight_layout()
plt.show()

In [None]:
# TD(λ) 不同λ值的比较

def td_lambda_prediction(env, n_episodes=100, alpha=0.1, gamma=1.0, lam=0.0):
    """TD(λ)策略评估"""
    V = np.zeros(env.n_total_states)
    true_v = env.get_true_values(gamma)
    rmse_history = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        # 资格迹
        E = np.zeros(env.n_total_states)
        
        while True:
            next_state, reward, done, _, _ = env.step(0)
            
            # TD误差
            td_target = reward + gamma * V[next_state] * (not done)
            td_error = td_target - V[state]
            
            # 更新资格迹
            E *= gamma * lam
            E[state] += 1
            
            # 更新所有状态
            V += alpha * td_error * E
            
            state = next_state
            if done:
                break
        
        rmse = np.sqrt(np.mean((V[1:-1] - true_v[1:-1])**2))
        rmse_history.append(rmse)
    
    return V, rmse_history


# 测试不同λ值
env = RandomWalk(n_states=19)
lambdas = [0.0, 0.4, 0.8, 0.9, 0.95, 1.0]
n_episodes = 100
n_runs = 50

lambda_results = {}

print("测试不同λ值...")
for lam in lambdas:
    rmse_all = []
    for run in range(n_runs):
        np.random.seed(run)
        _, rmse = td_lambda_prediction(env, n_episodes, alpha=0.1, gamma=1.0, lam=lam)
        rmse_all.append(rmse)
    lambda_results[lam] = np.mean(rmse_all, axis=0)
    print(f"  λ={lam}: 最终RMSE = {lambda_results[lam][-1]:.4f}")

# 绘图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 学习曲线
ax1 = axes[0]
colors = plt.cm.plasma(np.linspace(0, 1, len(lambdas)))

for lam, color in zip(lambdas, colors):
    label = f'λ={lam}' if lam < 1 else 'λ=1 (MC)'
    ax1.plot(lambda_results[lam], label=label, color=color, linewidth=2)

ax1.set_xlabel('回合', fontsize=12)
ax1.set_ylabel('RMSE', fontsize=12)
ax1.set_title('不同λ值的TD(λ)学习曲线', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# λ vs 最终RMSE
ax2 = axes[1]
final_rmse = [lambda_results[l][-1] for l in lambdas]
ax2.plot(lambdas, final_rmse, 'ro-', linewidth=2, markersize=10)
ax2.set_xlabel('λ', fontsize=12)
ax2.set_ylabel('最终RMSE', fontsize=12)
ax2.set_title('λ vs 最终误差', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)

# 标记最优
best_idx = np.argmin(final_rmse)
ax2.scatter([lambdas[best_idx]], [final_rmse[best_idx]], 
            color='green', s=200, zorder=5, marker='*')
ax2.annotate(f'最优 λ={lambdas[best_idx]}', 
             (lambdas[best_idx], final_rmse[best_idx]),
             xytext=(lambdas[best_idx]+0.05, final_rmse[best_idx]+0.01),
             fontsize=11, color='green')

plt.tight_layout()
plt.show()

---

## 9. 超参数分析

### 9.1 学习率α的影响

- **太小**：收敛慢，可能停留在次优解
- **太大**：震荡，不稳定
- **最优范围**：通常在0.1-0.5之间，取决于环境

### 9.2 折扣因子γ的影响

- **γ→0**：短视，只关心即时奖励
- **γ→1**：远视，平等看待所有未来奖励
- **典型值**：0.9-0.99

### 9.3 探索率ε的影响

- **太小**：可能陷入局部最优
- **太大**：学习效率低，最终策略差
- **常用策略**：从高到低递减

In [None]:
# 超参数敏感性分析

env = CliffWalkingEnv()

# 测试不同ε值
epsilons = [0.01, 0.05, 0.1, 0.2, 0.3]
n_episodes = 300
n_runs = 20

epsilon_results = {}

print("测试不同探索率ε...")
for eps in epsilons:
    rewards_all = []
    for run in range(n_runs):
        np.random.seed(run)
        config = TDConfig(alpha=0.5, gamma=0.99, epsilon=eps)
        learner = create_td_learner('sarsa', config=config)
        metrics = learner.train(env, n_episodes=n_episodes, log_interval=n_episodes+1)
        rewards_all.append(metrics.episode_rewards)
    epsilon_results[eps] = np.mean(rewards_all, axis=0)
    print(f"  ε={eps}: 最终平均奖励 = {np.mean(epsilon_results[eps][-50:]):.2f}")

# 绘图
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 学习曲线
ax1 = axes[0]
colors = plt.cm.coolwarm(np.linspace(0, 1, len(epsilons)))
window = 20

for eps, color in zip(epsilons, colors):
    rewards = epsilon_results[eps]
    if len(rewards) >= window:
        smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
        ax1.plot(smoothed, label=f'ε={eps}', color=color, linewidth=2)

ax1.set_xlabel('回合', fontsize=12)
ax1.set_ylabel('累积奖励', fontsize=12)
ax1.set_title('不同探索率ε的SARSA学习曲线', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# ε vs 最终奖励
ax2 = axes[1]
final_rewards = [np.mean(epsilon_results[e][-50:]) for e in epsilons]
ax2.plot(epsilons, final_rewards, 'go-', linewidth=2, markersize=10)
ax2.set_xlabel('探索率 ε', fontsize=12)
ax2.set_ylabel('最终平均奖励', fontsize=12)
ax2.set_title('探索率 vs 最终性能', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)

# 标记最优
best_idx = np.argmax(final_rewards)
ax2.scatter([epsilons[best_idx]], [final_rewards[best_idx]], 
            color='red', s=200, zorder=5, marker='*')

plt.tight_layout()
plt.show()

---

## 10. 总结与进阶方向

### 10.1 核心要点回顾

1. **TD学习的本质**：用"猜测"更新"猜测"，无需等待回合结束

2. **关键公式**：
   - TD目标：$R_{t+1} + \gamma V(S_{t+1})$
   - TD误差：$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

3. **算法选择指南**：
   | 场景 | 推荐算法 |
   |------|----------|
   | 需要安全探索 | SARSA |
   | 从历史数据学习 | Q-Learning |
   | 需要低方差 | Expected SARSA |
   | 噪声环境 | Double Q-Learning |
   | 需要快速传播 | TD(λ)或n-step |

4. **超参数建议**：
   - α：从0.1开始，根据收敛速度调整
   - γ：大多数任务用0.99
   - ε：从0.1开始，可以随训练递减
   - λ：从0.9开始，根据问题特性调整

### 10.2 进阶方向

1. **函数逼近**：
   - 线性函数逼近 + TD学习
   - 深度Q网络(DQN)：神经网络 + Q-Learning

2. **Actor-Critic方法**：
   - 结合策略梯度和TD学习
   - A2C, A3C, PPO等

3. **连续动作空间**：
   - DDPG, TD3, SAC

4. **模型基方法**：
   - Dyna-Q：结合模型学习和TD学习

### 10.3 参考文献

- Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction*. MIT Press.
- Watkins, C. J., & Dayan, P. (1992). Q-learning. *Machine Learning*.
- Van Hasselt, H. (2010). Double Q-learning. *NeurIPS*.

In [None]:
# 总结图：TD方法家族

fig, ax = plt.subplots(figsize=(14, 8))

# 绘制层次结构
boxes = [
    # 顶层
    (0.5, 0.9, '时序差分学习 (TD Learning)', 'gold', 0.25, 0.08),
    
    # 第二层
    (0.25, 0.7, '策略评估\n(Policy Evaluation)', 'lightblue', 0.18, 0.1),
    (0.75, 0.7, '控制\n(Control)', 'lightgreen', 0.18, 0.1),
    
    # 第三层 - 评估
    (0.15, 0.5, 'TD(0)', 'lightyellow', 0.12, 0.08),
    (0.35, 0.5, 'TD(λ)', 'lightyellow', 0.12, 0.08),
    
    # 第三层 - 控制
    (0.55, 0.5, 'On-Policy', 'lightcoral', 0.12, 0.08),
    (0.85, 0.5, 'Off-Policy', 'plum', 0.12, 0.08),
    
    # 第四层
    (0.45, 0.3, 'SARSA', 'orange', 0.1, 0.08),
    (0.6, 0.3, 'SARSA(λ)', 'orange', 0.1, 0.08),
    (0.75, 0.3, 'Expected\nSARSA', 'orange', 0.1, 0.08),
    
    (0.85, 0.3, 'Q-Learning', 'violet', 0.1, 0.08),
    (0.95, 0.3, 'Double Q', 'violet', 0.08, 0.08),
    
    # 深度学习扩展
    (0.5, 0.1, '深度强化学习', 'cyan', 0.35, 0.08),
]

for x, y, text, color, w, h in boxes:
    ax.add_patch(plt.Rectangle((x-w/2, y-h/2), w, h, 
                               facecolor=color, edgecolor='black', linewidth=2))
    ax.text(x, y, text, ha='center', va='center', fontsize=9, fontweight='bold')

# 连接线
connections = [
    ((0.5, 0.82), (0.25, 0.8)),
    ((0.5, 0.82), (0.75, 0.8)),
    ((0.25, 0.6), (0.15, 0.58)),
    ((0.25, 0.6), (0.35, 0.58)),
    ((0.75, 0.6), (0.55, 0.58)),
    ((0.75, 0.6), (0.85, 0.58)),
    ((0.55, 0.42), (0.45, 0.38)),
    ((0.55, 0.42), (0.6, 0.38)),
    ((0.55, 0.42), (0.75, 0.38)),
    ((0.85, 0.42), (0.85, 0.38)),
    ((0.85, 0.42), (0.95, 0.38)),
    ((0.5, 0.22), (0.5, 0.18)),
]

for start, end in connections:
    ax.annotate('', xy=end, xytext=start,
                arrowprops=dict(arrowstyle='->', color='gray', lw=1.5))

# 深度学习扩展的子项
deep_methods = ['DQN', 'Double DQN', 'Dueling DQN', 'Rainbow', 'A3C', 'PPO']
for i, method in enumerate(deep_methods):
    x = 0.25 + i * 0.1
    ax.text(x, 0.1, method, ha='center', va='center', fontsize=8)

ax.set_xlim(0, 1.05)
ax.set_ylim(0, 1)
ax.axis('off')
ax.set_title('TD学习方法家族', fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\n恭喜你完成了时序差分学习教程!")
print("下一步建议: 学习Deep Q-Network (DQN)，将TD学习与深度学习结合")