# 时序差分学习教程

## Temporal Difference Learning Tutorial

本教程系统介绍TD学习的核心概念和算法实现。

## 目录

1. [TD学习基础](#1-TD学习基础)
2. [TD(0)预测](#2-TD0预测)
3. [SARSA控制](#3-SARSA控制)
4. [Q-Learning控制](#4-Q-Learning控制)
5. [算法对比实验](#5-算法对比实验)
6. [资格迹方法](#6-资格迹方法)

In [None]:
# 导入必要的库
import numpy as np
import matplotlib.pyplot as plt
import sys
import os

sys.path.insert(0, os.path.dirname(os.getcwd()))
np.random.seed(42)

In [None]:
# 设置matplotlib
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
plt.style.use('seaborn-v0_8-whitegrid')

---

## 1. TD学习基础

### 1.1 核心思想

TD学习结合了两种方法的优点：
- **蒙特卡洛方法**：从经验中学习
- **动态规划**：使用估计值更新估计值（自举）

### 1.2 TD误差

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

In [None]:
# 导入核心模块
from core import TDConfig, TD0ValueLearner, SARSA, QLearning
from environments import RandomWalk, CliffWalkingEnv

---

## 2. TD(0)预测

### 2.1 RandomWalk环境

```
T(0) - A - B - C - D - E - T(1)
```

- 随机策略：左右各50%
- 右终止+1奖励，左终止0奖励

In [None]:
# 创建RandomWalk环境
env = RandomWalk(n_states=19)
print(f"状态数: {env.n_total_states}")

In [None]:
# 获取真实价值
true_values = env.get_true_values(gamma=1.0)
print("真实价值 (中间5个状态):")
for i in [8, 9, 10, 11, 12]:
    print(f"  状态 {i}: {true_values[i]:.4f}")

### 2.2 TD(0)更新规则

$$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$

In [None]:
# 创建TD(0)学习器
config = TDConfig(alpha=0.1, gamma=1.0)
learner = TD0ValueLearner(config)

In [None]:
# 训练
metrics = learner.train(env, n_episodes=100, log_interval=50)

In [None]:
# 可视化学习结果
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

states = range(1, env.n_states + 1)
estimated = [learner._value_function.get(s, 0.0) for s in states]
true_vals = [true_values[s] for s in states]

In [None]:
# 绘制估计值 vs 真实值
axes[0].plot(states, true_vals, 'b-', label='True', linewidth=2)
axes[0].plot(states, estimated, 'r--', label='TD(0)', linewidth=2)
axes[0].set_xlabel('State')
axes[0].set_ylabel('Value')
axes[0].set_title('TD(0) vs True Values')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

In [None]:
# 绘制误差
errors = [abs(e - t) for e, t in zip(estimated, true_vals)]
axes[1].bar(states, errors, color='coral', alpha=0.7)
axes[1].set_xlabel('State')
axes[1].set_ylabel('Absolute Error')
axes[1].set_title(f'Errors (Mean: {np.mean(errors):.4f})')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 3. SARSA控制

### 3.1 SARSA算法

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$$

名称来源：**S**tate-**A**ction-**R**eward-**S**tate-**A**ction

In [None]:
# 创建CliffWalking环境
env = CliffWalkingEnv()
print("CliffWalking环境:")
env.render()

In [None]:
# 创建SARSA学习器
sarsa_config = TDConfig(alpha=0.5, gamma=1.0, epsilon=0.1)
sarsa = SARSA(sarsa_config)

In [None]:
# 训练SARSA
sarsa_metrics = sarsa.train(env, n_episodes=500, log_interval=250)

---

## 4. Q-Learning控制

### 4.1 Q-Learning算法

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

关键区别：使用**max**而不是实际的下一动作。

In [None]:
# 创建Q-Learning学习器
qlearn_config = TDConfig(alpha=0.5, gamma=1.0, epsilon=0.1)
qlearn = QLearning(qlearn_config)

In [None]:
# 训练Q-Learning
env = CliffWalkingEnv()
qlearn_metrics = qlearn.train(env, n_episodes=500, log_interval=250)

---

## 5. 算法对比实验

### 5.1 SARSA vs Q-Learning

在CliffWalking环境中：
- **SARSA**：学习远离悬崖的安全路径
- **Q-Learning**：学习沿悬崖边缘的最短路径

In [None]:
# 比较训练曲线
fig, ax = plt.subplots(figsize=(10, 4))
window = 50

In [None]:
# 绘制SARSA曲线
sarsa_rewards = sarsa_metrics.episode_rewards
ax.plot(sarsa_rewards, alpha=0.2, color='blue')
sarsa_smooth = np.convolve(sarsa_rewards, np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(sarsa_rewards)), sarsa_smooth, 
        color='blue', linewidth=2, label='SARSA')

In [None]:
# 绘制Q-Learning曲线
qlearn_rewards = qlearn_metrics.episode_rewards
ax.plot(qlearn_rewards, alpha=0.2, color='red')
qlearn_smooth = np.convolve(qlearn_rewards, np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(qlearn_rewards)), qlearn_smooth, 
        color='red', linewidth=2, label='Q-Learning')

In [None]:
ax.set_xlabel('Episode')
ax.set_ylabel('Reward')
ax.set_title('SARSA vs Q-Learning on CliffWalking')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

In [None]:
# 评估贪婪策略
env = CliffWalkingEnv()
sarsa_eval = sarsa.evaluate(env, n_episodes=100)
qlearn_eval = qlearn.evaluate(env, n_episodes=100)

print("贪婪策略评估（无探索）:")
print(f"  SARSA:      {sarsa_eval[0]:.2f} +/- {sarsa_eval[1]:.2f}")
print(f"  Q-Learning: {qlearn_eval[0]:.2f} +/- {qlearn_eval[1]:.2f}")

### 5.2 关键洞察

| 指标 | SARSA | Q-Learning |
|------|-------|------------|
| 训练奖励 | 更高 | 更低 |
| 评估奖励 | 更低 | 更高 |
| 学到的路径 | 安全路径 | 最优路径 |

---

## 6. 资格迹方法

### 6.1 TD(λ)算法

资格迹允许将TD误差回传给所有"最近访问过的"状态：

$$E_t(s) = \gamma \lambda E_{t-1}(s) + \mathbf{1}[S_t = s]$$

$$V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$$

In [None]:
from core import TDLambda

In [None]:
# 比较不同λ值
env = RandomWalk(n_states=19)
true_values = env.get_true_values(gamma=1.0)

lambda_values = [0.0, 0.4, 0.8, 0.95]
results = {}

In [None]:
for lam in lambda_values:
    config = TDConfig(alpha=0.1, gamma=1.0, lambda_=lam)
    learner = TDLambda(config)
    learner.train(env, n_episodes=100, log_interval=200)
    
    # 计算RMSE
    estimated = {s: learner._value_function.get(s, 0.0) 
                 for s in range(env.n_total_states)}
    rmse = np.sqrt(np.mean([(estimated[s] - true_values[s])**2 
                            for s in range(1, env.n_states + 1)]))
    results[lam] = rmse
    print(f"lambda = {lam:.2f}: RMSE = {rmse:.4f}")

In [None]:
# 可视化λ的影响
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(list(results.keys()), list(results.values()), 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('lambda')
ax.set_ylabel('RMSE')
ax.set_title('Effect of lambda on TD(lambda) Performance')
ax.grid(True, alpha=0.3)
plt.show()

---

## 总结

### 关键要点

1. **TD学习**结合了MC的采样和DP的自举
2. **SARSA**是On-Policy，考虑探索风险，更安全
3. **Q-Learning**是Off-Policy，学习最优策略
4. **资格迹**通过λ参数权衡TD(0)和MC

### 算法选择建议

| 场景 | 推荐算法 |
|------|----------|
| 安全重要 | SARSA |
| 需要最优 | Q-Learning |
| 稀疏奖励 | TD(λ) |