# SARSA 与 Q-Learning 对比

---

## 学习目标

通过本教程，你将学会：
- 理解 SARSA 算法的原理
- 掌握 On-Policy 与 Off-Policy 的区别
- 实现 SARSA 和 Expected SARSA
- 通过悬崖行走环境对比算法行为差异
- 理解安全性与最优性的权衡

## 前置知识

- Q-Learning 算法基础
- 时序差分学习概念
- Python 和 NumPy 基础

## 预计时间

40-50 分钟

---

## 第1部分：On-Policy vs Off-Policy

### 1.1 基本概念

在强化学习中，策略可以分为两种：

- **行为策略 (Behavior Policy)**：智能体实际用来与环境交互、收集数据的策略
- **目标策略 (Target Policy)**：智能体正在学习和优化的策略

### 1.2 Off-Policy (离策略)

**行为策略 ≠ 目标策略**

典型代表：**Q-Learning**

```
Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
                          ↑
                    使用 max (贪心)
                    不管实际采取什么动作
```

**特点**：
- 学习最优策略，不受探索影响
- 可以从其他策略（如人类演示）的经验中学习
- 样本效率高，可重用历史数据

### 1.3 On-Policy (在策略)

**行为策略 = 目标策略**

典型代表：**SARSA**

```
Q(S,A) ← Q(S,A) + α[R + γ Q(S',A') - Q(S,A)]
                          ↑
                    使用实际采取的动作 A'
                    与行为策略一致
```

**特点**：
- 学习当前策略的价值函数
- 探索会影响学习结果
- 更保守，考虑探索风险

---

## 第2部分：SARSA 算法

### 2.1 算法原理

SARSA 名称来源于更新所需的五元组：**S**tate-**A**ction-**R**eward-**S**tate-**A**ction

**更新公式**：

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]$$

**与 Q-Learning 的关键区别**：使用实际采取的下一动作 $A_{t+1}$，而非 $\max$

### 2.2 算法伪代码

```
算法: SARSA

1. 初始化 Q(s, a) = 0
2. 对于每个回合:
   a. 初始化状态 S
   b. 使用策略选择动作 A  ← 这一步在循环外
   c. 重复:
      i.   执行动作 A，观察 R, S'
      ii.  使用策略选择 A' (在 S' 下)
      iii. Q(S, A) ← Q(S, A) + α[R + γ Q(S', A') - Q(S, A)]
      iv.  S ← S', A ← A'
   d. 直到 S 是终止状态
```

---

## 第3部分：代码实现

### 步骤1: 导入库和环境设置

In [None]:
# ============================================================
# 导入必要的库
# ============================================================

import numpy as np
from collections import defaultdict
from typing import Tuple, List, Dict, Any, Optional
import matplotlib.pyplot as plt

# 设置随机种子
np.random.seed(42)

# 可视化配置
plt.rcParams['figure.figsize'] = (12, 5)
plt.rcParams['font.size'] = 11

print("库导入完成")

In [None]:
# ============================================================
# 悬崖行走环境 (复用)
# ============================================================

class CliffWalkingEnv:
    """悬崖行走环境"""
    
    ACTIONS = {0: (-1, 0), 1: (0, 1), 2: (1, 0), 3: (0, -1)}
    ACTION_NAMES = ['上', '右', '下', '左']
    
    def __init__(self, height: int = 4, width: int = 12):
        self.height = height
        self.width = width
        self.start = (height - 1, 0)
        self.goal = (height - 1, width - 1)
        self.cliff = [(height - 1, j) for j in range(1, width - 1)]
        self.state = self.start
        self.n_actions = 4
        
    def reset(self) -> Tuple[int, int]:
        self.state = self.start
        return self.state
    
    def step(self, action: int) -> Tuple[Tuple[int, int], float, bool]:
        di, dj = self.ACTIONS[action]
        new_i = np.clip(self.state[0] + di, 0, self.height - 1)
        new_j = np.clip(self.state[1] + dj, 0, self.width - 1)
        next_state = (int(new_i), int(new_j))
        
        if next_state in self.cliff:
            self.state = self.start
            return self.state, -100.0, False
        
        self.state = next_state
        if self.state == self.goal:
            return self.state, 0.0, True
        return self.state, -1.0, False
    
    def render(self, path: Optional[List] = None) -> None:
        grid = [['.' for _ in range(self.width)] for _ in range(self.height)]
        for pos in self.cliff:
            grid[pos[0]][pos[1]] = 'C'
        grid[self.start[0]][self.start[1]] = 'S'
        grid[self.goal[0]][self.goal[1]] = 'G'
        if path:
            for pos in path[1:-1]:
                if pos not in self.cliff and pos != self.start and pos != self.goal:
                    grid[pos[0]][pos[1]] = '*'
        print("┌" + "─" * (self.width * 2 + 1) + "┐")
        for row in grid:
            print("│ " + " ".join(row) + " │")
        print("└" + "─" * (self.width * 2 + 1) + "┘")


env = CliffWalkingEnv()
print("悬崖行走环境:")
env.render()

### 步骤2: 实现 SARSA 智能体

In [None]:
class SARSAAgent:
    """
    SARSA 智能体 (On-Policy TD Control)
    
    与 Q-Learning 的关键区别:
    - 更新使用实际采取的下一个动作 A'
    - 学习的是当前策略的价值，而非最优策略
    """
    
    def __init__(
        self,
        n_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 1.0,
        epsilon_decay: float = 0.995,
        epsilon_min: float = 0.01
    ):
        self.n_actions = n_actions
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.q_table: Dict[Any, np.ndarray] = defaultdict(
            lambda: np.zeros(n_actions)
        )
        
    def get_action(self, state: Any, training: bool = True) -> int:
        """使用 ε-greedy 策略选择动作"""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        
        q_values = self.q_table[state]
        max_q = np.max(q_values)
        max_actions = np.where(np.isclose(q_values, max_q))[0]
        return np.random.choice(max_actions)
    
    def update(
        self,
        state: Any,
        action: int,
        reward: float,
        next_state: Any,
        next_action: int,  # SARSA 需要下一个动作
        done: bool
    ) -> float:
        """
        SARSA 更新规则
        
        Q(S,A) ← Q(S,A) + α[R + γ Q(S',A') - Q(S,A)]
        
        注意：需要 next_action 参数
        """
        current_q = self.q_table[state][action]
        
        if done:
            target = reward
        else:
            # SARSA 核心：使用实际的 next_action
            target = reward + self.gamma * self.q_table[next_state][next_action]
        
        td_error = target - current_q
        self.q_table[state][action] += self.lr * td_error
        
        return td_error
    
    def decay_epsilon(self) -> None:
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)


print("SARSA 智能体类定义完成")

### 步骤3: 实现 Q-Learning 智能体 (对比用)

In [None]:
class QLearningAgent:
    """Q-Learning 智能体 (Off-Policy TD Control)"""
    
    def __init__(
        self,
        n_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 1.0,
        epsilon_decay: float = 0.995,
        epsilon_min: float = 0.01
    ):
        self.n_actions = n_actions
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.q_table: Dict[Any, np.ndarray] = defaultdict(
            lambda: np.zeros(n_actions)
        )
        
    def get_action(self, state: Any, training: bool = True) -> int:
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        q_values = self.q_table[state]
        max_q = np.max(q_values)
        max_actions = np.where(np.isclose(q_values, max_q))[0]
        return np.random.choice(max_actions)
    
    def update(
        self,
        state: Any,
        action: int,
        reward: float,
        next_state: Any,
        done: bool
    ) -> float:
        """Q-Learning: Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]"""
        current_q = self.q_table[state][action]
        
        if done:
            target = reward
        else:
            # Q-Learning: 使用 max
            target = reward + self.gamma * np.max(self.q_table[next_state])
        
        td_error = target - current_q
        self.q_table[state][action] += self.lr * td_error
        return td_error
    
    def decay_epsilon(self) -> None:
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)


print("Q-Learning 智能体类定义完成")

### 步骤4: 实现训练函数

In [None]:
def train_q_learning(env, agent, episodes=500, max_steps=200, verbose=False):
    """训练 Q-Learning 智能体"""
    history = {'rewards': [], 'steps': []}
    
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0.0
        steps = 0
        
        for _ in range(max_steps):
            action = agent.get_action(state, training=True)
            next_state, reward, done = env.step(action)
            agent.update(state, action, reward, next_state, done)
            
            state = next_state
            total_reward += reward
            steps += 1
            
            if done:
                break
        
        agent.decay_epsilon()
        history['rewards'].append(total_reward)
        history['steps'].append(steps)
        
        if verbose and (episode + 1) % 100 == 0:
            avg = np.mean(history['rewards'][-100:])
            print(f"Q-Learning Episode {episode+1}: Avg Reward = {avg:.2f}")
    
    return history


def train_sarsa(env, agent, episodes=500, max_steps=200, verbose=False):
    """训练 SARSA 智能体"""
    history = {'rewards': [], 'steps': []}
    
    for episode in range(episodes):
        state = env.reset()
        # SARSA: 先选择初始动作
        action = agent.get_action(state, training=True)
        
        total_reward = 0.0
        steps = 0
        
        for _ in range(max_steps):
            next_state, reward, done = env.step(action)
            # SARSA: 在更新前选择下一个动作
            next_action = agent.get_action(next_state, training=True)
            
            # SARSA 更新需要 next_action
            agent.update(state, action, reward, next_state, next_action, done)
            
            state = next_state
            action = next_action  # 关键：动作传递
            total_reward += reward
            steps += 1
            
            if done:
                break
        
        agent.decay_epsilon()
        history['rewards'].append(total_reward)
        history['steps'].append(steps)
        
        if verbose and (episode + 1) % 100 == 0:
            avg = np.mean(history['rewards'][-100:])
            print(f"SARSA Episode {episode+1}: Avg Reward = {avg:.2f}")
    
    return history


print("训练函数定义完成")

### 步骤5: 对比实验

In [None]:
# ============================================================
# 对比实验：Q-Learning vs SARSA
# ============================================================

print("="*60)
print("悬崖行走环境: Q-Learning vs SARSA 对比实验")
print("="*60)

# 实验参数
EPISODES = 500
LEARNING_RATE = 0.5
EPSILON = 0.1  # 固定探索率，便于观察行为差异

# 创建环境和智能体
env = CliffWalkingEnv()

q_agent = QLearningAgent(
    n_actions=4,
    learning_rate=LEARNING_RATE,
    epsilon=EPSILON,
    epsilon_decay=1.0,  # 不衰减
    epsilon_min=EPSILON
)

sarsa_agent = SARSAAgent(
    n_actions=4,
    learning_rate=LEARNING_RATE,
    epsilon=EPSILON,
    epsilon_decay=1.0,
    epsilon_min=EPSILON
)

# 训练
print("\n训练 Q-Learning...")
q_history = train_q_learning(env, q_agent, episodes=EPISODES, verbose=True)

print("\n训练 SARSA...")
sarsa_history = train_sarsa(env, sarsa_agent, episodes=EPISODES, verbose=True)

print("\n训练完成！")

### 步骤6: 可视化对比结果

In [None]:
def plot_comparison(q_history, sarsa_history, window=10):
    """绘制 Q-Learning 和 SARSA 的学习曲线对比"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 奖励曲线
    q_smooth = np.convolve(q_history['rewards'], np.ones(window)/window, mode='valid')
    sarsa_smooth = np.convolve(sarsa_history['rewards'], np.ones(window)/window, mode='valid')
    
    axes[0].plot(q_smooth, label='Q-Learning', color='blue', alpha=0.8)
    axes[0].plot(sarsa_smooth, label='SARSA', color='red', alpha=0.8)
    axes[0].set_xlabel('Episode')
    axes[0].set_ylabel('Total Reward')
    axes[0].set_title('学习曲线: 回合奖励')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # 添加说明文本
    axes[0].axhline(y=-13, color='green', linestyle='--', alpha=0.5, label='最优路径')
    
    # 步数曲线
    q_steps_smooth = np.convolve(q_history['steps'], np.ones(window)/window, mode='valid')
    sarsa_steps_smooth = np.convolve(sarsa_history['steps'], np.ones(window)/window, mode='valid')
    
    axes[1].plot(q_steps_smooth, label='Q-Learning', color='blue', alpha=0.8)
    axes[1].plot(sarsa_steps_smooth, label='SARSA', color='red', alpha=0.8)
    axes[1].set_xlabel('Episode')
    axes[1].set_ylabel('Steps')
    axes[1].set_title('学习曲线: 回合步数')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # 打印统计
    print("\n" + "="*50)
    print("最后100回合统计")
    print("="*50)
    print(f"Q-Learning: 平均奖励 = {np.mean(q_history['rewards'][-100:]):.2f}")
    print(f"SARSA:      平均奖励 = {np.mean(sarsa_history['rewards'][-100:]):.2f}")


plot_comparison(q_history, sarsa_history)

### 步骤7: 提取并对比学到的策略路径

In [None]:
def extract_path(agent, env, max_steps=50):
    """提取贪心策略路径"""
    state = env.reset()
    path = [state]
    
    for _ in range(max_steps):
        action = agent.get_action(state, training=False)
        next_state, _, done = env.step(action)
        path.append(next_state)
        state = next_state
        if done:
            break
    
    return path


# 提取路径
print("\n" + "="*60)
print("学到的策略路径对比")
print("="*60)

print("\nQ-Learning 学到的路径 (倾向最短路径，沿悬崖边):")
q_path = extract_path(q_agent, env)
env.reset()
env.render(q_path)
print(f"路径长度: {len(q_path) - 1} 步")

print("\nSARSA 学到的路径 (倾向安全路径，远离悬崖):")
sarsa_path = extract_path(sarsa_agent, env)
env.reset()
env.render(sarsa_path)
print(f"路径长度: {len(sarsa_path) - 1} 步")

---

## 第4部分：行为差异分析

### 4.1 为什么 Q-Learning 选择悬崖边路径？

Q-Learning 更新使用 $\max$，学习的是**最优策略的价值**：

- 假设执行最优策略，不会掉入悬崖
- 沿悬崖边的路径最短，奖励最高
- 但训练时的 ε-greedy 探索会导致实际掉入悬崖

**结果**：学到的策略是最优的，但训练过程中经常失败

### 4.2 为什么 SARSA 选择安全路径？

SARSA 使用实际采取的动作，学习的是**当前 ε-greedy 策略的价值**：

- 考虑到探索时可能随机选择动作
- 靠近悬崖时，探索可能导致掉落
- 因此远离悬崖的路径价值更高

**结果**：学到的策略更保守，但训练过程更稳定

### 4.3 可视化对比

In [None]:
def visualize_value_comparison(q_agent, sarsa_agent, env):
    """对比两种算法学到的价值函数"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for idx, (agent, name) in enumerate([(q_agent, 'Q-Learning'), (sarsa_agent, 'SARSA')]):
        v_table = np.zeros((env.height, env.width))
        for i in range(env.height):
            for j in range(env.width):
                state = (i, j)
                if state in agent.q_table:
                    v_table[i, j] = np.max(agent.q_table[state])
        
        im = axes[idx].imshow(v_table, cmap='RdYlGn', aspect='auto')
        axes[idx].set_title(f'{name} 价值函数 V(s)')
        axes[idx].set_xlabel('列')
        axes[idx].set_ylabel('行')
        plt.colorbar(im, ax=axes[idx])
        
        # 标记悬崖
        for pos in env.cliff:
            axes[idx].add_patch(plt.Rectangle(
                (pos[1]-0.5, pos[0]-0.5), 1, 1,
                fill=True, color='black', alpha=0.5
            ))
    
    plt.tight_layout()
    plt.show()


visualize_value_comparison(q_agent, sarsa_agent, env)

---

## 第5部分：Expected SARSA

### 5.1 算法原理

Expected SARSA 是 SARSA 的改进版本，使用下一状态 Q 值的**期望**而非采样值：

$$Q(S,A) \leftarrow Q(S,A) + \alpha \left[ R + \gamma \mathbb{E}[Q(S',A')] - Q(S,A) \right]$$

其中期望在当前策略下计算：

$$\mathbb{E}[Q(S',A')] = \sum_a \pi(a|S') Q(S',a)$$

**特点**：结合了 Q-Learning 的低方差和 SARSA 的在策略特性

In [None]:
class ExpectedSARSAAgent:
    """Expected SARSA 智能体"""
    
    def __init__(
        self,
        n_actions: int,
        learning_rate: float = 0.1,
        discount_factor: float = 0.99,
        epsilon: float = 1.0,
        epsilon_decay: float = 0.995,
        epsilon_min: float = 0.01
    ):
        self.n_actions = n_actions
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.q_table: Dict[Any, np.ndarray] = defaultdict(
            lambda: np.zeros(n_actions)
        )
        
    def get_action(self, state: Any, training: bool = True) -> int:
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        q_values = self.q_table[state]
        max_q = np.max(q_values)
        max_actions = np.where(np.isclose(q_values, max_q))[0]
        return np.random.choice(max_actions)
    
    def _get_expected_q(self, state: Any) -> float:
        """计算 ε-greedy 策略下的期望 Q 值"""
        q_values = self.q_table[state]
        
        # ε-greedy 策略下的动作概率
        probs = np.ones(self.n_actions) * self.epsilon / self.n_actions
        best_action = np.argmax(q_values)
        probs[best_action] += 1 - self.epsilon
        
        return np.dot(probs, q_values)
    
    def update(
        self,
        state: Any,
        action: int,
        reward: float,
        next_state: Any,
        done: bool
    ) -> float:
        """Expected SARSA 更新"""
        current_q = self.q_table[state][action]
        
        if done:
            target = reward
        else:
            # Expected SARSA: 使用期望 Q 值
            target = reward + self.gamma * self._get_expected_q(next_state)
        
        td_error = target - current_q
        self.q_table[state][action] += self.lr * td_error
        return td_error
    
    def decay_epsilon(self) -> None:
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)


# 训练 Expected SARSA
print("\n训练 Expected SARSA...")
exp_sarsa_agent = ExpectedSARSAAgent(
    n_actions=4,
    learning_rate=LEARNING_RATE,
    epsilon=EPSILON,
    epsilon_decay=1.0,
    epsilon_min=EPSILON
)

exp_sarsa_history = train_q_learning(env, exp_sarsa_agent, episodes=EPISODES, verbose=True)

print("\nExpected SARSA 学到的路径:")
exp_sarsa_path = extract_path(exp_sarsa_agent, env)
env.reset()
env.render(exp_sarsa_path)

---

## 总结

### 算法对比

| 特性 | Q-Learning | SARSA | Expected SARSA |
|------|------------|-------|----------------|
| 类型 | Off-Policy | On-Policy | On-Policy |
| 更新目标 | $\max_a Q(S',a)$ | $Q(S',A')$ | $\mathbb{E}[Q(S',A')]$ |
| 学习策略 | 最优策略 | 当前策略 | 当前策略 |
| 方差 | 低 | 高 | 低 |
| 安全性 | 激进 | 保守 | 中等 |

### 选择建议

- **Q-Learning**：追求最优性能，可承受训练不稳定
- **SARSA**：需要安全探索，如机器人控制
- **Expected SARSA**：平衡方案，实践中常用

### 核心公式

| 算法 | 更新公式 |
|------|----------|
| Q-Learning | $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_a Q(S',a) - Q(S,A)]$ |
| SARSA | $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma Q(S',A') - Q(S,A)]$ |
| Expected SARSA | $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \mathbb{E}[Q(S',A')] - Q(S,A)]$ |

---

## 单元测试

In [None]:
def run_tests():
    """运行单元测试"""
    print("开始单元测试...\n")
    passed = 0
    failed = 0
    
    # 测试1: SARSA 更新
    try:
        agent = SARSAAgent(n_actions=4, learning_rate=0.5, discount_factor=0.9)
        state = (0, 0)
        next_state = (0, 1)
        
        # 设置 next_state 的 Q 值
        agent.q_table[next_state] = np.array([1.0, 2.0, 0.0, 0.0])
        
        # SARSA 更新：使用 next_action=1 (Q值为2.0)
        agent.update(state, 0, -1.0, next_state, 1, False)
        
        # Q(s,a) = 0 + 0.5 * (-1 + 0.9 * 2.0 - 0) = 0.5 * 0.8 = 0.4
        expected = 0.4
        assert np.isclose(agent.q_table[state][0], expected), \
            f"SARSA更新错误: {agent.q_table[state][0]} != {expected}"
        
        print("测试1通过: SARSA 更新正确")
        passed += 1
    except AssertionError as e:
        print(f"测试1失败: {e}")
        failed += 1
    
    # 测试2: Expected SARSA 期望计算
    try:
        agent = ExpectedSARSAAgent(n_actions=4, epsilon=0.2)
        state = (0, 0)
        agent.q_table[state] = np.array([1.0, 2.0, 0.5, 0.5])
        
        # ε=0.2 时，动作1(最优)概率 = 0.8 + 0.2/4 = 0.85
        # 其他动作概率 = 0.2/4 = 0.05
        # E[Q] = 0.05*1.0 + 0.85*2.0 + 0.05*0.5 + 0.05*0.5 = 1.8
        expected_q = agent._get_expected_q(state)
        assert np.isclose(expected_q, 1.8), f"期望Q值计算错误: {expected_q}"
        
        print("测试2通过: Expected SARSA 期望计算正确")
        passed += 1
    except AssertionError as e:
        print(f"测试2失败: {e}")
        failed += 1
    
    # 测试3: Q-Learning vs SARSA 行为差异
    try:
        env = CliffWalkingEnv()
        
        # 训练两个智能体
        q_agent = QLearningAgent(n_actions=4, learning_rate=0.5, epsilon=0.1, 
                                  epsilon_decay=1.0, epsilon_min=0.1)
        sarsa_agent = SARSAAgent(n_actions=4, learning_rate=0.5, epsilon=0.1,
                                  epsilon_decay=1.0, epsilon_min=0.1)
        
        q_hist = train_q_learning(env, q_agent, episodes=200, verbose=False)
        sarsa_hist = train_sarsa(env, sarsa_agent, episodes=200, verbose=False)
        
        # 验证训练后性能
        q_avg = np.mean(q_hist['rewards'][-50:])
        sarsa_avg = np.mean(sarsa_hist['rewards'][-50:])
        
        assert q_avg > -100 and sarsa_avg > -100, "训练未收敛"
        
        print(f"测试3通过: 算法训练正常 (Q-Learning: {q_avg:.1f}, SARSA: {sarsa_avg:.1f})")
        passed += 1
    except AssertionError as e:
        print(f"测试3失败: {e}")
        failed += 1
    
    print(f"\n{'='*50}")
    print(f"测试完成: {passed} 通过, {failed} 失败")
    if failed == 0:
        print("所有测试通过！")
    print(f"{'='*50}")
    
    return failed == 0


run_tests()

---

## 参考资料

1. Rummery, G.A. & Niranjan, M. (1994). On-Line Q-Learning Using Connectionist Systems.
2. Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction, 2nd ed. Chapter 6.
3. Van Seijen, H., et al. (2009). A Theoretical and Empirical Analysis of Expected Sarsa.

---

[返回目录](../README.md)