# 后见经验回放 (Hindsight Experience Replay) 深度教程

## 目录
1. [问题背景：目标条件RL的困境](#1-问题背景目标条件rl的困境)
2. [HER核心思想](#2-her核心思想)
3. [目标选择策略](#3-目标选择策略)
4. [HER实现详解](#4-her实现详解)
5. [进阶变体：优先级HER与课程HER](#5-进阶变体优先级her与课程her)
6. [实验与分析](#6-实验与分析)

## 1. 问题背景：目标条件RL的困境

### 1.1 目标条件强化学习

**标准RL**：学习 $\pi(a|s)$，最大化 $\mathbb{E}[\sum_t \gamma^t r_t]$

**目标条件RL**：学习 $\pi(a|s, g)$，最大化 $\mathbb{E}[\sum_t \gamma^t r(s_t, a_t, g)]$

**应用场景**：
- 机器人操作：抓取物体放到指定位置
- 导航：到达任意目标点
- 游戏：完成指定任务

### 1.2 稀疏奖励的灾难

典型奖励函数（稀疏）：
$$r(s, a, g) = \begin{cases} 0 & \text{if } \|s_{achieved} - g\| < \epsilon \\ -1 & \text{otherwise} \end{cases}$$

**问题**：
- 随机策略几乎永远达不到目标
- 所有样本的奖励都是 -1
- 没有梯度信号来改进策略

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Dict, Optional, Callable
import sys
sys.path.append('..')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
def visualize_sparse_reward_problem():
    """可视化稀疏奖励问题。"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # 图1: 目标条件任务示意
    ax = axes[0]
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 10)
    
    # 绘制起点、目标和轨迹
    start = np.array([1, 1])
    goal = np.array([8, 8])
    
    # 随机轨迹（未到达目标）
    traj = [start]
    pos = start.copy()
    for _ in range(20):
        pos = pos + np.random.randn(2) * 0.5
        pos = np.clip(pos, 0, 10)
        traj.append(pos.copy())
    traj = np.array(traj)
    
    ax.plot(traj[:, 0], traj[:, 1], 'b-', alpha=0.7, linewidth=2)
    ax.plot(start[0], start[1], 'go', markersize=15, label='起点')
    ax.plot(goal[0], goal[1], 'r*', markersize=20, label='目标')
    ax.plot(traj[-1, 0], traj[-1, 1], 'bs', markersize=12, label='终点')
    
    # 目标区域
    circle = plt.Circle(goal, 0.5, color='red', alpha=0.2)
    ax.add_patch(circle)
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title('目标条件任务: 到达指定目标')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # 图2: 奖励分布（稀疏）
    ax = axes[1]
    rewards = np.array([-1] * 100)  # 全是失败
    ax.hist(rewards, bins=20, color='red', alpha=0.7)
    ax.axvline(x=0, color='green', linestyle='--', label='成功阈值')
    ax.set_xlabel('奖励')
    ax.set_ylabel('频次')
    ax.set_title('稀疏奖励分布 (所有样本都失败)')
    ax.legend()
    
    # 图3: HER重标注后
    ax = axes[2]
    rewards_her = np.concatenate([
        np.array([-1] * 20),   # 原始失败样本
        np.array([0] * 80),    # HER重标注的"成功"样本
    ])
    ax.hist(rewards_her, bins=20, color='green', alpha=0.7)
    ax.set_xlabel('奖励')
    ax.set_ylabel('频次')
    ax.set_title('HER后的奖励分布 (大部分"成功")')
    
    plt.tight_layout()
    plt.show()

visualize_sparse_reward_problem()

## 2. HER核心思想

### 2.1 核心洞察

**失败的轨迹包含有价值的信息**：
- 虽然没达到原始目标 $g$
- 但确实到达了某个状态 $s_T$
- 如果目标是 $g' = s_T$，这就是一条"成功"的轨迹！

### 2.2 HER的数学表述

**原始转移**：$(s_t, a_t, g, r_t, s_{t+1})$，其中 $r_t = -1$（失败）

**重标注转移**：$(s_t, a_t, g', r'_t, s_{t+1})$，其中：
- $g' = \text{achieved\_goal}(s_{t+1})$ 或其他策略选择的目标
- $r'_t = R(s_{t+1}, g') = 0$（成功！）

### 2.3 为什么HER有效？

1. **样本效率**：每条轨迹产生 $(1 + k)$ 倍的学习样本
2. **正样本比例**：从 ~0% 提升到 ~$\frac{k}{k+1}$
3. **值函数学习**：Q网络学会了"如何到达各种状态"

In [None]:
def demonstrate_her_relabeling():
    """演示HER重标注过程。"""
    
    print("=" * 70)
    print("HER重标注演示")
    print("=" * 70)
    
    # 模拟一个失败的episode
    original_goal = np.array([8.0, 8.0])
    trajectory = [
        {'state': np.array([0.0, 0.0]), 'achieved': np.array([0.0, 0.0])},
        {'state': np.array([1.0, 0.5]), 'achieved': np.array([1.0, 0.5])},
        {'state': np.array([2.0, 1.5]), 'achieved': np.array([2.0, 1.5])},
        {'state': np.array([3.0, 2.0]), 'achieved': np.array([3.0, 2.0])},  # 最终状态
    ]
    
    epsilon = 0.5  # 成功阈值
    
    print(f"\n原始目标: {original_goal}")
    print(f"最终到达: {trajectory[-1]['achieved']}")
    print(f"距离目标: {np.linalg.norm(trajectory[-1]['achieved'] - original_goal):.2f}")
    print(f"成功阈值: {epsilon}")
    print("结果: ❌ 失败 (未到达目标)")
    
    print("\n" + "-" * 70)
    print("HER重标注 (使用'future'策略, k=4)")
    print("-" * 70)
    
    k = 4
    for t, step in enumerate(trajectory[:-1]):
        print(f"\n转移 {t}: s={step['state']} -> s'={trajectory[t+1]['state']}")
        
        # 原始转移
        distance_to_goal = np.linalg.norm(trajectory[t+1]['achieved'] - original_goal)
        original_reward = 0 if distance_to_goal < epsilon else -1
        print(f"  原始: g={original_goal}, r={original_reward}")
        
        # 重标注：从future中采样
        future_goals = [trajectory[i]['achieved'] for i in range(t+1, len(trajectory))]
        
        for i in range(min(k, len(future_goals))):
            new_goal = future_goals[i % len(future_goals)]
            distance = np.linalg.norm(trajectory[t+1]['achieved'] - new_goal)
            new_reward = 0 if distance < epsilon else -1
            status = "✓" if new_reward == 0 else "✗"
            print(f"  HER {i+1}: g'={new_goal}, r'={new_reward} {status}")

demonstrate_her_relabeling()

## 3. 目标选择策略

### 3.1 四种主要策略

| 策略 | 描述 | 数学表示 |
|------|------|----------|
| **final** | 使用轨迹最终达成的目标 | $g' = \text{ag}_{T}$ |
| **future** | 从当前时刻之后的达成目标中采样 | $g' \sim \{\text{ag}_i : i > t\}$ |
| **episode** | 从整个轨迹的达成目标中采样 | $g' \sim \{\text{ag}_i : i \in [0,T]\}$ |
| **random** | 从所有历史达成目标中采样 | $g' \sim \mathcal{D}_{\text{achieved}}$ |

### 3.2 策略选择建议

- **future** 是默认最佳选择，提供良好的时间一致性
- **final** 最简单，适合快速原型
- **random** 提供最大多样性，可能有助于泛化

In [None]:
def visualize_goal_strategies():
    """可视化不同目标选择策略。"""
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # 模拟轨迹
    np.random.seed(42)
    T = 10
    trajectory = np.cumsum(np.random.randn(T, 2) * 0.5, axis=0)
    current_t = 3  # 当前时刻
    
    strategies = [
        ('final', 'Final: 使用最终达成目标'),
        ('future', 'Future: 从未来达成目标采样'),
        ('episode', 'Episode: 从全轨迹达成目标采样'),
        ('random', 'Random: 从历史达成目标采样'),
    ]
    
    for ax, (strategy, title) in zip(axes.flat, strategies):
        # 绘制轨迹
        ax.plot(trajectory[:, 0], trajectory[:, 1], 'b-', alpha=0.5, linewidth=2)
        
        # 标记所有状态
        for i in range(T):
            color = 'red' if i == current_t else 'blue'
            size = 100 if i == current_t else 30
            ax.scatter(trajectory[i, 0], trajectory[i, 1], c=color, s=size, zorder=5)
        
        # 根据策略高亮候选目标
        if strategy == 'final':
            candidates = [trajectory[-1]]
        elif strategy == 'future':
            candidates = trajectory[current_t+1:]
        elif strategy == 'episode':
            candidates = trajectory
        else:  # random
            candidates = np.random.randn(10, 2) * 2 + trajectory.mean(axis=0)
        
        for cand in candidates:
            ax.scatter(cand[0], cand[1], c='green', s=80, marker='*', 
                      alpha=0.7, edgecolors='black')
        
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        ax.set_title(title)
        ax.grid(True, alpha=0.3)
        
        # 图例
        ax.scatter([], [], c='red', s=100, label=f'当前状态 t={current_t}')
        ax.scatter([], [], c='green', s=80, marker='*', label='候选HER目标')
        ax.legend(loc='upper left')
    
    plt.tight_layout()
    plt.show()

visualize_goal_strategies()

## 4. HER实现详解

In [None]:
from hindsight_experience_replay import (
    GoalSelectionStrategy,
    Transition,
    Episode,
    HERConfig,
    GoalConditionedReplayBuffer,
    HindsightExperienceReplay,
)

In [None]:
# 定义奖励函数
def sparse_reward(achieved_goal, desired_goal, threshold=0.5):
    """稀疏奖励函数。"""
    distance = np.linalg.norm(np.asarray(achieved_goal) - np.asarray(desired_goal))
    return 0.0 if distance < threshold else -1.0

# 创建HER配置
config = HERConfig(
    strategy=GoalSelectionStrategy.FUTURE,
    n_sampled_goal=4,  # 每个转移采样4个HER目标
    reward_function=sparse_reward,
)

print("HER配置:")
print(f"  目标选择策略: {config.strategy.value}")
print(f"  每转移采样目标数: {config.n_sampled_goal}")

In [None]:
# 创建HER回放缓冲区
state_dim = 6  # (pos_x, pos_y, vel_x, vel_y, grip, obj)
action_dim = 3
goal_dim = 2  # (target_x, target_y)

her = HindsightExperienceReplay(
    capacity=10000,
    state_dim=state_dim,
    action_dim=action_dim,
    goal_dim=goal_dim,
    config=config,
)

print(f"HER缓冲区容量: {her.capacity}")

In [None]:
# 模拟目标条件任务的Episode
def simulate_episode(goal, episode_length=20):
    """模拟一个目标条件任务的episode。"""
    
    transitions = []
    
    # 初始状态
    state = np.zeros(state_dim)
    state[:2] = np.random.randn(2) * 0.5  # 随机起始位置
    
    for _ in range(episode_length):
        # 随机动作（简化）
        action = np.random.randn(action_dim) * 0.1
        
        # 简单的动力学
        next_state = state.copy()
        next_state[:2] += action[:2]  # 位置更新
        
        # 达成的目标 = 当前位置
        achieved_goal = next_state[:2].copy()
        
        # 计算奖励
        reward = sparse_reward(achieved_goal, goal)
        
        # 检查是否成功
        done = reward == 0.0
        
        transition = Transition(
            state=state,
            action=action,
            reward=reward,
            next_state=next_state,
            done=done,
            goal=goal,
            achieved_goal=achieved_goal,
        )
        transitions.append(transition)
        
        if done:
            break
            
        state = next_state
    
    return Episode(transitions=transitions, goal=goal)

# 模拟多个episode
print("模拟目标条件任务...")

successes = 0
for i in range(50):
    goal = np.random.randn(goal_dim) * 3  # 随机目标
    episode = simulate_episode(goal)
    
    # 存储到HER
    her.store_episode(episode)
    
    if episode.transitions[-1].done:
        successes += 1

print(f"\n原始成功率: {successes}/50 = {successes/50:.1%}")
print(f"HER缓冲区大小: {len(her)}")

In [None]:
# 采样并分析HER样本
batch_size = 256
batch = her.sample(batch_size)

# 统计奖励分布
positive_rewards = np.sum(batch.rewards >= 0)
negative_rewards = np.sum(batch.rewards < 0)

print("\nHER采样分析:")
print(f"  批次大小: {batch_size}")
print(f"  正奖励样本: {positive_rewards} ({positive_rewards/batch_size:.1%})")
print(f"  负奖励样本: {negative_rewards} ({negative_rewards/batch_size:.1%})")

# 可视化
plt.figure(figsize=(8, 4))
plt.bar(['成功样本 (r≥0)', '失败样本 (r<0)'], 
        [positive_rewards, negative_rewards],
        color=['green', 'red'])
plt.ylabel('样本数')
plt.title('HER采样后的奖励分布')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

print("\n关键洞察: HER将原本~0%的正样本比例提升到了很高的水平！")

## 5. 进阶变体：优先级HER与课程HER

### 5.1 优先级HER (Prioritized HER)

结合优先级经验回放，根据TD误差优先采样：

$$P(i) \propto |\delta_i|^\alpha$$

### 5.2 课程HER (Curriculum HER)

**动态调整目标难度**：
- 初期：选择"容易"的目标（接近当前能力）
- 后期：选择更"困难"的目标（扩展能力边界）

In [None]:
from hindsight_experience_replay import PrioritizedHER, CurriculumHER

In [None]:
# 演示课程学习的目标采样
def demonstrate_curriculum():
    """演示课程HER的目标采样。"""
    
    # 模拟智能体当前能力
    agent_reach = 2.0  # 智能体能可靠到达的距离
    
    # 模拟目标池
    easy_goals = np.random.randn(50, 2) * 1.0  # 简单目标
    medium_goals = np.random.randn(50, 2) * 3.0  # 中等目标
    hard_goals = np.random.randn(50, 2) * 5.0  # 困难目标
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    training_stages = [
        ('初期: 主要采样简单目标', 0.8, 0.15, 0.05),
        ('中期: 混合采样', 0.3, 0.5, 0.2),
        ('后期: 主要采样困难目标', 0.1, 0.3, 0.6),
    ]
    
    for ax, (title, easy_prob, med_prob, hard_prob) in zip(axes, training_stages):
        # 按概率采样目标
        n_samples = 100
        sampled = []
        
        for _ in range(n_samples):
            r = np.random.random()
            if r < easy_prob:
                idx = np.random.randint(len(easy_goals))
                sampled.append(('easy', easy_goals[idx]))
            elif r < easy_prob + med_prob:
                idx = np.random.randint(len(medium_goals))
                sampled.append(('medium', medium_goals[idx]))
            else:
                idx = np.random.randint(len(hard_goals))
                sampled.append(('hard', hard_goals[idx]))
        
        # 绘制
        colors = {'easy': 'green', 'medium': 'orange', 'hard': 'red'}
        for difficulty, goal in sampled:
            ax.scatter(goal[0], goal[1], c=colors[difficulty], alpha=0.5, s=30)
        
        # 智能体能力圈
        circle = plt.Circle((0, 0), agent_reach, fill=False, 
                           color='blue', linestyle='--', linewidth=2)
        ax.add_patch(circle)
        
        ax.scatter(0, 0, c='blue', s=100, marker='s', label='智能体')
        ax.set_xlim(-8, 8)
        ax.set_ylim(-8, 8)
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        ax.set_title(title)
        ax.set_aspect('equal')
        ax.grid(True, alpha=0.3)
        
        # 图例
        ax.scatter([], [], c='green', label='简单目标')
        ax.scatter([], [], c='orange', label='中等目标')
        ax.scatter([], [], c='red', label='困难目标')
        ax.legend(loc='upper right', fontsize=8)
    
    plt.tight_layout()
    plt.show()

demonstrate_curriculum()

## 6. 实验与分析

### 6.1 HER vs 标准RL 学习曲线对比

In [None]:
def simulate_learning_comparison():
    """模拟HER vs 标准RL的学习曲线。"""
    
    np.random.seed(42)
    n_episodes = 200
    
    # 标准RL（稀疏奖励，学习缓慢）
    standard_rl = []
    success_prob = 0.01  # 初始成功率很低
    for ep in range(n_episodes):
        # 缓慢学习
        success_prob = min(0.95, success_prob + 0.002 * (1 - success_prob))
        success_prob += np.random.randn() * 0.01
        success_prob = np.clip(success_prob, 0, 1)
        standard_rl.append(success_prob)
    
    # HER（快速学习）
    her_rl = []
    success_prob = 0.01
    for ep in range(n_episodes):
        # 快速学习
        success_prob = min(0.95, success_prob + 0.02 * (1 - success_prob))
        success_prob += np.random.randn() * 0.02
        success_prob = np.clip(success_prob, 0, 1)
        her_rl.append(success_prob)
    
    # 绘图
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 学习曲线
    ax = axes[0]
    ax.plot(standard_rl, 'b-', label='标准RL (稀疏奖励)', alpha=0.8)
    ax.plot(her_rl, 'r-', label='HER', alpha=0.8)
    ax.set_xlabel('训练回合')
    ax.set_ylabel('成功率')
    ax.set_title('学习曲线对比')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(0, 1)
    
    # 达到阈值所需回合数
    ax = axes[1]
    thresholds = [0.3, 0.5, 0.7, 0.9]
    
    standard_episodes = []
    her_episodes = []
    
    for thresh in thresholds:
        # 找到首次达到阈值的回合
        std_ep = next((i for i, v in enumerate(standard_rl) if v >= thresh), n_episodes)
        her_ep = next((i for i, v in enumerate(her_rl) if v >= thresh), n_episodes)
        standard_episodes.append(std_ep)
        her_episodes.append(her_ep)
    
    x = np.arange(len(thresholds))
    width = 0.35
    
    ax.bar(x - width/2, standard_episodes, width, label='标准RL', color='blue')
    ax.bar(x + width/2, her_episodes, width, label='HER', color='red')
    ax.set_xlabel('目标成功率')
    ax.set_ylabel('所需训练回合数')
    ax.set_title('样本效率对比')
    ax.set_xticks(x)
    ax.set_xticklabels([f'{t:.0%}' for t in thresholds])
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    # 计算加速比
    print("\n样本效率提升:")
    for thresh, std, her_ep in zip(thresholds, standard_episodes, her_episodes):
        if her_ep > 0:
            speedup = std / her_ep
            print(f"  达到{thresh:.0%}成功率: 标准RL={std}回合, HER={her_ep}回合, 加速{speedup:.1f}x")

simulate_learning_comparison()

In [None]:
# 打印总结
summary = """
╔══════════════════════════════════════════════════════════════════════════╗
║                  后见经验回放 (HER) - 核心总结                           ║
╠══════════════════════════════════════════════════════════════════════════╣
║                                                                          ║
║  核心思想：                                                              ║
║    失败的轨迹 + 重新定义的目标 = 成功的经验                              ║
║                                                                          ║
║  重标注公式：                                                            ║
║    原始: (s, a, g, r=-1, s')                                            ║
║    HER:  (s, a, g'=achieved(s'), r'=0, s')                              ║
║                                                                          ║
║  目标选择策略：                                                          ║
║    • final:   使用轨迹终点                                              ║
║    • future:  从未来状态采样 (推荐)                                      ║
║    • episode: 从全轨迹采样                                              ║
║    • random:  从历史采样                                                ║
║                                                                          ║
║  样本效率提升：                                                          ║
║    正样本比例: ~0% → ~k/(k+1)                                           ║
║    训练速度:   通常提升 5-10倍                                          ║
║                                                                          ║
║  适用场景：                                                              ║
║    ✓ 目标条件任务                                                       ║
║    ✓ 稀疏奖励环境                                                       ║
║    ✓ 机器人操作                                                         ║
║                                                                          ║
╚══════════════════════════════════════════════════════════════════════════╝
"""
print(summary)

## 参考文献

1. Andrychowicz, M., et al. (2017). Hindsight Experience Replay. NeurIPS.
2. Plappert, M., et al. (2018). Multi-Goal RL: Challenging Robotics Environments.
3. Fang, M., et al. (2019). Curriculum-guided Hindsight Experience Replay. NeurIPS.
4. Zhao, R., & Tresp, V. (2019). Energy-Based Hindsight Experience Prioritization.