# 流行强化学习算法详解
# Popular Reinforcement Learning Algorithms: A Comprehensive Guide

---

## 目录 (Table of Contents)

1. [引言与理论基础](#1-引言与理论基础)
2. [DDPG: 深度确定性策略梯度](#2-ddpg-深度确定性策略梯度)
3. [TD3: 双延迟深度确定性策略梯度](#3-td3-双延迟深度确定性策略梯度)
4. [SAC: 软演员-评论家](#4-sac-软演员-评论家)
5. [算法对比与实验](#5-算法对比与实验)
6. [最佳实践与调参指南](#6-最佳实践与调参指南)

---

## 环境配置 (Environment Setup)

In [None]:
# 导入必要的库
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from IPython.display import display, Markdown, HTML
import warnings
warnings.filterwarnings('ignore')

# 设置随机种子确保可复现性
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

# 检查GPU可用性
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 设置绘图风格
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

In [None]:
# 导入我们的实现模块
from popular_rl_algorithms import (
    DDPGAgent, DDPGConfig,
    TD3Agent, TD3Config,
    SACAgent, SACConfig,
    ReplayBuffer,
    DeterministicActor, GaussianActor,
    QNetwork, TwinQNetwork,
    train_continuous_agent,
    evaluate_agent,
    compare_algorithms,
)

try:
    import gymnasium as gym
    HAS_GYM = True
    print("Gymnasium loaded successfully!")
except ImportError:
    HAS_GYM = False
    print("Warning: gymnasium not installed. Some features will be unavailable.")

---

## 1. 引言与理论基础

### 1.1 强化学习问题形式化

强化学习问题通常被形式化为**马尔可夫决策过程 (MDP)**，由五元组 $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ 定义：

| 符号 | 含义 | 说明 |
|------|------|------|
| $\mathcal{S}$ | 状态空间 | 环境可能处于的所有状态集合 |
| $\mathcal{A}$ | 动作空间 | 智能体可执行的所有动作集合 |
| $P(s'\|s, a)$ | 状态转移概率 | 在状态$s$执行动作$a$后转移到$s'$的概率 |
| $R(s, a, s')$ | 奖励函数 | 转移时获得的即时奖励 |
| $\gamma \in [0, 1]$ | 折扣因子 | 未来奖励的衰减系数 |

### 1.2 核心目标

**目标**：找到最优策略 $\pi^*$，使得期望累积折扣奖励最大化：

$$J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$$

其中 $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ 是遵循策略 $\pi$ 产生的轨迹。

### 1.3 价值函数

#### 状态价值函数 (State Value Function)
$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$$

**直觉**：从状态$s$开始，遵循策略$\pi$，期望能获得多少累积奖励？

#### 动作价值函数 (Action Value Function / Q-Function)
$$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$

**直觉**：在状态$s$执行动作$a$，然后遵循策略$\pi$，期望能获得多少累积奖励？

#### 优势函数 (Advantage Function)
$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

**直觉**：动作$a$比平均水平好多少？正值表示优于平均，负值表示劣于平均。

In [None]:
# 可视化价值函数的概念
def visualize_value_concepts():
    """可视化V(s)、Q(s,a)和A(s,a)的关系"""
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # V(s) - 状态价值
    states = np.linspace(-2, 2, 100)
    v_values = -states**2 + 4  # 模拟的V(s)
    axes[0].plot(states, v_values, 'b-', linewidth=2)
    axes[0].fill_between(states, v_values, alpha=0.3)
    axes[0].set_xlabel('State s')
    axes[0].set_ylabel('V(s)')
    axes[0].set_title('State Value Function V(s)\n"这个状态有多好?"')
    axes[0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
    
    # Q(s,a) - 动作价值
    actions = ['左', '不动', '右']
    q_values = [2.5, 3.0, 3.5]  # 模拟的Q(s,a)
    colors = ['#ff7f0e', '#2ca02c', '#1f77b4']
    bars = axes[1].bar(actions, q_values, color=colors, edgecolor='black')
    axes[1].set_xlabel('Action a')
    axes[1].set_ylabel('Q(s, a)')
    axes[1].set_title('Action Value Function Q(s,a)\n"在状态s执行动作a有多好?"')
    axes[1].axhline(y=np.mean(q_values), color='r', linestyle='--', label=f'V(s) = {np.mean(q_values):.1f}')
    axes[1].legend()
    
    # A(s,a) - 优势函数
    v_s = np.mean(q_values)
    a_values = [q - v_s for q in q_values]
    colors_a = ['red' if a < 0 else 'green' for a in a_values]
    axes[2].bar(actions, a_values, color=colors_a, edgecolor='black', alpha=0.7)
    axes[2].set_xlabel('Action a')
    axes[2].set_ylabel('A(s, a)')
    axes[2].set_title('Advantage Function A(s,a) = Q(s,a) - V(s)\n"这个动作比平均好多少?"')
    axes[2].axhline(y=0, color='k', linestyle='-', linewidth=1)
    
    plt.tight_layout()
    plt.show()

visualize_value_concepts()

### 1.4 Bellman方程

价值函数满足递归关系，即**Bellman方程**：

#### Bellman期望方程
$$V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[R(s, a) + \gamma \mathbb{E}_{s' \sim P}[V^\pi(s')]\right]$$

$$Q^\pi(s, a) = R(s, a) + \gamma \mathbb{E}_{s' \sim P}\left[\mathbb{E}_{a' \sim \pi}[Q^\pi(s', a')]\right]$$

#### Bellman最优方程
$$V^*(s) = \max_a \left[R(s, a) + \gamma \mathbb{E}_{s' \sim P}[V^*(s')]\right]$$

$$Q^*(s, a) = R(s, a) + \gamma \mathbb{E}_{s' \sim P}\left[\max_{a'} Q^*(s', a')\right]$$

**核心思想**：当前状态的价值 = 即时奖励 + 折扣后的未来价值

### 1.5 策略梯度定理

对于参数化策略 $\pi_\theta$，目标函数对参数的梯度为：

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t, a_t)\right]$$

**直觉解释**：
- $\nabla_\theta \log \pi_\theta(a_t|s_t)$：增加动作$a_t$概率的方向
- $A^{\pi_\theta}(s_t, a_t)$：动作$a_t$的"好坏程度"
- 乘积含义：好的动作（正优势）增加概率，坏的动作（负优势）减少概率

### 1.6 算法分类

```
强化学习算法
├── 基于价值 (Value-Based)
│   ├── Q-Learning
│   ├── DQN (及变体: Double, Dueling, Rainbow)
│   └── 适用于离散动作空间
│
├── 基于策略 (Policy-Based)
│   ├── REINFORCE
│   ├── PPO, TRPO
│   └── 适用于连续/离散动作空间
│
└── Actor-Critic (结合两者)
    ├── A2C/A3C
    ├── DDPG → TD3 (确定性策略)
    └── SAC (随机策略 + 最大熵)
```

In [None]:
# 可视化算法分类
def visualize_algorithm_taxonomy():
    """创建算法分类的可视化图表"""
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 10)
    ax.axis('off')
    
    # 绘制分类框
    boxes = {
        'Value-Based': (1, 7, 2.5, 2, '#3498db'),
        'Policy-Based': (4, 7, 2.5, 2, '#e74c3c'),
        'Actor-Critic': (7, 7, 2.5, 2, '#2ecc71'),
    }
    
    algorithms = {
        'Value-Based': ['DQN', 'Double DQN', 'Dueling DQN', 'Rainbow'],
        'Policy-Based': ['REINFORCE', 'PPO', 'TRPO'],
        'Actor-Critic': ['A2C/A3C', 'DDPG', 'TD3', 'SAC'],
    }
    
    for name, (x, y, w, h, color) in boxes.items():
        rect = plt.Rectangle((x, y), w, h, fill=True, facecolor=color, 
                             edgecolor='black', linewidth=2, alpha=0.7)
        ax.add_patch(rect)
        ax.text(x + w/2, y + h - 0.3, name, ha='center', va='top', 
               fontsize=12, fontweight='bold', color='white')
        
        # 添加算法列表
        for i, algo in enumerate(algorithms[name]):
            ax.text(x + w/2, y + h - 0.8 - i*0.4, algo, ha='center', 
                   fontsize=10, color='white')
    
    # 添加特性标注
    ax.text(2.25, 4.5, '离散动作\n高样本效率\n确定性策略', ha='center', 
           fontsize=9, style='italic', bbox=dict(boxstyle='round', facecolor='wheat'))
    ax.text(5.25, 4.5, '连续/离散动作\nOn-Policy\n高方差', ha='center', 
           fontsize=9, style='italic', bbox=dict(boxstyle='round', facecolor='wheat'))
    ax.text(8.25, 4.5, '连续动作\nOff-Policy\n低方差', ha='center', 
           fontsize=9, style='italic', bbox=dict(boxstyle='round', facecolor='wheat'))
    
    ax.set_title('深度强化学习算法分类', fontsize=16, fontweight='bold', pad=20)
    
    # 添加箭头连接
    ax.annotate('', xy=(2.25, 5.2), xytext=(2.25, 7),
               arrowprops=dict(arrowstyle='->', color='gray', lw=1.5))
    ax.annotate('', xy=(5.25, 5.2), xytext=(5.25, 7),
               arrowprops=dict(arrowstyle='->', color='gray', lw=1.5))
    ax.annotate('', xy=(8.25, 5.2), xytext=(8.25, 7),
               arrowprops=dict(arrowstyle='->', color='gray', lw=1.5))
    
    plt.tight_layout()
    plt.show()

visualize_algorithm_taxonomy()

---

## 2. DDPG: 深度确定性策略梯度

### 2.1 核心思想

**DDPG (Deep Deterministic Policy Gradient)** 将DQN扩展到连续动作空间：

- **确定性策略**: $a = \mu_\theta(s)$，直接输出动作值
- **经验回放**: 打破样本相关性
- **目标网络**: 稳定训练目标
- **探索噪声**: 通过添加噪声实现探索

### 2.2 数学原理

#### 确定性策略梯度定理 (Silver et al., 2014)

对于确定性策略 $\mu_\theta: \mathcal{S} \rightarrow \mathcal{A}$，梯度为：

$$\nabla_\theta J = \mathbb{E}_{s \sim \mathcal{D}}\left[\nabla_a Q(s, a)|_{a=\mu_\theta(s)} \cdot \nabla_\theta \mu_\theta(s)\right]$$

**直觉**：沿着Q值增大的方向调整策略

#### Critic更新 (TD学习)

$$L(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s')) - Q_\phi(s, a))^2\right]$$

#### 目标网络软更新 (Polyak Averaging)

$$\theta' \leftarrow \tau \theta + (1 - \tau) \theta'$$
$$\phi' \leftarrow \tau \phi + (1 - \tau) \phi'$$

其中 $\tau \ll 1$ (典型值: 0.005)

In [None]:
# DDPG架构可视化
def visualize_ddpg_architecture():
    """可视化DDPG的网络架构"""
    fig, ax = plt.subplots(figsize=(14, 8))
    ax.set_xlim(0, 14)
    ax.set_ylim(0, 10)
    ax.axis('off')
    
    # Actor网络
    actor_box = plt.Rectangle((1, 6), 3, 3, fill=True, facecolor='#3498db', 
                               edgecolor='black', linewidth=2, alpha=0.8)
    ax.add_patch(actor_box)
    ax.text(2.5, 8.5, 'Actor μ_θ(s)', ha='center', va='center', 
           fontsize=12, fontweight='bold', color='white')
    ax.text(2.5, 7.5, 's → a', ha='center', va='center', fontsize=11, color='white')
    ax.text(2.5, 6.7, '确定性策略', ha='center', va='center', fontsize=10, color='white')
    
    # Critic网络
    critic_box = plt.Rectangle((6, 6), 3, 3, fill=True, facecolor='#e74c3c', 
                                edgecolor='black', linewidth=2, alpha=0.8)
    ax.add_patch(critic_box)
    ax.text(7.5, 8.5, 'Critic Q_φ(s,a)', ha='center', va='center', 
           fontsize=12, fontweight='bold', color='white')
    ax.text(7.5, 7.5, '(s,a) → Q', ha='center', va='center', fontsize=11, color='white')
    ax.text(7.5, 6.7, '价值估计', ha='center', va='center', fontsize=10, color='white')
    
    # 目标网络
    target_actor = plt.Rectangle((1, 1), 3, 2, fill=True, facecolor='#3498db', 
                                  edgecolor='black', linewidth=2, alpha=0.4, linestyle='--')
    ax.add_patch(target_actor)
    ax.text(2.5, 2, "Actor' μ_θ'(s)", ha='center', va='center', fontsize=10)
    
    target_critic = plt.Rectangle((6, 1), 3, 2, fill=True, facecolor='#e74c3c', 
                                   edgecolor='black', linewidth=2, alpha=0.4, linestyle='--')
    ax.add_patch(target_critic)
    ax.text(7.5, 2, "Critic' Q_φ'(s,a)", ha='center', va='center', fontsize=10)
    
    # 经验回放缓冲区
    buffer_box = plt.Rectangle((11, 4), 2.5, 4, fill=True, facecolor='#2ecc71', 
                                edgecolor='black', linewidth=2, alpha=0.8)
    ax.add_patch(buffer_box)
    ax.text(12.25, 7.5, 'Replay', ha='center', va='center', fontsize=11, fontweight='bold', color='white')
    ax.text(12.25, 6.8, 'Buffer', ha='center', va='center', fontsize=11, fontweight='bold', color='white')
    ax.text(12.25, 5.8, '(s,a,r,s\',d)', ha='center', va='center', fontsize=9, color='white')
    ax.text(12.25, 5.0, '均匀采样', ha='center', va='center', fontsize=9, color='white')
    
    # 箭头
    ax.annotate('', xy=(6, 7.5), xytext=(4, 7.5),
               arrowprops=dict(arrowstyle='->', color='black', lw=2))
    ax.text(5, 7.8, 'a', ha='center', fontsize=10)
    
    ax.annotate('', xy=(2.5, 6), xytext=(2.5, 3),
               arrowprops=dict(arrowstyle='->', color='gray', lw=1.5, ls='--'))
    ax.text(3.2, 4.5, 'τ-soft update', ha='left', fontsize=9, color='gray')
    
    ax.annotate('', xy=(7.5, 6), xytext=(7.5, 3),
               arrowprops=dict(arrowstyle='->', color='gray', lw=1.5, ls='--'))
    
    ax.annotate('', xy=(11, 6), xytext=(9, 7),
               arrowprops=dict(arrowstyle='->', color='green', lw=1.5))
    
    ax.set_title('DDPG Architecture', fontsize=16, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.show()

visualize_ddpg_architecture()

### 2.3 DDPG算法伪代码

```
算法: DDPG
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
初始化:
    Actor网络 μ_θ, Critic网络 Q_φ
    目标网络 μ_θ' ← μ_θ, Q_φ' ← Q_φ
    经验回放缓冲区 D

For each episode:
    观察初始状态 s
    
    For each step:
        1. 选择动作: a = μ_θ(s) + ε,  ε ~ N(0, σ²)  # 添加探索噪声
        2. 执行动作，观察 (r, s', done)
        3. 存储转移: D ← D ∪ {(s, a, r, s', done)}
        4. 从D中采样批次 B
        
        5. 计算TD目标:
           y = r + γ(1-done) · Q_φ'(s', μ_θ'(s'))
        
        6. 更新Critic:
           φ ← φ - α_Q · ∇_φ (1/|B|) Σ(y - Q_φ(s,a))²
        
        7. 更新Actor:
           θ ← θ + α_μ · ∇_θ (1/|B|) Σ Q_φ(s, μ_θ(s))
        
        8. 软更新目标网络:
           θ' ← τθ + (1-τ)θ'
           φ' ← τφ + (1-τ)φ'
        
        s ← s'
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

In [None]:
# DDPG核心代码演示
class DDPGDemo:
    """DDPG核心实现演示（简化版）"""
    
    def __init__(self, state_dim, action_dim, max_action):
        self.max_action = max_action
        
        # Actor: s -> a
        self.actor = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Tanh()  # 输出[-1, 1]
        )
        
        # Critic: (s, a) -> Q
        self.critic = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )
        
        print("DDPG网络结构:")
        print(f"  Actor: {state_dim} -> [256] -> [256] -> {action_dim}")
        print(f"  Critic: {state_dim + action_dim} -> [256] -> [256] -> 1")
    
    def select_action(self, state, noise_std=0.1):
        """确定性策略 + 探索噪声"""
        with torch.no_grad():
            action = self.actor(state) * self.max_action
        
        # 添加高斯噪声进行探索
        noise = torch.randn_like(action) * noise_std * self.max_action
        action = (action + noise).clamp(-self.max_action, self.max_action)
        
        return action
    
    def compute_critic_loss(self, states, actions, rewards, next_states, dones, gamma=0.99):
        """Critic TD损失"""
        with torch.no_grad():
            # 目标动作（来自目标Actor）
            next_actions = self.actor(next_states) * self.max_action
            # 目标Q值
            target_q = rewards + gamma * (1 - dones) * self.critic(
                torch.cat([next_states, next_actions], dim=-1)
            )
        
        # 当前Q值
        current_q = self.critic(torch.cat([states, actions], dim=-1))
        
        # MSE损失
        loss = F.mse_loss(current_q, target_q)
        return loss
    
    def compute_actor_loss(self, states):
        """Actor策略梯度损失"""
        # 最大化Q(s, μ(s))，即最小化-Q
        actions = self.actor(states) * self.max_action
        q_values = self.critic(torch.cat([states, actions], dim=-1))
        loss = -q_values.mean()
        return loss

# 演示
demo = DDPGDemo(state_dim=4, action_dim=2, max_action=1.0)

# 模拟输入
batch_size = 32
states = torch.randn(batch_size, 4)
actions = torch.randn(batch_size, 2).clamp(-1, 1)
rewards = torch.randn(batch_size, 1)
next_states = torch.randn(batch_size, 4)
dones = torch.zeros(batch_size, 1)

# 计算损失
critic_loss = demo.compute_critic_loss(states, actions, rewards, next_states, dones)
actor_loss = demo.compute_actor_loss(states)

print(f"\nCritic Loss: {critic_loss.item():.4f}")
print(f"Actor Loss: {actor_loss.item():.4f}")

### 2.4 DDPG的问题与局限

| 问题 | 原因 | 影响 |
|------|------|------|
| **Q值过估计** | $\max$ 操作引入正向偏差 | 策略次优 |
| **训练不稳定** | Critic误差传递给Actor | 性能震荡 |
| **超参数敏感** | 噪声、学习率耦合 | 调参困难 |
| **探索不足** | 确定性策略 + 简单噪声 | 局部最优 |

这些问题促使了TD3和SAC的发展。

---

## 3. TD3: 双延迟深度确定性策略梯度

### 3.1 核心思想

**TD3 (Twin Delayed DDPG)** 通过三个关键技术解决DDPG的问题：

1. **Clipped Double Q-learning**: 使用双Q网络，取最小值减少过估计
2. **Delayed Policy Updates**: 延迟Actor更新，让Critic先稳定
3. **Target Policy Smoothing**: 为目标动作添加噪声，正则化Q函数

### 3.2 数学原理

#### 1. Clipped Double Q-learning

维护两个独立的Q网络 $Q_{\phi_1}$ 和 $Q_{\phi_2}$：

$$y = r + \gamma \min_{i=1,2} Q_{\phi_i'}(s', \tilde{a}')$$

**为什么取最小值？**
- 过估计源于 $\mathbb{E}[\max] \geq \max \mathbb{E}$
- 取最小值提供悲观估计，抵消过估计偏差

#### 2. Delayed Policy Updates

每更新D次Critic，才更新一次Actor：

```python
if update_step % policy_delay == 0:
    update_actor()
    soft_update_targets()
```

**为什么延迟？**
- Critic误差会传递给Actor
- 让Critic先收敛到较准确的估计
- 典型值: D = 2

#### 3. Target Policy Smoothing

$$\tilde{a}' = \mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$

**为什么添加噪声？**
- Q函数可能有尖锐的峰值（过拟合到某些动作）
- 噪声使Q学习平滑的价值函数
- 类似正则化效果

In [None]:
# 可视化TD3的三个改进
def visualize_td3_improvements():
    """可视化TD3相比DDPG的三个关键改进"""
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # 1. Clipped Double Q
    ax1 = axes[0]
    x = np.linspace(0, 10, 100)
    q1 = 5 + np.sin(x) + np.random.randn(100) * 0.3
    q2 = 5 + np.sin(x) + np.random.randn(100) * 0.3 + 0.5
    q_min = np.minimum(q1, q2)
    
    ax1.plot(x, q1, 'b-', alpha=0.5, label='Q₁')
    ax1.plot(x, q2, 'r-', alpha=0.5, label='Q₂')
    ax1.plot(x, q_min, 'g-', linewidth=2, label='min(Q₁, Q₂)')
    ax1.axhline(y=5, color='k', linestyle='--', alpha=0.3, label='True Q')
    ax1.set_xlabel('State-Action')
    ax1.set_ylabel('Q Value')
    ax1.set_title('1. Clipped Double Q-learning\n减少过估计偏差')
    ax1.legend(fontsize=8)
    
    # 2. Delayed Policy Updates
    ax2 = axes[1]
    steps = np.arange(0, 20)
    critic_updates = np.ones(20)
    actor_updates = np.zeros(20)
    actor_updates[::2] = 1  # 每2步更新一次
    
    width = 0.35
    ax2.bar(steps - width/2, critic_updates, width, label='Critic Updates', color='#e74c3c', alpha=0.7)
    ax2.bar(steps + width/2, actor_updates, width, label='Actor Updates', color='#3498db', alpha=0.7)
    ax2.set_xlabel('Training Step')
    ax2.set_ylabel('Update')
    ax2.set_title('2. Delayed Policy Updates\nActor每D步更新一次')
    ax2.legend()
    ax2.set_xlim(-1, 20)
    
    # 3. Target Policy Smoothing
    ax3 = axes[2]
    a_mean = 0.5
    a_samples = np.random.randn(500) * 0.2 + a_mean
    a_clipped = np.clip(a_samples, a_mean - 0.5, a_mean + 0.5)
    
    ax3.hist(a_clipped, bins=30, density=True, alpha=0.7, color='#2ecc71', edgecolor='black')
    ax3.axvline(x=a_mean, color='r', linestyle='--', linewidth=2, label=f'μ(s\') = {a_mean}')
    ax3.axvline(x=a_mean - 0.5, color='k', linestyle=':', label='Clip bounds')
    ax3.axvline(x=a_mean + 0.5, color='k', linestyle=':')
    ax3.set_xlabel('Target Action')
    ax3.set_ylabel('Density')
    ax3.set_title('3. Target Policy Smoothing\nã\' = μ\'(s\') + clip(ε, -c, c)')
    ax3.legend(fontsize=8)
    
    plt.tight_layout()
    plt.show()

visualize_td3_improvements()

### 3.3 TD3算法伪代码

```
算法: TD3
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
初始化:
    Actor μ_θ, Twin Critics Q_φ₁, Q_φ₂
    目标网络 θ' ← θ, φ'₁ ← φ₁, φ'₂ ← φ₂
    回放缓冲区 D

For each timestep t:
    # 探索
    a = μ_θ(s) + ε,  ε ~ N(0, σ²)
    执行a，观察(r, s', done)
    D ← D ∪ {(s, a, r, s', done)}
    
    # 学习
    采样批次 B ~ D
    
    # Target Policy Smoothing
    ã' = μ_θ'(s') + clip(ε, -c, c),  ε ~ N(0, σ_target²)
    
    # Clipped Double Q-learning
    y = r + γ(1-done) · min(Q_φ'₁(s', ã'), Q_φ'₂(s', ã'))
    
    # 更新Critics
    φᵢ ← minimize (1/|B|) Σ(y - Q_φᵢ(s, a))²,  i=1,2
    
    # Delayed Policy Updates
    if t mod d == 0:
        # 更新Actor (只用Q₁)
        θ ← maximize (1/|B|) Σ Q_φ₁(s, μ_θ(s))
        
        # 软更新目标网络
        θ' ← τθ + (1-τ)θ'
        φ'ᵢ ← τφᵢ + (1-τ)φ'ᵢ,  i=1,2
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

In [None]:
# TD3核心代码演示
class TD3Demo:
    """TD3核心实现演示"""
    
    def __init__(self, state_dim, action_dim, max_action):
        self.max_action = max_action
        self.policy_delay = 2
        self.policy_noise = 0.2
        self.noise_clip = 0.5
        self.update_count = 0
        
        # Actor
        self.actor = nn.Sequential(
            nn.Linear(state_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, action_dim), nn.Tanh()
        )
        
        # Twin Critics
        self.critic1 = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, 1)
        )
        self.critic2 = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, 1)
        )
        
        print("TD3网络结构:")
        print(f"  Actor: {state_dim} -> [256] -> [256] -> {action_dim}")
        print(f"  Critic1 & Critic2: {state_dim + action_dim} -> [256] -> [256] -> 1")
        print(f"  Policy Delay: {self.policy_delay}")
    
    def compute_target_actions(self, next_states):
        """带噪声的目标动作（Target Policy Smoothing）"""
        with torch.no_grad():
            # 目标动作
            next_actions = self.actor(next_states) * self.max_action
            
            # 添加裁剪噪声
            noise = torch.randn_like(next_actions) * self.policy_noise
            noise = noise.clamp(-self.noise_clip, self.noise_clip)
            
            # 裁剪到动作空间
            next_actions = (next_actions + noise).clamp(-self.max_action, self.max_action)
        
        return next_actions
    
    def compute_critic_loss(self, states, actions, rewards, next_states, dones, gamma=0.99):
        """Clipped Double Q-learning损失"""
        # 目标动作（带噪声）
        next_actions = self.compute_target_actions(next_states)
        
        with torch.no_grad():
            # 两个目标Q值
            target_q1 = self.critic1(torch.cat([next_states, next_actions], dim=-1))
            target_q2 = self.critic2(torch.cat([next_states, next_actions], dim=-1))
            
            # 取最小值 (Clipped Double Q)
            target_q = torch.min(target_q1, target_q2)
            target_q = rewards + gamma * (1 - dones) * target_q
        
        # 当前Q值
        current_q1 = self.critic1(torch.cat([states, actions], dim=-1))
        current_q2 = self.critic2(torch.cat([states, actions], dim=-1))
        
        # 两个Critic的损失
        loss1 = F.mse_loss(current_q1, target_q)
        loss2 = F.mse_loss(current_q2, target_q)
        
        return loss1 + loss2, current_q1.mean(), current_q2.mean()
    
    def should_update_actor(self):
        """检查是否应该更新Actor（Delayed Update）"""
        self.update_count += 1
        return self.update_count % self.policy_delay == 0

# 演示
td3_demo = TD3Demo(state_dim=4, action_dim=2, max_action=1.0)

# 模拟训练
batch_size = 32
states = torch.randn(batch_size, 4)
actions = torch.randn(batch_size, 2).clamp(-1, 1)
rewards = torch.randn(batch_size, 1)
next_states = torch.randn(batch_size, 4)
dones = torch.zeros(batch_size, 1)

critic_loss, q1_mean, q2_mean = td3_demo.compute_critic_loss(
    states, actions, rewards, next_states, dones
)

print(f"\nCritic Loss: {critic_loss.item():.4f}")
print(f"Q1 Mean: {q1_mean.item():.4f}, Q2 Mean: {q2_mean.item():.4f}")
print(f"Should update actor: {td3_demo.should_update_actor()}")

---

## 4. SAC: 软演员-评论家

### 4.1 核心思想

**SAC (Soft Actor-Critic)** 引入最大熵框架：

$$J(\pi) = \sum_t \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\left[r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]$$

**关键创新**：
- 最大化奖励的同时最大化策略熵
- 随机策略提供内在探索
- 自动调节温度参数 $\alpha$

### 4.2 最大熵强化学习

#### 为什么要最大化熵？

1. **探索**: 高熵策略自然探索更多动作
2. **鲁棒性**: 不过度依赖单一动作序列
3. **多模态**: 可以学习多种解决方案

#### 软价值函数

**软V函数**:
$$V(s) = \mathbb{E}_{a \sim \pi}\left[Q(s, a) - \alpha \log \pi(a|s)\right]$$

**软Q函数**:
$$Q(s, a) = r + \gamma \mathbb{E}_{s'}\left[V(s')\right]$$

#### 自动温度调节

$$J(\alpha) = \mathbb{E}_{a \sim \pi}\left[-\alpha(\log \pi(a|s) + \bar{\mathcal{H}})\right]$$

其中 $\bar{\mathcal{H}}$ 是目标熵（通常设为 $-\dim(\mathcal{A})$）

In [None]:
# 可视化最大熵的效果
def visualize_entropy_effect():
    """可视化熵正则化对策略的影响"""
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    x = np.linspace(-3, 3, 100)
    
    # 低熵策略（确定性）
    ax1 = axes[0]
    low_entropy = np.exp(-(x - 0.5)**2 / 0.1)
    low_entropy /= low_entropy.sum() * (x[1] - x[0])
    ax1.fill_between(x, low_entropy, alpha=0.7, color='#e74c3c')
    ax1.set_xlabel('Action')
    ax1.set_ylabel('π(a|s)')
    ax1.set_title('低熵策略 (α → 0)\n确定性，集中在单一动作')
    ax1.set_ylim(0, 3)
    
    # 中等熵策略
    ax2 = axes[1]
    mid_entropy = np.exp(-(x - 0.5)**2 / 1.0)
    mid_entropy /= mid_entropy.sum() * (x[1] - x[0])
    ax2.fill_between(x, mid_entropy, alpha=0.7, color='#2ecc71')
    ax2.set_xlabel('Action')
    ax2.set_ylabel('π(a|s)')
    ax2.set_title('中等熵策略 (合适的α)\n平衡探索与利用')
    ax2.set_ylim(0, 0.8)
    
    # 高熵策略（均匀）
    ax3 = axes[2]
    high_entropy = np.ones_like(x) / 6  # 近似均匀分布
    ax3.fill_between(x, high_entropy, alpha=0.7, color='#3498db')
    ax3.set_xlabel('Action')
    ax3.set_ylabel('π(a|s)')
    ax3.set_title('高熵策略 (α → ∞)\n随机，接近均匀分布')
    ax3.set_ylim(0, 0.3)
    
    plt.tight_layout()
    plt.show()

visualize_entropy_effect()

### 4.3 SAC的高斯策略

SAC使用**可学习的高斯策略**：

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)$$

对于有界动作空间，使用**tanh压缩**：

$$u \sim \mathcal{N}(\mu, \sigma^2), \quad a = \tanh(u)$$

**对数概率（考虑变量变换）**：

$$\log \pi(a|s) = \log \mathcal{N}(u|\mu, \sigma^2) - \sum_i \log(1 - \tanh^2(u_i))$$

In [None]:
# SAC高斯策略演示
class SACPolicyDemo(nn.Module):
    """SAC的可重参数化高斯策略"""
    
    LOG_STD_MIN = -20
    LOG_STD_MAX = 2
    
    def __init__(self, state_dim, action_dim):
        super().__init__()
        
        # 共享特征提取
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU()
        )
        
        # 均值和对数标准差
        self.mean_layer = nn.Linear(256, action_dim)
        self.log_std_layer = nn.Linear(256, action_dim)
    
    def forward(self, state):
        """输出高斯分布参数"""
        features = self.shared(state)
        mean = self.mean_layer(features)
        log_std = self.log_std_layer(features)
        log_std = torch.clamp(log_std, self.LOG_STD_MIN, self.LOG_STD_MAX)
        return mean, log_std.exp()
    
    def sample(self, state):
        """使用重参数化技巧采样"""
        mean, std = self.forward(state)
        
        # 重参数化: u = mean + std * epsilon
        normal = torch.distributions.Normal(mean, std)
        u = normal.rsample()  # 可微分的采样
        
        # Tanh压缩到[-1, 1]
        action = torch.tanh(u)
        
        # 计算对数概率（带Jacobian校正）
        log_prob = normal.log_prob(u).sum(dim=-1)
        # Jacobian: d(tanh(u))/du = 1 - tanh²(u)
        log_prob -= (2 * (np.log(2) - u - F.softplus(-2 * u))).sum(dim=-1)
        
        return action, log_prob, mean

# 演示
policy = SACPolicyDemo(state_dim=4, action_dim=2)
state = torch.randn(5, 4)  # 5个样本

action, log_prob, mean = policy.sample(state)

print("SAC高斯策略采样:")
print(f"  State shape: {state.shape}")
print(f"  Action shape: {action.shape}")
print(f"  Log prob shape: {log_prob.shape}")
print(f"\n  Sample actions:\n{action}")
print(f"\n  Log probabilities: {log_prob}")

### 4.4 SAC算法伪代码

```
算法: SAC (带自动温度调节)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
初始化:
    Actor π_θ (高斯策略)
    Twin Critics Q_φ₁, Q_φ₂
    目标Critics Q_φ'₁, Q_φ'₂
    温度 α (可学习)
    目标熵 H̄ = -dim(A)

For each timestep:
    # 从策略采样动作
    a ~ π_θ(·|s)
    执行a，观察(r, s', done)
    D ← D ∪ {(s, a, r, s', done)}
    
    # 采样批次
    B ~ D
    
    # 计算目标 (Soft Bellman)
    a' ~ π_θ(·|s')
    y = r + γ(1-done) · (min(Q_φ'₁(s',a'), Q_φ'₂(s',a')) - α log π_θ(a'|s'))
    
    # 更新Critics
    φᵢ ← minimize (1/|B|) Σ(y - Q_φᵢ(s,a))²,  i=1,2
    
    # 更新Actor
    a_new ~ π_θ(·|s)
    θ ← maximize (1/|B|) Σ[min(Q_φ₁(s,a_new), Q_φ₂(s,a_new)) - α log π_θ(a_new|s)]
    
    # 更新温度 (自动调节)
    α ← minimize (1/|B|) Σ[-α(log π_θ(a_new|s) + H̄)]
    
    # 软更新目标Critics
    φ'ᵢ ← τφᵢ + (1-τ)φ'ᵢ,  i=1,2
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

---

## 5. 算法对比与实验

### 5.1 算法特性对比

In [None]:
# 创建算法对比表格
comparison_data = {
    '特性': ['策略类型', '动作空间', 'On/Off Policy', '样本效率', 
             '训练稳定性', '探索机制', '超参敏感度', '核心贡献'],
    'DDPG': ['确定性', '连续', 'Off-policy', '高', 
             '低', '添加噪声', '高', 'DPG + 神经网络'],
    'TD3': ['确定性', '连续', 'Off-policy', '高', 
            '中', '添加噪声', '中', 'Twin Q + Delayed + Smoothing'],
    'SAC': ['随机', '连续', 'Off-policy', '高', 
            '高', '熵正则化', '低', '最大熵 + 自动温度']
}

import pandas as pd
df = pd.DataFrame(comparison_data)
df.set_index('特性', inplace=True)

# 显示表格
display(HTML(df.to_html()))

In [None]:
# 模拟训练曲线对比
def simulate_training_curves():
    """模拟三种算法的训练曲线"""
    np.random.seed(42)
    timesteps = np.arange(0, 100000, 1000)
    
    # DDPG: 不稳定，震荡大
    ddpg_base = -200 + 180 * (1 - np.exp(-timesteps / 30000))
    ddpg_noise = np.random.randn(len(timesteps)) * 30
    ddpg_returns = ddpg_base + ddpg_noise
    ddpg_returns[50:60] -= 80  # 模拟崩溃
    
    # TD3: 更稳定，但可能收敛较慢
    td3_base = -200 + 190 * (1 - np.exp(-timesteps / 35000))
    td3_noise = np.random.randn(len(timesteps)) * 15
    td3_returns = td3_base + td3_noise
    
    # SAC: 最稳定，收敛最快
    sac_base = -200 + 200 * (1 - np.exp(-timesteps / 25000))
    sac_noise = np.random.randn(len(timesteps)) * 10
    sac_returns = sac_base + sac_noise
    
    # 绘图
    fig, ax = plt.subplots(figsize=(12, 6))
    
    ax.plot(timesteps, ddpg_returns, 'b-', alpha=0.3)
    ax.plot(timesteps, pd.Series(ddpg_returns).rolling(10).mean(), 'b-', 
           linewidth=2, label='DDPG')
    
    ax.plot(timesteps, td3_returns, 'orange', alpha=0.3)
    ax.plot(timesteps, pd.Series(td3_returns).rolling(10).mean(), 'orange', 
           linewidth=2, label='TD3')
    
    ax.plot(timesteps, sac_returns, 'g-', alpha=0.3)
    ax.plot(timesteps, pd.Series(sac_returns).rolling(10).mean(), 'g-', 
           linewidth=2, label='SAC')
    
    ax.set_xlabel('Timesteps', fontsize=12)
    ax.set_ylabel('Episode Return', fontsize=12)
    ax.set_title('算法训练曲线对比 (Pendulum-v1 模拟)', fontsize=14)
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    ax.axhline(y=-200, color='k', linestyle='--', alpha=0.3, label='Random')
    
    # 标注
    ax.annotate('DDPG崩溃', xy=(55000, -100), xytext=(60000, -50),
               arrowprops=dict(arrowstyle='->', color='blue'),
               fontsize=10, color='blue')
    
    plt.tight_layout()
    plt.show()

simulate_training_curves()

### 5.2 实际训练演示

In [None]:
# 实际训练演示（如果gymnasium可用）
if HAS_GYM:
    print("开始实际训练演示...")
    print("注意: 完整训练需要较长时间，这里只演示少量步数")
    
    # 创建环境
    env = gym.make('Pendulum-v1')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]
    max_action = float(env.action_space.high[0])
    
    print(f"\n环境: Pendulum-v1")
    print(f"状态维度: {state_dim}")
    print(f"动作维度: {action_dim}")
    print(f"动作范围: [-{max_action}, {max_action}]")
    
    # 创建SAC智能体
    config = SACConfig(
        state_dim=state_dim,
        action_dim=action_dim,
        max_action=max_action,
        batch_size=64,
    )
    agent = SACAgent(config)
    
    # 快速训练演示
    num_episodes = 5
    episode_returns = []
    
    for ep in range(num_episodes):
        state, _ = env.reset(seed=42 + ep)
        episode_return = 0
        done = False
        steps = 0
        
        while not done and steps < 200:
            action = agent.select_action(state, training=True)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.store_transition(state, action, reward, next_state, done)
            agent.update()
            
            state = next_state
            episode_return += reward
            steps += 1
        
        episode_returns.append(episode_return)
        print(f"Episode {ep + 1}: Return = {episode_return:.2f}, Steps = {steps}")
    
    env.close()
    print(f"\n平均回报: {np.mean(episode_returns):.2f}")
else:
    print("gymnasium未安装，跳过实际训练演示")
    print("安装: pip install gymnasium")

---

## 6. 最佳实践与调参指南

### 6.1 超参数推荐

In [None]:
# 超参数推荐表
hyperparams = {
    '超参数': ['学习率 (Actor)', '学习率 (Critic)', 'Batch Size', 
              'Buffer Size', 'γ (Discount)', 'τ (Soft Update)',
              '探索噪声', 'Hidden Dims'],
    'DDPG': ['1e-4', '1e-3', '64-256', '1M', '0.99', '0.005',
             '0.1 (Gaussian)', '[256, 256]'],
    'TD3': ['3e-4', '3e-4', '256', '1M', '0.99', '0.005',
            '0.1 (Gaussian)', '[256, 256]'],
    'SAC': ['3e-4', '3e-4', '256', '1M', '0.99', '0.005',
            '自动 (熵)', '[256, 256]']
}

df_hyper = pd.DataFrame(hyperparams)
df_hyper.set_index('超参数', inplace=True)
display(HTML(df_hyper.to_html()))

### 6.2 调参建议

#### 训练不稳定时
1. **降低学习率**: 特别是Actor的学习率
2. **增加τ**: 让目标网络更新更慢
3. **增加Batch Size**: 减少梯度方差
4. **梯度裁剪**: 防止梯度爆炸

#### 收敛太慢时
1. **增加学习率**: 但要注意稳定性
2. **增加网络容量**: 更宽/更深的网络
3. **检查奖励设计**: 是否稀疏？是否scale合适？

#### 探索不足时
1. **DDPG/TD3**: 增加探索噪声σ
2. **SAC**: 检查α是否太小，或增加target_entropy
3. **增加初始随机步数**: 收集更多样化的经验

### 6.3 算法选择指南

```
选择哪个算法?
│
├─ 需要确定性策略吗?
│  ├─ 是 → TD3 (比DDPG更稳定)
│  └─ 否 → SAC (通常首选)
│
├─ 样本效率重要吗?
│  ├─ 是 → SAC 或 TD3 (Off-policy)
│  └─ 否 → 可考虑PPO (更简单)
│
└─ 需要多模态策略吗?
   ├─ 是 → SAC (随机策略)
   └─ 否 → TD3 也可以
```

---

## 总结

本notebook详细介绍了三种流行的深度强化学习算法：

| 算法 | 核心创新 | 最佳场景 |
|------|----------|----------|
| **DDPG** | DPG + 神经网络 | 历史意义，现已被超越 |
| **TD3** | Twin Q + Delayed + Smoothing | 需要确定性策略时 |
| **SAC** | 最大熵 + 自动温度 | 通用首选，最稳定 |

### 关键要点

1. **理解问题**: 所有算法都在解决连续动作空间的策略学习
2. **稳定性很重要**: 从DDPG到TD3到SAC，稳定性逐步提升
3. **探索是关键**: SAC的熵正则化提供了原则性的探索机制
4. **实践出真知**: 理论之外，大量实验调参是必要的

### 进一步学习

- **论文**: 阅读原始论文理解更多细节
- **代码**: 运行`popular_rl_algorithms.py`进行实验
- **扩展**: 探索更高级的算法如PPO、D4PG、MPO等