# Part 5: Temporal Difference Learning

In this notebook, we'll learn **Temporal Difference (TD)** methods - the most important class of model-free RL algorithms that combine ideas from Monte Carlo and Dynamic Programming.

## What You'll Learn
- TD(0) prediction
- The bias-variance tradeoff (TD vs MC)
- SARSA (On-policy TD control)
- Q-Learning (Off-policy TD control)
- Comparison of SARSA and Q-Learning

## Prerequisites
- Understanding of MDPs and Bellman equations (Notebooks 01-02)
- Monte Carlo methods (Notebook 04)

Let's begin!

## Setup

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import time

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
np.random.seed(42)

print("Setup complete!")

In [None]:
# Create environment and helper variables
env = gym.make("FrozenLake-v1", is_slippery=True)

n_states = env.observation_space.n
n_actions = env.action_space.n
action_names = ['LEFT', 'DOWN', 'RIGHT', 'UP']
action_arrows = ['←', '↓', '→', '↑']

print("FrozenLake Environment")
print("=" * 40)
print(f"States: {n_states}")
print(f"Actions: {n_actions}")

In [None]:
# Visualization helper functions
def plot_value_function(V, title="Value Function", ax=None):
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    V_grid = V.reshape(nrow, ncol)
    
    im = ax.imshow(V_grid, cmap='RdYlGn', vmin=0, vmax=max(V.max(), 0.01))
    plt.colorbar(im, ax=ax, shrink=0.8)
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            color = 'white' if V_grid[i, j] < V.max() / 2 else 'black'
            ax.text(j, i, f'{cell}\n{V[state]:.3f}', ha='center', va='center',
                   fontsize=9, color=color)
    
    ax.set_xticks(range(ncol))
    ax.set_yticks(range(nrow))
    ax.set_title(title)
    return ax

def plot_policy(Q, title="Policy", ax=None):
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    colors = {'S': 'lightblue', 'F': 'white', 'H': 'lightcoral', 'G': 'lightgreen'}
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            
            rect = plt.Rectangle((j, nrow-1-i), 1, 1, fill=True,
                                 facecolor=colors.get(cell, 'white'), edgecolor='black')
            ax.add_patch(rect)
            
            best_action = np.argmax(Q[state])
            
            if cell not in ['H', 'G']:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, 
                       f'{cell}\n{action_arrows[best_action]}',
                       ha='center', va='center', fontsize=14, fontweight='bold')
            else:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, cell,
                       ha='center', va='center', fontsize=14, fontweight='bold')
    
    ax.set_xlim(0, ncol)
    ax.set_ylim(0, nrow)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(title)
    return ax

print("Visualization functions ready!")

---
# 1. What is Temporal Difference Learning?

**Temporal Difference (TD)** learning combines ideas from:
- **Monte Carlo**: Learn from experience (model-free)
- **Dynamic Programming**: Bootstrap (update estimates based on other estimates)

## Key Insight: Bootstrapping

**Monte Carlo** waits until the end of episode to update:
$$V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$$

where $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots$ (full return)

**TD** updates immediately using estimated return:
$$V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$$

The term $R_{t+1} + \gamma V(S_{t+1})$ is called the **TD target**.

The difference $(R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$ is the **TD error** ($\delta$).

## Advantages of TD

1. **Online learning**: Update after every step, not just at episode end
2. **Works for continuing tasks**: Don't need episodes to terminate
3. **Lower variance**: Uses single reward + estimate instead of full return
4. **Often faster convergence**: Especially in practice

## Disadvantages

1. **Biased**: Bootstrapping introduces bias from current estimates
2. **Depends on initialization**: Bad initial values can slow learning

---
# 2. TD(0) Prediction

The simplest TD method: update after each step using immediate reward and next state estimate.

$$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$

Where:
- $\alpha$ is the learning rate
- $R_{t+1} + \gamma V(S_{t+1})$ is the TD target
- $R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is the TD error $\delta_t$

In [None]:
def td0_prediction(env, policy, gamma, alpha, n_episodes):
    """
    TD(0) Prediction for estimating V^π.
    
    Args:
        env: Gymnasium environment
        policy: Policy to evaluate (π[s,a] probabilities)
        gamma: Discount factor
        alpha: Learning rate
        n_episodes: Number of episodes
    
    Returns:
        V: Estimated state value function
        V_history: V at intervals for visualization
        td_errors: TD errors during training
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    V = np.zeros(n_states)
    V_history = [V.copy()]
    td_errors = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        done = False
        
        while not done:
            # Select action according to policy
            action = np.random.choice(n_actions, p=policy[state])
            
            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # TD update
            # V(s) <- V(s) + α * [r + γV(s') - V(s)]
            td_target = reward + gamma * V[next_state] * (not done)
            td_error = td_target - V[state]
            V[state] += alpha * td_error
            
            td_errors.append(td_error)
            state = next_state
        
        # Save history at intervals
        if (episode + 1) % (n_episodes // 10) == 0:
            V_history.append(V.copy())
    
    return V, V_history, td_errors

In [None]:
# Run TD(0) prediction for random policy
uniform_policy = np.ones((n_states, n_actions)) / n_actions

print("TD(0) Prediction")
print("=" * 50)

V_td, V_history_td, td_errors = td0_prediction(
    env, uniform_policy, gamma=0.99, alpha=0.1, n_episodes=50000
)

print(f"\nEstimated V^π (random policy):")
print(V_td.reshape(4, 4).round(4))

In [None]:
# Compare TD(0) with true values (from DP)
def extract_mdp(env):
    n_s = env.observation_space.n
    n_a = env.action_space.n
    P = np.zeros((n_s, n_a, n_s))
    R = np.zeros((n_s, n_a))
    for s in range(n_s):
        for a in range(n_a):
            for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                P[s, a, next_s] += prob
                R[s, a] += prob * reward
    return P, R

def policy_evaluation_dp(P, R, policy, gamma, theta=1e-8):
    n_states = P.shape[0]
    n_actions = P.shape[1]
    V = np.zeros(n_states)
    while True:
        V_new = np.zeros(n_states)
        for s in range(n_states):
            for a in range(n_actions):
                V_new[s] += policy[s, a] * (R[s, a] + gamma * np.sum(P[s, a] * V))
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    return V

P, R = extract_mdp(env)
V_true = policy_evaluation_dp(P, R, uniform_policy, gamma=0.99)

print("Comparison: TD(0) vs True Values (DP)")
print("=" * 60)
print(f"Mean Absolute Error: {np.mean(np.abs(V_td - V_true)):.4f}")
print(f"Max Absolute Error: {np.max(np.abs(V_td - V_true)):.4f}")

In [None]:
# Visualize TD(0) convergence
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Plot V at different stages
episodes_at = [0, 5000, 10000, 25000, 50000]
for idx, (ax, ep) in enumerate(zip(axes.flat[:-1], episodes_at)):
    hist_idx = min(idx, len(V_history_td)-1)
    plot_value_function(V_history_td[hist_idx], title=f"After {ep} episodes", ax=ax)

# TD error over time
ax = axes.flat[-1]
window = 1000
td_errors_smooth = np.convolve(np.abs(td_errors), np.ones(window)/window, mode='valid')
ax.plot(td_errors_smooth)
ax.set_xlabel('Step')
ax.set_ylabel('|TD Error| (moving avg)')
ax.set_title('TD Error Over Time')

plt.suptitle("TD(0) Prediction Convergence", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
# 3. SARSA: On-Policy TD Control

**SARSA** (State-Action-Reward-State-Action) is an on-policy TD control algorithm.

## The SARSA Update

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$$

The name comes from the quintuple: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$

## On-Policy

SARSA is **on-policy**: it learns about the policy it's following.
- Uses $A_{t+1}$ which is selected by the same policy
- The Q-values converge to $Q^\pi$ for the behavior policy $\pi$
- Typically uses ε-greedy policy

In [None]:
def sarsa(env, gamma, alpha, n_episodes, epsilon=0.1, 
          epsilon_decay=1.0, min_epsilon=0.01):
    """
    SARSA: On-policy TD Control.
    
    Args:
        env: Gymnasium environment
        gamma: Discount factor
        alpha: Learning rate
        n_episodes: Number of episodes
        epsilon: Exploration rate
        epsilon_decay: Decay rate for epsilon
        min_epsilon: Minimum epsilon
    
    Returns:
        Q: Learned Q-values
        stats: Training statistics
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    Q = np.zeros((n_states, n_actions))
    
    episode_rewards = []
    episode_lengths = []
    epsilons = []
    
    def epsilon_greedy_action(state, eps):
        if np.random.random() < eps:
            return np.random.randint(n_actions)
        return np.argmax(Q[state])
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        action = epsilon_greedy_action(state, epsilon)
        
        total_reward = 0
        steps = 0
        done = False
        
        while not done:
            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Choose next action (for SARSA update)
            next_action = epsilon_greedy_action(next_state, epsilon)
            
            # SARSA update
            # Q(s,a) <- Q(s,a) + α * [r + γ*Q(s',a') - Q(s,a)]
            td_target = reward + gamma * Q[next_state, next_action] * (not done)
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error
            
            state = next_state
            action = next_action
            total_reward += reward
            steps += 1
        
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
        epsilons.append(epsilon)
        
        # Decay epsilon
        epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    stats = {
        'episode_rewards': episode_rewards,
        'episode_lengths': episode_lengths,
        'epsilons': epsilons
    }
    
    return Q, stats

In [None]:
# Run SARSA
print("SARSA Training")
print("=" * 50)

start_time = time.time()
Q_sarsa, stats_sarsa = sarsa(
    env, gamma=0.99, alpha=0.1, n_episodes=100000,
    epsilon=1.0, epsilon_decay=0.99995, min_epsilon=0.01
)
sarsa_time = time.time() - start_time

print(f"Training time: {sarsa_time:.2f} seconds")
print(f"Final epsilon: {stats_sarsa['epsilons'][-1]:.4f}")

In [None]:
# Plot SARSA training progress
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Learning curve
window = 1000
rewards_smooth = np.convolve(stats_sarsa['episode_rewards'], 
                              np.ones(window)/window, mode='valid')
axes[0, 0].plot(rewards_smooth)
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Reward (moving avg)')
axes[0, 0].set_title(f'SARSA Learning Curve (window={window})')

# Epsilon decay
axes[0, 1].plot(stats_sarsa['epsilons'])
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Epsilon')
axes[0, 1].set_title('Exploration Rate Decay')

# Value function
V_sarsa = np.max(Q_sarsa, axis=1)
plot_value_function(V_sarsa, title="Learned V = max Q(s,a)", ax=axes[1, 0])

# Policy
plot_policy(Q_sarsa, title="Learned Policy", ax=axes[1, 1])

plt.suptitle("SARSA Results (100,000 episodes)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
# 4. Q-Learning: Off-Policy TD Control

**Q-Learning** is an off-policy TD control algorithm - the most famous RL algorithm!

## The Q-Learning Update

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

## Off-Policy

Q-Learning is **off-policy**: it learns about the optimal policy while following a different (exploratory) policy.

Key difference from SARSA:
- **SARSA**: Uses $Q(S_{t+1}, A_{t+1})$ where $A_{t+1}$ comes from the behavior policy
- **Q-Learning**: Uses $\max_a Q(S_{t+1}, a)$ - the value of the best action

This means Q-Learning directly learns $Q^*$ regardless of the policy being followed!

In [None]:
def q_learning(env, gamma, alpha, n_episodes, epsilon=0.1,
               epsilon_decay=1.0, min_epsilon=0.01):
    """
    Q-Learning: Off-policy TD Control.
    
    Args:
        env: Gymnasium environment
        gamma: Discount factor
        alpha: Learning rate
        n_episodes: Number of episodes
        epsilon: Exploration rate
        epsilon_decay: Decay rate for epsilon
        min_epsilon: Minimum epsilon
    
    Returns:
        Q: Learned Q-values
        stats: Training statistics
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    Q = np.zeros((n_states, n_actions))
    
    episode_rewards = []
    episode_lengths = []
    epsilons = []
    
    def epsilon_greedy_action(state, eps):
        if np.random.random() < eps:
            return np.random.randint(n_actions)
        return np.argmax(Q[state])
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        
        total_reward = 0
        steps = 0
        done = False
        
        while not done:
            # Choose action using ε-greedy
            action = epsilon_greedy_action(state, epsilon)
            
            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Q-Learning update
            # Q(s,a) <- Q(s,a) + α * [r + γ*max_a' Q(s',a') - Q(s,a)]
            td_target = reward + gamma * np.max(Q[next_state]) * (not done)
            td_error = td_target - Q[state, action]
            Q[state, action] += alpha * td_error
            
            state = next_state
            total_reward += reward
            steps += 1
        
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
        epsilons.append(epsilon)
        
        # Decay epsilon
        epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    stats = {
        'episode_rewards': episode_rewards,
        'episode_lengths': episode_lengths,
        'epsilons': epsilons
    }
    
    return Q, stats

In [None]:
# Run Q-Learning
print("Q-Learning Training")
print("=" * 50)

start_time = time.time()
Q_qlearn, stats_qlearn = q_learning(
    env, gamma=0.99, alpha=0.1, n_episodes=100000,
    epsilon=1.0, epsilon_decay=0.99995, min_epsilon=0.01
)
qlearn_time = time.time() - start_time

print(f"Training time: {qlearn_time:.2f} seconds")
print(f"Final epsilon: {stats_qlearn['epsilons'][-1]:.4f}")

In [None]:
# Plot Q-Learning training progress
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Learning curve
window = 1000
rewards_smooth = np.convolve(stats_qlearn['episode_rewards'], 
                              np.ones(window)/window, mode='valid')
axes[0, 0].plot(rewards_smooth)
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Reward (moving avg)')
axes[0, 0].set_title(f'Q-Learning Curve (window={window})')

# Epsilon decay
axes[0, 1].plot(stats_qlearn['epsilons'])
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Epsilon')
axes[0, 1].set_title('Exploration Rate Decay')

# Value function
V_qlearn = np.max(Q_qlearn, axis=1)
plot_value_function(V_qlearn, title="Learned V = max Q(s,a)", ax=axes[1, 0])

# Policy
plot_policy(Q_qlearn, title="Learned Policy", ax=axes[1, 1])

plt.suptitle("Q-Learning Results (100,000 episodes)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
# 5. SARSA vs Q-Learning Comparison

Let's compare the two algorithms in detail.

In [None]:
# Evaluate both policies
def evaluate_policy(env, Q, n_episodes=10000):
    """Evaluate a greedy policy derived from Q."""
    rewards = []
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = np.argmax(Q[state])
            state, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated
        
        rewards.append(total_reward)
    
    return np.array(rewards)

# Get optimal Q from DP for comparison
def value_iteration(P, R, gamma, theta=1e-8):
    n_states, n_actions = R.shape
    V = np.zeros(n_states)
    while True:
        V_new = np.zeros(n_states)
        for s in range(n_states):
            V_new[s] = np.max([R[s, a] + gamma * np.sum(P[s, a] * V) for a in range(n_actions)])
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    Q = np.zeros((n_states, n_actions))
    for s in range(n_states):
        for a in range(n_actions):
            Q[s, a] = R[s, a] + gamma * np.sum(P[s, a] * V)
    return Q

Q_optimal = value_iteration(P, R, gamma=0.99)

# Evaluate
print("Policy Evaluation Comparison")
print("=" * 50)

rewards_sarsa_eval = evaluate_policy(env, Q_sarsa, n_episodes=10000)
rewards_qlearn_eval = evaluate_policy(env, Q_qlearn, n_episodes=10000)
rewards_optimal_eval = evaluate_policy(env, Q_optimal, n_episodes=10000)

print(f"SARSA: Success rate = {np.mean(rewards_sarsa_eval)*100:.2f}%")
print(f"Q-Learning: Success rate = {np.mean(rewards_qlearn_eval)*100:.2f}%")
print(f"Optimal (DP): Success rate = {np.mean(rewards_optimal_eval)*100:.2f}%")

In [None]:
# Compare learning curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Learning curves side by side
window = 1000
sarsa_smooth = np.convolve(stats_sarsa['episode_rewards'], 
                            np.ones(window)/window, mode='valid')
qlearn_smooth = np.convolve(stats_qlearn['episode_rewards'], 
                             np.ones(window)/window, mode='valid')

axes[0].plot(sarsa_smooth, label='SARSA', alpha=0.8)
axes[0].plot(qlearn_smooth, label='Q-Learning', alpha=0.8)
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Reward (moving avg)')
axes[0].set_title('Learning Curves Comparison')
axes[0].legend()

# Success rate comparison
methods = ['SARSA', 'Q-Learning', 'Optimal (DP)']
success_rates = [
    np.mean(rewards_sarsa_eval)*100,
    np.mean(rewards_qlearn_eval)*100,
    np.mean(rewards_optimal_eval)*100
]
colors = ['steelblue', 'orange', 'green']

bars = axes[1].bar(methods, success_rates, color=colors, edgecolor='black')
axes[1].set_ylabel('Success Rate (%)')
axes[1].set_title('Final Policy Performance')
axes[1].set_ylim(0, 100)
for bar, rate in zip(bars, success_rates):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{rate:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Compare Q-values with optimal
print("Q-value Comparison with Optimal")
print("=" * 50)

sarsa_error = np.mean(np.abs(Q_sarsa - Q_optimal))
qlearn_error = np.mean(np.abs(Q_qlearn - Q_optimal))

print(f"SARSA Mean Absolute Q-error: {sarsa_error:.4f}")
print(f"Q-Learning Mean Absolute Q-error: {qlearn_error:.4f}")

In [None]:
# Visualize policies side by side
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

plot_policy(Q_sarsa, title="SARSA Policy", ax=axes[0])
plot_policy(Q_qlearn, title="Q-Learning Policy", ax=axes[1])
plot_policy(Q_optimal, title="Optimal Policy (DP)", ax=axes[2])

plt.suptitle("Policy Comparison", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
# 6. Key Differences: SARSA vs Q-Learning

| Aspect | SARSA | Q-Learning |
|--------|-------|------------|
| **Type** | On-policy | Off-policy |
| **Update uses** | $Q(S', A')$ where $A'$ from policy | $\max_a Q(S', a)$ |
| **Learns** | $Q^\pi$ for behavior policy | $Q^*$ optimal Q |
| **Behavior** | More conservative/safe | More aggressive/risky |
| **Convergence** | To $Q^\pi$ | To $Q^*$ |

## On-Policy vs Off-Policy

**On-policy (SARSA)**:
- Learns about the policy it's following
- Takes exploration into account
- May be safer in dangerous environments

**Off-policy (Q-Learning)**:
- Learns optimal policy while following any policy
- Can use experience from any source (replay buffer)
- More sample efficient but may be riskier

---
# 7. Effect of Learning Rate

The learning rate α controls how much new information overrides old information.

In [None]:
# Test different learning rates
alphas = [0.01, 0.1, 0.5, 0.9]
results_alpha = {}

print("Testing different learning rates (Q-Learning)")
print("=" * 50)

for alpha in alphas:
    Q, stats = q_learning(
        env, gamma=0.99, alpha=alpha, n_episodes=50000,
        epsilon=1.0, epsilon_decay=0.9999, min_epsilon=0.01
    )
    rewards = evaluate_policy(env, Q, n_episodes=5000)
    results_alpha[alpha] = {
        'Q': Q,
        'stats': stats,
        'success_rate': np.mean(rewards) * 100
    }
    print(f"α = {alpha}: Success rate = {results_alpha[alpha]['success_rate']:.2f}%")

In [None]:
# Plot learning rate comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Learning curves
window = 500
for alpha in alphas:
    rewards_smooth = np.convolve(results_alpha[alpha]['stats']['episode_rewards'],
                                  np.ones(window)/window, mode='valid')
    axes[0].plot(rewards_smooth, label=f'α = {alpha}')

axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Reward (moving avg)')
axes[0].set_title('Learning Curves for Different α')
axes[0].legend()

# Final success rates
success_rates = [results_alpha[a]['success_rate'] for a in alphas]
axes[1].bar([str(a) for a in alphas], success_rates, color='steelblue', edgecolor='black')
axes[1].set_xlabel('Learning Rate (α)')
axes[1].set_ylabel('Success Rate (%)')
axes[1].set_title('Final Performance vs Learning Rate')

for i, (a, rate) in enumerate(zip(alphas, success_rates)):
    axes[1].text(i, rate + 1, f'{rate:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

---
# Summary

## TD Methods Overview

| Method | Type | Update Rule | Learns |
|--------|------|-------------|--------|
| **TD(0)** | Prediction | $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$ | $V^\pi$ |
| **SARSA** | On-policy Control | $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s',a') - Q(s,a)]$ | $Q^\pi$ |
| **Q-Learning** | Off-policy Control | $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$ | $Q^*$ |

## Key Takeaways

1. **TD combines MC and DP**: Model-free like MC, bootstraps like DP
2. **Updates every step**: Don't need to wait for episode end
3. **SARSA (on-policy)**: Learns about the policy being followed, more conservative
4. **Q-Learning (off-policy)**: Learns optimal policy regardless of behavior, more aggressive
5. **Trade-offs**: TD has lower variance but is biased; MC is unbiased but high variance

## TD vs MC vs DP

| Property | DP | MC | TD |
|----------|----|----|----|
| Model-free | No | Yes | Yes |
| Bootstraps | Yes | No | Yes |
| Online (step-by-step) | Yes | No | Yes |
| Works for continuing tasks | Yes | No | Yes |

## Next Steps

In the final notebook (**06_algorithm_comparison.ipynb**), we'll:
- Compare all algorithms side by side
- Discuss when to use which method
- Summarize the entire tutorial

In [None]:
print("Congratulations! You've completed Part 5 of the RL Tutorial!")
print("\nKey takeaways:")
print("- TD methods update after every step using bootstrapping")
print("- SARSA is on-policy: learns about the policy it follows")
print("- Q-Learning is off-policy: learns optimal policy directly")
print("- Both are model-free and work for continuing tasks")
print("- Learning rate α controls the speed/stability trade-off")
print("\nNext: 06_algorithm_comparison.ipynb")