# Part 6: Algorithm Comparison and Conclusion

In this final notebook, we'll compare all the RL algorithms we've learned and provide a comprehensive summary of the tutorial.

## What You'll Learn
- Recap of all fundamental RL algorithms
- Side-by-side comparison of all algorithms
- Performance benchmarks
- When to use which algorithm
- Complete summary of RL concepts
- Next steps for further learning

Let's begin!

## Setup

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from collections import defaultdict

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
np.random.seed(42)

print("Setup complete!")

In [None]:
# Create environment
env = gym.make("FrozenLake-v1", is_slippery=True)

n_states = env.observation_space.n
n_actions = env.action_space.n
action_names = ['LEFT', 'DOWN', 'RIGHT', 'UP']
action_arrows = ['‚Üê', '‚Üì', '‚Üí', '‚Üë']

print(f"FrozenLake: {n_states} states, {n_actions} actions")

In [None]:
# Extract MDP for DP methods
def extract_mdp(env):
    n_s = env.observation_space.n
    n_a = env.action_space.n
    P = np.zeros((n_s, n_a, n_s))
    R = np.zeros((n_s, n_a))
    for s in range(n_s):
        for a in range(n_a):
            for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                P[s, a, next_s] += prob
                R[s, a] += prob * reward
    return P, R

P, R = extract_mdp(env)

In [None]:
# Visualization functions
def plot_policy(Q, title="Policy", ax=None):
    if ax is None:
        fig, ax = plt.subplots(figsize=(5, 5))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    colors = {'S': 'lightblue', 'F': 'white', 'H': 'lightcoral', 'G': 'lightgreen'}
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            
            rect = plt.Rectangle((j, nrow-1-i), 1, 1, fill=True,
                                 facecolor=colors.get(cell, 'white'), edgecolor='black')
            ax.add_patch(rect)
            
            best_action = np.argmax(Q[state])
            
            if cell not in ['H', 'G']:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, 
                       f'{cell}\n{action_arrows[best_action]}',
                       ha='center', va='center', fontsize=12, fontweight='bold')
            else:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, cell,
                       ha='center', va='center', fontsize=12, fontweight='bold')
    
    ax.set_xlim(0, ncol)
    ax.set_ylim(0, nrow)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(title, fontsize=11)
    return ax

---
# 0. Recap from Notebooks 03-05

We've now learned the fundamental RL algorithms:

**Dynamic Programming (Notebook 03):**
- Model-based: Requires knowing P and R
- Exact computation through Bellman updates
- Guaranteed optimal solution
- Examples: Policy Iteration, Value Iteration

**Monte Carlo (Notebook 04):**
- Model-free: Learns from experience
- Episode-based: Must wait until termination
- High variance, zero bias
- Examples: First-visit MC, GLIE MC Control

**Temporal Difference (Notebook 05):**
- Model-free: Learns from experience
- Online: Updates after each step
- Bootstraps: Uses current estimates
- Examples: SARSA (on-policy), Q-learning (off-policy)

---
# 0.1 What This Notebook Does NOT Cover

| Topic | Why Not Here | How It Differs From What We Cover |
|-------|--------------|-----------------------------------|
| **Deep RL implementations** | We compare tabular methods on small discrete state spaces. Deep RL uses neural networks for continuous/high-dimensional states. | Our algorithms maintain explicit Q[s,a] tables. DQN, PPO, A3C use neural networks Q(s,a;Œ∏) with gradient descent ‚Äî necessary for Atari, robotics, but conceptually build on these foundations. |
| **Advanced algorithms (PPO, SAC, A3C)** | These are modern improvements combining multiple techniques. We focus on foundational algorithms that underpin them. | We compare core methods: DP, MC, TD. PPO (policy optimization), SAC (soft actor-critic), A3C (asynchronous AC) combine ideas from all these plus additional innovations ‚Äî they're the next step after mastering fundamentals. |
| **Multi-agent reinforcement learning** | Multi-agent settings require game theory and coordination. We focus on single-agent optimization. | Our algorithms optimize one agent against a fixed environment. Multi-agent RL involves multiple learning agents with potentially conflicting goals ‚Äî changes learning dynamics fundamentally (requires Nash equilibria, communication). |
| **Continuous control and action spaces** | Our comparison uses discrete actions (4 moves in FrozenLake). Continuous control requires policy gradients or discretization. | We compare algorithms that select from finite action sets. Continuous control (robot joints, vehicle steering) needs methods like DDPG, TD3, or SAC that handle continuous action spaces in R^n. |

---
# 0.2 How to Read This Notebook

This notebook provides a **systematic comparison framework** for evaluating RL algorithms:

**1. Implementation Review (Section 1)**: All algorithms in one place for easy reference

**2. Experimental Comparison (Sections 2-4)**: 
   - Run all algorithms on the same environment
   - Measure performance, speed, and accuracy
   - Visualize learned policies side-by-side

**3. Comparative Analysis (Sections 5-6)**:
   - When to use which algorithm
   - Trade-offs between methods
   - Decision guides for real-world problems

**4. Comprehensive Summary (Sections 7-8)**:
   - All RL concepts from the tutorial
   - Algorithm properties table
   - What comes next in your RL journey

**How to engage**:
- Run cells in order to reproduce the comparison
- Pay attention to the performance metrics and trade-offs
- Use the decision guides to understand when to apply each method
- The "Your Turn" section at the end has exercises to test your understanding

---
# 0.3 Preview: What We'll Compare

We will systematically compare algorithms across multiple dimensions:

**Performance Metrics**:
- **Success Rate**: How often does the learned policy reach the goal?
- **Q-value Accuracy**: How close are learned Q-values to the optimal solution?
- **Training Time**: How long does the algorithm take to converge?

**Algorithm Properties**:
- **Model Requirements**: Does it need to know P and R?
- **Update Frequency**: After each step, episode, or sweep?
- **Variance vs Bias**: What are the statistical properties?
- **Convergence Guarantees**: Does it provably reach the optimal solution?

**Practical Considerations**:
- **Sample Efficiency**: How many episodes/steps needed?
- **Computational Cost**: Time per update vs number of updates
- **Exploration Strategy**: How does it balance exploration and exploitation?
- **Applicability**: When should you use each algorithm?

By the end, you'll have a clear decision framework for choosing the right algorithm for your problem.

---
# 1. Implement All Algorithms

Let's implement all the algorithms we've learned in one place.

In [None]:
# =====================
# DYNAMIC PROGRAMMING
# =====================

def policy_iteration(P, R, gamma, theta=1e-8):
    """Policy Iteration (DP method, requires model)."""
    n_states, n_actions = R.shape
    
    # Initialize random policy
    policy = np.ones((n_states, n_actions)) / n_actions
    
    iterations = 0
    while True:
        # Policy Evaluation
        V = np.zeros(n_states)
        while True:
            V_new = np.zeros(n_states)
            for s in range(n_states):
                for a in range(n_actions):
                    V_new[s] += policy[s, a] * (R[s, a] + gamma * np.sum(P[s, a] * V))
            if np.max(np.abs(V_new - V)) < theta:
                break
            V = V_new
        
        # Policy Improvement
        new_policy = np.zeros((n_states, n_actions))
        for s in range(n_states):
            q_values = [R[s, a] + gamma * np.sum(P[s, a] * V) for a in range(n_actions)]
            new_policy[s, np.argmax(q_values)] = 1.0
        
        iterations += 1
        if np.array_equal(new_policy, policy):
            break
        policy = new_policy
    
    # Compute Q from V
    Q = np.zeros((n_states, n_actions))
    for s in range(n_states):
        for a in range(n_actions):
            Q[s, a] = R[s, a] + gamma * np.sum(P[s, a] * V)
    
    return Q, iterations


def value_iteration(P, R, gamma, theta=1e-8):
    """Value Iteration (DP method, requires model)."""
    n_states, n_actions = R.shape
    V = np.zeros(n_states)
    
    iterations = 0
    while True:
        V_new = np.zeros(n_states)
        for s in range(n_states):
            V_new[s] = np.max([R[s, a] + gamma * np.sum(P[s, a] * V) for a in range(n_actions)])
        
        iterations += 1
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    
    # Extract Q
    Q = np.zeros((n_states, n_actions))
    for s in range(n_states):
        for a in range(n_actions):
            Q[s, a] = R[s, a] + gamma * np.sum(P[s, a] * V)
    
    return Q, iterations

In [None]:
# =====================
# MONTE CARLO
# =====================

def mc_control(env, gamma, n_episodes, epsilon=0.1, epsilon_decay=0.99999, min_epsilon=0.01):
    """Monte Carlo Control with Œµ-greedy (model-free)."""
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    Q = np.zeros((n_states, n_actions))
    returns_sum = np.zeros((n_states, n_actions))
    returns_count = np.zeros((n_states, n_actions))
    episode_rewards = []
    
    for episode_num in range(n_episodes):
        # Generate episode
        episode = []
        state, _ = env.reset()
        done = False
        
        while not done:
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = np.argmax(Q[state])
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode.append((state, action, reward))
            state = next_state
            done = terminated or truncated
        
        episode_rewards.append(sum(r for _, _, r in episode))
        
        # Update Q
        sa_visited = set()
        G = 0
        for t in reversed(range(len(episode))):
            s, a, r = episode[t]
            G = gamma * G + r
            if (s, a) not in sa_visited:
                sa_visited.add((s, a))
                returns_sum[s, a] += G
                returns_count[s, a] += 1
                Q[s, a] = returns_sum[s, a] / returns_count[s, a]
        
        epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    return Q, episode_rewards

In [None]:
# =====================
# TEMPORAL DIFFERENCE
# =====================

def sarsa(env, gamma, alpha, n_episodes, epsilon=0.1, epsilon_decay=0.99999, min_epsilon=0.01):
    """SARSA: On-policy TD Control (model-free)."""
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    Q = np.zeros((n_states, n_actions))
    episode_rewards = []
    
    def eps_greedy(state, eps):
        if np.random.random() < eps:
            return np.random.randint(n_actions)
        return np.argmax(Q[state])
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        action = eps_greedy(state, epsilon)
        total_reward = 0
        done = False
        
        while not done:
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            next_action = eps_greedy(next_state, epsilon)
            
            # SARSA update
            td_target = reward + gamma * Q[next_state, next_action] * (not done)
            Q[state, action] += alpha * (td_target - Q[state, action])
            
            state, action = next_state, next_action
            total_reward += reward
        
        episode_rewards.append(total_reward)
        epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    return Q, episode_rewards


def q_learning(env, gamma, alpha, n_episodes, epsilon=0.1, epsilon_decay=0.99999, min_epsilon=0.01):
    """Q-Learning: Off-policy TD Control (model-free)."""
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    Q = np.zeros((n_states, n_actions))
    episode_rewards = []
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Œµ-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = np.argmax(Q[state])
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Q-Learning update
            td_target = reward + gamma * np.max(Q[next_state]) * (not done)
            Q[state, action] += alpha * (td_target - Q[state, action])
            
            state = next_state
            total_reward += reward
        
        episode_rewards.append(total_reward)
        epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    return Q, episode_rewards

---
# 2. Run All Algorithms

In [None]:
# Parameters
gamma = 0.99
n_episodes_mf = 100000  # For model-free methods

results = {}

print("Running All Algorithms")
print("=" * 60)

# Policy Iteration
print("\n1. Policy Iteration (DP)...")
start = time.time()
Q_pi, iters_pi = policy_iteration(P, R, gamma)
time_pi = time.time() - start
results['Policy Iteration'] = {'Q': Q_pi, 'time': time_pi, 'iterations': iters_pi}
print(f"   Done in {time_pi:.4f}s, {iters_pi} iterations")

# Value Iteration
print("\n2. Value Iteration (DP)...")
start = time.time()
Q_vi, iters_vi = value_iteration(P, R, gamma)
time_vi = time.time() - start
results['Value Iteration'] = {'Q': Q_vi, 'time': time_vi, 'iterations': iters_vi}
print(f"   Done in {time_vi:.4f}s, {iters_vi} iterations")

# Monte Carlo
print(f"\n3. Monte Carlo ({n_episodes_mf} episodes)...")
start = time.time()
Q_mc, rewards_mc = mc_control(env, gamma, n_episodes_mf, epsilon=1.0, epsilon_decay=0.99995)
time_mc = time.time() - start
results['Monte Carlo'] = {'Q': Q_mc, 'time': time_mc, 'rewards': rewards_mc}
print(f"   Done in {time_mc:.2f}s")

# SARSA
print(f"\n4. SARSA ({n_episodes_mf} episodes)...")
start = time.time()
Q_sarsa, rewards_sarsa = sarsa(env, gamma, alpha=0.1, n_episodes=n_episodes_mf, 
                                epsilon=1.0, epsilon_decay=0.99995)
time_sarsa = time.time() - start
results['SARSA'] = {'Q': Q_sarsa, 'time': time_sarsa, 'rewards': rewards_sarsa}
print(f"   Done in {time_sarsa:.2f}s")

# Q-Learning
print(f"\n5. Q-Learning ({n_episodes_mf} episodes)...")
start = time.time()
Q_qlearn, rewards_qlearn = q_learning(env, gamma, alpha=0.1, n_episodes=n_episodes_mf,
                                       epsilon=1.0, epsilon_decay=0.99995)
time_qlearn = time.time() - start
results['Q-Learning'] = {'Q': Q_qlearn, 'time': time_qlearn, 'rewards': rewards_qlearn}
print(f"   Done in {time_qlearn:.2f}s")

print("\n" + "=" * 60)
print("All algorithms complete!")

---
# 3. Evaluate All Policies

In [None]:
def evaluate_policy(env, Q, n_episodes=10000):
    """Evaluate a greedy policy derived from Q."""
    rewards = []
    for _ in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        while not done:
            action = np.argmax(Q[state])
            state, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated
        rewards.append(total_reward)
    return np.array(rewards)

print("Evaluating All Policies (10,000 episodes each)")
print("=" * 60)

for name in results:
    rewards = evaluate_policy(env, results[name]['Q'])
    results[name]['success_rate'] = np.mean(rewards) * 100
    results[name]['eval_rewards'] = rewards
    print(f"{name}: Success rate = {results[name]['success_rate']:.2f}%")

---
# 4. Comprehensive Comparison

In [None]:
# Compare Q-values with optimal (Policy Iteration is our reference)
Q_optimal = results['Policy Iteration']['Q']

print("Q-value Accuracy (Mean Absolute Error vs Optimal)")
print("=" * 60)

for name in results:
    mae = np.mean(np.abs(results[name]['Q'] - Q_optimal))
    results[name]['q_mae'] = mae
    print(f"{name}: MAE = {mae:.6f}")

In [None]:
# Create comprehensive comparison visualization
fig = plt.figure(figsize=(18, 12))

# 1. Success rates bar chart
ax1 = fig.add_subplot(2, 3, 1)
names = list(results.keys())
success_rates = [results[n]['success_rate'] for n in names]
colors = ['#2ecc71', '#27ae60', '#3498db', '#e74c3c', '#9b59b6']
bars = ax1.bar(range(len(names)), success_rates, color=colors, edgecolor='black')
ax1.set_xticks(range(len(names)))
ax1.set_xticklabels([n.replace(' ', '\n') for n in names], fontsize=9)
ax1.set_ylabel('Success Rate (%)')
ax1.set_title('Policy Performance')
ax1.set_ylim(0, 100)
for bar, rate in zip(bars, success_rates):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
            f'{rate:.1f}%', ha='center', fontsize=9, fontweight='bold')

# 2. Training time bar chart
ax2 = fig.add_subplot(2, 3, 2)
times = [results[n]['time'] for n in names]
bars = ax2.bar(range(len(names)), times, color=colors, edgecolor='black')
ax2.set_xticks(range(len(names)))
ax2.set_xticklabels([n.replace(' ', '\n') for n in names], fontsize=9)
ax2.set_ylabel('Time (seconds)')
ax2.set_title('Training Time')
for bar, t in zip(bars, times):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(times)*0.02,
            f'{t:.2f}s', ha='center', fontsize=9, fontweight='bold')

# 3. Q-value accuracy
ax3 = fig.add_subplot(2, 3, 3)
maes = [results[n]['q_mae'] for n in names]
bars = ax3.bar(range(len(names)), maes, color=colors, edgecolor='black')
ax3.set_xticks(range(len(names)))
ax3.set_xticklabels([n.replace(' ', '\n') for n in names], fontsize=9)
ax3.set_ylabel('Mean Absolute Error')
ax3.set_title('Q-value Accuracy (vs Optimal)')

# 4-8. Policies side by side
for idx, name in enumerate(names):
    ax = fig.add_subplot(2, 5, 6 + idx)
    plot_policy(results[name]['Q'], title=name, ax=ax)

plt.suptitle('Comprehensive Algorithm Comparison', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Learning curves for model-free methods
fig, ax = plt.subplots(figsize=(12, 5))

window = 1000
model_free = ['Monte Carlo', 'SARSA', 'Q-Learning']
colors_mf = ['#3498db', '#e74c3c', '#9b59b6']

for name, color in zip(model_free, colors_mf):
    rewards = results[name]['rewards']
    smooth = np.convolve(rewards, np.ones(window)/window, mode='valid')
    ax.plot(smooth, label=name, color=color, alpha=0.8)

ax.set_xlabel('Episode')
ax.set_ylabel(f'Reward (moving avg, window={window})')
ax.set_title('Learning Curves: Model-Free Methods')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
# 5. Summary Table

In [None]:
# Create summary table
print("\n" + "=" * 90)
print("ALGORITHM COMPARISON SUMMARY")
print("=" * 90)
print(f"{'Algorithm':<20} {'Type':<15} {'Model':<12} {'Success':<10} {'Time':<12} {'Q-MAE':<10}")
print("-" * 90)

algo_info = {
    'Policy Iteration': ('DP', 'Required'),
    'Value Iteration': ('DP', 'Required'),
    'Monte Carlo': ('MC', 'Free'),
    'SARSA': ('TD (On)', 'Free'),
    'Q-Learning': ('TD (Off)', 'Free')
}

for name in results:
    algo_type, model = algo_info[name]
    success = results[name]['success_rate']
    time_taken = results[name]['time']
    mae = results[name]['q_mae']
    print(f"{name:<20} {algo_type:<15} {model:<12} {success:>6.2f}%   {time_taken:>8.4f}s   {mae:>8.6f}")

print("=" * 90)

---
# 6. When to Use Which Algorithm?

## Decision Guide

```
Do you have a complete model of the environment?
‚îÇ
‚îú‚îÄ‚îÄ YES ‚Üí Use Dynamic Programming
‚îÇ         ‚îú‚îÄ‚îÄ Policy Iteration: Fewer iterations, more work per iteration
‚îÇ         ‚îî‚îÄ‚îÄ Value Iteration: More iterations, less work per iteration
‚îÇ
‚îî‚îÄ‚îÄ NO ‚Üí Use Model-Free Methods
         ‚îÇ
         ‚îú‚îÄ‚îÄ Do episodes terminate?
         ‚îÇ   ‚îú‚îÄ‚îÄ YES ‚Üí Can use Monte Carlo or TD
         ‚îÇ   ‚îî‚îÄ‚îÄ NO ‚Üí Must use TD methods
         ‚îÇ
         ‚îî‚îÄ‚îÄ Do you want to learn the optimal policy?
             ‚îú‚îÄ‚îÄ YES ‚Üí Q-Learning (off-policy)
             ‚îî‚îÄ‚îÄ Policy you're following ‚Üí SARSA (on-policy)
```

In [None]:
# Create decision flowchart visualization
fig, ax = plt.subplots(figsize=(14, 8))

# Draw boxes
boxes = [
    {'pos': (0.5, 0.9), 'text': 'Start: Choose RL Algorithm', 'color': 'lightgray'},
    {'pos': (0.5, 0.75), 'text': 'Have complete\nmodel (P, R)?', 'color': 'lightyellow'},
    {'pos': (0.2, 0.55), 'text': 'Dynamic\nProgramming', 'color': 'lightgreen'},
    {'pos': (0.8, 0.55), 'text': 'Model-Free\nMethods', 'color': 'lightblue'},
    {'pos': (0.1, 0.35), 'text': 'Policy\nIteration', 'color': '#2ecc71'},
    {'pos': (0.3, 0.35), 'text': 'Value\nIteration', 'color': '#27ae60'},
    {'pos': (0.65, 0.35), 'text': 'Episodes\nterminate?', 'color': 'lightyellow'},
    {'pos': (0.5, 0.15), 'text': 'Monte Carlo', 'color': '#3498db'},
    {'pos': (0.95, 0.35), 'text': 'TD Methods', 'color': 'lightblue'},
    {'pos': (0.8, 0.15), 'text': 'SARSA\n(on-policy)', 'color': '#e74c3c'},
    {'pos': (1.0, 0.15), 'text': 'Q-Learning\n(off-policy)', 'color': '#9b59b6'},
]

for box in boxes:
    rect = plt.Rectangle((box['pos'][0]-0.08, box['pos'][1]-0.06), 0.16, 0.12,
                         facecolor=box['color'], edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(box['pos'][0], box['pos'][1], box['text'], ha='center', va='center',
           fontsize=9, fontweight='bold')

# Draw arrows with labels
arrows = [
    ((0.5, 0.84), (0.5, 0.81), ''),
    ((0.42, 0.69), (0.28, 0.61), 'Yes'),
    ((0.58, 0.69), (0.72, 0.61), 'No'),
    ((0.15, 0.49), (0.12, 0.41), ''),
    ((0.25, 0.49), (0.28, 0.41), ''),
    ((0.73, 0.49), (0.58, 0.41), ''),
    ((0.87, 0.49), (0.93, 0.41), 'No'),
    ((0.57, 0.29), (0.52, 0.21), 'Yes'),
    ((0.93, 0.29), (0.85, 0.21), ''),
    ((0.97, 0.29), (0.98, 0.21), ''),
]

for start, end, label in arrows:
    ax.annotate('', xy=end, xytext=start,
               arrowprops=dict(arrowstyle='->', color='black', lw=1.5))
    if label:
        mid = ((start[0]+end[0])/2, (start[1]+end[1])/2)
        ax.text(mid[0]+0.02, mid[1]+0.02, label, fontsize=9, fontweight='bold', color='darkblue')

ax.set_xlim(-0.05, 1.15)
ax.set_ylim(0.05, 1.0)
ax.axis('off')
ax.set_title('Algorithm Selection Guide', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

---
# 7. Complete RL Concepts Summary

## Core Concepts

| Concept | Definition | Formula |
|---------|------------|----------|
| **State** | Current situation | $s \in S$ |
| **Action** | Decision to take | $a \in A$ |
| **Reward** | Immediate feedback | $R_t$ |
| **Return** | Cumulative discounted reward | $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ |
| **Policy** | Behavior strategy | $\pi(a|s) = P[A_t=a|S_t=s]$ |
| **State Value** | Expected return from state | $V^\pi(s) = E_\pi[G_t|S_t=s]$ |
| **Action Value** | Expected return from (state, action) | $Q^\pi(s,a) = E_\pi[G_t|S_t=s, A_t=a]$ |

## Bellman Equations

| Equation | Purpose | Form |
|----------|---------|------|
| **Bellman Expectation (V)** | Value of policy | $V^\pi(s) = \sum_a \pi(a|s)[R_s^a + \gamma \sum_{s'} P_{ss'}^a V^\pi(s')]$ |
| **Bellman Expectation (Q)** | Q of policy | $Q^\pi(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a V^\pi(s')$ |
| **Bellman Optimality (V)** | Optimal value | $V^*(s) = \max_a[R_s^a + \gamma \sum_{s'} P_{ss'}^a V^*(s')]$ |
| **Bellman Optimality (Q)** | Optimal Q | $Q^*(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a \max_{a'} Q^*(s',a')$ |

## Algorithm Summary

### Dynamic Programming (Model-Based)

| Algorithm | Key Idea | Update |
|-----------|----------|--------|
| **Policy Iteration** | Alternate eval & improve | Full policy evaluation |
| **Value Iteration** | One-step lookahead | $V(s) = \max_a[R + \gamma \sum P \cdot V]$ |

### Monte Carlo (Model-Free)

| Aspect | Description |
|--------|-------------|
| **Learns from** | Complete episodes |
| **Update** | After episode ends |
| **Uses** | Actual returns |
| **Variance** | High |
| **Bias** | None |

### Temporal Difference (Model-Free)

| Algorithm | Type | Update |
|-----------|------|--------|
| **TD(0)** | Prediction | $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$ |
| **SARSA** | On-policy | $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s',a') - Q(s,a)]$ |
| **Q-Learning** | Off-policy | $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max Q(s',\cdot) - Q(s,a)]$ |

---
# 8. Enhanced Summary and Concept Map

## Algorithm Properties Comparison

| Algorithm | Model-Based | Online | Variance | Bias | Convergence Guarantee | Best Use Case |
|-----------|-------------|--------|----------|------|----------------------|---------------|
| **Policy Iteration** | Yes | Yes | N/A | N/A | Yes (finite steps) | Small state spaces, need exact solution |
| **Value Iteration** | Yes | Yes | N/A | N/A | Yes (asymptotic) | Medium state spaces, can stop early |
| **Monte Carlo** | No | No | High | None | Yes (asymptotic) | Episodic tasks, simple to implement |
| **SARSA** | No | Yes | Low | Some | Yes (with conditions) | Safe exploration, learn actual behavior |
| **Q-Learning** | No | Yes | Low | Some | Yes (with conditions) | Learn optimal policy, sample efficiency |

## When to Use Each Algorithm

### Use Dynamic Programming when:
- You have complete knowledge of P and R
- State space is small to medium
- You need guaranteed convergence to optimal
- Computation is cheaper than sampling

### Use Monte Carlo when:
- Environment is unknown (model-free)
- Episodes naturally terminate
- You can afford high variance
- Simple implementation is preferred
- Each episode provides complete information

### Use SARSA when:
- Environment is unknown (model-free)
- You want to learn the policy you're actually following
- Safety matters (avoid risky exploration)
- On-policy learning is required

### Use Q-Learning when:
- Environment is unknown (model-free)
- You want to learn the optimal policy
- Can use off-policy data (experience replay)
- Sample efficiency is important

## Trade-offs Visualization

**Sample Efficiency vs Computational Cost:**
- DP: Low samples needed (uses model), high computation per update
- MC: Many samples needed, low computation per sample
- TD: Moderate samples, moderate computation

**Variance vs Bias:**
- DP: N/A (uses exact expectations)
- MC: High variance, zero bias
- TD: Low variance, some bias (from bootstrapping)

**Flexibility vs Requirements:**
- DP: Least flexible (needs model), strongest guarantees
- MC: More flexible (model-free), requires episodic tasks
- TD: Most flexible (model-free, online), works for continuing tasks

---
# Congratulations!

You have completed this comprehensive Reinforcement Learning tutorial!

## What You've Learned

1. **Fundamentals** (Notebook 01)
   - What makes RL unique
   - Agent-environment interaction
   - States, actions, rewards, policies

2. **Mathematical Framework** (Notebook 02)
   - Markov Decision Processes
   - Bellman Equations
   - Optimal value functions

3. **Dynamic Programming** (Notebook 03)
   - Policy Evaluation
   - Policy Iteration
   - Value Iteration

4. **Monte Carlo Methods** (Notebook 04)
   - Learning from episodes
   - First-visit vs Every-visit
   - MC Control

5. **Temporal Difference** (Notebook 05)
   - TD(0) Prediction
   - SARSA (On-policy)
   - Q-Learning (Off-policy)

6. **Comparison & Summary** (This Notebook)
   - All algorithms compared
   - When to use what
   - Next steps

In [None]:
print("="*70)
print("   CONGRATULATIONS! You've completed the RL Tutorial!")
print("="*70)
print("\nüìö Notebooks completed:")
print("   01. Introduction to Reinforcement Learning")
print("   02. MDPs and Bellman Equations")
print("   03. Dynamic Programming")
print("   04. Monte Carlo Methods")
print("   05. Temporal Difference Learning")
print("   06. Algorithm Comparison (this one!)")
print("\nüéØ Algorithms mastered:")
print("   ‚Ä¢ Policy Iteration")
print("   ‚Ä¢ Value Iteration")
print("   ‚Ä¢ Monte Carlo Control")
print("   ‚Ä¢ SARSA")
print("   ‚Ä¢ Q-Learning")
print("\nüöÄ Next steps:")
print("   ‚Ä¢ Try different environments (CartPole, MountainCar, etc.)")
print("   ‚Ä¢ Learn about Deep RL (DQN, Policy Gradients)")
print("   ‚Ä¢ Read Sutton & Barto's book for deeper understanding")
print("   ‚Ä¢ Implement your own RL agent for a real problem!")
print("\n" + "="*70)
print("   Happy Learning! üéâ")
print("="*70)

---
# 9. What's Next?

Congratulations! You've learned the foundations of reinforcement learning. Here's what comes next:

## Immediate Next Steps

**Function Approximation**: Handle large state spaces with linear/neural approximators
- Instead of Q-tables, use Q(s,a;w) with parameters w
- Linear: Q(s,a;w) = w^T œÜ(s,a) where œÜ are hand-crafted features
- Neural: Q(s,a;Œ∏) approximated by deep neural networks
- Enables scaling to millions of states (Atari games, robotics)

**Deep RL**: DQN, A3C, PPO for high-dimensional problems (images, etc.)
- DQN (Deep Q-Network): Q-Learning with neural networks + experience replay
- A3C (Asynchronous Advantage Actor-Critic): Parallel agents learning together
- PPO (Proximal Policy Optimization): Stable policy gradient method
- Handles raw pixels, continuous control, complex environments

**Policy Gradients**: Direct policy optimization (REINFORCE, A2C, PPO)
- Instead of learning Q and deriving policy, directly optimize œÄ(a|s;Œ∏)
- Better for continuous action spaces
- Can learn stochastic policies
- Foundation for modern deep RL (PPO, TRPO, SAC)

## Advanced Topics

**Model-based RL**: Learn environment models and plan
- Learn P(s'|s,a) and R(s,a) from experience
- Use learned model for planning (like DP)
- Sample efficiency: planning with imagined rollouts
- Examples: Dyna-Q, PETS, MuZero

**Multi-agent RL**: Game theory, coordination, competition
- Multiple agents learning simultaneously
- Cooperative: Team rewards, coordination
- Competitive: Zero-sum games, adversarial learning
- Mixed: Negotiation, communication protocols

**Exploration**: Better strategies than Œµ-greedy (UCB, Thompson sampling)
- Upper Confidence Bound (UCB): Explore states with high uncertainty
- Thompson Sampling: Bayesian approach to exploration
- Curiosity-driven: Intrinsic motivation to explore
- Count-based: Bonus for rarely visited states

**Transfer Learning**: Apply knowledge across tasks
- Pre-training on simpler tasks
- Meta-learning: Learning to learn (MAML)
- Hierarchical RL: Decompose complex tasks
- Multi-task RL: Share knowledge across related tasks

## Practical Applications

**Robotics and Control**:
- Robot manipulation (grasping, assembly)
- Locomotion (walking, running, jumping)
- Autonomous vehicles (self-driving cars, drones)
- Industrial automation (optimization, scheduling)

**Game Playing**:
- Chess, Go (AlphaZero)
- Starcraft, Dota (AlphaStar, OpenAI Five)
- Atari games (DQN, Rainbow)
- Poker (Pluribus)

**Resource Management**:
- Data center cooling (Google DeepMind)
- Traffic light optimization
- Energy grid management
- Cloud resource allocation

**Personalization**:
- Recommendation systems (content, products)
- Healthcare (treatment optimization, drug discovery)
- Education (adaptive learning, tutoring systems)
- Finance (portfolio optimization, trading)

## Learning Resources

**Books**:
- Sutton & Barto: "Reinforcement Learning: An Introduction" (2nd ed, 2018)
- Bertsekas: "Dynamic Programming and Optimal Control"
- Szepesv√°ri: "Algorithms for Reinforcement Learning"

**Online Courses**:
- David Silver's RL Course (DeepMind, YouTube)
- CS285 Deep RL (UC Berkeley, Sergey Levine)
- OpenAI Spinning Up in Deep RL

**Code & Libraries**:
- Stable Baselines3: Ready-to-use RL algorithms
- RLlib (Ray): Scalable RL library
- OpenAI Gym/Gymnasium: Standard environments
- CleanRL: Single-file implementations

The RL journey continues!

---
# 10. Your Turn

Now it's time to test your understanding with comprehensive exercises!

## Exercise 1: Implement Custom Environment Comparison

Apply all algorithms to a different Gymnasium environment and compare results.

**Task**: Complete the code below to compare algorithms on CliffWalking-v0

```python
# YOUR CODE HERE
# 1. Create CliffWalking environment
# 2. Run all 5 algorithms (Policy Iteration, Value Iteration, MC, SARSA, Q-Learning)
# 3. Compare success rates and learning curves
# 4. Which algorithm performs best? Why?

import gymnasium as gym

# TODO: Create environment
cliff_env = gym.make("CliffWalking-v0")

# TODO: Extract MDP for DP methods (adapt extract_mdp function)

# TODO: Run all algorithms

# TODO: Evaluate and compare results

# Question: CliffWalking has a "cliff" that gives -100 reward.
# Which algorithm (SARSA or Q-Learning) do you expect to be safer? Why?
```

<details>
<summary>Click to see hint</summary>

CliffWalking is a 4x12 gridworld where the agent must navigate from bottom-left to bottom-right, avoiding a cliff along the bottom edge.

Key considerations:
- SARSA is on-policy: learns the policy it follows (including exploration)
- Q-Learning is off-policy: learns the optimal policy regardless of behavior
- During learning with Œµ-greedy, which one will fall off the cliff more often?

</details>

<details>
<summary>Click to see solution</summary>

```python
import gymnasium as gym

# Create environment
cliff_env = gym.make("CliffWalking-v0")
n_states_cliff = cliff_env.observation_space.n
n_actions_cliff = cliff_env.action_space.n

# Extract MDP for DP
P_cliff, R_cliff = extract_mdp(cliff_env)

# Run algorithms
print("Comparing on CliffWalking-v0")
print("=" * 60)

# DP methods
Q_pi_cliff, _ = policy_iteration(P_cliff, R_cliff, gamma=0.99)
Q_vi_cliff, _ = value_iteration(P_cliff, R_cliff, gamma=0.99)

# Model-free (50k episodes)
Q_mc_cliff, rewards_mc_cliff = mc_control(cliff_env, gamma=0.99, n_episodes=50000,
                                          epsilon=1.0, epsilon_decay=0.9999)
Q_sarsa_cliff, rewards_sarsa_cliff = sarsa(cliff_env, gamma=0.99, alpha=0.5, 
                                            n_episodes=50000, epsilon=0.1)
Q_qlearn_cliff, rewards_qlearn_cliff = q_learning(cliff_env, gamma=0.99, alpha=0.5,
                                                   n_episodes=50000, epsilon=0.1)

# Compare learning curves
window = 500
plt.figure(figsize=(12, 5))
plt.plot(np.convolve(rewards_sarsa_cliff, np.ones(window)/window, mode='valid'), 
         label='SARSA', alpha=0.8)
plt.plot(np.convolve(rewards_qlearn_cliff, np.ones(window)/window, mode='valid'),
         label='Q-Learning', alpha=0.8)
plt.xlabel('Episode')
plt.ylabel('Reward (moving avg)')
plt.title('CliffWalking: SARSA vs Q-Learning')
plt.legend()
plt.show()

# Key insight: SARSA learns a safer path (away from cliff) because it's on-policy
# Q-Learning learns the optimal path (close to cliff) but falls off during learning
print(f"SARSA mean reward: {np.mean(rewards_sarsa_cliff[-1000:]):.2f}")
print(f"Q-Learning mean reward: {np.mean(rewards_qlearn_cliff[-1000:]):.2f}")
```

**Answer**: SARSA is safer during learning because it's on-policy. It learns to avoid the cliff because its exploration policy sometimes takes random actions near the edge. Q-Learning learns the optimal policy (walk along the cliff edge) but falls off more during training.

</details>

## Exercise 2: Hyperparameter Tuning Competition

**Task**: Find the best hyperparameters for Q-Learning on FrozenLake

Experiment with:
- Learning rate Œ±: [0.01, 0.1, 0.5, 0.9]
- Initial epsilon: [0.1, 0.5, 1.0]
- Epsilon decay: [0.9999, 0.99995, 0.99999]
- Number of episodes: [10000, 50000, 100000]

```python
# YOUR CODE HERE
# Run grid search over hyperparameters
# Track success rate for each combination
# Find the best configuration

best_success = 0
best_params = {}

# TODO: Implement grid search
# for alpha in [0.01, 0.1, 0.5, 0.9]:
#     for epsilon_init in [0.1, 0.5, 1.0]:
#         for epsilon_decay in [0.9999, 0.99995, 0.99999]:
#             # Train Q-Learning
#             # Evaluate success rate
#             # Track best

print(f"Best parameters: {best_params}")
print(f"Best success rate: {best_success:.2f}%")
```

<details>
<summary>Click to see hint</summary>

Tips for hyperparameter tuning:
- Higher learning rate Œ± ‚Üí faster learning but less stable
- Higher initial epsilon ‚Üí more exploration early on
- Slower epsilon decay ‚Üí explore for longer
- More episodes ‚Üí more time to learn but diminishing returns

FrozenLake is stochastic (slippery ice), so you need:
- Enough episodes to overcome randomness
- Balanced exploration (not too greedy too fast)

</details>

<details>
<summary>Click to see solution</summary>

```python
import itertools

# Define parameter grid
param_grid = {
    'alpha': [0.01, 0.1, 0.5, 0.9],
    'epsilon': [0.1, 0.5, 1.0],
    'epsilon_decay': [0.9999, 0.99995, 0.99999],
    'n_episodes': [50000]
}

best_success = 0
best_params = {}
results_grid = []

# Grid search
for alpha in param_grid['alpha']:
    for eps in param_grid['epsilon']:
        for decay in param_grid['epsilon_decay']:
            for n_ep in param_grid['n_episodes']:
                # Train
                Q, _ = q_learning(env, gamma=0.99, alpha=alpha, n_episodes=n_ep,
                                 epsilon=eps, epsilon_decay=decay)
                
                # Evaluate
                rewards = evaluate_policy(env, Q, n_episodes=5000)
                success = np.mean(rewards) * 100
                
                results_grid.append({
                    'alpha': alpha, 'epsilon': eps, 'decay': decay,
                    'n_episodes': n_ep, 'success': success
                })
                
                if success > best_success:
                    best_success = success
                    best_params = {'alpha': alpha, 'epsilon': eps, 
                                  'epsilon_decay': decay, 'n_episodes': n_ep}

print("Top 5 configurations:")
print("-" * 70)
sorted_results = sorted(results_grid, key=lambda x: x['success'], reverse=True)
for i, r in enumerate(sorted_results[:5], 1):
    print(f"{i}. Œ±={r['alpha']:.2f}, Œµ={r['epsilon']:.1f}, "
          f"decay={r['decay']:.5f} ‚Üí Success={r['success']:.2f}%")

print(f"\nBest: {best_params} ‚Üí {best_success:.2f}%")
```

**Typical findings**:
- Œ± around 0.1-0.5 works best (not too fast, not too slow)
- High initial Œµ (1.0) helps explore initially
- Moderate decay (0.99995) balances exploration/exploitation
- 50k+ episodes needed for reliable convergence

</details>

## Exercise 3: Conceptual Analysis

**Question**: You're building an RL agent for the following scenarios. Which algorithm would you choose and why?

### Scenario A: Robot Navigation
- **Environment**: Robot moving in a warehouse
- **State space**: Continuous (x, y, orientation)
- **Actions**: Continuous (velocity, angular velocity)
- **Episodes**: Never terminate (robot runs continuously)
- **Model**: Unknown (real world)

**Which algorithm? Why?**

<details>
<summary>Click to see answer</summary>

**None of the tabular methods we learned!**

This requires:
1. Function approximation (continuous states)
2. Policy gradients or actor-critic (continuous actions)
3. Online learning (non-episodic)

Best choice: **PPO or SAC** (modern deep RL algorithms)
- PPO: Robust policy gradient method
- SAC: Handles continuous actions well, maximum entropy

Why not our algorithms:
- DP: Need model + continuous states
- MC: Need episodes to terminate
- SARSA/Q-Learning: Tabular (can't handle continuous states/actions)

**Key takeaway**: Tabular methods are foundations. Real robotics needs function approximation!

</details>

### Scenario B: Board Game AI (Chess)
- **Environment**: Chess board
- **State space**: Discrete but huge (~10^40 positions)
- **Actions**: Discrete (legal moves, varies by position)
- **Episodes**: Terminate (checkmate, draw)
- **Model**: Known (chess rules)

**Which algorithm? Why?**

<details>
<summary>Click to see answer</summary>

**For tabular methods: Value Iteration** (if state space were small)

But in practice: **AlphaZero-style approach**
- Monte Carlo Tree Search (MCTS) for planning
- Deep neural networks for value/policy approximation
- Self-play for training

Why not pure tabular:
- State space too large (can't store table for 10^40 states)
- Need function approximation with neural networks

If we had a small board game (Tic-Tac-Toe):
- Use DP (Value Iteration) since we know the model
- Guaranteed optimal solution
- Fast because state space is small (~3^9 = 19,683 states)

</details>

### Scenario C: Personalized News Recommendations
- **Environment**: User engagement with news articles
- **State space**: User features (reading history, preferences)
- **Actions**: Which article to recommend
- **Episodes**: Each user session
- **Model**: Unknown (user behavior is complex)

**Which algorithm? Why?**

<details>
<summary>Click to see answer</summary>

**Contextual Bandits** or **Off-policy RL (Q-Learning with replay)**

Best approach:
1. Collect logged data from existing system
2. Use off-policy learning (Q-Learning, importance sampling)
3. Function approximation (user/article features)

Why Q-Learning style:
- Off-policy: Can learn from historical data
- Don't need to explore on live users (can use logged data)
- Model-free: User behavior too complex to model

Why not SARSA:
- On-policy: Would need to explore with current policy
- Risky: Bad recommendations hurt user experience

Modern approach: Contextual bandits
- Simpler than full RL (no long-term rewards)
- Faster learning
- Better suited for immediate feedback

</details>

## Bonus Challenge: Implement Your Own Environment

Create a custom Gymnasium environment and compare algorithms!

```python
# YOUR CODE HERE
# Define a simple custom MDP (e.g., a grid world with special rules)
# Implement as a Gymnasium environment
# Run all algorithms and compare

# Example: 5x5 grid with:
# - Start at (0,0)
# - Goal at (4,4) with reward +10
# - Special power-up at (2,2) that doubles future rewards
# - Traps that give -5 reward

# Hint: Inherit from gym.Env and implement:
# - reset()
# - step(action)
# - Define observation_space and action_space
```

Good luck! These exercises will solidify your understanding of when and how to apply RL algorithms.

In [None]:
print("="*70)
print("   Thank you for completing the RL Tutorial!")
print("="*70)
print("\nKey Takeaways:")
print("- DP (Policy/Value Iteration): Model-based, exact, optimal")
print("- MC (Monte Carlo): Model-free, episodic, unbiased")
print("- TD (SARSA/Q-Learning): Model-free, online, efficient")
print("\nRemember:")
print("- No single algorithm is best for all problems")
print("- Consider: model availability, state space size, episode structure")
print("- Tabular methods are foundations for modern deep RL")
print("\nNext Steps:")
print("- Experiment with different environments")
print("- Try function approximation")
print("- Learn about deep RL (DQN, PPO, SAC)")
print("- Build your own RL application!")
print("\n" + "="*70)
print("   The journey continues... Happy learning!")
print("="*70)