# Part 4: Monte Carlo Methods

In this notebook, we'll learn **Monte Carlo (MC) methods** - our first **model-free** algorithms that learn from experience without knowing the environment dynamics.

## What You'll Learn
- Why model-free methods are important
- Monte Carlo prediction (policy evaluation)
- First-visit vs Every-visit MC
- Monte Carlo control (finding optimal policy)
- Exploring starts and ε-soft policies

## Prerequisites
- Understanding of MDPs and value functions (Notebooks 01-02)
- Dynamic Programming concepts (Notebook 03)

Let's begin!

## Setup

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import time

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
np.random.seed(42)

print("Setup complete!")

In [None]:
# Create environment and helper variables
env = gym.make("FrozenLake-v1", is_slippery=True)

n_states = env.observation_space.n
n_actions = env.action_space.n
action_names = ['LEFT', 'DOWN', 'RIGHT', 'UP']
action_arrows = ['←', '↓', '→', '↑']

print("FrozenLake Environment")
print("=" * 40)
print(f"States: {n_states}")
print(f"Actions: {n_actions}")

In [None]:
# Visualization helper functions
def plot_value_function(V, title="Value Function", ax=None):
    """Plot value function as a heatmap."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    V_grid = V.reshape(nrow, ncol)
    
    im = ax.imshow(V_grid, cmap='RdYlGn', vmin=0, vmax=max(V.max(), 0.01))
    plt.colorbar(im, ax=ax, shrink=0.8)
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            color = 'white' if V_grid[i, j] < V.max() / 2 else 'black'
            ax.text(j, i, f'{cell}\n{V[state]:.3f}', ha='center', va='center',
                   fontsize=9, color=color)
    
    ax.set_xticks(range(ncol))
    ax.set_yticks(range(nrow))
    ax.set_title(title)
    return ax

def plot_policy(Q, title="Policy", ax=None):
    """Plot policy derived from Q-values."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    colors = {'S': 'lightblue', 'F': 'white', 'H': 'lightcoral', 'G': 'lightgreen'}
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            
            rect = plt.Rectangle((j, nrow-1-i), 1, 1, fill=True,
                                 facecolor=colors.get(cell, 'white'), edgecolor='black')
            ax.add_patch(rect)
            
            best_action = np.argmax(Q[state])
            
            if cell not in ['H', 'G']:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, 
                       f'{cell}\n{action_arrows[best_action]}',
                       ha='center', va='center', fontsize=14, fontweight='bold')
            else:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, cell,
                       ha='center', va='center', fontsize=14, fontweight='bold')
    
    ax.set_xlim(0, ncol)
    ax.set_ylim(0, nrow)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(title)
    return ax

print("Visualization functions ready!")

---
# 1. Why Model-Free Methods?

## Limitations of Dynamic Programming

Dynamic Programming (Policy Iteration, Value Iteration) requires:
- Complete knowledge of transition probabilities $P_{ss'}^a$
- Complete knowledge of reward function $R_s^a$

In many real-world problems:
- The environment model is **unknown**
- The model is too **complex** to use directly
- Building the model is too **expensive**

## Model-Free Learning

**Model-free** methods learn directly from experience:
- No need to know P or R
- Learn from sampled episodes
- Can handle unknown environments

Monte Carlo methods are one class of model-free algorithms.

---
# 2. Monte Carlo Prediction

**Problem**: Estimate $V^\pi(s)$ or $Q^\pi(s,a)$ for a given policy $\pi$, without knowing the model.

## Key Idea

The value function is the **expected return**:

$$V^\pi(s) = E_\pi[G_t | S_t = s]$$

Monte Carlo estimates this expectation by **averaging sample returns**:

$$V^\pi(s) \approx \frac{1}{N} \sum_{i=1}^{N} G_t^{(i)}$$

where $G_t^{(i)}$ is the return from state $s$ in the $i$-th episode.

## Requirements

- Episodes must **terminate** (finite episodes)
- We learn from **complete episodes** (unlike TD methods)
- Only updates happen at the **end of episodes**

## First-Visit vs Every-Visit MC

When a state $s$ is visited multiple times in an episode:

**First-Visit MC**: Only use the return from the **first** time $s$ is visited

**Every-Visit MC**: Use returns from **every** time $s$ is visited

Both converge to $V^\pi(s)$ as the number of episodes → ∞

In [None]:
def generate_episode(env, policy):
    """
    Generate an episode following the given policy.
    
    Args:
        env: Gymnasium environment
        policy: Policy array π[s,a] with probabilities or π[s] with action
    
    Returns:
        episode: List of (state, action, reward) tuples
    """
    episode = []
    state, _ = env.reset()
    done = False
    
    while not done:
        # Get action from policy
        if len(policy.shape) == 1:
            action = int(policy[state])
        else:
            action = np.random.choice(len(policy[state]), p=policy[state])
        
        next_state, reward, terminated, truncated, _ = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        done = terminated or truncated
    
    return episode

In [None]:
# Demonstrate episode generation
uniform_policy = np.ones((n_states, n_actions)) / n_actions

print("Example Episodes with Random Policy")
print("=" * 60)

for i in range(5):
    episode = generate_episode(env, uniform_policy)
    states = [s for s, a, r in episode]
    actions = [action_arrows[a] for s, a, r in episode]
    rewards = [r for s, a, r in episode]
    total_reward = sum(rewards)
    
    print(f"\nEpisode {i+1} (length={len(episode)}, reward={total_reward}):")
    print(f"  States:  {states}")
    print(f"  Actions: {actions}")
    print(f"  Rewards: {rewards}")

In [None]:
def mc_prediction_first_visit(env, policy, gamma, n_episodes, verbose=False):
    """
    First-Visit Monte Carlo Prediction for estimating V^π.
    
    Args:
        env: Gymnasium environment
        policy: Policy to evaluate
        gamma: Discount factor
        n_episodes: Number of episodes to sample
        verbose: Print progress
    
    Returns:
        V: Estimated state value function
        V_history: Value function at intervals for visualization
    """
    n_states = env.observation_space.n
    
    # Store returns for each state
    returns_sum = np.zeros(n_states)
    returns_count = np.zeros(n_states)
    V = np.zeros(n_states)
    
    V_history = [V.copy()]
    
    for episode_num in range(n_episodes):
        # Generate an episode
        episode = generate_episode(env, policy)
        
        # Calculate returns for each state visited
        states_visited = set()
        G = 0
        
        # Go backwards through the episode
        for t in reversed(range(len(episode))):
            state, action, reward = episode[t]
            G = gamma * G + reward
            
            # First-visit: only count if this is the first occurrence of state in episode
            if state not in states_visited:
                states_visited.add(state)
                returns_sum[state] += G
                returns_count[state] += 1
        
        # Update V
        for s in range(n_states):
            if returns_count[s] > 0:
                V[s] = returns_sum[s] / returns_count[s]
        
        # Save history at intervals
        if (episode_num + 1) % (n_episodes // 10) == 0:
            V_history.append(V.copy())
            if verbose:
                print(f"Episode {episode_num + 1}/{n_episodes}")
    
    return V, V_history

In [None]:
# Run MC prediction for random policy
print("Monte Carlo Prediction (First-Visit)")
print("=" * 50)
print("Evaluating uniform random policy...\n")

start_time = time.time()
V_mc, V_history = mc_prediction_first_visit(env, uniform_policy, gamma=0.99, 
                                             n_episodes=50000, verbose=True)
mc_time = time.time() - start_time

print(f"\nTime taken: {mc_time:.2f} seconds")
print(f"\nEstimated V^π (random policy):")
print(V_mc.reshape(4, 4).round(4))

In [None]:
# Compare with DP solution (if we had the model)
# First, get the "true" values using DP
def extract_mdp(env):
    n_s = env.observation_space.n
    n_a = env.action_space.n
    P = np.zeros((n_s, n_a, n_s))
    R = np.zeros((n_s, n_a))
    for s in range(n_s):
        for a in range(n_a):
            for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                P[s, a, next_s] += prob
                R[s, a] += prob * reward
    return P, R

def policy_evaluation_dp(P, R, policy, gamma, theta=1e-8):
    n_states = P.shape[0]
    n_actions = P.shape[1]
    V = np.zeros(n_states)
    
    while True:
        V_new = np.zeros(n_states)
        for s in range(n_states):
            for a in range(n_actions):
                V_new[s] += policy[s, a] * (R[s, a] + gamma * np.sum(P[s, a] * V))
        
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    return V

P, R = extract_mdp(env)
V_true = policy_evaluation_dp(P, R, uniform_policy, gamma=0.99)

print("Comparison: MC Prediction vs DP (True Values)")
print("=" * 60)
print(f"{'State':<8} {'MC Estimate':>15} {'True (DP)':>15} {'Error':>15}")
print("-" * 60)
for s in range(n_states):
    error = abs(V_mc[s] - V_true[s])
    print(f"{s:<8} {V_mc[s]:>15.4f} {V_true[s]:>15.4f} {error:>15.4f}")

print(f"\nMean Absolute Error: {np.mean(np.abs(V_mc - V_true)):.4f}")

In [None]:
# Visualize MC convergence
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Plot value function evolution
episodes_at = [0, 5000, 10000, 25000, 50000]
for idx, (ax, ep) in enumerate(zip(axes.flat[:-1], episodes_at)):
    hist_idx = min(idx, len(V_history)-1)
    plot_value_function(V_history[hist_idx], title=f"After {ep} episodes", ax=ax)

# Plot comparison with true values
ax = axes.flat[-1]
x = np.arange(n_states)
width = 0.35
ax.bar(x - width/2, V_mc, width, label='MC Estimate', color='steelblue')
ax.bar(x + width/2, V_true, width, label='True (DP)', color='orange')
ax.set_xlabel('State')
ax.set_ylabel('Value')
ax.set_title('MC vs True Values')
ax.legend()

plt.suptitle("Monte Carlo Prediction Convergence (50,000 episodes)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
# 3. Monte Carlo Estimation of Q-values

For **control** (finding the optimal policy), we need to estimate **action-values** $Q^\pi(s,a)$, not just state-values.

Why? Because to improve a policy, we need to know how good each action is:

$$\pi'(s) = \arg\max_a Q^\pi(s, a)$$

With just $V(s)$, we would need the model to compute:
$$Q(s,a) = R_s^a + \gamma \sum_{s'} P_{ss'}^a V(s')$$

But we're model-free! So we estimate Q directly.

In [None]:
def mc_prediction_Q(env, policy, gamma, n_episodes):
    """
    First-Visit Monte Carlo Prediction for Q-values.
    
    Returns:
        Q: Estimated action-value function Q[s, a]
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    returns_sum = np.zeros((n_states, n_actions))
    returns_count = np.zeros((n_states, n_actions))
    Q = np.zeros((n_states, n_actions))
    
    for _ in range(n_episodes):
        episode = generate_episode(env, policy)
        
        # Track visited (state, action) pairs
        sa_visited = set()
        G = 0
        
        for t in reversed(range(len(episode))):
            state, action, reward = episode[t]
            G = gamma * G + reward
            
            # First-visit check
            if (state, action) not in sa_visited:
                sa_visited.add((state, action))
                returns_sum[state, action] += G
                returns_count[state, action] += 1
        
        # Update Q
        for s in range(n_states):
            for a in range(n_actions):
                if returns_count[s, a] > 0:
                    Q[s, a] = returns_sum[s, a] / returns_count[s, a]
    
    return Q

# Estimate Q for random policy
print("Estimating Q^π for random policy...")
Q_random = mc_prediction_Q(env, uniform_policy, gamma=0.99, n_episodes=50000)

print("\nQ-values for state 0 (start):")
for a in range(n_actions):
    print(f"  Q(0, {action_names[a]}) = {Q_random[0, a]:.4f}")

---
# 4. Monte Carlo Control

**Goal**: Find the optimal policy $\pi^*$ using only experience.

## Approach: Generalized Policy Iteration

Like Policy Iteration, we alternate between:
1. **Policy Evaluation**: Estimate $Q^\pi$ using MC
2. **Policy Improvement**: Make policy greedy with respect to Q

## The Exploration Problem

If our policy is deterministic, we might never visit some (state, action) pairs!

**Solution 1: Exploring Starts**
- Start each episode from a random (state, action) pair
- Ensures all pairs have chance to be visited

**Solution 2: ε-soft Policies**
- Always have non-zero probability of selecting any action
- E.g., ε-greedy: with prob ε choose random, else choose best

In [None]:
def mc_control_epsilon_greedy(env, gamma, n_episodes, epsilon=0.1, 
                               epsilon_decay=0.9999, min_epsilon=0.01):
    """
    Monte Carlo Control with ε-greedy exploration.
    
    Args:
        env: Gymnasium environment
        gamma: Discount factor
        n_episodes: Number of episodes
        epsilon: Initial exploration rate
        epsilon_decay: Decay rate for epsilon
        min_epsilon: Minimum epsilon value
    
    Returns:
        Q: Learned action-value function
        policy: Learned policy
        stats: Training statistics
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    # Initialize Q arbitrarily and returns tracking
    Q = np.zeros((n_states, n_actions))
    returns_sum = np.zeros((n_states, n_actions))
    returns_count = np.zeros((n_states, n_actions))
    
    # Statistics
    episode_rewards = []
    episode_lengths = []
    epsilons = []
    
    for episode_num in range(n_episodes):
        # Generate episode using ε-greedy policy derived from Q
        episode = []
        state, _ = env.reset()
        done = False
        
        while not done:
            # ε-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(n_actions)
            else:
                action = np.argmax(Q[state])
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode.append((state, action, reward))
            state = next_state
            done = terminated or truncated
        
        # Record stats
        episode_rewards.append(sum(r for _, _, r in episode))
        episode_lengths.append(len(episode))
        epsilons.append(epsilon)
        
        # Update Q using first-visit MC
        sa_visited = set()
        G = 0
        
        for t in reversed(range(len(episode))):
            state, action, reward = episode[t]
            G = gamma * G + reward
            
            if (state, action) not in sa_visited:
                sa_visited.add((state, action))
                returns_sum[state, action] += G
                returns_count[state, action] += 1
                Q[state, action] = returns_sum[state, action] / returns_count[state, action]
        
        # Decay epsilon
        epsilon = max(min_epsilon, epsilon * epsilon_decay)
    
    # Extract greedy policy
    policy = np.zeros((n_states, n_actions))
    for s in range(n_states):
        policy[s, np.argmax(Q[s])] = 1.0
    
    stats = {
        'episode_rewards': episode_rewards,
        'episode_lengths': episode_lengths,
        'epsilons': epsilons
    }
    
    return Q, policy, stats

In [None]:
# Run MC Control
print("Monte Carlo Control with ε-greedy")
print("=" * 50)

start_time = time.time()
Q_mc, policy_mc, stats = mc_control_epsilon_greedy(
    env, gamma=0.99, n_episodes=100000, 
    epsilon=1.0, epsilon_decay=0.99995, min_epsilon=0.01
)
mc_control_time = time.time() - start_time

print(f"Training time: {mc_control_time:.2f} seconds")
print(f"Final epsilon: {stats['epsilons'][-1]:.4f}")

In [None]:
# Plot training progress
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Smooth rewards with moving average
window = 1000
rewards_smooth = np.convolve(stats['episode_rewards'], 
                              np.ones(window)/window, mode='valid')

# Episode rewards
axes[0, 0].plot(rewards_smooth)
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Reward (moving avg)')
axes[0, 0].set_title(f'Learning Curve (window={window})')

# Epsilon decay
axes[0, 1].plot(stats['epsilons'])
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Epsilon')
axes[0, 1].set_title('Exploration Rate Decay')

# Final Q-values heatmap
V_from_Q = np.max(Q_mc, axis=1)
plot_value_function(V_from_Q, title="Learned V* = max_a Q(s,a)", ax=axes[1, 0])

# Learned policy
plot_policy(Q_mc, title="Learned Policy", ax=axes[1, 1])

plt.suptitle("Monte Carlo Control Results (100,000 episodes)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Evaluate the learned policy
def evaluate_policy(env, Q, n_episodes=10000):
    """Evaluate a greedy policy derived from Q."""
    rewards = []
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = np.argmax(Q[state])
            state, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated
        
        rewards.append(total_reward)
    
    return np.array(rewards)

# Compare learned policy with optimal (from DP) and random
print("Policy Evaluation Comparison")
print("=" * 50)

# MC learned policy
rewards_mc = evaluate_policy(env, Q_mc, n_episodes=10000)
print(f"MC Policy: Success rate = {np.mean(rewards_mc)*100:.2f}%")

# Random policy
Q_random = np.random.rand(n_states, n_actions)
rewards_random = evaluate_policy(env, Q_random, n_episodes=10000)
print(f"Random Policy: Success rate = {np.mean(rewards_random)*100:.2f}%")

# Optimal policy (from Value Iteration)
def value_iteration(P, R, gamma, theta=1e-8):
    n_states, n_actions = R.shape
    V = np.zeros(n_states)
    while True:
        V_new = np.zeros(n_states)
        for s in range(n_states):
            V_new[s] = np.max([R[s, a] + gamma * np.sum(P[s, a] * V) for a in range(n_actions)])
        if np.max(np.abs(V_new - V)) < theta:
            break
        V = V_new
    # Extract Q
    Q = np.zeros((n_states, n_actions))
    for s in range(n_states):
        for a in range(n_actions):
            Q[s, a] = R[s, a] + gamma * np.sum(P[s, a] * V)
    return Q

Q_optimal = value_iteration(P, R, gamma=0.99)
rewards_optimal = evaluate_policy(env, Q_optimal, n_episodes=10000)
print(f"Optimal Policy (DP): Success rate = {np.mean(rewards_optimal)*100:.2f}%")

In [None]:
# Visualize policy comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

policies = ['Random', 'MC Learned', 'Optimal (DP)']
success_rates = [np.mean(rewards_random)*100, np.mean(rewards_mc)*100, np.mean(rewards_optimal)*100]
colors = ['gray', 'steelblue', 'green']

# Bar chart
bars = axes[0].bar(policies, success_rates, color=colors, edgecolor='black')
axes[0].set_ylabel('Success Rate (%)')
axes[0].set_title('Policy Performance Comparison')
axes[0].set_ylim(0, 100)
for bar, rate in zip(bars, success_rates):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{rate:.1f}%', ha='center', fontweight='bold')

# MC learned policy
plot_policy(Q_mc, title="MC Learned Policy", ax=axes[1])

# Optimal policy
plot_policy(Q_optimal, title="Optimal Policy (DP)", ax=axes[2])

plt.tight_layout()
plt.show()

---
# 5. Exploring Starts MC Control

An alternative to ε-greedy is **Exploring Starts**: every (state, action) pair has non-zero probability of being the starting point of an episode.

This guarantees all pairs are visited infinitely often, allowing convergence to the true optimal policy.

In [None]:
def mc_control_exploring_starts(env, gamma, n_episodes):
    """
    Monte Carlo Control with Exploring Starts.
    
    Note: This requires ability to set the initial state,
    which isn't always possible in real environments.
    """
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    
    Q = np.zeros((n_states, n_actions))
    returns_sum = np.zeros((n_states, n_actions))
    returns_count = np.zeros((n_states, n_actions))
    
    episode_rewards = []
    
    for _ in range(n_episodes):
        # Exploring start: random initial state and action
        # Note: FrozenLake doesn't support setting initial state,
        # so we'll simulate by using the normal start but random first action
        state, _ = env.reset()
        
        # Random first action (exploring start)
        first_action = np.random.randint(n_actions)
        
        # Generate episode
        episode = []
        action = first_action
        done = False
        
        while not done:
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode.append((state, action, reward))
            state = next_state
            done = terminated or truncated
            
            if not done:
                # Follow greedy policy after first action
                action = np.argmax(Q[state])
        
        episode_rewards.append(sum(r for _, _, r in episode))
        
        # Update Q
        sa_visited = set()
        G = 0
        
        for t in reversed(range(len(episode))):
            s, a, r = episode[t]
            G = gamma * G + r
            
            if (s, a) not in sa_visited:
                sa_visited.add((s, a))
                returns_sum[s, a] += G
                returns_count[s, a] += 1
                Q[s, a] = returns_sum[s, a] / returns_count[s, a]
    
    return Q, episode_rewards

# Run Exploring Starts MC
print("Monte Carlo Control with Exploring Starts")
print("=" * 50)

Q_es, rewards_es = mc_control_exploring_starts(env, gamma=0.99, n_episodes=100000)

# Evaluate
rewards_es_eval = evaluate_policy(env, Q_es, n_episodes=10000)
print(f"Exploring Starts Policy: Success rate = {np.mean(rewards_es_eval)*100:.2f}%")

---
# 6. MC vs DP: Key Differences

| Aspect | Dynamic Programming | Monte Carlo |
|--------|--------------------|--------------|
| **Model** | Requires complete model | Model-free |
| **Updates** | Bootstraps (uses estimated values) | Uses actual returns |
| **Episodes** | Can update mid-episode | Must wait for episode end |
| **Variance** | Lower (uses expectations) | Higher (uses samples) |
| **Bias** | Biased by initialization | Unbiased (true returns) |

In [None]:
# Compare convergence characteristics
print("MC Characteristics Demonstration")
print("=" * 50)

# Run MC with different numbers of episodes
episode_counts = [1000, 5000, 10000, 50000, 100000]
mc_errors = []

for n_ep in episode_counts:
    Q_temp, _, _ = mc_control_epsilon_greedy(env, gamma=0.99, n_episodes=n_ep,
                                             epsilon=1.0, epsilon_decay=0.9999, min_epsilon=0.01)
    # Compare with optimal Q
    error = np.mean(np.abs(Q_temp - Q_optimal))
    mc_errors.append(error)
    print(f"{n_ep:>7} episodes: Mean |Q - Q*| = {error:.4f}")

# Plot
plt.figure(figsize=(10, 5))
plt.plot(episode_counts, mc_errors, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Episodes')
plt.ylabel('Mean Absolute Error in Q')
plt.title('MC Control Convergence to Optimal Q')
plt.xscale('log')
plt.grid(True, alpha=0.3)
plt.show()

---
# Summary

## Monte Carlo Methods

| Method | Purpose | Key Feature |
|--------|---------|-------------|
| **MC Prediction** | Estimate $V^\pi$ or $Q^\pi$ | Average sample returns |
| **MC Control** | Find $\pi^*$ | Generalized Policy Iteration |
| **First-Visit MC** | Each state counted once per episode | Simple, unbiased |
| **Every-Visit MC** | All visits counted | More data, slightly biased |

## Key Takeaways

1. **Model-free**: MC learns from experience, no need for P and R
2. **Uses complete episodes**: Must wait until episode ends to update
3. **Unbiased estimates**: Uses actual returns, not bootstrapped values
4. **High variance**: Sample returns can vary a lot
5. **Exploration is crucial**: Need mechanisms like ε-greedy or exploring starts

## Limitations

- Only works for **episodic** tasks (must terminate)
- Updates only at **episode end** (slow for long episodes)
- **High variance** can slow convergence

## Next Steps

In the next notebook (**05_temporal_difference.ipynb**), we'll learn **TD methods** that:
- Update every step (not just at episode end)
- Work for continuing (non-episodic) tasks
- Include SARSA and Q-Learning

In [None]:
print("Congratulations! You've completed Part 4 of the RL Tutorial!")
print("\nKey takeaways:")
print("- Monte Carlo methods learn from complete episodes")
print("- They don't need a model - completely model-free")
print("- MC Prediction estimates V or Q by averaging returns")
print("- MC Control uses ε-greedy or exploring starts for exploration")
print("- MC has high variance but no bias from bootstrapping")
print("\nNext: 05_temporal_difference.ipynb")