<a href="https://colab.research.google.com/github/mdehghani86/RL_labs/blob/master/enhanced_mab_lab_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 2: Multi-Armed Bandits
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 2 | Intermediate Level | 120 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        The multi-armed bandit problem models decision-making under uncertainty. An agent repeatedly chooses among k actions,
        receiving numerical rewards from stationary probability distributions. The challenge is balancing exploration
        (trying different actions to find the best) with exploitation (choosing the current best action).
        This lab reproduces key results from <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto (2018)</a>, Chapter 2.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand the exploration-exploitation tradeoff</li>
        <li>Implement the 10-armed testbed</li>
        <li>Compare ε-greedy strategies</li>
        <li>Analyze optimistic initial values</li>
        <li>Reproduce Figures 2.1, 2.2, and 2.3</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Key Concepts</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">q*(a)</code> = true action value</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Qt(a)</code> = estimated value at time t</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">ε-greedy</code> = exploration strategy</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">α</code> = step-size parameter</div>
    </div>
</td>
</tr>
</table>

## Configuration and Setup

In [None]:
# ============================================
# CELL 1: Environment Setup and Configuration
# Purpose: Import libraries and set visualization parameters
# ============================================

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List
import warnings
from IPython.display import display, HTML
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
warnings.filterwarnings('ignore')

# Configurable color scheme - modify these to change plot colors
COLORS = {
    'greedy': '#008000',      # Green for ε=0
    'epsilon_01': '#FF0000',  # Red for ε=0.01
    'epsilon_1': '#0000FF',   # Blue for ε=0.1
    'optimistic': '#00BFFF',  # Cyan for optimistic
    'realistic': '#808080',   # Gray for realistic
    'violin': '#7f7f7f'       # Gray for violin plots
}

# Standard parameters from Sutton & Barto
K = 10          # Number of arms
STEPS = 1000    # Time steps per run
RUNS = 2000     # Number of independent runs

# Configure matplotlib for better plots
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 11
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['legend.fontsize'] = 10

print("✅ Environment setup complete!")
print(f"📊 Default configuration: {K} arms, {STEPS} steps, {RUNS} runs")

In [None]:
# ============================================
# CELL 2: Custom Pretty Print Function
# Purpose: Create beautiful output displays
# ============================================

def pretty_print(title, content, color='#17a2b8'):
    """Display formatted output in a gradient box"""
    # Create gradient from color to darker version
    darker_color = '#0e5a63'

    html = f'''
    <div style="border-radius: 5px;
                margin: 10px 0;
                width: 30cm;
                max-width: 30cm;
                box-sizing: border-box;
                box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
        <div style="background: linear-gradient(90deg, {color} 0%, {darker_color} 100%);
                    padding: 10px 15px; border-radius: 5px 5px 0 0;">
            <strong style="color: white; font-size: 14px;">{title}</strong>
        </div>
        <div style="background: #f8f9fa; padding: 10px 15px; border-radius: 0 0 5px 5px;
                    border-left: 3px solid {color};">
            <div style="color: rgba(73,80,87,0.8); font-size: 12px;">{content}</div>
        </div>
    </div>
    '''
    display(HTML(html))

# Test our pretty print function
pretty_print("Welcome to RL Lab Series!",
             "This function will help us display information beautifully throughout the lab.")

## Interactive Exploration: Understanding Multi-Armed Bandits

Before diving into algorithms, let's build intuition about the multi-armed bandit problem through interactive exploration.

In [None]:
# ============================================
# CELL 3: Interactive Bandit Exploration - Basic Setup
# Purpose: Let students interact with a simple bandit
# ============================================

class InteractiveBandit:
    """Simple bandit for interactive exploration"""

    def __init__(self, k_arms=5, seed=42):
        np.random.seed(seed)
        self.k = k_arms
        # True action values (hidden from student initially)
        self.q_true = np.random.randn(k_arms)
        # Keep track of what student has tried
        self.action_counts = np.zeros(k_arms)
        self.total_rewards = np.zeros(k_arms)
        self.history = []

    def pull_arm(self, action):
        """Pull an arm and get reward"""
        if action < 0 or action >= self.k:
            return None

        # Get noisy reward: R ~ N(q*(a), 1)
        reward = self.q_true[action] + np.random.randn()

        # Update statistics
        self.action_counts[action] += 1
        self.total_rewards[action] += reward
        self.history.append((action, reward))

        return reward

    def get_estimates(self):
        """Get current action value estimates"""
        estimates = np.zeros(self.k)
        for i in range(self.k):
            if self.action_counts[i] > 0:
                estimates[i] = self.total_rewards[i] / self.action_counts[i]
        return estimates

    def show_status(self):
        """Display current status"""
        estimates = self.get_estimates()

        status_html = "<div style='font-family: monospace;'>"
        status_html += "<table style='border-collapse: collapse; margin: 10px 0;'>"
        status_html += "<tr><th style='padding: 5px; border: 1px solid #ccc;'>Arm</th>"
        status_html += "<th style='padding: 5px; border: 1px solid #ccc;'>Times Pulled</th>"
        status_html += "<th style='padding: 5px; border: 1px solid #ccc;'>Avg Reward</th></tr>"

        for i in range(self.k):
            status_html += f"<tr><td style='padding: 5px; border: 1px solid #ccc; text-align: center;'>{i+1}</td>"
            status_html += f"<td style='padding: 5px; border: 1px solid #ccc; text-align: center;'>{int(self.action_counts[i])}</td>"
            if self.action_counts[i] > 0:
                status_html += f"<td style='padding: 5px; border: 1px solid #ccc; text-align: center;'>{estimates[i]:.3f}</td></tr>"
            else:
                status_html += "<td style='padding: 5px; border: 1px solid #ccc; text-align: center;'>---</td></tr>"

        status_html += "</table></div>"

        total_pulls = int(np.sum(self.action_counts))
        if total_pulls > 0:
            best_arm = np.argmax(estimates) + 1
            avg_reward = np.sum(self.total_rewards) / total_pulls
            status_html += f"<p><strong>Total pulls:</strong> {total_pulls} | "
            status_html += f"<strong>Current best arm:</strong> {best_arm} | "
            status_html += f"<strong>Overall avg reward:</strong> {avg_reward:.3f}</p>"

        display(HTML(status_html))

    def reveal_truth(self):
        """Reveal the true action values"""
        optimal_arm = np.argmax(self.q_true) + 1
        pretty_print("🎯 TRUE ACTION VALUES REVEALED!",
                    f"True values: {[f'{q:.3f}' for q in self.q_true]}<br>" +
                    f"Optimal arm: {optimal_arm} (value: {self.q_true[optimal_arm-1]:.3f})")

# Create a bandit for exploration
bandit = InteractiveBandit(k_arms=5, seed=42)

pretty_print("🎰 Interactive Bandit Ready!",
             "You have a 5-armed bandit. Each arm gives rewards from a different distribution.<br>" +
             "Your goal: figure out which arm is best by pulling arms and observing rewards.<br>" +
             "<strong>Think:</strong> How will you balance trying new arms vs. sticking with good ones?")

In [None]:
# ============================================
# CELL 4: Interactive Bandit - Manual Pulling
# Purpose: Let students manually pull arms
# ============================================

@interact_manual(arm=widgets.IntSlider(min=1, max=5, value=1, description='Arm to pull:'))
def pull_bandit_arm(arm):
    """Pull a bandit arm and see the result"""
    reward = bandit.pull_arm(arm - 1)  # Convert to 0-indexed

    if reward is not None:
        pretty_print(f"🎲 Pulled Arm {arm}",
                    f"Reward received: {reward:.3f}",
                    color='#28a745' if reward > 0 else '#dc3545')
        bandit.show_status()
    else:
        pretty_print("❌ Invalid Arm", "Please select an arm between 1 and 5")

print("👆 Use the slider to select an arm, then click 'Run Interact' to pull it!")
print("💡 Try different strategies: random exploration, stick with best, systematic testing...")

In [None]:
# ============================================
# CELL 5: Interactive Strategy Comparison
# Purpose: Let students compare different exploration strategies
# ============================================

def simulate_strategy(strategy_name, n_steps=50):
    """Simulate different bandit strategies"""
    # Create fresh bandit for each strategy
    test_bandit = InteractiveBandit(k_arms=5, seed=42)

    rewards = []
    actions = []

    for step in range(n_steps):
        if strategy_name == "Random":
            # Pure random exploration
            action = np.random.randint(0, 5)

        elif strategy_name == "Greedy":
            # Always pick current best (pure exploitation)
            estimates = test_bandit.get_estimates()
            if step < 5:  # Try each arm once first
                action = step
            else:
                action = np.argmax(estimates)

        elif strategy_name == "ε-greedy (ε=0.1)":
            # 10% random, 90% greedy
            estimates = test_bandit.get_estimates()
            if step < 5:  # Try each arm once first
                action = step
            elif np.random.random() < 0.1:
                action = np.random.randint(0, 5)
            else:
                action = np.argmax(estimates)

        reward = test_bandit.pull_arm(action)
        rewards.append(reward)
        actions.append(action)

    return rewards, actions, test_bandit

@interact(strategy=widgets.Dropdown(
    options=["Random", "Greedy", "ε-greedy (ε=0.1)"],
    value="Random",
    description='Strategy:'
))
def compare_strategies(strategy):
    """Compare different bandit strategies"""
    rewards, actions, test_bandit = simulate_strategy(strategy, n_steps=100)

    # Calculate metrics
    total_reward = np.sum(rewards)
    avg_reward = np.mean(rewards)
    optimal_arm = np.argmax(test_bandit.q_true)
    optimal_selections = np.sum(np.array(actions) == optimal_arm)
    optimal_percentage = (optimal_selections / len(actions)) * 100

    # Plot results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

    # Cumulative reward
    ax1.plot(np.cumsum(rewards), linewidth=2)
    ax1.set_xlabel('Steps')
    ax1.set_ylabel('Cumulative Reward')
    ax1.set_title(f'{strategy}: Cumulative Reward')
    ax1.grid(True, alpha=0.3)

    # Action selection histogram
    ax2.hist(actions, bins=np.arange(6)-0.5, alpha=0.7, edgecolor='black')
    ax2.set_xlabel('Arm')
    ax2.set_ylabel('Times Selected')
    ax2.set_title(f'{strategy}: Arm Selection Frequency')
    ax2.set_xticks(range(5))
    ax2.set_xticklabels([f'Arm {i+1}' for i in range(5)])

    # Highlight optimal arm
    ax2.axvline(optimal_arm, color='red', linestyle='--', alpha=0.7, label=f'Optimal (Arm {optimal_arm+1})')
    ax2.legend()

    plt.tight_layout()
    plt.show()

    # Display results
    results_text = f"Total Reward: {total_reward:.2f}<br>"
    results_text += f"Average Reward: {avg_reward:.3f}<br>"
    results_text += f"Optimal Arm Selected: {optimal_percentage:.1f}% of the time<br>"
    results_text += f"Strategy: {strategy}"

    pretty_print(f"📈 Results for {strategy}", results_text)

print("🔄 Select different strategies to see how they perform!")
print("🎯 Notice the trade-off between exploration and exploitation")

In [None]:
# ============================================
# CELL 6: Reveal the Truth and Reflect
# Purpose: Show true values and discuss strategies
# ============================================

# Reveal the true action values
bandit.reveal_truth()

# Show final status of student's exploration
pretty_print("📊 Your Exploration Results", "How well did you do compared to the optimal strategy?")
bandit.show_status()

pretty_print("🤔 Reflection Questions",
             "1. How did you decide which arms to try?<br>" +
             "2. When did you stop exploring and start exploiting?<br>" +
             "3. What would you do differently with more time?<br>" +
             "4. How did the different automated strategies compare?")

pretty_print("🎯 Key Insight",
             "This is the <strong>exploration-exploitation dilemma</strong>!<br>" +
             "• <strong>Explore</strong>: Try new actions to find better options<br>" +
             "• <strong>Exploit</strong>: Use current knowledge to maximize reward<br>" +
             "The challenge is finding the right balance between these two.")

## Part 1: The 10-Armed Testbed (Figure 2.1)

### Mathematical Foundation

The **k-armed bandit problem** is a fundamental framework for studying sequential decision-making under uncertainty. Formally:

- **Actions**: $A_t \in \{1, 2, ..., k\}$ at time step $t$
- **True Action Values**: $q_*(a) = \mathbb{E}[R_t | A_t = a]$ for each action $a$
- **Rewards**: $R_t \sim \mathcal{N}(q_*(A_t), 1)$ (Gaussian with unit variance)
- **Goal**: Maximize cumulative reward $\sum_{t=1}^T R_t$

### Historical Context

The multi-armed bandit problem originated from the "one-armed bandit" slot machines in casinos. The term was coined by Herbert Robbins in 1952, though the mathematical foundations trace back to Thompson (1933). It has become a cornerstone of:

- **Online advertising** (ad selection)
- **Clinical trials** (treatment allocation)
- **Recommendation systems** (content selection)
- **Resource allocation** (server load balancing)

The **10-armed testbed** specifically uses $k=10$ actions with true values $q_*(a) \sim \mathcal{N}(0, 1)$, providing a standardized benchmark for algorithm comparison.

In [None]:
# ============================================
# CELL 7: Create and Visualize the 10-Armed Testbed
# Purpose: Generate Figure 2.1 showing reward distributions
# Mathematical Note: q*(a) ~ N(0,1), R_t ~ N(q*(a), 1)
# ============================================

# Set seed for reproducible example (matching Sutton & Barto)
np.random.seed(0)

# Generate true action values q*(a) ~ N(0, 1)
# This represents the expected reward for each action
q_true = np.random.randn(K)

# Display the true values
pretty_print("🎯 True Action Values Generated",
             f"q*(a) for each arm: {[f'{q:.3f}' for q in q_true]}<br>" +
             f"Optimal action: Arm {np.argmax(q_true) + 1} with value {np.max(q_true):.3f}")

# Create Figure 2.1: Reward distributions
fig, ax = plt.subplots(figsize=(10, 6))

# Generate reward distribution samples for visualization
# Each action gives rewards R ~ N(q*(a), 1)
n_samples = 10000
rewards = np.zeros((n_samples, K))
for i in range(K):
    # Rewards are normally distributed around the true action value
    rewards[:, i] = q_true[i] + np.random.randn(n_samples)

# Create violin plots to show reward distributions
parts = ax.violinplot(rewards, positions=range(1, K+1), widths=0.7,
                      showmeans=False, showextrema=False)

# Style the violins
for pc in parts['bodies']:
    pc.set_facecolor(COLORS['violin'])
    pc.set_alpha(0.7)
    pc.set_edgecolor('black')
    pc.set_linewidth(0.5)

# Add horizontal lines for true values q*(a)
for i in range(K):
    ax.hlines(q_true[i], i+0.7, i+1.3, colors='black', linewidth=1.5)
    ax.text(i+1.4, q_true[i], f'$q_*({i+1})$', fontsize=10, va='center')

# Formatting
ax.set_xlabel('Action', fontsize=12)
ax.set_ylabel('Reward\ndistribution', fontsize=12)
ax.set_xticks(range(1, K+1))
ax.set_xlim(0.5, K+0.5)
ax.set_ylim(-3, 3)
ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5, alpha=0.5)
ax.grid(True, alpha=0.3, axis='y')

plt.title('Figure 2.1: The 10-armed testbed', fontsize=12)
plt.tight_layout()
plt.show()

pretty_print("📊 Testbed Analysis",
             f"Each violin shows the distribution of rewards for that action.<br>" +
             f"The black lines show the true expected values q*(a).<br>" +
             f"Notice how rewards are noisy - this creates the exploration challenge!")

## Part 2: ε-Greedy Action Selection

### Algorithm Theory

The **ε-greedy** method is one of the simplest approaches to balance exploration and exploitation. It works by:

1. **Exploitation** (probability $1-ε$): Choose the action with highest estimated value
   $$A_t = \arg\max_a Q_t(a)$$

2. **Exploration** (probability $ε$): Choose a random action uniformly
   $$A_t \sim \text{Uniform}(\{1, 2, ..., k\})$$

### Action Value Estimation

We maintain estimates $Q_t(a)$ of the true action values $q_*(a)$ using the **sample average method**:

$$Q_t(a) = \frac{\text{sum of rewards when action } a \text{ taken}}{\text{number of times action } a \text{ taken}} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbf{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbf{1}_{A_i = a}}$$

This can be computed incrementally as:
$$Q_{n+1} = Q_n + \frac{1}{n}[R_n - Q_n]$$

where $n$ is the number of times action $a$ has been selected.

### Theoretical Properties

- **Convergence**: As $t \to \infty$, $Q_t(a) \to q_*(a)$ with probability 1 (by Law of Large Numbers)
- **Exploration guarantee**: Every action has probability $ε/k$ of being selected
- **Regret**: The difference between optimal and achieved performance decreases over time

In [None]:
# ============================================
# CELL 8: Core ε-Greedy Algorithm Implementation
# Purpose: Define action selection and value update methods with detailed explanations
# ============================================

def create_bandit() -> np.ndarray:
    """
    Create a new 10-armed bandit problem

    Returns:
        q_true: Array of true action values q*(a) ~ N(0,1)
    """
    return np.random.randn(K)

def get_reward(action: int, q_true: np.ndarray) -> float:
    """
    Get reward for an action following the bandit model

    Mathematical model: R ~ N(q*(a), 1)
    This means rewards are normally distributed around the true action value
    with unit variance, creating noise that makes learning challenging.

    Args:
        action: Selected action index (0 to K-1)
        q_true: True action values

    Returns:
        reward: Noisy reward sampled from N(q*(action), 1)
    """
    return q_true[action] + np.random.randn()

def epsilon_greedy(Q: np.ndarray, epsilon: float) -> int:
    """
    ε-greedy action selection - the heart of exploration vs exploitation

    This function implements the fundamental trade-off in reinforcement learning:
    - With probability ε: EXPLORE (try random actions to learn more)
    - With probability 1-ε: EXPLOIT (use current knowledge optimally)

    Mathematical formulation:
    A_t = {
        argmax_a Q_t(a)           with probability 1-ε  (greedy)
        random action             with probability ε     (exploration)
    }

    Args:
        Q: Current action value estimates Q_t(a)
        epsilon: Exploration probability ε ∈ [0,1]

    Returns:
        Selected action index
    """
    if np.random.random() < epsilon:
        # EXPLORATION: Choose random action
        # This helps us learn about actions we haven't tried much
        return np.random.randint(K)
    else:
        # EXPLOITATION: Choose greedy action (break ties randomly)
        # Find the action(s) with maximum estimated value
        max_Q = np.max(Q)
        # Handle ties by randomly selecting among equally good actions
        return np.random.choice(np.where(Q == max_Q)[0])

def update_estimates(Q: np.ndarray, N: np.ndarray, action: int,
                    reward: float, alpha: float = None) -> None:
    """
    Update action value estimates using incremental sample average

    This implements the fundamental learning rule in bandits:
    NewEstimate = OldEstimate + StepSize × [Target - OldEstimate]

    Two update rules:
    1. Sample Average (α = 1/n): Q_n+1 = Q_n + (1/n)[R_n - Q_n]
       - Gives equal weight to all past rewards
       - Converges to true mean with probability 1

    2. Constant Step Size (α = constant): Q_n+1 = Q_n + α[R_n - Q_n]
       - Recent rewards have more influence
       - Better for non-stationary problems

    Args:
        Q: Action value estimates (modified in place)
        N: Action selection counts
        action: Action that was selected
        reward: Observed reward R_t
        alpha: Step size parameter (None for sample average)
    """
    # Increment the count for this action
    N[action] += 1

    if alpha is None:
        # Sample average update: step size = 1/n
        # This gives equal weight to all past observations
        step_size = 1.0 / N[action]
        Q[action] += step_size * (reward - Q[action])
    else:
        # Constant step size update
        # This gives more weight to recent observations
        Q[action] += alpha * (reward - Q[action])

# Display the key algorithmic components
pretty_print("🧠 ε-Greedy Algorithm Components",
             "<strong>Action Selection:</strong> ε-greedy with exploration probability ε<br>" +
             "<strong>Value Estimation:</strong> Incremental sample average Q_n+1 = Q_n + (1/n)[R_n - Q_n]<br>" +
             "<strong>Key Insight:</strong> Balance between exploration (learning) and exploitation (earning)")

## Part 3: Comparing ε-greedy Methods (Figure 2.2)

### Experimental Design

We compare three ε-greedy variants to understand the exploration-exploitation trade-off:

1. **Greedy (ε = 0)**: Pure exploitation, no exploration
2. **ε-greedy (ε = 0.01)**: 1% exploration, 99% exploitation  
3. **ε-greedy (ε = 0.1)**: 10% exploration, 90% exploitation

### Expected Behaviors

- **Greedy**: Fast initial learning but often gets stuck on suboptimal actions
- **Small ε**: Good balance, slow but steady improvement
- **Large ε**: More exploration, potentially higher long-term performance but more variability

In [None]:
# ============================================
# CELL 9: Run ε-greedy Experiments
# Purpose: Generate data for Figure 2.2 comparison
# ============================================

def run_epsilon_greedy_experiment(epsilon: float, runs: int = 2000,
                                 steps: int = 1000) -> Tuple[np.ndarray, np.ndarray]:
    """
    Run ε-greedy bandit experiment with multiple independent runs

    This function implements the standard experimental protocol:
    1. For each run, create a new bandit problem (different q* values)
    2. Run the ε-greedy algorithm for specified steps
    3. Track rewards and optimal action selections
    4. Average across all runs for statistical significance

    Args:
        epsilon: Exploration probability
        runs: Number of independent experiments
        steps: Number of time steps per run

    Returns:
        (average_rewards, optimal_action_percentage)
    """
    # Initialize storage for all runs
    all_rewards = np.zeros((runs, steps))
    all_optimal = np.zeros((runs, steps))

    # Run multiple independent experiments
    for run in range(runs):
        # Create a new bandit problem for this run
        q_true = create_bandit()
        optimal_action = np.argmax(q_true)  # The truly best action

        # Initialize the agent's estimates and counts
        Q = np.zeros(K)  # Action value estimates Q_t(a)
        N = np.zeros(K)  # Number of times each action selected

        # Run the ε-greedy algorithm for specified steps
        for step in range(steps):
            # Select action using ε-greedy policy
            action = epsilon_greedy(Q, epsilon)

            # Get reward from environment
            reward = get_reward(action, q_true)

            # Update action value estimates
            update_estimates(Q, N, action, reward)

            # Record results for analysis
            all_rewards[run, step] = reward
            all_optimal[run, step] = (action == optimal_action)

    # Calculate averages across all runs
    avg_rewards = np.mean(all_rewards, axis=0)
    pct_optimal = np.mean(all_optimal, axis=0) * 100

    return avg_rewards, pct_optimal

# Run experiments for different epsilon values
pretty_print("🔬 Running ε-greedy Experiments",
             f"Testing ε = 0, 0.01, 0.1 with {RUNS} runs of {STEPS} steps each.<br>" +
             "This may take a moment...")

results = {}
epsilon_values = [(0, 'greedy'), (0.01, 'epsilon_01'), (0.1, 'epsilon_1')]

for eps, name in epsilon_values:
    print(f"  Running ε = {eps}...")
    avg_reward, pct_optimal = run_epsilon_greedy_experiment(eps, RUNS, STEPS)
    results[eps] = (avg_reward, pct_optimal, name)

    # Show some statistics
    final_avg_reward = avg_reward[-1]
    final_optimal_pct = pct_optimal[-1]
    pretty_print(f"📊 Results for ε = {eps}",
                f"Final average reward: {final_avg_reward:.3f}<br>" +
                f"Final optimal action %: {final_optimal_pct:.1f}%")

pretty_print("✅ Experiments Complete!", "Ready to visualize the results in Figure 2.2")

In [None]:
# ============================================
# CELL 10: Plot Figure 2.2 - ε-greedy Comparison
# Purpose: Reproduce Figure 2.2 from Sutton & Barto
# ============================================

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8))

# Plot average rewards (top panel)
for eps in [0.1, 0.01, 0]:  # Order for correct layering
    avg_reward, _, name = results[eps]
    label = f'ε = {eps} (greedy)' if eps == 0 else f'ε = {eps}'
    ax1.plot(avg_reward, color=COLORS[name], label=label, linewidth=1.5)

ax1.set_ylabel('Average\nreward', fontsize=11)
ax1.set_xlim(0, 1000)
ax1.set_ylim(0, 1.5)
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)

# Plot optimal action percentage (bottom panel)
for eps in [0.1, 0.01, 0]:  # Order for correct layering
    _, pct_optimal, name = results[eps]
    label = f'ε = {eps} (greedy)' if eps == 0 else f'ε = {eps}'
    ax2.plot(pct_optimal, color=COLORS[name], label=label, linewidth=1.5)

ax2.set_xlabel('Steps', fontsize=11)
ax2.set_ylabel('%\nOptimal\naction', fontsize=11)
ax2.set_xlim(0, 1000)
ax2.set_ylim(0, 100)
ax2.legend(loc='lower right')
ax2.grid(True, alpha=0.3)

fig.suptitle('Figure 2.2: Average performance of ε-greedy action-value methods\non the 10-armed testbed',
             fontsize=12, y=0.995)
plt.tight_layout()
plt.show()

# Analyze the results
analysis_text = "<strong>Key Observations:</strong><br>"
analysis_text += "• <strong>Greedy (ε=0):</strong> Fast initial improvement but plateaus early due to no exploration<br>"
analysis_text += "• <strong>ε=0.01:</strong> Slower start but steady long-term improvement<br>"
analysis_text += "• <strong>ε=0.1:</strong> More exploration leads to better long-term performance<br><br>"
analysis_text += "<strong>Trade-off Insight:</strong> More exploration sacrifices short-term reward for long-term optimality"

pretty_print("🎯 Figure 2.2 Analysis", analysis_text)

## Part 4: Optimistic Initial Values (Figure 2.3)

### Theoretical Motivation

**Optimistic initialization** is an elegant way to encourage exploration without explicit randomness. The key insights:

1. **Optimism in the face of uncertainty**: Start with unrealistically high value estimates
2. **Disappointment drives exploration**: When optimistic estimates are too high, the agent will try other actions
3. **Automatic exploration**: Even greedy action selection becomes exploratory initially

### Mathematical Framework

Instead of initializing $Q_1(a) = 0$, we use $Q_1(a) = Q_0$ where $Q_0 > \max_a q_*(a)$.

**Constant Step Size**: For non-stationary environments, we use:
$$Q_{n+1} = Q_n + \alpha[R_n - Q_n] = (1-\alpha)Q_n + \alpha R_n$$

This is a weighted average giving weight $(1-\alpha)^{n-1}$ to the initial estimate and $\alpha(1-\alpha)^{n-i-1}$ to the $i$-th reward.

### Why It Works

- **Initial pessimism**: All actions appear equally good initially, encouraging exploration
- **Natural convergence**: As true rewards are observed, estimates converge to realistic values
- **No explicit randomness**: Pure greedy selection, but optimism creates exploration

In [None]:
# ============================================
# CELL 11: Optimistic Initial Values Experiment
# Purpose: Generate data for Figure 2.3
# ============================================

def run_optimistic_experiment(epsilon: float, initial_Q: float, alpha: float = 0.1,
                             runs: int = 2000, steps: int = 1000) -> np.ndarray:
    """
    Run experiment with specified initial values and constant step size

    Key differences from sample average method:
    1. Initial estimates can be set to any value (optimistic or realistic)
    2. Uses constant step size α = 0.1 instead of decreasing 1/n
    3. Recent rewards have more influence than distant ones

    Mathematical update rule:
    Q_n+1 = Q_n + α[R_n - Q_n]
         = (1-α)Q_n + αR_n

    This is exponential recency-weighted average, giving weight:
    - (1-α)^n to initial estimate Q_1
    - α(1-α)^(n-i) to i-th reward

    Args:
        epsilon: Exploration probability
        initial_Q: Initial action value estimates Q_1(a)
        alpha: Constant step size parameter
        runs: Number of independent experiments
        steps: Number of time steps per run

    Returns:
        Percentage of optimal actions at each step
    """
    all_optimal = np.zeros((runs, steps))

    for run in range(runs):
        # Initialize new bandit problem
        q_true = create_bandit()
        optimal_action = np.argmax(q_true)

        # Initialize with specified initial values
        # This is the key difference: optimistic vs realistic initialization
        Q = np.ones(K) * initial_Q  # All actions start with same estimate
        N = np.zeros(K)  # Track selections (though not used with constant α)

        # Run the algorithm
        for step in range(steps):
            # Select action using ε-greedy policy
            action = epsilon_greedy(Q, epsilon)

            # Get reward from environment
            reward = get_reward(action, q_true)

            # Update with constant step size (key difference)
            update_estimates(Q, N, action, reward, alpha=alpha)

            # Record if optimal action was selected
            all_optimal[run, step] = (action == optimal_action)

    return np.mean(all_optimal, axis=0) * 100

# Run the two key experiments for Figure 2.3
pretty_print("🚀 Running Optimistic Initialization Experiments",
             "Comparing optimistic greedy vs realistic ε-greedy with constant step size α=0.1")

print("  Experiment 1: Optimistic greedy (Q₁=5, ε=0)...")
optimistic_greedy = run_optimistic_experiment(epsilon=0, initial_Q=5, alpha=0.1)

print("  Experiment 2: Realistic ε-greedy (Q₁=0, ε=0.1)...")
realistic_egreedy = run_optimistic_experiment(epsilon=0.1, initial_Q=0, alpha=0.1)

# Compare final performance
opt_final = optimistic_greedy[-1]
real_final = realistic_egreedy[-1]

pretty_print("📊 Final Performance Comparison",
             f"Optimistic Greedy (Q₁=5, ε=0): {opt_final:.1f}% optimal actions<br>" +
             f"Realistic ε-greedy (Q₁=0, ε=0.1): {real_final:.1f}% optimal actions<br>" +
             f"<strong>Winner:</strong> {'Optimistic' if opt_final > real_final else 'Realistic'} method")

print("✅ Optimistic initialization experiments complete!")

In [None]:
# ============================================
# CELL 12: Plot Figure 2.3 - Optimistic vs Realistic
# Purpose: Reproduce Figure 2.3 from Sutton & Barto
# ============================================

plt.figure(figsize=(8, 5))

# Plot optimistic greedy method
plt.plot(optimistic_greedy, color=COLORS['optimistic'],
         label='Optimistic, greedy\n$Q_1 = 5, ε = 0$', linewidth=1.5)

# Plot realistic ε-greedy method
plt.plot(realistic_egreedy, color=COLORS['realistic'],
         label='Realistic, ε-greedy\n$Q_1 = 0, ε = 0.1$', linewidth=1.5)

plt.xlabel('Steps', fontsize=11)
plt.ylabel('%\nOptimal\naction', fontsize=11)
plt.xlim(0, 1000)
plt.ylim(0, 100)
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)

plt.title('Figure 2.3: The effect of optimistic initial action-value estimates on the 10-armed testbed.\n' +
          'Both methods used a constant step-size parameter, α = 0.1.',
          fontsize=11)
plt.tight_layout()
plt.show()

# Detailed analysis of the results
analysis_text = "<strong>Why Optimistic Initialization Works:</strong><br>"
analysis_text += "1. <strong>Initial Phase:</strong> All actions seem equally good (Q₁=5), encouraging systematic exploration<br>"
analysis_text += "2. <strong>Disappointment:</strong> Actions give lower rewards than expected, agent tries others<br>"
analysis_text += "3. <strong>Convergence:</strong> Eventually finds and sticks with optimal action<br><br>"
analysis_text += "<strong>Key Insight:</strong> Optimism creates exploration without explicit randomness!<br>"
analysis_text += "<strong>Trade-off:</strong> Better long-term performance but requires tuning initial values"

pretty_print("🎯 Figure 2.3 Analysis", analysis_text)

# Show the exploration pattern differences
pattern_text = "<strong>Exploration Patterns:</strong><br>"
pattern_text += "• <strong>Optimistic:</strong> Systematic early exploration, then focused exploitation<br>"
pattern_text += "• <strong>ε-greedy:</strong> Random exploration throughout, more consistent but less efficient<br><br>"
pattern_text += "<strong>When to Use Each:</strong><br>"
pattern_text += "• <strong>Optimistic:</strong> When you can set good initial values and want fast convergence<br>"
pattern_text += "• <strong>ε-greedy:</strong> When problem characteristics are unknown or non-stationary"

pretty_print("🔍 Method Comparison", pattern_text)

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Lab Summary</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>Key Findings:</strong></p>
        <ul style="margin: 10px 0; padding-left: 20px;">
            <li><strong>Exploration is Essential:</strong> Pure greedy methods often fail to find optimal actions</li>
            <li><strong>ε-greedy Balance:</strong> ε = 0.1 provides good long-term performance in stationary environments</li>
            <li><strong>Optimistic Initialization:</strong> Can outperform ε-greedy by encouraging systematic early exploration</li>
            <li><strong>Step Size Matters:</strong> Constant α gives more weight to recent observations</li>
        </ul>
        
        <p><strong>Algorithmic Insights:</strong></p>
        <ul style="margin: 10px 0; padding-left: 20px;">
            <li><strong>Sample Average:</strong> Q_n+1 = Q_n + (1/n)[R_n - Q_n] converges to true values</li>
            <li><strong>Constant Step Size:</strong> Q_n+1 = Q_n + α[R_n - Q_n] adapts to recent changes</li>
            <li><strong>Exploration Strategies:</strong> Random (ε-greedy) vs Systematic (optimistic)</li>
        </ul>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li><strong>Scalability:</strong> How would performance change with different numbers of arms (k = 2, 100, 1000)?</li>
        <li><strong>Non-stationarity:</strong> What happens when true action values q*(a) change over time?</li>
        <li><strong>Adaptive ε:</strong> How could you adapt ε over time for better performance?</li>
        <li><strong>Initialization:</strong> How do you choose good optimistic initial values in practice?</li>
        <li><strong>Applications:</strong> Where would you use each method in real-world scenarios?</li>
    </ol>
</div>

<div style="background: #e8f5e8; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #28a745;">
    <h3 style="color: #28a745; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Extensions and Advanced Topics</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>Beyond Basic Bandits:</strong></p>
        <ul style="margin: 10px 0; padding-left: 20px;">
            <li><strong>Upper Confidence Bound (UCB):</strong> A_t = argmax[Q_t(a) + c√(ln t/N_t(a))]</li>
            <li><strong>Thompson Sampling:</strong> Bayesian approach using probability matching</li>
            <li><strong>Gradient Bandits:</strong> Learn action preferences rather than values</li>
            <li><strong>Contextual Bandits:</strong> Actions depend on observed context/state</li>
        </ul>
    </div>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">🎓 End of Lab 2: Multi-Armed Bandits | Next: Finite Markov Decision Processes</p>
</div>