<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo ES
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton and Barto Chapter 5 Figure 5.2 | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        This lab implements <strong>Monte Carlo ES (Exploring Starts)</strong> exactly as described in Sutton and Barto Figure 5.2. 
        MC ES learns the optimal Blackjack policy without requiring a model of the environment. The key mechanism is 
        <strong>Exploring Starts</strong>: each episode begins with a randomly selected state-action pair, guaranteeing that 
        all state-action pairs are explored infinitely often. After the random first action, the agent follows its current 
        greedy policy. This ensures both exploration and convergence to optimality.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Implement Monte Carlo ES from Figure 5.2</li>
        <li>Understand exploring starts mechanism</li>
        <li>Apply first-visit MC for Q-value estimation</li>
        <li>Implement greedy policy improvement</li>
        <li>Reproduce textbook Blackjack results</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Get sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → 0=Stick (stop), 1=Hit (draw card)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if 11)</div>
    </div>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 1: Environment Setup and Dependencies</h2>
</div>

We begin by importing all necessary libraries:
- **Gymnasium**: Modern RL environment library (replaces OpenAI Gym)
- **NumPy**: Numerical computations and array operations
- **Matplotlib**: 3D surface plots and 2D heatmaps
- **defaultdict**: Efficient sparse storage for Q-values and returns
- **pretty_print**: Custom utility for formatted output

In [None]:
"""
Cell 1: Import Libraries and Initialize Environment

Purpose:
  - Import all required libraries for MC ES implementation
  - Load pretty_print utility from GitHub for formatted output
  - Configure matplotlib for publication-quality visualizations
  - Create Blackjack-v1 environment

Key Libraries:
  - gymnasium: Provides Blackjack-v1 environment with v1 API
  - numpy: Array operations, random sampling, statistical functions
  - defaultdict: Automatically initializes missing keys with default values
  - matplotlib: 3D plotting (Axes3D), colormaps (cm), pyplot interface

Environment Details:
  - Blackjack-v1 uses modern Gymnasium API
  - State: tuple (player_sum, dealer_card, usable_ace)
  - Action: 0 (Stick) or 1 (Hit)
  - Reward: Terminal only (+1 win, 0 draw, -1 lose)
"""

import sys
import gymnasium as gym  # Modern replacement for gym
import numpy as np
from collections import defaultdict
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

# Configure matplotlib for better quality figures
plt.rcParams['figure.dpi'] = 100          # Display resolution
plt.rcParams['figure.figsize'] = (12, 8)  # Default figure size
plt.rcParams['font.size'] = 10            # Font size for labels

# Load pretty_print utility from GitHub repository
try:
    import requests
    url = 'https://raw.githubusercontent.com/mdehghani86/RL_labs/master/utility/rl_utility.py'
    response = requests.get(url)
    exec(response.text)  # Execute utility code
    pretty_print("Environment Setup Complete", 
                 f"Gymnasium version: {gym.__version__}<br>" +
                 "NumPy, Matplotlib loaded successfully<br>" +
                 "Ready to implement Monte Carlo ES", 
                 style='success')
except Exception as e:
    # Fallback if GitHub fetch fails
    print(f"Libraries loaded (pretty_print unavailable: {str(e)})")
    print(f"Gymnasium version: {gym.__version__}")

# Create Blackjack environment using v1 API
env = gym.make('Blackjack-v1')  # v1 is current, v0 deprecated

# Display environment information
pretty_print("Blackjack Environment Created",
             f"<strong>Action Space:</strong> {env.action_space.n} actions<br>" +
             "• Action 0: Stick (stop drawing cards)<br>" +
             "• Action 1: Hit (draw another card)<br><br>" +
             "<strong>State Space:</strong><br>" +
             "• player_sum: Current sum of player cards (12-21)<br>" +
             "• dealer_card: Dealer's visible card (1-10, Ace=1)<br>" +
             "• usable_ace: Boolean, True if ace counts as 11<br><br>" +
             "<strong>Reward Structure:</strong><br>" +
             "• +1 for winning, 0 for draw, -1 for losing<br>" +
             "• Rewards given only at episode termination",
             style='info')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 2: Monte Carlo ES Algorithm - Pseudocode</h2>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="https://github.com/mdehghani86/RL_labs/blob/master/Lab%2005/MCM_ES.jpg?raw=true" 
         alt="Monte Carlo ES Pseudocode" 
         style="width: 70%; max-width: 800px; border: 2px solid #17a2b8; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Figure 5.2: Monte Carlo ES Algorithm from Sutton and Barto</p>
</div>

<div style="background: #e8f5e9; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #4caf50;">
    <h3 style="color: #2e7d32; font-size: 14px; margin: 0 0 8px 0;">Algorithm Components</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        <strong>1. Exploring Starts:</strong> Each episode begins with a random state-action pair, ensuring all pairs are explored.<br><br>
        <strong>2. First-Visit MC:</strong> For each (s,a) pair, only the first occurrence in an episode is used for Q-value updates.<br><br>
        <strong>3. Greedy Policy Improvement:</strong> After each episode, policy becomes greedy: π(s) ← argmax_a Q(s,a).<br><br>
        <strong>4. Running Average:</strong> Q(s,a) is updated as the average of all observed returns from (s,a).
    </p>
</div>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 3: Episode Generation with Exploring Starts</h2>
</div>

The exploring starts mechanism is THE KEY innovation of MC ES. Without it, a deterministic greedy policy would never explore alternative actions. By starting each episode with a random action, we guarantee that:
1. All state-action pairs are visited infinitely often
2. We can still follow the greedy policy for most of the episode
3. Exploration happens naturally without ongoing ε-greedy behavior

In [None]:
"""
Cell 2: Generate Episode with Exploring Starts

Purpose:
  - Generate complete Blackjack episodes using exploring starts mechanism
  - First action: RANDOM (ensures exploration of all state-action pairs)
  - Subsequent actions: GREEDY (follows current policy for exploitation)

Algorithm:
  1. Reset environment to get initial state
  2. Select RANDOM first action (exploring start)
  3. Execute first action and record (state, action, reward)
  4. For remaining steps:
     a) Select action using greedy policy
     b) Execute action
     c) Record (state, action, reward)
     d) Continue until episode terminates

Parameters:
  env: Gymnasium Blackjack-v1 environment
  policy: Dictionary mapping states to actions (greedy policy)
          Format: policy[state] = action

Returns:
  episode: List of (state, action, reward) tuples
           Example: [((12, 2, False), 1, 0), ((13, 2, False), 1, -1)]

CRITICAL NOTES:
  - The RANDOM first action is what makes this "Exploring Starts"
  - Without random starts, greedy policy would never try sub-optimal actions
  - This guarantees all (s,a) pairs are explored infinitely often
  - After first action, we follow greedy policy to exploit learned knowledge
"""

def generate_episode_with_exploring_starts(env, policy):
    """
    Generate one complete episode using exploring starts.
    """
    episode = []  # Will store (state, action, reward) tuples
    
    # Initialize episode by resetting environment
    # v1 API returns (state, info) tuple
    state, info = env.reset()
    
    # ============================================================
    # EXPLORING START: Select RANDOM first action
    # This is THE critical component of exploring starts
    # ============================================================
    action = env.action_space.sample()  # Uniform random: 0 or 1
    
    # Execute the first (random) action
    # v1 API returns (state, reward, terminated, truncated, info)
    next_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated  # Episode ends if either flag is True
    
    # Record first step
    episode.append((state, action, reward))
    
    # ============================================================
    # Continue episode following GREEDY policy
    # After exploring start, we exploit learned policy
    # ============================================================
    state = next_state
    while not done:
        # Get action from greedy policy
        # If state not yet in policy (early in learning), use random
        action = policy.get(state, env.action_space.sample())
        
        # Execute action
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        # Record step
        episode.append((state, action, reward))
        
        # Move to next state
        state = next_state
    
    return episode

# Test episode generation
test_policy = {}  # Empty policy for testing
test_episode = generate_episode_with_exploring_starts(env, test_policy)

pretty_print("Episode Generation Function Ready",
             f"<strong>Test episode generated:</strong><br>" +
             f"• Episode length: {len(test_episode)} steps<br>" +
             f"• Final reward: {test_episode[-1][2]}<br>" +
             f"• First step: {test_episode[0]}<br><br>" +
             "<strong>Exploring Starts Mechanism:</strong><br>" +
             "• First action: RANDOM (exploration)<br>" +
             "• Subsequent actions: GREEDY (exploitation)<br>" +
             "• Guarantees all (s,a) pairs visited",
             style='success')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 4: Monte Carlo ES Main Algorithm</h2>
</div>

This is the complete learning algorithm that implements Figure 5.2 from the textbook. The algorithm alternates between:
1. **Policy Evaluation**: Update Q-values based on observed returns
2. **Policy Improvement**: Make policy greedy with respect to Q-values

This pattern is called **Generalized Policy Iteration (GPI)** and is fundamental to many RL algorithms.

In [None]:
"""
Cell 3: Monte Carlo ES - Complete Learning Algorithm

Purpose:
  - Implement complete MC ES algorithm from Sutton and Barto Figure 5.2
  - Learn optimal Q-values Q*(s,a) through episode sampling
  - Extract optimal policy π*(s) = argmax_a Q*(s,a)

Algorithm (Figure 5.2):
  Initialize:
    - Q(s,a) arbitrarily for all s,a
    - Returns(s,a) ← empty list for all s,a
    - π(s) arbitrarily for all s
  
  Loop for each episode:
    1. Generate episode S0,A0,R1,...,ST-1,AT-1,RT using exploring starts
    2. G ← 0
    3. Loop for each step of episode t = T-1, T-2, ..., 0:
       a) G ← γ*G + Rt+1
       b) Unless the pair St,At appears in S0,A0,...,St-1,At-1:
          - Append G to Returns(St,At)
          - Q(St,At) ← average(Returns(St,At))
          - π(St) ← argmax_a Q(St,a)

Data Structures:
  Q: defaultdict of numpy arrays
     Q[state][action] = estimated action value
     Automatically initializes new states with zeros
  
  returns: defaultdict of lists
     returns[(state,action)] = [G1, G2, G3, ...]
     Stores all observed returns for averaging
  
  policy: regular dict
     policy[state] = action
     Greedy policy derived from Q-values

Parameters:
  env: Blackjack-v1 environment
  num_episodes: Number of episodes to run (default 500000)

Returns:
  Q: Final Q-value estimates (optimal action-values)
  policy: Final greedy policy (optimal policy)
"""

def monte_carlo_es(env, num_episodes=500000):
    """
    Monte Carlo ES control algorithm for finding optimal policy.
    """
    
    # ============================================================
    # INITIALIZATION
    # ============================================================
    
    # Initialize Q(s,a) arbitrarily (here: all zeros)
    # defaultdict automatically creates np.zeros(2) for new states
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    # Initialize Returns(s,a) as empty lists
    # Will store all observed returns for each state-action pair
    returns = defaultdict(list)
    
    # Initialize policy arbitrarily (will become greedy)
    policy = {}
    
    # Progress reporting
    pretty_print("Starting Monte Carlo ES Learning",
                 f"<strong>Configuration:</strong><br>" +
                 f"• Episodes: {num_episodes:,}<br>" +
                 f"• Method: First-Visit Monte Carlo<br>" +
                 f"• Discount factor γ: 1.0 (undiscounted)<br>" +
                 f"• Policy improvement: Greedy<br><br>" +
                 "<strong>This will take 2-3 minutes...</strong>",
                 style='warning')
    
    # ============================================================
    # MAIN LEARNING LOOP
    # ============================================================
    
    for episode_num in range(1, num_episodes + 1):
        
        # Generate episode using exploring starts
        episode = generate_episode_with_exploring_starts(env, policy)
        
        # Track visited state-action pairs for first-visit check
        visited_state_actions = set()
        
        # Initialize return G
        # For Blackjack, γ=1 (undiscounted episodic task)
        G = 0
        
        # ============================================================
        # Process episode BACKWARD to calculate returns efficiently
        # Working backward: G accumulates rewards from end to start
        # ============================================================
        
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            
            # Update return G
            # G = γ*G + R_{t+1}
            # Since γ=1 for Blackjack: G = reward + G
            G = reward + G
            
            # Create state-action tuple for tracking
            state_action = (state, action)
            
            # ============================================================
            # FIRST-VISIT CHECK
            # Only update Q if this is first time seeing (s,a) in episode
            # ============================================================
            
            if state_action not in visited_state_actions:
                # Mark as visited
                visited_state_actions.add(state_action)
                
                # Append return to list for this state-action pair
                returns[state_action].append(G)
                
                # ============================================================
                # POLICY EVALUATION
                # Update Q(s,a) as average of all observed returns
                # Q(s,a) = mean([G1, G2, G3, ...])
                # ============================================================
                Q[state][action] = np.mean(returns[state_action])
                
                # ============================================================
                # POLICY IMPROVEMENT
                # Make policy greedy with respect to Q
                # π(s) = argmax_a Q(s,a)
                # ============================================================
                policy[state] = np.argmax(Q[state])
        
        # Progress reporting every 100,000 episodes
        if episode_num % 100000 == 0:
            avg_q = np.mean([np.max(q) for q in Q.values()]) if Q else 0
            print(f"Episode {episode_num:,}/{num_episodes:,} | "
                  f"States visited: {len(Q)} | "
                  f"Avg max Q: {avg_q:.3f}")
    
    # ============================================================
    # LEARNING COMPLETE
    # ============================================================
    
    print("\n" + "="*70)
    print("LEARNING COMPLETE")
    print("="*70)
    
    pretty_print("Monte Carlo ES Learning Complete",
                 f"<strong>Final Statistics:</strong><br>" +
                 f"• Total episodes processed: {num_episodes:,}<br>" +
                 f"• States discovered: {len(Q)}<br>" +
                 f"• Policy entries: {len(policy)}<br>" +
                 f"• Average max Q-value: {np.mean([np.max(q) for q in Q.values()]):.4f}<br><br>" +
                 "<strong>Results:</strong><br>" +
                 "• Q-values converged to optimal values Q*<br>" +
                 "• Policy converged to optimal policy π*<br>" +
                 "• Ready for visualization and analysis",
                 style='success')
    
    return Q, policy

# Display algorithm readiness
pretty_print("Monte Carlo ES Algorithm Loaded",
             "<strong>Algorithm Components:</strong><br>" +
             "• Exploring starts for exploration<br>" +
             "• First-visit MC for Q-value updates<br>" +
             "• Greedy policy improvement<br>" +
             "• Running average for Q(s,a)<br><br>" +
             "Ready to learn optimal Blackjack policy",
             style='info')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 5: Visualization Functions</h2>
</div>

We create two complementary visualizations:

**3D Surface Plots**: Show the optimal state-value function V*(s) = max_a Q(s,a). Height and color represent value, with blue for unfavorable states and red for favorable states.

**2D Policy Heatmaps**: Display the optimal policy π*(s) = argmax_a Q(s,a) using discrete colors. Green represents STICK and Red represents HIT. We use pcolormesh to ensure crisp, discrete boundaries.

In [None]:
"""
Cell 4: 3D Value Function Visualization

Purpose:
  - Create 3D surface plots of optimal value function V*(s)
  - Show how value changes with player sum and dealer card
  - Separate plots for usable ace vs no usable ace

Visualization Structure:
  - X-axis: Player sum (12-21)
  - Y-axis: Dealer showing card (1-10, where 1=Ace)
  - Z-axis (height): State value V(s)
  - Color: Blue (low/bad) to Red (high/good)

Value Function:
  V*(s) = max_a Q*(s,a)
  The value of a state is the maximum Q-value over all actions
"""

def plot_value_function(Q, title="Optimal State-Value Function V*"):
    """
    Plot 3D surface of value function.
    """
    
    def get_value(player_sum, dealer_card, usable_ace):
        """
        Get optimal value for a state.
        V*(s) = max_a Q*(s,a)
        """
        state = (player_sum, dealer_card, usable_ace)
        if state in Q:
            return np.max(Q[state])  # Max over actions
        else:
            return 0  # Default for unvisited states
    
    def create_surface(usable_ace, ax):
        """
        Create one 3D surface plot.
        """
        # Define state space ranges
        player_range = np.arange(12, 22)  # 12, 13, ..., 21
        dealer_range = np.arange(1, 11)   # 1 (Ace), 2, ..., 10
        
        # Create coordinate meshgrid
        # X[i,j] = player_range[j]
        # Y[i,j] = dealer_range[i]
        X, Y = np.meshgrid(player_range, dealer_range)
        
        # Build value array
        # Z[i,j] = V(player_range[j], dealer_range[i], usable_ace)
        Z = np.array([[get_value(x, y, usable_ace) 
                      for x in player_range]    # Columns
                     for y in dealer_range])    # Rows
        
        # Create 3D surface
        surf = ax.plot_surface(
            X, Y, Z,
            cmap=cm.coolwarm,      # Blue to Red colormap
            linewidth=0,           # No wireframe lines
            antialiased=True,      # Smooth rendering
            vmin=-1, vmax=1,       # Value range for color
            alpha=0.8              # Slight transparency
        )
        
        # Configure axes
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.set_zlabel('Value V(s)', fontsize=11)
        ax.set_zlim(-1, 1)  # Z-axis limits
        ax.view_init(elev=25, azim=-130)  # Viewing angle
        
        return surf
    
    # Create figure with two subplots (2 rows, 1 column)
    fig = plt.figure(figsize=(14, 11))
    
    # Subplot 1: With usable ace
    ax1 = fig.add_subplot(211, projection='3d')
    ax1.set_title(f'{title} - WITH Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf1 = create_surface(True, ax1)
    fig.colorbar(surf1, ax=ax1, shrink=0.5, aspect=10)
    
    # Subplot 2: Without usable ace
    ax2 = fig.add_subplot(212, projection='3d')
    ax2.set_title(f'{title} - WITHOUT Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf2 = create_surface(False, ax2)
    fig.colorbar(surf2, ax=ax2, shrink=0.5, aspect=10)
    
    plt.tight_layout()
    plt.show()

pretty_print("3D Value Visualization Ready",
             "<strong>Features:</strong><br>" +
             "• Plots V*(s) = max_a Q(s,a)<br>" +
             "• Color: Blue (low) to Red (high)<br>" +
             "• Separate plots for usable/no usable ace<br>" +
             "• 3D surface shows value landscape",
             style='success')

In [None]:
"""
Cell 5: 2D Policy Heatmap Visualization (DISCRETE COLORS - FIXED)

Purpose:
  - Create 2D heatmaps of optimal policy π*(s)
  - Show STICK vs HIT decisions for each state
  - Use DISCRETE colors with NO interpolation

CRITICAL FIX:
  - Uses pcolormesh instead of imshow
  - pcolormesh creates discrete rectangular patches
  - Ensures crisp boundaries between actions
  - No blending or interpolation between values

Color Coding:
  - Green: STICK (action 0) - Stop drawing cards
  - Red: HIT (action 1) - Draw another card

Policy Function:
  π*(s) = argmax_a Q*(s,a)
  Select action with highest Q-value
"""

def plot_policy(policy, title="Optimal Policy π*"):
    """
    Plot 2D heatmap of policy with discrete colors.
    """
    
    def get_action(player_sum, dealer_card, usable_ace):
        """
        Get optimal action for a state.
        Returns: 0 (Stick) or 1 (Hit)
        """
        state = (player_sum, dealer_card, usable_ace)
        return policy.get(state, 1)  # Default to Hit if not in policy
    
    def create_heatmap(usable_ace, ax):
        """
        Create one 2D policy heatmap.
        """
        # Define state space ranges
        player_range = np.arange(12, 22)  # 12-21
        dealer_range = np.arange(1, 11)   # 1-10
        
        # Build policy grid
        # Z[i,j] = action for state (player_range[j], dealer_range[i], usable_ace)
        Z = np.array([[get_action(p, d, usable_ace)
                      for p in player_range]    # Columns: player sums
                     for d in dealer_range])    # Rows: dealer cards
        
        # ============================================================
        # CRITICAL: Use pcolormesh for DISCRETE values
        # pcolormesh creates colored rectangles without interpolation
        # This ensures actions are displayed as discrete blocks
        # ============================================================
        im = ax.pcolormesh(
            player_range,           # X coordinates (player sums)
            dealer_range,           # Y coordinates (dealer cards)
            Z,                      # Action values (0 or 1)
            cmap='RdYlGn_r',        # Red-Yellow-Green reversed
                                    # Red = Hit (1), Green = Stick (0)
            edgecolors='black',     # Black gridlines between cells
            linewidth=0.5,          # Gridline thickness
            vmin=0, vmax=1,         # Action range [0, 1]
            shading='flat'          # Flat shading = no interpolation
        )
        
        # Configure axes
        ax.set_xticks(player_range)
        ax.set_yticks(dealer_range)
        # Display 'A' for Ace (value 1) on y-axis
        ax.set_yticklabels(['A'] + list(range(2, 11)))
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.set_aspect('equal')  # Square cells
        
        # Add colorbar with discrete labels
        # Ticks at 0.25 and 0.75 to center labels in color regions
        cbar = plt.colorbar(im, ax=ax, ticks=[0.25, 0.75], 
                           fraction=0.046, pad=0.04)
        cbar.ax.set_yticklabels(['STICK (0)', 'HIT (1)'])
        
        return im
    
    # Create figure with two subplots (1 row, 2 columns)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Subplot 1: With usable ace
    ax1.set_title(f'{title} - WITH Usable Ace', 
                  fontsize=12, fontweight='bold')
    create_heatmap(True, ax1)
    
    # Subplot 2: Without usable ace
    ax2.set_title(f'{title} - WITHOUT Usable Ace', 
                  fontsize=12, fontweight='bold')
    create_heatmap(False, ax2)
    
    plt.tight_layout()
    plt.show()

pretty_print("2D Policy Visualization Ready",
             "<strong>Features:</strong><br>" +
             "• DISCRETE colors using pcolormesh<br>" +
             "• Green = STICK (0), Red = HIT (1)<br>" +
             "• NO interpolation between actions<br>" +
             "• Crisp boundaries for clear visualization<br>" +
             "• Black gridlines show state boundaries",
             style='success')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 6: Run Monte Carlo ES Experiment</h2>
</div>

Now we execute the complete learning process with 500,000 episodes. This large number ensures:
- All state-action pairs are visited sufficiently
- Q-value estimates converge to true optimal values
- Policy converges to the optimal policy
- Results match textbook Figure 5.2

In [None]:
"""
Cell 6: Execute Monte Carlo ES Learning

Purpose:
  - Run MC ES for 500,000 episodes
  - Learn optimal Q-values and policy
  - Analyze learned policy statistics

Expected Runtime:
  - Approximately 2-3 minutes on modern hardware
  - Progress updates every 100,000 episodes

Expected Results:
  - Policy should match textbook Figure 5.2
  - STICK dominant at high sums (20-21)
  - HIT dominant at low sums (12-16)
  - Boundary around 17-19 depends on dealer card
"""

# Run Monte Carlo ES
Q, policy = monte_carlo_es(env, num_episodes=500000)

# ============================================================
# ANALYZE LEARNED POLICY
# ============================================================

# Count action distribution
stick_count = sum(1 for action in policy.values() if action == 0)
hit_count = sum(1 for action in policy.values() if action == 1)
total_states = len(policy)

# Calculate statistics
avg_max_q = np.mean([np.max(q) for q in Q.values()])
avg_min_q = np.mean([np.min(q) for q in Q.values()])

# Display results
pretty_print("Policy Learning Complete - Analysis",
             f"<strong>Policy Statistics:</strong><br>" +
             f"• Total states in policy: {total_states}<br>" +
             f"• STICK actions: {stick_count} ({100*stick_count/total_states:.1f}%)<br>" +
             f"• HIT actions: {hit_count} ({100*hit_count/total_states:.1f}%)<br><br>" +
             f"<strong>Q-Value Statistics:</strong><br>" +
             f"• Average max Q-value: {avg_max_q:.4f}<br>" +
             f"• Average min Q-value: {avg_min_q:.4f}<br><br>" +
             "<strong>Expected Pattern:</strong><br>" +
             "• High sums (20-21): Mostly STICK<br>" +
             "• Low sums (12-16): Mostly HIT<br>" +
             "• Middle sums (17-19): Depends on dealer card",
             style='result')

In [None]:
"""
Cell 7: Visualize Optimal Value Function

Purpose:
  - Display 3D plots of learned value function
  - Show how values vary across state space
  - Compare usable vs non-usable ace scenarios
"""

pretty_print("Generating 3D Value Function Plots",
             "Creating surface visualization of V*(s) = max_a Q(s,a)...",
             style='info')

# Plot value function
plot_value_function(Q, "Optimal State-Value Function V*")

# Interpretation guide
pretty_print("Value Function Interpretation",
             "<strong>Color Coding:</strong><br>" +
             "• Red (high values): Favorable states, likely to win<br>" +
             "• Blue (low values): Unfavorable states, likely to lose<br>" +
             "• White (mid values): Neutral states<br><br>" +
             "<strong>Key Observations:</strong><br>" +
             "• Peak values near player sum 20-21<br>" +
             "• Lower values at low player sums<br>" +
             "• Usable ace provides higher values (flexibility)<br>" +
             "• Values vary with dealer showing card<br>" +
             "• Dealer weak cards (4-6) → higher player values",
             style='note')

In [None]:
"""
Cell 8: Visualize Optimal Policy

Purpose:
  - Display 2D heatmaps of learned policy
  - Show optimal STICK vs HIT decisions
  - Compare with textbook Figure 5.2

CRITICAL: This uses DISCRETE colors (pcolormesh)
  - No blending between Green (STICK) and Red (HIT)
  - Clear boundaries showing decision thresholds
"""

pretty_print("Generating Optimal Policy Heatmaps",
             "Creating discrete policy visualization...<br>" +
             "Green = STICK, Red = HIT",
             style='info')

# Plot policy
plot_policy(policy, "Optimal Policy π* (Learned via MC ES)")

# Interpretation guide
pretty_print("Policy Interpretation",
             "<strong>Color Coding:</strong><br>" +
             "• Green: STICK (action 0) - Stop drawing cards<br>" +
             "• Red: HIT (action 1) - Draw another card<br><br>" +
             "<strong>Policy Patterns:</strong><br>" +
             "• Clear decision boundary around sum 17-20<br>" +
             "• STICK (green) dominates at high sums<br>" +
             "• HIT (red) dominates at low sums<br>" +
             "• More aggressive with usable ace (cannot bust)<br>" +
             "• Policy adapts to dealer showing card<br>" +
             "• Hit more against dealer strong cards (9, 10, A)<br><br>" +
             "<strong>Textbook Comparison:</strong><br>" +
             "This should closely match Figure 5.2 in Sutton and Barto",
             style='note')

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings</h3>
    <p style="color: #555; line-height: 1.6; font-size: 13px;">
        <strong>1. Exploring Starts Effectiveness:</strong> Random initial actions ensured all state-action pairs were explored, eliminating the need for ongoing ε-greedy exploration.<br><br>
        <strong>2. Convergence to Optimality:</strong> The greedy policy converged to the optimal policy, matching textbook results with clear decision boundaries.<br><br>
        <strong>3. Usable Ace Strategy:</strong> States with usable ace show higher values and more aggressive hitting due to flexibility in avoiding busting.<br><br>
        <strong>4. First-Visit MC Accuracy:</strong> Averaging returns from first visits provided unbiased Q-value estimates that converged to true action values.<br><br>
        <strong>5. GPI Pattern:</strong> The interleaved policy evaluation and improvement (Generalized Policy Iteration) led to optimal policy.
    </p>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why does exploring starts guarantee sufficient exploration without ε-greedy?</li>
        <li>How would results differ with every-visit MC instead of first-visit?</li>
        <li>Why is the policy more aggressive (more hitting) with a usable ace?</li>
        <li>What would happen if we used γ < 1 instead of γ = 1?</li>
        <li>How could we modify this for continuous action spaces?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 5-1: Blackjack with Monte Carlo ES</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 5-2 - Off-Policy Monte Carlo with Importance Sampling</p>
</div>