<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo ES
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton and Barto Chapter 5 Figure 5.2 | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        This lab implements <strong>Monte Carlo ES (Exploring Starts)</strong> exactly as described in Sutton and Barto Figure 5.2. 
        The algorithm learns the optimal Blackjack policy without requiring a model of the environment. The key insight is 
        <strong>Exploring Starts</strong>: each episode begins with a randomly selected state-action pair, guaranteeing that all 
        state-action pairs are visited infinitely often. After the initial random action, the agent follows its current greedy policy. 
        This combination ensures both exploration and convergence to the optimal policy through Generalized Policy Iteration.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Implement Monte Carlo ES from Figure 5.2</li>
        <li>Understand exploring starts mechanism</li>
        <li>Apply first-visit MC for Q-value estimation</li>
        <li>Implement greedy policy improvement</li>
        <li>Reproduce textbook Blackjack results</li>
        <li>Visualize value functions and policies</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → 0=Stick (stop), 1=Hit (draw card)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">State</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if counted as 11)</div>
    </div>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 1: Environment Setup and Dependencies</h2>
</div>

We begin by importing all necessary libraries for the Monte Carlo ES implementation:

- **Gymnasium**: Provides the Blackjack-v1 environment with modern API (replaces deprecated gym)
- **NumPy**: For numerical computations, array operations, and random number generation
- **Matplotlib**: For creating 3D surface plots (value functions) and 2D heatmaps (policies)
- **defaultdict**: Efficient sparse storage for Q-values and returns lists
- **pretty_print utility**: For formatted, color-coded output messages

In [None]:
"""
Cell 1: Import Libraries and Initialize Environment

PURPOSE:
  - Import all required libraries for MC ES implementation
  - Load pretty_print utility for formatted output
  - Configure matplotlib for high-quality visualizations
  - Create Blackjack-v1 environment

KEY LIBRARIES:
  - gymnasium: Modern RL environment library (v1 API with 5-tuple returns)
  - numpy: Numerical operations, averaging, argmax for greedy selection
  - defaultdict: Auto-initializing dictionaries for sparse Q-value storage
  - matplotlib: 3D surface plots and 2D heatmaps with discrete colors

CONFIGURATION:
  - Figure DPI: 100 (display quality)
  - Figure size: 12x8 inches (default)
  - Font size: 10pt (readable labels)
"""

import sys
import gymnasium as gym  # Modern replacement for OpenAI gym
import numpy as np
from collections import defaultdict
from mpl_toolkits.mplot3d import Axes3D  # For 3D surface plots
import matplotlib.pyplot as plt
from matplotlib import cm  # Colormaps for visualizations
import warnings
warnings.filterwarnings('ignore')  # Suppress matplotlib warnings

# Configure matplotlib for publication-quality figures
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Load pretty_print utility from GitHub repository
try:
    import requests
    url = 'https://raw.githubusercontent.com/mdehghani86/RL_labs/master/utility/rl_utility.py'
    response = requests.get(url)
    exec(response.text)  # Execute utility code to load pretty_print function
    
    pretty_print("Environment Ready", 
                 f"Gymnasium version: {gym.__version__}<br>" +
                 "NumPy, Matplotlib loaded successfully<br>" +
                 "Implementing Monte Carlo ES from Figure 5.2<br>" +
                 "All dependencies ready", 
                 style='success')
except Exception as e:
    # Fallback if GitHub fetch fails
    print(f"Libraries loaded successfully")
    print(f"Gymnasium version: {gym.__version__}")
    print(f"Note: pretty_print unavailable ({e})")

# Create Blackjack environment using Gymnasium v1 API
# v1 uses modern API: reset() returns (state, info), step() returns 5-tuple
env = gym.make('Blackjack-v1')

pretty_print("Blackjack Environment Created",
             f"<strong>Action Space:</strong> {env.action_space.n} actions<br>" +
             "• 0 = Stick (stop drawing cards)<br>" +
             "• 1 = Hit (draw another card)<br><br>" +
             "<strong>State Space:</strong><br>" +
             "• player_sum: 12-21 (game starts at 12+)<br>" +
             "• dealer_card: 1-10 (Ace=1, face cards=10)<br>" +
             "• usable_ace: True/False<br><br>" +
             "<strong>Rewards:</strong> Terminal only (+1 win, 0 draw, -1 lose)",
             style='info')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 2: Monte Carlo ES Algorithm - Pseudocode and Implementation</h2>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="https://github.com/mdehghani86/RL_labs/blob/master/Lab%2005/MCM_ES.jpg?raw=true" 
         alt="Monte Carlo ES Pseudocode" 
         style="width: 70%; max-width: 800px; border: 2px solid #17a2b8; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Figure 5.2: Monte Carlo ES Algorithm from Sutton and Barto</p>
</div>

<div style="background: white; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0;">Algorithm Overview</h3>
    <p style="color: #555; line-height: 1.6; margin: 0 0 10px 0; font-size: 13px;">
        Monte Carlo ES solves the exploration problem through <strong>Exploring Starts</strong>. Each episode begins with a 
        random state-action pair (S₀, A₀), then follows the current greedy policy. This ensures:
    </p>
    <ul style="color: #555; line-height: 1.6; margin: 0 0 10px 0; padding-left: 20px; font-size: 13px;">
        <li>All state-action pairs are explored (random start)</li>
        <li>Policy exploits current knowledge (greedy after start)</li>
        <li>Convergence to optimal policy π* (GPI pattern)</li>
    </ul>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        The algorithm alternates between <strong>policy evaluation</strong> (updating Q-values) and 
        <strong>policy improvement</strong> (making policy greedy), converging to optimality.
    </p>
</div>

In [None]:
"""
Cell 2: Episode Generation with Exploring Starts

PURPOSE:
  - Generate complete Blackjack episodes using exploring starts mechanism
  - First action is RANDOM (exploring start - ensures exploration)
  - Subsequent actions follow GREEDY policy (exploitation)

EXPLORING STARTS EXPLANATION:
  Without exploring starts, a deterministic greedy policy might never try certain
  actions in certain states. By randomly selecting the first action, we guarantee
  that every state-action pair (s,a) is visited infinitely often as episodes → ∞.
  This solves the exploration problem without needing epsilon-greedy or other mechanisms.

ALGORITHM:
  1. Reset environment → get initial state S₀
  2. Select RANDOM first action A₀ (exploring start)
  3. Execute A₀, observe R₁, S₁
  4. Record (S₀, A₀, R₁)
  5. For rest of episode:
     a) Select action from current policy π (greedy)
     b) Execute action, observe reward and next state
     c) Record (state, action, reward)
     d) Continue until termination

PARAMETERS:
  env: Gymnasium Blackjack environment
  policy: Dictionary mapping states to actions (greedy policy)

RETURNS:
  episode: List of (state, action, reward) tuples representing complete episode
"""

def generate_episode_with_exploring_starts(env, policy):
    episode = []
    
    # Initialize episode - get starting state
    state, _ = env.reset()  # v1 API returns (state, info)
    
    # CRITICAL: EXPLORING START - select RANDOM first action
    # This is the KEY innovation that ensures exploration
    # Guarantees all (state, action) pairs are visited
    action = env.action_space.sample()  # Uniform random: 0 or 1
    
    # Execute first action
    # v1 API returns: (next_state, reward, terminated, truncated, info)
    next_state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated  # Episode ends on either condition
    
    # Record first step (with exploring start action)
    episode.append((state, action, reward))
    
    # Continue episode following GREEDY policy
    # From this point onward, we exploit our current knowledge
    state = next_state
    while not done:
        # Get greedy action from current policy
        # If state not in policy yet (early episodes), default to random
        action = policy.get(state, env.action_space.sample())
        
        # Execute greedy action
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Record step
        episode.append((state, action, reward))
        
        # Move to next state
        state = next_state
    
    return episode

pretty_print("Episode Generation Ready",
             "<strong>Exploring Starts Mechanism:</strong><br>" +
             "• First action: <strong>Random</strong> (exploration guarantee)<br>" +
             "• Subsequent actions: <strong>Greedy</strong> (exploitation)<br>" +
             "• Ensures all (s,a) pairs visited infinitely often<br>" +
             "• No need for epsilon-greedy or other exploration",
             style='success')

In [None]:
"""
Cell 3: Monte Carlo ES - Main Learning Algorithm

PURPOSE:
  - Implement complete MC ES algorithm from Sutton and Barto Figure 5.2
  - Learn optimal action-value function Q*(s,a)
  - Extract optimal policy π*(s) = argmax_a Q*(s,a)

ALGORITHM (Figure 5.2 - Line by Line):
  Initialize:
    - π(s) ∈ A(s) arbitrarily for all s ∈ S
    - Q(s,a) ∈ ℝ arbitrarily for all s ∈ S, a ∈ A(s)
    - Returns(s,a) ← empty list for all s ∈ S, a ∈ A(s)
  
  Loop forever (for each episode):
    1. Choose S₀ ∈ S, A₀ ∈ A(S₀) randomly (exploring start)
    2. Generate episode from S₀, A₀ following π
    3. G ← 0
    4. Loop for each step of episode, t = T-1, T-2, ..., 0:
       a) G ← γG + R_{t+1}
       b) Unless pair S_t, A_t appears earlier in episode:
          - Append G to Returns(S_t, A_t)
          - Q(S_t, A_t) ← average(Returns(S_t, A_t))
          - π(S_t) ← argmax_a Q(S_t, a)

KEY DATA STRUCTURES:
  - Q: defaultdict(lambda: np.zeros(2))
       Stores Q(s,a) estimates. Auto-initializes to [0, 0] for unseen states.
       Q[state][0] = value of STICK, Q[state][1] = value of HIT
  
  - returns: defaultdict(list)
       Stores all observed returns for each (s,a) pair.
       returns[(state, action)] = [G1, G2, G3, ...]
       Q(s,a) = mean of this list
  
  - policy: dict
       Maps states to greedy actions.
       policy[state] = argmax_a Q[state][a]

PARAMETERS:
  env: Blackjack environment
  num_episodes: Number of episodes to run (default 500,000)

RETURNS:
  Q: Final action-value estimates
  policy: Final greedy policy
"""

def monte_carlo_es(env, num_episodes=500000):
    # Initialize Q(s,a) arbitrarily (to zeros)
    # defaultdict automatically creates zero arrays for new states
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    # Initialize Returns(s,a) as empty lists
    # Will store all observed returns for averaging
    returns = defaultdict(list)
    
    # Initialize policy π arbitrarily (will become greedy)
    policy = {}
    
    pretty_print("Starting Monte Carlo ES",
                 f"<strong>Configuration:</strong><br>" +
                 f"• Episodes: {num_episodes:,}<br>" +
                 "• Method: First-Visit MC with Exploring Starts<br>" +
                 "• Discount factor: γ = 1.0 (undiscounted)<br>" +
                 "• Expected runtime: 2-3 minutes<br><br>" +
                 "<strong>This implements Figure 5.2 exactly</strong>",
                 style='warning')
    
    # Main learning loop - iterate over episodes
    for episode_num in range(1, num_episodes + 1):
        # Generate episode using exploring starts
        episode = generate_episode_with_exploring_starts(env, policy)
        
        # Track which (state, action) pairs we've already processed
        # This implements the FIRST-VISIT rule
        visited_state_actions = set()
        
        # Process episode BACKWARDS to efficiently calculate returns
        # This is more efficient than calculating returns forward
        # G accumulates reward from end of episode back to start
        G = 0  # Return (undiscounted since gamma=1 for Blackjack)
        
        # Loop through episode backwards: T-1, T-2, ..., 0
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            
            # Calculate return: G = r_{t+1} + γ*G
            # Since γ=1 for Blackjack: G = reward + G
            G = reward + G
            
            # Create hashable (state, action) tuple for tracking
            state_action = (state, action)
            
            # FIRST-VISIT CHECK: only update if this is first occurrence
            # This implements the "Unless S_t,A_t appears earlier" condition
            if state_action not in visited_state_actions:
                visited_state_actions.add(state_action)
                
                # Append G to Returns(S_t, A_t)
                returns[state_action].append(G)
                
                # Q(S_t, A_t) ← average(Returns(S_t, A_t))
                # This is POLICY EVALUATION step
                Q[state][action] = np.mean(returns[state_action])
                
                # π(S_t) ← argmax_a Q(S_t, a)
                # This is POLICY IMPROVEMENT step (make policy greedy)
                # argmax returns index of maximum value (0 or 1)
                policy[state] = np.argmax(Q[state])
        
        # Progress reporting every 100,000 episodes
        if episode_num % 100000 == 0:
            avg_q = np.mean([np.max(q) for q in Q.values()])
            print(f"Episode {episode_num:,}/{num_episodes:,} | Avg V(s): {avg_q:.3f}")
    
    pretty_print("Monte Carlo ES Complete",
                 f"<strong>Learning Results:</strong><br>" +
                 f"• Processed: {num_episodes:,} episodes<br>" +
                 f"• States learned: {len(Q):,} unique states<br>" +
                 f"• Policy states: {len(policy):,}<br>" +
                 f"• Q-values converged<br>" +
                 f"• Policy is greedy w.r.t. Q",
                 style='success')
    
    return Q, policy

pretty_print("MC ES Algorithm Loaded",
             "<strong>Ready to learn optimal Blackjack policy</strong><br>" +
             "Algorithm matches textbook Figure 5.2 line-by-line<br>" +
             "Uses first-visit MC with exploring starts",
             style='info')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 3: Visualization Functions</h2>
</div>

We create two types of visualizations to understand the learned value function and policy:

**1. 3D Surface Plots (Value Function):**
- Display V*(s) = max_a Q(s,a) as a 3D surface
- X-axis: Player sum (12-21)
- Y-axis: Dealer showing card (1-10, where 1=Ace)
- Z-axis (height) and color: State value
- Blue (cold) = low value (likely to lose)
- Red (warm) = high value (likely to win)
- Separate plots for usable/non-usable ace

**2. 2D Policy Heatmaps (Optimal Policy):**
- Display π*(s) = argmax_a Q(s,a) as discrete color grid
- Green = STICK (action 0)
- Red = HIT (action 1)
- Uses **pcolormesh** for discrete values (no interpolation)
- Black gridlines separate cells
- Separate heatmaps for usable/non-usable ace

In [None]:
"""
Cell 4: 3D Value Function Visualization

PURPOSE:
  - Create 3D surface plots of optimal state-value function V*(s)
  - Show how value changes with player sum and dealer card
  - Compare states with and without usable ace

VALUE FUNCTION:
  V*(s) = max_a Q(s,a)
  This is the value of being in state s under the optimal policy

VISUALIZATION DETAILS:
  - Surface height and color represent V*(s)
  - Colormap: coolwarm (blue→white→red)
  - Value range: [-1, +1] (lose to win)
  - Viewing angle: elevation=25°, azimuth=-130°
  - Two subplots: one for each usable_ace condition
"""

def plot_value_function(Q, title="Optimal State-Value Function"):
    def get_Z(player_sum, dealer_card, usable_ace):
        """
        Get optimal value V*(s) = max_a Q(s,a) for a state.
        Returns 0 for unvisited states.
        """
        state = (player_sum, dealer_card, usable_ace)
        if state in Q:
            # V*(s) = max over actions
            return np.max(Q[state])
        return 0  # Default for unvisited states
    
    def create_surface(usable_ace, ax):
        """
        Create 3D surface plot for given usable_ace condition.
        """
        # Define state space ranges
        player_range = np.arange(12, 22)  # 12 to 21
        dealer_range = np.arange(1, 11)   # 1 (Ace) to 10
        
        # Create meshgrid for 3D plotting
        # X[i,j] = player_range[j], Y[i,j] = dealer_range[i]
        X, Y = np.meshgrid(player_range, dealer_range)
        
        # Build value array Z[i,j] = V*(player_range[j], dealer_range[i], usable_ace)
        # Each row corresponds to a dealer card
        # Each column corresponds to a player sum
        Z = np.array([[get_Z(x, y, usable_ace) 
                      for x in player_range]  # Columns: player sums
                     for y in dealer_range])  # Rows: dealer cards
        
        # Create 3D surface plot
        surf = ax.plot_surface(
            X, Y, Z,                # Coordinates and heights
            cmap=cm.coolwarm,       # Blue (cold/bad) to Red (warm/good)
            linewidth=0,            # No wireframe lines
            antialiased=True,       # Smooth rendering
            vmin=-1, vmax=1,        # Value range for consistent color mapping
            alpha=0.8               # Slight transparency
        )
        
        # Configure axes labels and limits
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.set_zlabel('Value V*(s)', fontsize=11)
        ax.set_zlim(-1, 1)  # Z-axis range
        
        # Set viewing angle for best perspective
        ax.view_init(elev=25, azim=-130)
        
        return surf
    
    # Create figure with two 3D subplots (stacked vertically)
    fig = plt.figure(figsize=(14, 11))
    
    # Subplot 1: With usable ace
    ax1 = fig.add_subplot(211, projection='3d')
    ax1.set_title(f'{title} - WITH Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf1 = create_surface(True, ax1)
    fig.colorbar(surf1, ax=ax1, shrink=0.5, aspect=10)
    
    # Subplot 2: Without usable ace
    ax2 = fig.add_subplot(212, projection='3d')
    ax2.set_title(f'{title} - WITHOUT Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf2 = create_surface(False, ax2)
    fig.colorbar(surf2, ax=ax2, shrink=0.5, aspect=10)
    
    plt.tight_layout()
    plt.show()

pretty_print("3D Value Visualization Ready",
             "<strong>Surface Plot Features:</strong><br>" +
             "• Height = State value V*(s)<br>" +
             "• Color: Blue (low) to Red (high)<br>" +
             "• Two plots: with/without usable ace<br>" +
             "• Smooth surface interpolation",
             style='success')

In [None]:
"""
Cell 5: 2D Policy Heatmap with DISCRETE Colors

PURPOSE:
  - Visualize optimal policy π*(s) = argmax_a Q(s,a) as 2D heatmap
  - Show STICK vs HIT decisions for each state
  - Use discrete colors with NO interpolation

CRITICAL FIX:
  This uses pcolormesh instead of imshow to ensure DISCRETE color values.
  - pcolormesh: Each cell is solid color (no blending)
  - imshow: Would interpolate between values (creates gradients)
  For policy visualization, we need crisp boundaries between actions.

COLOR CODING:
  - Green = STICK (action 0) - stop drawing cards
  - Red = HIT (action 1) - draw another card
  - Colormap: RdYlGn_r (Red-Yellow-Green reversed)

GRID STRUCTURE:
  - Rows: Dealer showing card (1-10, displayed as A,2,3,...,10)
  - Columns: Player sum (12-21)
  - Black gridlines: Separate each state
"""

def plot_policy(policy, title="Optimal Policy"):
    def get_action(player_sum, dealer_card, usable_ace):
        """
        Get optimal action π*(s) for a state.
        Returns 1 (HIT) as default for unvisited states.
        """
        state = (player_sum, dealer_card, usable_ace)
        return policy.get(state, 1)  # Default to HIT
    
    def create_heatmap(usable_ace, ax):
        """
        Create discrete policy heatmap for given usable_ace condition.
        """
        # Define state space ranges
        player_range = np.arange(12, 22)  # 12-21
        dealer_range = np.arange(1, 11)   # 1-10 (Ace to 10)
        
        # Build policy grid Z[i,j] = action
        # Rows = dealer cards, Columns = player sums
        Z = np.array([[get_action(player, dealer, usable_ace)
                      for player in player_range]  # Columns
                     for dealer in dealer_range])  # Rows
        
        # CRITICAL: Use pcolormesh for DISCRETE values
        # This ensures no interpolation between action values
        # shading='flat' means each cell gets a solid color
        im = ax.pcolormesh(
            player_range,           # X coordinates (player sums)
            dealer_range,           # Y coordinates (dealer cards)
            Z,                      # Action values (0 or 1)
            cmap='RdYlGn_r',        # Colormap: Red=Hit(1), Green=Stick(0)
            edgecolors='black',     # Black gridlines between cells
            linewidth=0.5,          # Gridline thickness
            vmin=0, vmax=1,         # Action value range [0,1]
            shading='flat'          # No interpolation (discrete colors)
        )
        
        # Configure axes ticks and labels
        ax.set_xticks(player_range)
        ax.set_yticks(dealer_range)
        # Display 'A' for Ace (dealer_card=1), then 2,3,...,10
        ax.set_yticklabels(['A'] + list(range(2, 11)))
        
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        
        # Set equal aspect ratio (square cells)
        ax.set_aspect('equal')
        
        # Add colorbar with discrete action labels
        # Ticks at 0.25 and 0.75 to center labels in color regions
        cbar = plt.colorbar(im, ax=ax, ticks=[0.25, 0.75], 
                           fraction=0.046, pad=0.04)
        cbar.ax.set_yticklabels(['STICK (0)', 'HIT (1)'])
    
    # Create figure with two subplots side by side
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Heatmap 1: With usable ace
    ax1.set_title(f'{title} - WITH Usable Ace', 
                  fontsize=12, fontweight='bold')
    create_heatmap(True, ax1)
    
    # Heatmap 2: Without usable ace
    ax2.set_title(f'{title} - WITHOUT Usable Ace', 
                  fontsize=12, fontweight='bold')
    create_heatmap(False, ax2)
    
    plt.tight_layout()
    plt.show()

pretty_print("2D Policy Visualization Ready",
             "<strong>Discrete Policy Heatmaps:</strong><br>" +
             "• Uses pcolormesh for crisp boundaries<br>" +
             "• Green = STICK (0), Red = HIT (1)<br>" +
             "• No color interpolation (discrete actions)<br>" +
             "• Black gridlines separate cells<br>" +
             "• Equal aspect ratio (square cells)",
             style='success')

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 4: Run Monte Carlo ES Experiment</h2>
</div>

Now we execute the complete learning process. We run 500,000 episodes, which is sufficient for:
1. All state-action pairs to be visited many times
2. Q-value estimates to converge to true values Q*(s,a)
3. The greedy policy to converge to the optimal policy π*(s)
4. Results to match the textbook figures

The learning process typically takes 2-3 minutes on modern hardware. We'll see progress updates every 100,000 episodes.

In [None]:
"""
Cell 6: Execute Monte Carlo ES Learning

PURPOSE:
  - Run MC ES algorithm for 500,000 episodes
  - Learn optimal Q-values and policy
  - Analyze learned policy statistics

EXPECTED BEHAVIOR:
  - Policy should converge to optimal Blackjack strategy
  - Stick more often at high sums (19-21)
  - Hit more often at low sums (12-16)
  - Different strategies with/without usable ace
  - Results should match textbook Figure 5.2
"""

# Run Monte Carlo ES with 500,000 episodes
Q, policy = monte_carlo_es(env, num_episodes=500000)

# Analyze learned policy statistics
# Count how many states have each action as optimal
stick_count = sum(1 for action in policy.values() if action == 0)
hit_count = sum(1 for action in policy.values() if action == 1)
total_states = len(policy)

# Calculate average value across all states
avg_value = np.mean([np.max(q) for q in Q.values()])

pretty_print("Learning Complete - Policy Analysis",
             f"<strong>Policy Statistics:</strong><br>" +
             f"• Total states in policy: {total_states}<br>" +
             f"• Average state value: {avg_value:.3f}<br><br>" +
             f"<strong>Action Distribution:</strong><br>" +
             f"• STICK (action 0): {stick_count} states ({100*stick_count/total_states:.1f}%)<br>" +
             f"• HIT (action 1): {hit_count} states ({100*hit_count/total_states:.1f}%)<br><br>" +
             f"<strong>Expected Pattern:</strong><br>" +
             "• More STICK at high player sums (20-21)<br>" +
             "• More HIT at low player sums (12-16)<br>" +
             "• Decision boundary around sum 17-19<br>" +
             "• Different behavior with usable ace",
             style='result')

In [None]:
"""
Cell 7: Visualize Optimal Value Function

PURPOSE:
  - Display 3D surface plots of learned value function V*(s)
  - Show how values vary across state space
  - Compare usable vs non-usable ace scenarios

INTERPRETATION GUIDE:
  - Red peaks: Best states (high probability of winning)
  - Blue valleys: Worst states (high probability of losing)
  - Gradient: How value changes with player sum and dealer card
"""

pretty_print("Generating 3D Value Function Plots",
             "Creating surface plots of V*(s) = max_a Q(s,a)<br>" +
             "This may take a moment to render...",
             style='info')

# Generate 3D surface plots
plot_value_function(Q, "Optimal State-Value Function V*")

pretty_print("Value Function Interpretation",
             "<strong>Color Coding:</strong><br>" +
             "• 🔴 Red (high): Favorable states, likely to win<br>" +
             "• 🔵 Blue (low): Unfavorable states, likely to lose<br>" +
             "• White: Neutral states<br><br>" +
             "<strong>Key Observations:</strong><br>" +
             "• Peak values near player sum 20-21 (close to winning)<br>" +
             "• Lower values at low sums (far from 21)<br>" +
             "• Usable ace provides higher values (flexibility to hit)<br>" +
             "• Values vary with dealer showing card strength",
             style='note')

In [None]:
"""
Cell 8: Visualize Optimal Policy

PURPOSE:
  - Display 2D heatmaps of learned optimal policy π*(s)
  - Show STICK vs HIT decisions with discrete colors
  - Compare with textbook Figure 5.2 results

INTERPRETATION GUIDE:
  - Look for clear decision boundaries
  - Policy should stick at high sums, hit at low sums
  - Usable ace allows more aggressive hitting
"""

pretty_print("Generating Optimal Policy Heatmaps",
             "Creating discrete policy visualizations<br>" +
             "<strong>Green = STICK</strong>, <strong>Red = HIT</strong>",
             style='info')

# Generate 2D policy heatmaps with discrete colors
plot_policy(policy, "Optimal Policy π* (from MC ES)")

pretty_print("Policy Interpretation",
             "<strong>Policy Patterns (should match Figure 5.2):</strong><br>" +
             "• Clear boundary around player sum 17-20<br>" +
             "• STICK (green) dominates at high sums (20-21)<br>" +
             "• HIT (red) dominates at low sums (12-16)<br>" +
             "• Transition zone at middle sums (17-19)<br>" +
             "• More aggressive with usable ace (cannot bust)<br>" +
             "• Adapts to dealer showing card (weaker dealer = more stick)<br><br>" +
             "<strong>✓ This should match Sutton and Barto Figure 5.2</strong>",
             style='note')

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings and Insights</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Exploring Starts Effectiveness:</strong> Random initial actions successfully ensured comprehensive exploration of all state-action pairs without requiring ongoing exploration mechanisms like epsilon-greedy. The guarantee that all (s,a) pairs are visited infinitely often leads to convergence.</p>
        <p><strong>2. Policy Convergence:</strong> The greedy policy converged to the optimal policy, matching textbook results with clear decision boundaries. The policy shows rational behavior: stick at high sums to avoid busting, hit at low sums to improve hand.</p>
        <p><strong>3. Usable Ace Impact:</strong> States with usable ace show higher values and more aggressive hitting strategy. The flexibility of counting ace as 1 or 11 prevents busting, allowing the player to take more risks.</p>
        <p><strong>4. First-Visit MC Unbiased Estimation:</strong> By averaging returns only from first visits to each (s,a) pair, we obtained unbiased estimates that converged to true action values Q*(s,a) as the number of episodes increased.</p>
        <p><strong>5. Generalized Policy Iteration Pattern:</strong> The interleaved pattern of policy evaluation (updating Q-values) and policy improvement (making policy greedy) is a key characteristic of GPI, which guarantees convergence to the optimal policy.</p>
        <p><strong>6. Computational Efficiency:</strong> Despite running 500,000 episodes, the algorithm completed in 2-3 minutes due to efficient numpy operations and the simplicity of the Blackjack environment.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why does exploring starts solve the exploration problem without needing epsilon-greedy?</li>
        <li>How would the algorithm behave if we used every-visit MC instead of first-visit?</li>
        <li>What would happen if we used a discount factor gamma less than 1.0 for Blackjack?</li>
        <li>Could we implement MC ES in continuing (non-episodic) tasks? Why or why not?</li>
        <li>How does the variance of Q-value estimates decrease as we collect more episodes?</li>
        <li>Why is the decision boundary different between usable and non-usable ace states?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center; border-radius: 8px;">
    <p style="margin: 0; font-size: 13px; font-weight: 600;">End of Lab 5-1: Monte Carlo ES for Blackjack</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 5-2 - Off-Policy Monte Carlo with Importance Sampling</p>
</div>