<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo Methods
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 5 | Intermediate Level | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo methods learn directly from episodes of experience without requiring a model of the environment. 
        First introduced for RL by Stanislaw Ulam during the Manhattan Project, these methods are particularly effective 
        for episodic tasks. This lab implements the <strong>First-Visit Monte Carlo</strong> algorithm on the classic 
        Blackjack problem from Sutton & Barto (2018), Example 5.1. We explore how Monte Carlo methods estimate value 
        functions through repeated sampling and averaging of returns.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand Monte Carlo prediction methods</li>
        <li>Implement First-Visit MC algorithm</li>
        <li>Learn from sampled episodes of experience</li>
        <li>Estimate action-value functions Q(s,a)</li>
        <li>Visualize value functions and policies</li>
        <li>Work with OpenAI Gymnasium environments</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Get sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Hit (draw card) or Stick (stop)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if 11)</div>
    </div>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 1: Environment Setup and Dependencies</h2>
</div>

We begin by importing the necessary libraries for our Monte Carlo implementation. The key libraries are:
- **Gymnasium**: Provides the Blackjack-v1 environment (successor to OpenAI Gym)
- **NumPy**: For numerical computations and array operations
- **Matplotlib**: For creating visualizations of value functions and policies
- **Collections**: For efficient data structures like defaultdict

In [None]:
"""
Cell 1: Import Required Libraries

Purpose:
  - Import all necessary libraries for Monte Carlo implementation
  - Configure matplotlib for publication-quality visualizations
  - Suppress warnings for cleaner output

Key Libraries:
  - gymnasium: RL environment (Blackjack-v1)
  - numpy: Numerical operations and array handling
  - matplotlib: 3D and 2D plotting
  - defaultdict: Efficient sparse state storage
"""

import sys
import gymnasium as gym  # Modern replacement for gym
import numpy as np
from collections import defaultdict
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

# Configure matplotlib for better quality figures
plt.rcParams['figure.dpi'] = 100          # Display resolution
plt.rcParams['figure.figsize'] = (12, 8)  # Default figure size
plt.rcParams['font.size'] = 10            # Font size for labels

print("✓ Libraries imported successfully")
print(f"✓ Gymnasium version: {gym.__version__}")

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 2: Creating the Blackjack Environment</h2>
</div>

The Blackjack environment simulates the card game with simplified rules. The state space consists of three components:
1. **Player sum** (12-21): Current sum of player cards
2. **Dealer card** (1-10): The dealer's visible card (1 = Ace, 10 = face cards)
3. **Usable ace** (True/False): Whether player has an ace counted as 11

The action space has two actions: Stick (0) to stop taking cards, or Hit (1) to draw another card. Rewards are given only at episode termination: +1 for winning, 0 for drawing, and -1 for losing.

In [None]:
"""
Cell 2: Initialize Blackjack-v1 Environment

Purpose:
  - Create the Blackjack environment using Gymnasium
  - Verify environment properties (state/action spaces)
  - Demonstrate initial state generation

Environment Details:
  - State: (player_sum, dealer_card, usable_ace)
    * player_sum: 12-21 (game starts at 12+)
    * dealer_card: 1-10 (Ace through face cards)
    * usable_ace: Boolean (True if ace counts as 11)
  - Actions: 0=Stick, 1=Hit
  - Rewards: {+1, 0, -1} given at episode end only
"""

# Create environment using Gymnasium API
env = gym.make('Blackjack-v1')  # Use v1 (v0 deprecated)

print(f"Environment: Blackjack-v1")
print(f"Action space: {env.action_space}")
print(f"Number of actions: {env.action_space.n}")
print("\nActions:")
print("  0 = Stick (stop drawing cards)")
print("  1 = Hit (draw another card)")

# Reset environment to get initial state
sample_state, _ = env.reset()  # v1 returns (state, info)
print(f"\nSample initial state: {sample_state}")
print(f"  Player sum: {sample_state[0]}")
print(f"  Dealer showing: {sample_state[1]}")
print(f"  Usable ace: {sample_state[2]}")

---
<div style="border-left: 4px solid #6f42c1; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #6f42c1; margin: 0; font-size: 18px;">Section 3: Monte Carlo ES Algorithm Overview</h2>
</div>

<div style="background: white; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #6f42c1;">
    <h3 style="color: #6f42c1; font-size: 14px; margin: 0 0 8px 0;">Monte Carlo with Exploring Starts</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo ES uses <strong>Exploring Starts</strong> to ensure comprehensive exploration of the state-action space. 
        Each episode begins with a random state-action pair, guaranteeing that all possibilities are visited. After the 
        initial random selection, the agent follows its current policy for the remainder of the episode. This approach 
        solves the exploration problem while still converging to the optimal policy through iterative improvement.
    </p>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="https://github.com/mdehghani86/RL_labs/blob/master/Lab%2005/MCM_ES.jpg?raw=true" 
         alt="Monte Carlo ES Pseudocode" 
         style="width: 70%; max-width: 800px; border: 2px solid #6f42c1; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Figure: Monte Carlo ES Algorithm from Sutton & Barto</p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: #e8f5e9; padding: 12px 15px; border-left: 3px solid #4caf50; vertical-align: top; width: 50%;">
    <h4 style="color: #2e7d32; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Algorithm Steps</h4>
    <ol style="color: #555; line-height: 1.6; margin: 0; padding-left: 20px; font-size: 12px;">
        <li><strong>Exploring Start:</strong> Choose random (S₀, A₀) pair</li>
        <li><strong>Generate Episode:</strong> Follow current policy π from S₁ onward</li>
        <li><strong>Calculate Returns:</strong> Compute G for each visited (s,a)</li>
        <li><strong>Update Q-values:</strong> Average all returns for each (s,a) pair</li>
        <li><strong>Policy Improvement:</strong> Make policy greedy: π(s) ← argmax Q(s,a)</li>
    </ol>
</td>
<td style="background: #fff3e0; padding: 12px 15px; border-left: 3px solid #ff9800; vertical-align: top; width: 50%;">
    <h4 style="color: #e65100; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Why Exploring Starts?</h4>
    <p style="color: #555; font-size: 12px; line-height: 1.6; margin: 0 0 8px 0;">
        Without exploring starts, a deterministic policy might never visit certain state-action pairs, 
        preventing optimal value estimation. Random initialization ensures every (s,a) pair has non-zero 
        probability of being explored.
    </p>
    <p style="color: #555; font-size: 12px; line-height: 1.6; margin: 0;">
        <strong>Key Guarantee:</strong> All state-action pairs are visited infinitely often as episodes → ∞
    </p>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #28a745; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #28a745; margin: 0; font-size: 18px;">Section 4: Stochastic Policy for Exploration</h2>
</div>

In this implementation, we use an **arbitrary stochastic policy** for generating learning episodes. This policy serves as our exploration mechanism during the learning phase. The policy is threshold-based:
- When player sum > 18: Prefer to stick (80% probability) to avoid busting
- When player sum ≤ 18: Prefer to hit (80% probability) to get closer to 21

This is NOT the optimal policy we're trying to find. Rather, it's a reasonable exploration strategy that ensures we visit diverse states and actions. From the Q-values learned using this arbitrary policy, we will later extract the optimal greedy policy.

In [None]:
"""
Cell 3: Generate Episodes Using Arbitrary Stochastic Policy

Purpose:
  - Play complete Blackjack episodes for data collection
  - Use arbitrary threshold-based policy for exploration
  - Record (state, action, reward) tuples for each step

Policy Definition (ARBITRARY - not optimal):
  - If player_sum > 18:
      P(Stick) = 0.8, P(Hit) = 0.2  (conservative)
  - If player_sum ≤ 18:
      P(Stick) = 0.2, P(Hit) = 0.8  (aggressive)

Why Arbitrary Policy?
  - Provides reasonable exploration
  - Ensures we visit diverse state-action pairs
  - We'll learn Q-values from these episodes
  - Then extract optimal policy via greedy selection

Returns:
  episode: List of (state, action, reward) tuples
"""

def play_episode_arbitrary_policy(env):
    episode = []
    state, _ = env.reset()  # Start new game
    
    while True:
        # Define action probabilities based on player's sum
        if state[0] > 18:
            # High sum: mostly stick to avoid busting
            action_probs = [0.8, 0.2]  # [P(stick), P(hit)]
        else:
            # Low sum: mostly hit to approach 21
            action_probs = [0.2, 0.8]  # [P(stick), P(hit)]
        
        # Sample action from probability distribution
        action = np.random.choice([0, 1], p=action_probs)
        
        # Execute action in environment (v1 returns 5-tuple)
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated  # Episode ends on either
        
        # Store experience tuple
        episode.append((state, action, reward))
        
        # Update state for next iteration
        state = next_state
        
        if done:
            break  # Episode complete
    
    return episode

# Test episode generation
sample_episode = play_episode_arbitrary_policy(env)
print(f"Sample episode length: {len(sample_episode)} steps")
print(f"Final reward: {sample_episode[-1][2]}")
print(f"\nFirst 3 steps:")
for i, (state, action, reward) in enumerate(sample_episode[:3]):
    action_name = "Stick" if action == 0 else "Hit"
    print(f"  {i+1}. State={state}, Action={action_name}, Reward={reward}")

---
<div style="border-left: 4px solid #dc3545; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #dc3545; margin: 0; font-size: 18px;">Section 5: First-Visit Monte Carlo Q-Value Updates</h2>
</div>

The core of Monte Carlo learning is the update of Q-values based on observed returns. We implement the **First-Visit MC** approach:

**First-Visit Rule:** For each (state, action) pair, only the FIRST occurrence in an episode is used for updates. Subsequent visits to the same pair are ignored.

**Return Calculation:** From time t when (s,a) is first visited:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$$

**Q-value Update:** The action-value is the average of all observed returns:
$$Q(s,a) = \frac{1}{N(s,a)} \sum_{\text{episodes}} G_t$$

Where N(s,a) is the number of times we've made a first-visit to (s,a) across all episodes.

In [None]:
"""
Cell 4: Update Q-Values Using First-Visit Monte Carlo

Purpose:
  - Update action-value estimates Q(s,a) from episode experience
  - Implement FIRST-VISIT rule (only first occurrence counts)
  - Maintain running averages of returns

Algorithm:
  1. For each (state, action) in episode:
     - Check if this is FIRST visit to (s,a)
     - If yes: calculate return G from this point forward
     - Update running sum and count
     - Compute new average: Q(s,a) = sum(returns) / count

Parameters:
  episode: List of (state, action, reward) tuples
  Q: Action-value estimates (dict of arrays)
  returns_sum: Cumulative sum of returns for each (s,a)
  N: Visit counts for each (s,a)
  gamma: Discount factor (1.0 for undiscounted)
"""

def update_Q(episode, Q, returns_sum, N, gamma=1.0):
    # Track which (s,a) pairs we've already processed (first-visit)
    visited = set()
    
    # Process each step in the episode
    for t, (state, action, reward) in enumerate(episode):
        sa_pair = (state, action)  # Create hashable pair
        
        # First-visit check: only process if not seen before
        if sa_pair not in visited:
            visited.add(sa_pair)  # Mark as visited
            
            # Calculate return G_t from time t onwards
            # G_t = r_{t+1} + γ*r_{t+2} + γ²*r_{t+3} + ...
            G = sum((gamma ** k) * r 
                    for k, (_, _, r) in enumerate(episode[t:]))
            
            # Update cumulative sum of returns
            returns_sum[state][action] += G
            
            # Increment visit counter
            N[state][action] += 1.0
            
            # Update Q-value as running average
            # Q(s,a) = average of all returns from (s,a)
            Q[state][action] = returns_sum[state][action] / N[state][action]

print("✓ Q-value update function ready")

In [None]:
"""
Cell 5: Monte Carlo Prediction - Main Learning Loop

Purpose:
  - Run multiple episodes to learn Q-values
  - Aggregate experience across many games
  - Estimate Q(s,a) through averaging

Process:
  1. Initialize Q, returns_sum, and visit counts (N)
  2. For each episode:
     - Generate episode using arbitrary policy
     - Update Q-values using first-visit MC
  3. Return learned Q-values

Data Structures:
  - Q: defaultdict storing Q(s,a) estimates
  - returns_sum: Cumulative returns for averaging
  - N: Visit counts for each (s,a) pair

Returns:
  Q: Dictionary mapping states to action-value arrays
"""

def mc_predict(env, num_episodes, gamma=1.0):
    # Initialize data structures using defaultdict for sparse storage
    # defaultdict automatically creates entries as needed
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n))
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    print(f"Starting MC prediction with {num_episodes:,} episodes...\n")
    
    # Main learning loop
    for i_episode in range(1, num_episodes + 1):
        # Generate episode using arbitrary exploration policy
        episode = play_episode_arbitrary_policy(env)
        
        # Update Q-values based on observed returns
        update_Q(episode, Q, returns_sum, N, gamma)
        
        # Progress reporting every 50k episodes
        if i_episode % 50000 == 0:
            print(f"Episode {i_episode:,}/{num_episodes:,}")
    
    print("\n✓ Monte Carlo prediction complete")
    return Q

print("✓ MC prediction function ready")

---
<div style="border-left: 4px solid #ffc107; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #e0a800; margin: 0; font-size: 18px;">Section 6: Visualization Functions</h2>
</div>

We create two types of visualizations to understand the learned value function and policy:

**3D Surface Plots:** Display state values V(s) as a function of player sum and dealer showing card. The height and color of the surface represent the expected value of being in that state. We create separate plots for states with and without a usable ace, as the ace significantly affects strategy.

**2D Policy Heatmaps:** Show the optimal action (Stick or Hit) for each state using color coding. Green indicates Stick (action 0) and Red indicates Hit (action 1). These heatmaps provide an intuitive view of the decision boundaries learned by the algorithm.

In [None]:
"""
Cell 6: Create 3D Surface Plots for Value Functions

Purpose:
  - Visualize state values V(s) as 3D surfaces
  - Show how value varies with player sum and dealer card
  - Create separate plots for usable/non-usable ace

Visualization Details:
  - X-axis: Player sum (12-21)
  - Y-axis: Dealer showing card (1-10, where 1=Ace)
  - Z-axis: State value V(s)
  - Color: Blue (low value) to Red (high value)

Function: plot_blackjack_values(V)
  Input: V = dictionary of state values
  Output: Two 3D surface plots displayed
"""

def plot_blackjack_values(V):
    def get_Z(player_sum, dealer_card, usable_ace):
        """Lookup value for state, return 0 if not in V"""
        state = (player_sum, dealer_card, usable_ace)
        return V.get(state, 0)
    
    def create_surface(usable_ace, ax):
        """Create 3D surface plot for given usable_ace condition"""
        # Define state space ranges
        player_range = np.arange(12, 22)  # 12 to 21
        dealer_range = np.arange(1, 11)   # 1 (Ace) to 10
        
        # Create meshgrid: X[i,j]=player_sum, Y[i,j]=dealer_card
        X, Y = np.meshgrid(player_range, dealer_range)
        
        # Build Z array: Z[i,j] = V(player_range[j], dealer_range[i], usable_ace)
        Z = np.array([[get_Z(x, y, usable_ace) 
                      for x in player_range]    # Columns: player sums
                     for y in dealer_range])    # Rows: dealer cards
        
        # Create 3D surface plot
        surf = ax.plot_surface(
            X, Y, Z,                    # Coordinates and heights
            cmap=cm.coolwarm,           # Blue to red colormap
            linewidth=0,                # No wireframe
            antialiased=True,           # Smooth rendering
            vmin=-1, vmax=1,            # Value range for colors
            alpha=0.8                   # Slight transparency
        )
        
        # Configure axes and labels
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.set_zlabel('Value', fontsize=11)
        ax.set_zlim(-1, 1)
        ax.view_init(elev=25, azim=-130)  # Viewing angle
        return surf
    
    # Create figure with two subplots
    fig = plt.figure(figsize=(14, 11))
    
    # Plot 1: States WITH usable ace
    ax1 = fig.add_subplot(211, projection='3d')
    ax1.set_title('State Values WITH Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf1 = create_surface(True, ax1)
    fig.colorbar(surf1, ax=ax1, shrink=0.5, aspect=10)
    
    # Plot 2: States WITHOUT usable ace
    ax2 = fig.add_subplot(212, projection='3d')
    ax2.set_title('State Values WITHOUT Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf2 = create_surface(False, ax2)
    fig.colorbar(surf2, ax=ax2, shrink=0.5, aspect=10)
    
    plt.tight_layout()
    plt.show()

print("✓ 3D value function plotting ready")

In [None]:
"""
Cell 7: Create 2D Policy Heatmaps (FIXED FOR DISCRETE VALUES)

Purpose:
  - Visualize policy π(s) as 2D heatmaps
  - Show which action is optimal for each state
  - Use discrete colors: Green=Stick, Red=Hit

CRITICAL FIX:
  - Use pcolormesh instead of imshow for discrete values
  - Ensures crisp boundaries between actions
  - No interpolation between policy decisions

Visualization Details:
  - X-axis: Player sum (12-21)
  - Y-axis: Dealer card (Ace, 2-10)
  - Color: Green = STICK (0), Red = HIT (1)

Function: plot_policy(policy)
  Input: policy = dictionary mapping states to actions
  Output: Two 2D heatmaps displayed side-by-side
"""

def plot_policy(policy):
    def get_action(player_sum, dealer_card, usable_ace):
        """Lookup action for state, default to Hit if not in policy"""
        state = (player_sum, dealer_card, usable_ace)
        return policy.get(state, 1)  # Default: hit
    
    def create_heatmap(usable_ace, ax):
        """Create discrete 2D heatmap for given usable_ace condition"""
        # Define state space
        player_range = np.arange(12, 22)  # 12-21
        dealer_range = np.arange(1, 11)   # 1-10 (Ace to 10)
        
        # Build policy grid: Z[i,j] = action
        # Rows = dealer cards, Columns = player sums
        Z = np.array([[get_action(player, dealer, usable_ace)
                      for player in player_range]
                     for dealer in dealer_range])
        
        # CRITICAL: Use pcolormesh for discrete values (not imshow)
        # This ensures no interpolation between action values
        im = ax.pcolormesh(
            player_range,               # X coordinates
            dealer_range,               # Y coordinates  
            Z,                          # Action values (0 or 1)
            cmap='RdYlGn_r',           # Red=Hit(1), Green=Stick(0)
            edgecolors='black',         # Black grid lines
            linewidth=0.5,              # Grid line thickness
            vmin=0, vmax=1,             # Discrete action values
            shading='flat'              # Flat colors (no interpolation)
        )
        
        # Configure ticks and labels
        ax.set_xticks(player_range)
        ax.set_yticks(dealer_range)
        # Display 'A' for Ace (value 1)
        ax.set_yticklabels(['A'] + list(range(2, 11)))
        
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        
        # Add colorbar with discrete labels
        cbar = plt.colorbar(im, ax=ax, ticks=[0.25, 0.75], 
                           fraction=0.046, pad=0.04)
        cbar.ax.set_yticklabels(['STICK (0)', 'HIT (1)'])
        
        # Set aspect ratio to square
        ax.set_aspect('equal')
        
        return im
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Heatmap 1: Policy WITH usable ace
    ax1.set_title('Policy WITH Usable Ace', 
                  fontsize=12, fontweight='bold')
    create_heatmap(True, ax1)
    
    # Heatmap 2: Policy WITHOUT usable ace  
    ax2.set_title('Policy WITHOUT Usable Ace', 
                  fontsize=12, fontweight='bold')
    create_heatmap(False, ax2)
    
    plt.tight_layout()
    plt.show()

print("✓ 2D policy heatmap plotting ready (DISCRETE)")

---
<div style="border-left: 4px solid #007bff; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #007bff; margin: 0; font-size: 18px;">Section 7: Running Monte Carlo Experiments</h2>
</div>

Now we execute the complete Monte Carlo learning process. This section demonstrates the distinction between two key concepts:

**Arbitrary Exploration Policy:** The stochastic threshold-based policy we defined earlier is used to GENERATE episodes and collect experience. This policy explores the environment but is not necessarily optimal. It serves as our data collection mechanism.

**Optimal Policy Extraction:** After learning Q-values from the arbitrary policy's experiences, we extract the optimal policy by selecting the action with highest Q-value in each state: π*(s) = argmax_a Q(s,a). This greedy policy represents what we've learned about the best way to play Blackjack.

The learning process flows as: Exploration Policy → Generate Episodes → Learn Q-values → Extract Optimal Policy

In [None]:
"""
Cell 8: Execute Monte Carlo Learning - Main Experiment

Purpose:
  - Run large-scale MC prediction (500k episodes)
  - Learn Q(s,a) using arbitrary exploration policy
  - Extract optimal policy via greedy selection
  - Compute state values under both policies

CRITICAL CONCEPTS:

TWO POLICIES IN PLAY:
  1. ARBITRARY POLICY (exploration):
     - Used to GENERATE episodes
     - Stochastic threshold-based
     - Ensures diverse experience
     - NOT what we're trying to find
  
  2. OPTIMAL POLICY (exploitation):
     - EXTRACTED from learned Q-values
     - Greedy: π*(s) = argmax_a Q(s,a)
     - Deterministic best action
     - THIS is our learning goal

LEARNING FLOW:
  Arbitrary Policy → Episodes → Q-values → Optimal Policy
  (exploration)    (data)     (learning)  (solution)
"""

NUM_EPISODES = 500000

print("="*70)
print("PHASE 1: LEARNING Q-VALUES")
print("="*70)
print(f"Episodes to run: {NUM_EPISODES:,}")
print("Method: First-Visit Monte Carlo")
print("Exploration: Arbitrary stochastic policy")
print("Goal: Learn Q(s,a) for all state-action pairs\n")

# Run Monte Carlo prediction to learn Q-values
Q = mc_predict(env, NUM_EPISODES)

print("\n" + "="*70)
print("PHASE 2: POLICY EXTRACTION")
print("="*70)

# Compute state values V(s) under ARBITRARY policy
# V(s) = Σ_a π(a|s) * Q(s,a) where π is arbitrary policy
V_arbitrary = {}
for state, action_values in Q.items():
    if state[0] > 18:
        # Arbitrary policy: 80% stick, 20% hit
        V_arbitrary[state] = 0.8 * action_values[0] + 0.2 * action_values[1]
    else:
        # Arbitrary policy: 20% stick, 80% hit
        V_arbitrary[state] = 0.2 * action_values[0] + 0.8 * action_values[1]

# Extract OPTIMAL policy via greedy selection
# π*(s) = argmax_a Q(s,a) for each state
optimal_policy = {}
for state, action_values in Q.items():
    # Select action with highest Q-value (greedy)
    optimal_policy[state] = np.argmax(action_values)

print("✓ Optimal policy extracted via: π*(s) = argmax_a Q(s,a)\n")

# Analyze learned policy
states_count = len(Q)
stick_count = sum(1 for a in optimal_policy.values() if a == 0)
hit_count = sum(1 for a in optimal_policy.values() if a == 1)

print("="*70)
print("RESULTS SUMMARY")
print("="*70)
print(f"\nLearning Statistics:")
print(f"  States explored: {states_count}")
print(f"  Average state value: {np.mean(list(V_arbitrary.values())):.4f}")
print(f"\nOptimal Policy Composition:")
print(f"  States where optimal action is STICK: {stick_count} ({100*stick_count/states_count:.1f}%)")
print(f"  States where optimal action is HIT:   {hit_count} ({100*hit_count/states_count:.1f}%)")
print(f"\nKey Insight:")
print(f"  Policy learned to stick more often at higher sums")
print(f"  This matches optimal Blackjack strategy")
print("="*70)

In [None]:
"""
Cell 9: Visualize Learned Value Function

Purpose:
  - Display 3D surface plots of state values
  - Show how value changes with player sum and dealer card
  - Compare states with/without usable ace

Interpretation Guide:
  - Red (high values): Favorable states, likely to win
  - Blue (low values): Unfavorable states, likely to lose
  - Peak near sum 20-21: Best winning positions
  - Valley at low sums: Poor positions needing improvement
"""

print("Generating 3D value function plots...\n")
plot_blackjack_values(V_arbitrary)

print("\n" + "="*70)
print("VALUE FUNCTION INTERPRETATION")
print("="*70)
print("Color Coding:")
print("  🔴 Red (high): Favorable states with positive expected return")
print("  🔵 Blue (low): Unfavorable states with negative expected return")
print("\nKey Observations:")
print("  • Peak values near player sum 20-21 (close to winning)")
print("  • Lower values with weak dealer cards (dealer likely to bust)")
print("  • Usable ace provides more flexibility and higher values")
print("="*70)

In [None]:
"""
Cell 10: Visualize Optimal Policy (DISCRETE COLORS)

Purpose:
  - Display 2D heatmaps of optimal policy
  - Show STICK vs HIT decisions for each state
  - Use discrete colors (no blending)

Color Coding:
  - 🟢 Green = STICK (action 0)
  - 🔴 Red = HIT (action 1)

Interpretation Guide:
  - Clear decision boundary around sum 17-20
  - More aggressive hitting with usable ace (can't bust)
  - Adapts to dealer's showing card
  - Matches known optimal Blackjack strategy
"""

print("Generating optimal policy heatmaps (discrete)...\n")
plot_policy(optimal_policy)

print("\n" + "="*70)
print("OPTIMAL POLICY INTERPRETATION")
print("="*70)
print("Color Coding:")
print("  🟢 Green = STICK (action 0) - Stop drawing cards")
print("  🔴 Red = HIT (action 1) - Draw another card")
print("\nPolicy Patterns Observed:")
print("  • Clear threshold around player sum 17-20")
print("  • More conservative without usable ace (risk of busting)")
print("  • More aggressive with usable ace (flexibility)")
print("  • Adapts based on dealer's showing card")
print("  • Matches expert Blackjack strategy")
print("="*70)

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Policy Learning:</strong> Used arbitrary exploration policy to generate episodes, then extracted optimal policy from learned Q-values through greedy selection.</p>
        <p><strong>2. Exploration vs Exploitation:</strong> Arbitrary policy provides exploration during learning, optimal policy is purely exploitative at decision time.</p>
        <p><strong>3. Usable Ace Impact:</strong> Optimal strategy differs significantly with usable ace due to flexibility in avoiding bust.</p>
        <p><strong>4. Decision Boundaries:</strong> Clear threshold emerges around sum 17-20 for stick/hit decision, adapting to dealer card.</p>
        <p><strong>5. Monte Carlo Strength:</strong> Model-free learning directly from experience converges to near-optimal behavior without environment dynamics.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why do we need an exploration policy if we are trying to find the optimal policy?</li>
        <li>What would happen if we used a purely greedy policy from the start?</li>
        <li>How does First-Visit MC differ from Every-Visit MC in terms of bias and variance?</li>
        <li>Why is Monte Carlo particularly suitable for Blackjack compared to Dynamic Programming?</li>
        <li>How could we implement epsilon-greedy exploration instead of arbitrary policy?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 5-1: Blackjack with Monte Carlo Methods</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 5-2 - Monte Carlo Control</p>
</div>