<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo Methods
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 5 | Intermediate Level | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo methods learn directly from episodes of experience without requiring a model of the environment's dynamics.
        First introduced for RL by <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" style="color: #17a2b8;">Stanislaw Ulam</a> 
        during the Manhattan Project, these methods are particularly effective for episodic tasks. This lab implements the
        <strong>First-Visit Monte Carlo</strong> algorithm on the classic Blackjack problem from
        <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto (2018)</a>, Example 5.1.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand Monte Carlo prediction methods</li>
        <li>Implement First-Visit MC algorithm</li>
        <li>Learn from sampled episodes of experience</li>
        <li>Estimate action-value functions Q(s,a)</li>
        <li>Visualize value functions and policies</li>
        <li>Work with OpenAI Gym environments</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Get sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Hit (draw card) or Stick (stop)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if 11)</div>
    </div>
</td>
</tr>
</table>

## Section 1: Environment Setup and Dependencies

In [None]:
"""
Cell 1: Import Libraries
"""
import sys
import gymnasium as gym  # Updated to gymnasium (replaces gym)
import numpy as np
from collections import defaultdict
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

# Configure matplotlib
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("✓ All libraries imported successfully")
print(f"✓ Gymnasium version: {gym.__version__}")

## Section 2: Creating the Blackjack Environment

In [None]:
"""
Cell 2: Initialize Blackjack Environment (v1)
"""
# FIXED: Using v1 instead of v0
env = gym.make('Blackjack-v1')

print(f"Environment: Blackjack-v1")
print(f"Action space: {env.action_space}")
print(f"Number of actions: {env.action_space.n}")
print("\nActions: 0 = Stick (stop), 1 = Hit (draw card)")

# Demonstrate environment
sample_state, _ = env.reset()
print(f"\nSample initial state: {sample_state}")
print(f"  Player sum: {sample_state[0]}")
print(f"  Dealer showing: {sample_state[1]}")
print(f"  Usable ace: {sample_state[2]}")

## Section 3: Monte Carlo ES Algorithm Overview

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px;">
    <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
        Monte Carlo with Exploring Starts (MC ES)
    </h2>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Algorithm for Finding Optimal Policy π ≈ π*
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0;">Algorithm Overview</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo ES uses <strong>Exploring Starts</strong> to ensure all state-action pairs are visited. 
        Each episode begins with a <strong>random</strong> state-action pair, then follows the current policy.
        This guarantees exploration while still finding the optimal policy.
    </p>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="https://github.com/mdehghani86/RL_labs/blob/master/Lab%2005/MCM_ES.jpg?raw=true" 
         alt="Monte Carlo ES Pseudocode" 
         style="width: 70%; max-width: 800px; border: 2px solid #17a2b8; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Figure: Monte Carlo ES Algorithm from Sutton & Barto</p>
</div>

<table style="width: 100%; border-spacing: 12px; margin-top: 20px;">
<tr>
<td style="background: #e8f5e9; padding: 12px 15px; border-left: 3px solid #4caf50; vertical-align: top; width: 50%;">
    <h4 style="color: #2e7d32; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Key Steps</h4>
    <ol style="color: #555; line-height: 1.6; margin: 0; padding-left: 20px; font-size: 12px;">
        <li><strong>Exploring Start:</strong> Random (S₀, A₀) pair</li>
        <li><strong>Generate Episode:</strong> Follow current policy π</li>
        <li><strong>Calculate Returns:</strong> G for each (s,a)</li>
        <li><strong>Update Q:</strong> Average returns for each pair</li>
        <li><strong>Policy Improvement:</strong> π(s) ← argmax Q(s,a)</li>
    </ol>
</td>
<td style="background: #fff3e0; padding: 12px 15px; border-left: 3px solid #ff9800; vertical-align: top; width: 50%;">
    <h4 style="color: #e65100; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Why Exploring Starts?</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <p style="margin: 0 0 8px 0;">Without exploring starts, a deterministic policy might never visit certain state-action pairs, preventing us from learning their true values.</p>
        <p style="margin: 0;"><strong>Solution:</strong> Start each episode with a random (s,a) pair to ensure comprehensive exploration of the state-action space.</p>
    </div>
</td>
</tr>
</table>

## Section 4: Stochastic Policy for Exploration

In [None]:
"""
Cell 3: Episode Generation with Arbitrary Stochastic Policy

IMPORTANT CONCEPT - ARBITRARY vs OPTIMAL POLICY:
==================================================

1. ARBITRARY POLICY (used during learning):
   - A simple, reasonable policy we start with
   - NOT optimal, but provides exploration
   - In this code: threshold-based stochastic policy
     * If player_sum > 18: 80% stick, 20% hit
     * If player_sum ≤ 18: 20% stick, 80% hit
   - Purpose: Generate episodes to learn Q-values

2. OPTIMAL POLICY (learned from Q-values):
   - The BEST policy derived after learning
   - Greedy with respect to learned Q-values
   - π*(s) = argmax_a Q(s,a)
   - This is what we're trying to find!

LEARNING PROCESS:
   Arbitrary Policy → Generate Episodes → Learn Q-values → Extract Optimal Policy
"""

def play_episode_arbitrary_policy(env):
    """
    Play complete episode using ARBITRARY stochastic policy.
    This is NOT the optimal policy - it's our exploration policy.
    
    Returns:
        episode: List of (state, action, reward) tuples
    """
    episode = []
    state, _ = env.reset()  # Fixed for v1
    
    while True:
        # ARBITRARY POLICY DEFINITION:
        # Simple threshold-based probabilities for exploration
        if state[0] > 18:
            # High sum: prefer to stick (conservative)
            action_probs = [0.8, 0.2]  # [P(stick), P(hit)]
        else:
            # Low sum: prefer to hit (aggressive)
            action_probs = [0.2, 0.8]
        
        # Sample action from arbitrary policy
        action = np.random.choice([0, 1], p=action_probs)
        
        # Execute action
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        episode.append((state, action, reward))
        state = next_state
        
        if done:
            break
    
    return episode

# Test episode generation
print("Testing arbitrary policy episode generation...\n")
sample_episode = play_episode_arbitrary_policy(env)
print(f"Episode length: {len(sample_episode)} steps")
print(f"Final reward: {sample_episode[-1][2]}")
print(f"\nFirst 3 steps:")
for i, (state, action, reward) in enumerate(sample_episode[:3]):
    action_name = "Stick" if action == 0 else "Hit"
    print(f"  Step {i+1}: State={state}, Action={action_name}, Reward={reward}")

## Section 5: First-Visit Monte Carlo Q-Value Updates

In [None]:
"""
Cell 4: Q-Value Update Function
"""
def update_Q(episode, Q, returns_sum, N, gamma=1.0):
    """
    Update Q-values using First-Visit Monte Carlo.
    
    For each (state, action) pair in episode:
      1. Check if this is the FIRST visit to this pair
      2. Calculate return G from this point forward
      3. Update running average: Q(s,a) = mean(all returns)
    """
    visited = set()
    
    for t, (state, action, reward) in enumerate(episode):
        sa_pair = (state, action)
        
        # First-visit check
        if sa_pair not in visited:
            visited.add(sa_pair)
            
            # Calculate return G from time t
            G = sum((gamma ** k) * r for k, (_, _, r) in enumerate(episode[t:]))
            
            # Update statistics
            returns_sum[state][action] += G
            N[state][action] += 1.0
            
            # Update Q as running average
            Q[state][action] = returns_sum[state][action] / N[state][action]

print("✓ Q-value update function ready")

In [None]:
"""
Cell 5: Monte Carlo Prediction Loop
"""
def mc_predict(env, num_episodes, gamma=1.0):
    """
    Monte Carlo prediction for Q-value estimation.
    Uses arbitrary policy for exploration.
    """
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n))
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    print(f"Starting Monte Carlo prediction with {num_episodes:,} episodes...\n")
    
    for i_episode in range(1, num_episodes + 1):
        episode = play_episode_arbitrary_policy(env)
        update_Q(episode, Q, returns_sum, N, gamma)
        
        if i_episode % 50000 == 0:
            print(f"Episode {i_episode:,}/{num_episodes:,}")
    
    print("\n✓ Monte Carlo prediction complete")
    return Q

print("✓ MC prediction function ready")

## Section 6: FIXED Visualization Functions

In [None]:
"""
Cell 6: FIXED Visualization Functions

FIXES:
1. Proper 3D surface plot configuration
2. Corrected 2D policy heatmap with proper axis alignment
3. Better colormap handling
4. Fixed coordinate systems
"""

def plot_blackjack_values(V):
    """
    FIXED: Create properly formatted 3D value function plots
    """
    def get_Z(player_sum, dealer_card, usable_ace):
        state = (player_sum, dealer_card, usable_ace)
        return V.get(state, 0)
    
    def create_surface(usable_ace, ax):
        # Create meshgrid
        player_range = np.arange(12, 22)  # 12 to 21
        dealer_range = np.arange(1, 11)   # Ace(1) to 10
        X, Y = np.meshgrid(player_range, dealer_range)
        
        # Compute Z values
        Z = np.array([[get_Z(x, y, usable_ace) 
                      for x in player_range] 
                     for y in dealer_range])
        
        # Create surface
        surf = ax.plot_surface(X, Y, Z, 
                               cmap=cm.coolwarm,
                               linewidth=0,
                               antialiased=True,
                               vmin=-1, vmax=1,
                               alpha=0.8)
        
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.set_zlabel('Value', fontsize=11)
        ax.set_zlim(-1, 1)
        ax.view_init(elev=25, azim=-130)
        
        return surf
    
    # Create figure
    fig = plt.figure(figsize=(14, 11))
    
    # With usable ace
    ax1 = fig.add_subplot(211, projection='3d')
    ax1.set_title('State Values WITH Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf1 = create_surface(True, ax1)
    fig.colorbar(surf1, ax=ax1, shrink=0.5, aspect=10)
    
    # Without usable ace
    ax2 = fig.add_subplot(212, projection='3d')
    ax2.set_title('State Values WITHOUT Usable Ace', 
                  fontsize=13, fontweight='bold', pad=15)
    surf2 = create_surface(False, ax2)
    fig.colorbar(surf2, ax=ax2, shrink=0.5, aspect=10)
    
    plt.tight_layout()
    plt.show()


def plot_policy(policy):
    """
    FIXED: Create properly formatted 2D policy heatmaps
    
    FIXES:
    - Correct axis orientation
    - Proper coordinate alignment
    - Clear labeling
    """
    def get_action(player_sum, dealer_card, usable_ace):
        state = (player_sum, dealer_card, usable_ace)
        return policy.get(state, 1)  # Default: hit
    
    def create_heatmap(usable_ace, ax):
        # Define ranges
        player_range = range(12, 22)  # 12-21
        dealer_range = range(1, 11)   # 1-10 (Ace to 10)
        
        # Create policy grid (rows=dealer, cols=player)
        Z = np.array([[get_action(player, dealer, usable_ace)
                      for player in player_range]
                     for dealer in dealer_range])
        
        # Create heatmap
        im = ax.imshow(Z, 
                       cmap='RdYlGn_r',  # Red=Hit, Green=Stick
                       aspect='auto',
                       vmin=0, vmax=1,
                       extent=[11.5, 21.5, 0.5, 10.5],
                       origin='lower',
                       interpolation='nearest')
        
        # Set ticks
        ax.set_xticks(range(12, 22))
        ax.set_yticks(range(1, 11))
        ax.set_yticklabels(['A'] + list(range(2, 11)))
        
        # Labels
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        
        # Grid
        ax.grid(True, color='black', linewidth=0.5, alpha=0.3)
        ax.set_axisbelow(False)
        
        # Colorbar
        cbar = plt.colorbar(im, ax=ax, ticks=[0, 1], fraction=0.046, pad=0.04)
        cbar.ax.set_yticklabels(['STICK', 'HIT'])
        
        return im
    
    # Create figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # With usable ace
    ax1.set_title('Policy WITH Usable Ace', fontsize=12, fontweight='bold')
    create_heatmap(True, ax1)
    
    # Without usable ace
    ax2.set_title('Policy WITHOUT Usable Ace', fontsize=12, fontweight='bold')
    create_heatmap(False, ax2)
    
    plt.tight_layout()
    plt.show()

print("✓ FIXED visualization functions ready")

## Section 7: Running Monte Carlo Experiments

In [None]:
"""
Cell 7: Execute Monte Carlo Learning

CRITICAL EXPLANATION - TWO POLICIES:
=====================================

In this cell, we work with TWO different policies:

1. ARBITRARY POLICY (for learning):
   --------------------------------
   - The stochastic policy we use to GENERATE episodes
   - Defined in play_episode_arbitrary_policy()
   - Threshold-based: prefer stick if sum>18, hit if sum≤18
   - Purpose: Explore the environment to learn Q-values
   - This is NOT what we're trying to find!

2. OPTIMAL POLICY (what we're finding):
   ------------------------------------
   - The GREEDY policy extracted from learned Q-values
   - Defined as: π*(s) = argmax_a Q(s,a)
   - Deterministic: always picks best action
   - This is the GOAL of our learning!

PROCESS FLOW:
   Arbitrary Policy → Episodes → Q-values → Optimal Policy
   (exploration)     (data)    (learning)  (solution)

ANALOGY:
   - Arbitrary policy = "practice games" with exploration
   - Q-values = knowledge learned from practice
   - Optimal policy = "tournament strategy" using learned knowledge
"""

# Run Monte Carlo prediction
NUM_EPISODES = 500000

print("="*60)
print("LEARNING PHASE: Using ARBITRARY policy for exploration")
print("="*60)
print(f"Episodes: {NUM_EPISODES:,}")
print("Arbitrary policy: Threshold-based stochastic")
print("Goal: Learn Q(s,a) values\n")

# Learn Q-values using arbitrary policy
Q = mc_predict(env, NUM_EPISODES)

print("\n" + "="*60)
print("EXTRACTION PHASE: Deriving OPTIMAL policy from Q-values")
print("="*60)

# Convert Q-values to state values under arbitrary policy
# V(s) = Σ π(a|s) * Q(s,a) for arbitrary policy
V_arbitrary = {}
for state, action_values in Q.items():
    if state[0] > 18:
        # Arbitrary policy: 80% stick, 20% hit
        V_arbitrary[state] = 0.8 * action_values[0] + 0.2 * action_values[1]
    else:
        # Arbitrary policy: 20% stick, 80% hit
        V_arbitrary[state] = 0.2 * action_values[0] + 0.8 * action_values[1]

# Extract OPTIMAL policy (greedy w.r.t. Q)
# This is the policy we're trying to find!
optimal_policy = {}
for state, action_values in Q.items():
    # Select action with highest Q-value (greedy)
    optimal_policy[state] = np.argmax(action_values)
    
print("\n✓ Optimal policy extracted via greedy selection")
print("  π*(s) = argmax_a Q(s,a) for each state\n")

# Analysis
states_count = len(Q)
stick_count = sum(1 for a in optimal_policy.values() if a == 0)
hit_count = sum(1 for a in optimal_policy.values() if a == 1)

print("="*60)
print("RESULTS SUMMARY")
print("="*60)
print(f"States explored: {states_count}")
print(f"\nOptimal Policy Composition:")
print(f"  STICK states: {stick_count} ({100*stick_count/states_count:.1f}%)")
print(f"  HIT states:   {hit_count} ({100*hit_count/states_count:.1f}%)")
print(f"\nAverage state value: {np.mean(list(V_arbitrary.values())):.4f}")
print("\nInterpretation:")
print("  - Learned Q-values from arbitrary policy episodes")
print("  - Extracted optimal policy by always choosing best action")
print("  - Optimal policy is deterministic and greedy")
print("="*60)

In [None]:
"""
Cell 8: Visualize Value Function
"""
print("Generating 3D value function plots...\n")
plot_blackjack_values(V_arbitrary)

print("\nValue Function Interpretation:")
print("  • Red (high): Favorable states likely to win")
print("  • Blue (low): Unfavorable states likely to lose")
print("  • Peak near sum 20-21: Best positions")
print("  • Usable ace provides more flexibility")

In [None]:
"""
Cell 9: Visualize OPTIMAL Policy
"""
print("Generating optimal policy heatmaps...\n")
plot_policy(optimal_policy)

print("\nOptimal Policy Interpretation:")
print("  • Green = STICK (action 0)")
print("  • Red = HIT (action 1)")
print("  • Clear threshold around sum 17-20")
print("  • More aggressive hitting with usable ace")
print("  • Adapts to dealer's showing card")

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Policy Learning:</strong> We used an arbitrary exploration policy to generate episodes, then extracted the optimal policy from learned Q-values.</p>
        <p><strong>2. Exploration vs Exploitation:</strong> Arbitrary policy provides exploration, optimal policy is purely exploitative (greedy).</p>
        <p><strong>3. Usable Ace Impact:</strong> Optimal strategy differs significantly with usable ace - more aggressive hitting since can't bust.</p>
        <p><strong>4. Decision Boundaries:</strong> Clear threshold emerges around sum 17-20 for stick/hit decision.</p>
        <p><strong>5. Monte Carlo Strength:</strong> Model-free learning directly from experience converges to optimal behavior.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why do we need an exploration policy if we're trying to find the optimal policy?</li>
        <li>What would happen if we used a purely greedy policy from the start?</li>
        <li>How does First-Visit MC differ from Every-Visit MC?</li>
        <li>Why is MC particularly suitable for Blackjack vs Dynamic Programming?</li>
        <li>How could we implement ε-greedy exploration instead of arbitrary policy?</li>
    </ol>
</div>