<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo Methods
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 5 | Intermediate Level | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo methods learn directly from episodes of experience without requiring a model of the environment. 
        First introduced for RL by Stanislaw Ulam during the Manhattan Project, these methods are particularly effective 
        for episodic tasks. This lab implements the <strong>First-Visit Monte Carlo</strong> algorithm on the classic 
        Blackjack problem from Sutton & Barto (2018), Example 5.1. We explore how Monte Carlo methods estimate value 
        functions through repeated sampling and averaging of returns.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand Monte Carlo prediction methods</li>
        <li>Implement First-Visit MC algorithm</li>
        <li>Learn from sampled episodes of experience</li>
        <li>Estimate action-value functions Q(s,a)</li>
        <li>Visualize value functions and policies</li>
        <li>Work with OpenAI Gymnasium environments</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Get sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Hit (draw card) or Stick (stop)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if 11)</div>
    </div>
</td>
</tr>
</table>

## Section 1: Environment Setup and Dependencies

We begin by importing the necessary libraries for our Monte Carlo implementation. The key libraries are:
- **Gymnasium**: Provides the Blackjack-v1 environment (successor to OpenAI Gym)
- **NumPy**: For numerical computations and array operations
- **Matplotlib**: For creating visualizations of value functions and policies
- **Collections**: For efficient data structures like defaultdict

In [None]:
import sys
import gymnasium as gym
import numpy as np
from collections import defaultdict
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully")
print(f"Gymnasium version: {gym.__version__}")

## Section 2: Creating the Blackjack Environment

The Blackjack environment simulates the card game with simplified rules. The state space consists of three components:
1. **Player sum** (12-21): Current sum of player cards
2. **Dealer card** (1-10): The dealer's visible card (1 = Ace, 10 = face cards)
3. **Usable ace** (True/False): Whether player has an ace counted as 11

The action space has two actions: Stick (0) to stop taking cards, or Hit (1) to draw another card. Rewards are given only at episode termination: +1 for winning, 0 for drawing, and -1 for losing.

In [None]:
env = gym.make('Blackjack-v1')

print(f"Environment: Blackjack-v1")
print(f"Action space: {env.action_space}")
print(f"Number of actions: {env.action_space.n}")
print("Actions: 0 = Stick, 1 = Hit")

sample_state, _ = env.reset()
print(f"\nSample initial state: {sample_state}")
print(f"  Player sum: {sample_state[0]}")
print(f"  Dealer showing: {sample_state[1]}")
print(f"  Usable ace: {sample_state[2]}")

## Section 3: Monte Carlo ES Algorithm Overview

<div style="background: white; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0;">Monte Carlo with Exploring Starts</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo ES uses <strong>Exploring Starts</strong> to ensure comprehensive exploration of the state-action space. 
        Each episode begins with a random state-action pair, guaranteeing that all possibilities are visited. After the 
        initial random selection, the agent follows its current policy for the remainder of the episode. This approach 
        solves the exploration problem while still converging to the optimal policy through iterative improvement.
    </p>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="https://github.com/mdehghani86/RL_labs/blob/master/Lab%2005/MCM_ES.jpg?raw=true" 
         alt="Monte Carlo ES Pseudocode" 
         style="width: 70%; max-width: 800px; border: 2px solid #17a2b8; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Figure: Monte Carlo ES Algorithm from Sutton & Barto</p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: #e8f5e9; padding: 12px 15px; border-left: 3px solid #4caf50; vertical-align: top; width: 50%;">
    <h4 style="color: #2e7d32; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Algorithm Steps</h4>
    <ol style="color: #555; line-height: 1.6; margin: 0; padding-left: 20px; font-size: 12px;">
        <li><strong>Exploring Start:</strong> Choose random (S₀, A₀) pair</li>
        <li><strong>Generate Episode:</strong> Follow current policy π from S₁ onward</li>
        <li><strong>Calculate Returns:</strong> Compute G for each visited (s,a)</li>
        <li><strong>Update Q-values:</strong> Average all returns for each (s,a) pair</li>
        <li><strong>Policy Improvement:</strong> Make policy greedy: π(s) ← argmax Q(s,a)</li>
    </ol>
</td>
<td style="background: #fff3e0; padding: 12px 15px; border-left: 3px solid #ff9800; vertical-align: top; width: 50%;">
    <h4 style="color: #e65100; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Why Exploring Starts?</h4>
    <p style="color: #555; font-size: 12px; line-height: 1.6; margin: 0 0 8px 0;">
        Without exploring starts, a deterministic policy might never visit certain state-action pairs, 
        preventing optimal value estimation. Random initialization ensures every (s,a) pair has non-zero 
        probability of being explored.
    </p>
    <p style="color: #555; font-size: 12px; line-height: 1.6; margin: 0;">
        <strong>Key Guarantee:</strong> All state-action pairs are visited infinitely often as episodes → ∞
    </p>
</td>
</tr>
</table>

## Section 4: Stochastic Policy for Exploration

In this implementation, we use an **arbitrary stochastic policy** for generating learning episodes. This policy serves as our exploration mechanism during the learning phase. The policy is threshold-based:
- When player sum > 18: Prefer to stick (80% probability) to avoid busting
- When player sum ≤ 18: Prefer to hit (80% probability) to get closer to 21

This is NOT the optimal policy we're trying to find. Rather, it's a reasonable exploration strategy that ensures we visit diverse states and actions. From the Q-values learned using this arbitrary policy, we will later extract the optimal greedy policy.

In [None]:
def play_episode_arbitrary_policy(env):
    episode = []
    state, _ = env.reset()
    
    while True:
        if state[0] > 18:
            action_probs = [0.8, 0.2]
        else:
            action_probs = [0.2, 0.8]
        
        action = np.random.choice([0, 1], p=action_probs)
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        episode.append((state, action, reward))
        state = next_state
        
        if done:
            break
    
    return episode

sample_episode = play_episode_arbitrary_policy(env)
print(f"Sample episode length: {len(sample_episode)} steps")
print(f"Final reward: {sample_episode[-1][2]}")
print(f"First 3 steps:")
for i, (state, action, reward) in enumerate(sample_episode[:3]):
    action_name = "Stick" if action == 0 else "Hit"
    print(f"  {i+1}. State={state}, Action={action_name}, Reward={reward}")

## Section 5: First-Visit Monte Carlo Q-Value Updates

The core of Monte Carlo learning is the update of Q-values based on observed returns. We implement the **First-Visit MC** approach:

**First-Visit Rule:** For each (state, action) pair, only the FIRST occurrence in an episode is used for updates. Subsequent visits to the same pair are ignored.

**Return Calculation:** From time t when (s,a) is first visited:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$$

**Q-value Update:** The action-value is the average of all observed returns:
$$Q(s,a) = \frac{1}{N(s,a)} \sum_{\text{episodes}} G_t$$

Where N(s,a) is the number of times we've made a first-visit to (s,a) across all episodes.

In [None]:
def update_Q(episode, Q, returns_sum, N, gamma=1.0):
    visited = set()
    
    for t, (state, action, reward) in enumerate(episode):
        sa_pair = (state, action)
        
        if sa_pair not in visited:
            visited.add(sa_pair)
            
            G = sum((gamma ** k) * r for k, (_, _, r) in enumerate(episode[t:]))
            
            returns_sum[state][action] += G
            N[state][action] += 1.0
            Q[state][action] = returns_sum[state][action] / N[state][action]

print("Q-value update function ready")

In [None]:
def mc_predict(env, num_episodes, gamma=1.0):
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n))
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    print(f"Starting MC prediction with {num_episodes:,} episodes...\n")
    
    for i_episode in range(1, num_episodes + 1):
        episode = play_episode_arbitrary_policy(env)
        update_Q(episode, Q, returns_sum, N, gamma)
        
        if i_episode % 50000 == 0:
            print(f"Episode {i_episode:,}/{num_episodes:,}")
    
    print("\nMonte Carlo prediction complete")
    return Q

print("MC prediction function ready")

## Section 6: Visualization Functions

We create two types of visualizations to understand the learned value function and policy:

**3D Surface Plots:** Display state values V(s) as a function of player sum and dealer showing card. The height and color of the surface represent the expected value of being in that state. We create separate plots for states with and without a usable ace, as the ace significantly affects strategy.

**2D Policy Heatmaps:** Show the optimal action (Stick or Hit) for each state using color coding. Green indicates Stick (action 0) and Red indicates Hit (action 1). These heatmaps provide an intuitive view of the decision boundaries learned by the algorithm.

In [None]:
def plot_blackjack_values(V):
    def get_Z(player_sum, dealer_card, usable_ace):
        state = (player_sum, dealer_card, usable_ace)
        return V.get(state, 0)
    
    def create_surface(usable_ace, ax):
        player_range = np.arange(12, 22)
        dealer_range = np.arange(1, 11)
        X, Y = np.meshgrid(player_range, dealer_range)
        
        Z = np.array([[get_Z(x, y, usable_ace) 
                      for x in player_range] 
                     for y in dealer_range])
        
        surf = ax.plot_surface(X, Y, Z, 
                               cmap=cm.coolwarm,
                               linewidth=0,
                               antialiased=True,
                               vmin=-1, vmax=1,
                               alpha=0.8)
        
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.set_zlabel('Value', fontsize=11)
        ax.set_zlim(-1, 1)
        ax.view_init(elev=25, azim=-130)
        return surf
    
    fig = plt.figure(figsize=(14, 11))
    
    ax1 = fig.add_subplot(211, projection='3d')
    ax1.set_title('State Values WITH Usable Ace', fontsize=13, fontweight='bold', pad=15)
    surf1 = create_surface(True, ax1)
    fig.colorbar(surf1, ax=ax1, shrink=0.5, aspect=10)
    
    ax2 = fig.add_subplot(212, projection='3d')
    ax2.set_title('State Values WITHOUT Usable Ace', fontsize=13, fontweight='bold', pad=15)
    surf2 = create_surface(False, ax2)
    fig.colorbar(surf2, ax=ax2, shrink=0.5, aspect=10)
    
    plt.tight_layout()
    plt.show()

def plot_policy(policy):
    def get_action(player_sum, dealer_card, usable_ace):
        state = (player_sum, dealer_card, usable_ace)
        return policy.get(state, 1)
    
    def create_heatmap(usable_ace, ax):
        player_range = range(12, 22)
        dealer_range = range(1, 11)
        
        Z = np.array([[get_action(player, dealer, usable_ace)
                      for player in player_range]
                     for dealer in dealer_range])
        
        im = ax.imshow(Z, 
                       cmap='RdYlGn_r',
                       aspect='auto',
                       vmin=0, vmax=1,
                       extent=[11.5, 21.5, 0.5, 10.5],
                       origin='lower',
                       interpolation='nearest')
        
        ax.set_xticks(range(12, 22))
        ax.set_yticks(range(1, 11))
        ax.set_yticklabels(['A'] + list(range(2, 11)))
        ax.set_xlabel('Player Sum', fontsize=11)
        ax.set_ylabel('Dealer Showing', fontsize=11)
        ax.grid(True, color='black', linewidth=0.5, alpha=0.3)
        ax.set_axisbelow(False)
        
        cbar = plt.colorbar(im, ax=ax, ticks=[0, 1], fraction=0.046, pad=0.04)
        cbar.ax.set_yticklabels(['STICK', 'HIT'])
        return im
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    ax1.set_title('Policy WITH Usable Ace', fontsize=12, fontweight='bold')
    create_heatmap(True, ax1)
    
    ax2.set_title('Policy WITHOUT Usable Ace', fontsize=12, fontweight='bold')
    create_heatmap(False, ax2)
    
    plt.tight_layout()
    plt.show()

print("Visualization functions ready")

## Section 7: Running Monte Carlo Experiments

Now we execute the complete Monte Carlo learning process. This section demonstrates the distinction between two key concepts:

**Arbitrary Exploration Policy:** The stochastic threshold-based policy we defined earlier is used to GENERATE episodes and collect experience. This policy explores the environment but is not necessarily optimal. It serves as our data collection mechanism.

**Optimal Policy Extraction:** After learning Q-values from the arbitrary policy's experiences, we extract the optimal policy by selecting the action with highest Q-value in each state: π*(s) = argmax_a Q(s,a). This greedy policy represents what we've learned about the best way to play Blackjack.

The learning process flows as: Exploration Policy → Generate Episodes → Learn Q-values → Extract Optimal Policy

In [None]:
NUM_EPISODES = 500000

print("="*60)
print("LEARNING PHASE")
print("="*60)
print(f"Episodes: {NUM_EPISODES:,}")
print("Using arbitrary stochastic policy for exploration\n")

Q = mc_predict(env, NUM_EPISODES)

print("\n" + "="*60)
print("POLICY EXTRACTION PHASE")
print("="*60)

V_arbitrary = {}
for state, action_values in Q.items():
    if state[0] > 18:
        V_arbitrary[state] = 0.8 * action_values[0] + 0.2 * action_values[1]
    else:
        V_arbitrary[state] = 0.2 * action_values[0] + 0.8 * action_values[1]

optimal_policy = {}
for state, action_values in Q.items():
    optimal_policy[state] = np.argmax(action_values)

print("Optimal policy extracted via greedy selection\n")

states_count = len(Q)
stick_count = sum(1 for a in optimal_policy.values() if a == 0)
hit_count = sum(1 for a in optimal_policy.values() if a == 1)

print("="*60)
print("RESULTS")
print("="*60)
print(f"States explored: {states_count}")
print(f"\nOptimal Policy:")
print(f"  STICK: {stick_count} states ({100*stick_count/states_count:.1f}%)")
print(f"  HIT:   {hit_count} states ({100*hit_count/states_count:.1f}%)")
print(f"\nAverage state value: {np.mean(list(V_arbitrary.values())):.4f}")
print("="*60)

In [None]:
print("Generating 3D value function plots...\n")
plot_blackjack_values(V_arbitrary)

print("Value Function Interpretation:")
print("  - Red (high): Favorable states")
print("  - Blue (low): Unfavorable states")
print("  - Peak near sum 20-21: Best positions")

In [None]:
print("Generating optimal policy heatmaps...\n")
plot_policy(optimal_policy)

print("Optimal Policy Interpretation:")
print("  - Green = STICK")
print("  - Red = HIT")
print("  - Clear threshold around sum 17-20")

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Policy Learning:</strong> Used arbitrary exploration policy to generate episodes, then extracted optimal policy from learned Q-values through greedy selection.</p>
        <p><strong>2. Exploration vs Exploitation:</strong> Arbitrary policy provides exploration during learning, optimal policy is purely exploitative at decision time.</p>
        <p><strong>3. Usable Ace Impact:</strong> Optimal strategy differs significantly with usable ace due to flexibility in avoiding bust.</p>
        <p><strong>4. Decision Boundaries:</strong> Clear threshold emerges around sum 17-20 for stick/hit decision, adapting to dealer card.</p>
        <p><strong>5. Monte Carlo Strength:</strong> Model-free learning directly from experience converges to near-optimal behavior without environment dynamics.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why do we need an exploration policy if we are trying to find the optimal policy?</li>
        <li>What would happen if we used a purely greedy policy from the start?</li>
        <li>How does First-Visit MC differ from Every-Visit MC in terms of bias and variance?</li>
        <li>Why is Monte Carlo particularly suitable for Blackjack compared to Dynamic Programming?</li>
        <li>How could we implement epsilon-greedy exploration instead of arbitrary policy?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 5-1: Blackjack with Monte Carlo Methods</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 5-2 - Monte Carlo Control</p>
</div>