<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo Methods
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 5 | Intermediate Level | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo methods learn directly from episodes of experience without requiring a model of the environment's dynamics.
        First introduced for RL by <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" style="color: #17a2b8;">Stanislaw Ulam</a> 
        during the Manhattan Project, these methods are particularly effective for episodic tasks. This lab implements the
        <strong>First-Visit Monte Carlo</strong> algorithm on the classic Blackjack problem from
        <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto (2018)</a>, Example 5.1.
        We use <a href="https://gym.openai.com/" style="color: #17a2b8;">OpenAI Gym</a> for the environment simulation.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand Monte Carlo prediction methods</li>
        <li>Implement First-Visit MC algorithm</li>
        <li>Learn from sampled episodes of experience</li>
        <li>Estimate action-value functions Q(s,a)</li>
        <li>Visualize value functions and policies</li>
        <li>Work with OpenAI Gym environments</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Get sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Hit (draw card) or Stick (stop)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if 11)</div>
    </div>
</td>
</tr>
</table>

## Section 1: Environment Setup and Dependencies

We begin by importing necessary libraries including OpenAI Gym for the Blackjack environment.

In [None]:
"""
Cell 1: Import Libraries and Load Utilities
Purpose: Set up the computational environment with all necessary dependencies
"""

import sys
import gym
import numpy as np
from collections import defaultdict
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import requests
import warnings
warnings.filterwarnings('ignore')

# Fetch and execute the pretty print utility from GitHub
try:
    url = 'https://raw.githubusercontent.com/mdehghani86/RL_labs/master/utility/rl_utility.py'
    response = requests.get(url)
    exec(response.text)
    pretty_print("Environment Ready", 
                 "Successfully loaded all dependencies<br>" +
                 "Libraries: Gym, NumPy, Matplotlib<br>" +
                 "Ready for Monte Carlo Blackjack implementation", 
                 style='success')
except Exception as e:
    # Fallback definition if GitHub fetch fails
    from IPython.display import display, HTML
    def pretty_print(title, content, style='info'):
        themes = {
            'info': {'primary': '#17a2b8', 'secondary': '#0e5a63', 'background': '#f8f9fa'},
            'success': {'primary': '#28a745', 'secondary': '#155724', 'background': '#f8fff9'},
            'warning': {'primary': '#ffc107', 'secondary': '#e0a800', 'background': '#fffdf5'},
            'result': {'primary': '#6f42c1', 'secondary': '#4e2c8e', 'background': '#faf5ff'},
            'note': {'primary': '#20c997', 'secondary': '#0d7a5f', 'background': '#f0fdf9'}
        }
        theme = themes.get(style, themes['info'])
        html = f'''
        <div style="border-radius: 5px; margin: 10px 0; width: 20cm; max-width: 20cm; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
            <div style="background: linear-gradient(90deg, {theme['primary']} 0%, {theme['secondary']} 100%); padding: 10px 15px; border-radius: 5px 5px 0 0;">
                <strong style="color: white; font-size: 14px;">{title}</strong>
            </div>
            <div style="background: {theme['background']}; padding: 10px 15px; border-radius: 0 0 5px 5px; border-left: 3px solid {theme['primary']};">        
                <div style="color: rgba(0,0,0,0.8); font-size: 12px; line-height: 1.5;">{content}</div>
            </div>
        </div>
        '''
        display(HTML(html))
    
    pretty_print("Fallback Mode", 
                 f"Using local pretty_print definition<br>Error: {str(e)}", 
                 style='warning')

# Configure matplotlib for better visualizations
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

## Section 2: Creating the Blackjack Environment

### OpenAI Gym Environment

OpenAI Gym provides a standardized interface for RL environments. The Blackjack environment simulates the card game with:
- **State space**: (player_sum, dealer_showing, usable_ace)
- **Action space**: {0: Stick, 1: Hit}
- **Reward structure**: Terminal rewards only (+1, 0, -1)

The environment follows the rules from Sutton & Barto Example 5.1.

In [None]:
"""
Cell 2: Initialize Blackjack Environment
Purpose: Create and explore the OpenAI Gym Blackjack environment
"""

# Create the Blackjack environment using Gym
env = gym.make('Blackjack-v0')

# Explore environment properties
pretty_print("Blackjack Environment Created",
             f"Action space size: {env.action_space.n} actions<br>" +
             "Actions: 0 = Stick (stop), 1 = Hit (draw card)<br>" +
             "State: (player_sum, dealer_card, usable_ace)<br>" +
             "Rewards: +1 (win), 0 (draw), -1 (lose)",
             style='info')

# Demonstrate environment interface
sample_state = env.reset()
pretty_print("Sample Initial State",
             f"State: {sample_state}<br>" +
             f"Player sum: {sample_state[0]}<br>" +
             f"Dealer showing: {sample_state[1]}<br>" +
             f"Usable ace: {sample_state[2]}",
             style='note')

## Section 3: Stochastic Policy for Blackjack

### Initial Policy Definition

We define a simple stochastic policy for exploration:
- If player's sum > 18: P(Stick) = 0.8, P(Hit) = 0.2
- If player's sum ≤ 18: P(Stick) = 0.2, P(Hit) = 0.8

This policy provides a balance between conservative play (sticking on high sums) and exploration.

In [None]:
"""
Cell 3: Episode Generation with Stochastic Policy
Purpose: Implement function to play complete episodes using our initial policy
"""

def play_episode(env):
    """
    Play a complete episode of Blackjack using a stochastic policy
    
    The policy is threshold-based:
    - Tends to stick (80% probability) when player sum > 18
    - Tends to hit (80% probability) when player sum ≤ 18
    
    This provides exploration while following reasonable Blackjack strategy.
    
    Args:
        env: OpenAI Gym Blackjack environment
    
    Returns:
        episode: List of (state, action, reward) tuples for the complete episode
    """
    episode = []
    
    # Initialize new game (Blackjack uses "exploring starts" - random initial states)
    state = env.reset()
    
    while True:
        # Define stochastic policy based on player's sum
        # Higher sums (>18) → prefer to stick to avoid busting
        # Lower sums (≤18) → prefer to hit to get closer to 21
        if state[0] > 18:
            # Conservative: mostly stick when close to 21
            action_probs = [0.8, 0.2]  # [P(stick), P(hit)]
        else:
            # Aggressive: mostly hit when far from 21
            action_probs = [0.2, 0.8]  # [P(stick), P(hit)]
        
        # Sample action from probability distribution
        action = np.random.choice([0, 1], p=action_probs)
        
        # Execute action in environment
        next_state, reward, done, info = env.step(action)
        
        # Record state-action-reward tuple
        episode.append((state, action, reward))
        
        # Update state for next iteration
        state = next_state
        
        # Check if episode is complete (game over)
        if done:
            break
    
    return episode

# Test episode generation
sample_episode = play_episode(env)
pretty_print("Sample Episode Generated",
             f"Episode length: {len(sample_episode)} steps<br>" +
             f"Final reward: {sample_episode[-1][2]}<br>" +
             f"Sample step: {sample_episode[0]}",
             style='success')

# Display full episode for understanding
print("\nComplete episode trajectory:")
for i, (state, action, reward) in enumerate(sample_episode):
    action_name = "Stick" if action == 0 else "Hit"
    print(f"Step {i+1}: State={state}, Action={action_name}, Reward={reward}")

## Section 4: First-Visit Monte Carlo Algorithm

### Theoretical Foundation

First-Visit Monte Carlo estimates value functions by averaging returns from the **first visit** to each state in an episode:

$$Q(s,a) = \frac{1}{N(s,a)} \sum_{\text{episodes}} G_t$$

Where:
- $G_t = \sum_{k=0}^{T-t} \gamma^k R_{t+k+1}$ is the return from time $t$
- $N(s,a)$ is the number of first visits to $(s,a)$
- $\gamma$ is the discount factor

In [None]:
"""
Cell 4: Q-Value Update Function
Purpose: Implement First-Visit Monte Carlo update rule for action-value estimation
"""

def update_Q(episode, Q, returns_sum, N, gamma=1.0):
    """
    Update Q-values using First-Visit Monte Carlo method
    
    For each state-action pair in the episode:
    1. Find the FIRST occurrence (first-visit)
    2. Calculate discounted return from that point
    3. Update running sum and count
    4. Compute new average Q-value
    
    Args:
        episode: List of (state, action, reward) tuples
        Q: Action-value function estimates
        returns_sum: Cumulative returns for each state-action pair
        N: Visit counts for each state-action pair
        gamma: Discount factor (default 1.0 for episodic task)
    """
    # Process each unique state-action pair in the episode
    visited = set()  # Track visited state-action pairs for first-visit
    
    for t, (state, action, reward) in enumerate(episode):
        # Create hashable state-action pair
        sa_pair = (state, action)
        
        # First-visit check: only update if this is first occurrence
        if sa_pair not in visited:
            visited.add(sa_pair)
            
            # Calculate return G from time t onwards
            G = 0
            for k, (_, _, r) in enumerate(episode[t:]):
                # G = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
                G += (gamma ** k) * r
            
            # Update cumulative sum of returns
            returns_sum[state][action] += G
            
            # Increment visit count
            N[state][action] += 1.0
            
            # Update Q-value as running average
            # Q(s,a) = average of all returns from (s,a)
            Q[state][action] = returns_sum[state][action] / N[state][action]

pretty_print("Q-Value Update Function Ready",
             "First-Visit Monte Carlo update implemented<br>" +
             "Calculates returns from first occurrence only<br>" +
             "Maintains running average of returns",
             style='success')

In [None]:
"""
Cell 5: Monte Carlo Prediction Main Loop
Purpose: Implement the complete MC prediction algorithm for policy evaluation
"""

def mc_predict(env, num_episodes, gamma=1.0):
    """
    Monte Carlo prediction for estimating Q-values of the current policy
    
    Runs multiple episodes and updates Q-values using first-visit MC.
    This implements the prediction (policy evaluation) problem:
    Given a policy π, estimate Q^π(s,a) for all state-action pairs.
    
    Args:
        env: OpenAI Gym Blackjack environment
        num_episodes: Number of episodes to simulate
        gamma: Discount factor (1.0 for undiscounted)
    
    Returns:
        Q: Estimated action-value function
    """
    # Initialize data structures
    # defaultdict automatically initializes missing keys with zeros
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n))
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    pretty_print("Starting Monte Carlo Prediction",
                 f"Running {num_episodes:,} episodes<br>" +
                 f"Discount factor γ = {gamma}<br>" +
                 "This may take a few minutes...",
                 style='warning')
    
    # Run episodes and update Q-values
    for i_episode in range(1, num_episodes + 1):
        # Generate complete episode
        episode = play_episode(env)
        
        # Update Q-values based on episode
        update_Q(episode, Q, returns_sum, N, gamma)
        
        # Progress reporting
        if i_episode % 10000 == 0:
            avg_q = np.mean([q.mean() for q in Q.values()])
            print(f"\rEpisode {i_episode}/{num_episodes} | Avg Q-value: {avg_q:.4f}", end="")
            sys.stdout.flush()
        elif i_episode % 1000 == 0:
            print(f"\rEpisode {i_episode}/{num_episodes}", end="")
            sys.stdout.flush()
    
    print()  # New line after progress
    
    pretty_print("Monte Carlo Prediction Complete",
                 f"Processed {num_episodes:,} episodes<br>" +
                 f"Estimated Q-values for {len(Q)} states<br>" +
                 f"Average visits per state: {np.mean([n.sum() for n in N.values()]):.1f}",
                 style='success')
    
    return Q

pretty_print("Monte Carlo Prediction Ready",
             "Complete algorithm implemented<br>" +
             "Will estimate Q-values through episode sampling",
             style='info')

## Section 5: Visualization Functions

We create comprehensive visualizations to understand the learned value function and policy.

In [None]:
"""
Cell 6: Visualization Helper Functions
Purpose: Create 3D plots for value functions and 2D heatmaps for policies
"""

def plot_blackjack_values(V):
    """
    Create 3D surface plots of the state-value function
    Separate plots for states with and without usable ace
    
    Args:
        V: State-value function dictionary
    """
    def get_Z(x, y, usable_ace):
        """Get value for a specific state, default to 0 if not visited"""
        if (x, y, usable_ace) in V:
            return V[x, y, usable_ace]
        else:
            return 0
    
    def get_figure(usable_ace, ax):
        """Create 3D surface plot for given usable_ace condition"""
        # Define ranges for player sum and dealer card
        x_range = np.arange(11, 22)  # Player sum from 11 to 21
        y_range = np.arange(1, 11)   # Dealer card from 1 (Ace) to 10
        X, Y = np.meshgrid(x_range, y_range)
        
        # Compute values for all state combinations
        Z = np.array([get_Z(x, y, usable_ace) 
                     for x, y in zip(np.ravel(X), np.ravel(Y))]).reshape(X.shape)
        
        # Create surface plot
        surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, 
                               cmap=plt.cm.coolwarm, vmin=-1.0, vmax=1.0,
                               alpha=0.8, edgecolor='none')
        ax.set_xlabel('Player\'s Current Sum')
        ax.set_ylabel('Dealer\'s Showing Card')
        ax.set_zlabel('State Value')
        ax.view_init(elev=30, azim=-120)  # Set viewing angle
        ax.set_zlim(-1, 1)
    
    # Create figure with two subplots
    fig = plt.figure(figsize=(15, 12))
    
    # Subplot 1: States with usable ace
    ax1 = fig.add_subplot(211, projection='3d')
    ax1.set_title('State Values with Usable Ace', fontsize=14, fontweight='bold')
    get_figure(True, ax1)
    
    # Subplot 2: States without usable ace
    ax2 = fig.add_subplot(212, projection='3d')
    ax2.set_title('State Values without Usable Ace', fontsize=14, fontweight='bold')
    get_figure(False, ax2)
    
    plt.tight_layout()
    plt.show()

def plot_policy(policy):
    """
    Create 2D heatmaps showing the optimal action for each state
    
    Args:
        policy: Dictionary mapping states to actions
    """
    def get_Z(x, y, usable_ace):
        """Get action for a specific state, default to 1 (hit) if not defined"""
        if (x, y, usable_ace) in policy:
            return policy[x, y, usable_ace]
        else:
            return 1  # Default to hit
    
    def get_figure(usable_ace, ax):
        """Create 2D heatmap for given usable_ace condition"""
        x_range = np.arange(11, 22)  # Player sum
        y_range = np.arange(10, 0, -1)  # Dealer card (reversed for display)
        X, Y = np.meshgrid(x_range, y_range)
        
        # Compute policy for all states
        Z = np.array([[get_Z(x, y, usable_ace) for x in x_range] 
                     for y in y_range])
        
        # Create heatmap
        surf = ax.imshow(Z, cmap=plt.get_cmap('RdYlBu_r', 2), 
                        vmin=0, vmax=1, extent=[10.5, 21.5, 0.5, 10.5])
        plt.xticks(x_range)
        plt.yticks(y_range)
        plt.gca().invert_yaxis()
        ax.set_xlabel('Player\'s Current Sum')
        ax.set_ylabel('Dealer\'s Showing Card')
        ax.grid(color='black', linestyle='-', linewidth=0.5, alpha=0.3)
        
        # Add colorbar
        divider = make_axes_locatable(ax)
        cax = divider.append_axes("right", size="5%", pad=0.1)
        cbar = plt.colorbar(surf, ticks=[0, 1], cax=cax)
        cbar.ax.set_yticklabels(['STICK (0)', 'HIT (1)'])
    
    # Create figure with two subplots
    fig = plt.figure(figsize=(14, 6))
    
    # Subplot 1: Policy with usable ace
    ax1 = fig.add_subplot(121)
    ax1.set_title('Policy with Usable Ace', fontsize=12, fontweight='bold')
    get_figure(True, ax1)
    
    # Subplot 2: Policy without usable ace
    ax2 = fig.add_subplot(122)
    ax2.set_title('Policy without Usable Ace', fontsize=12, fontweight='bold')
    get_figure(False, ax2)
    
    plt.tight_layout()
    plt.show()

pretty_print("Visualization Functions Ready",
             "3D surface plots for value functions<br>" +
             "2D heatmaps for policy visualization<br>" +
             "Separate views for usable/non-usable ace states",
             style='success')

## Section 6: Running Monte Carlo Experiments

In [None]:
"""
Cell 7: Execute Monte Carlo Learning
Purpose: Run MC prediction with large number of episodes and visualize results
"""

# Run Monte Carlo prediction with 500,000 episodes
NUM_EPISODES = 500000

pretty_print("Starting Large-Scale Experiment",
             f"Episodes to run: {NUM_EPISODES:,}<br>" +
             "Expected time: 1-2 minutes<br>" +
             "Learning Q-values for stochastic policy",
             style='warning')

# Run MC prediction
Q = mc_predict(env, NUM_EPISODES)

# Convert Q-values to state values using the stochastic policy
# V(s) = Σ_a π(a|s) * Q(s,a)
V_to_plot = {}
for state, action_values in Q.items():
    # Apply our stochastic policy to get expected value
    if state[0] > 18:
        # Policy: 80% stick, 20% hit
        expected_value = 0.8 * action_values[0] + 0.2 * action_values[1]
    else:
        # Policy: 20% stick, 80% hit
        expected_value = 0.2 * action_values[0] + 0.8 * action_values[1]
    
    V_to_plot[state] = expected_value

# Extract greedy policy from Q-values
greedy_policy = {}
for state, action_values in Q.items():
    # Select action with highest Q-value
    greedy_policy[state] = np.argmax(action_values)

# Analyze results
states_visited = len(Q)
avg_value = np.mean(list(V_to_plot.values()))
stick_states = sum(1 for a in greedy_policy.values() if a == 0)
hit_states = sum(1 for a in greedy_policy.values() if a == 1)

analysis_text = f"""
<strong>Learning Results:</strong><br><br>
• States visited: {states_visited} unique states<br>
• Average state value: {avg_value:.4f}<br>
• Greedy policy statistics:<br>
  - States where optimal is STICK: {stick_states} ({100*stick_states/states_visited:.1f}%)<br>
  - States where optimal is HIT: {hit_states} ({100*hit_states/states_visited:.1f}%)<br><br>
<strong>Key Insights:</strong><br>
• Policy tends to stick with higher player sums<br>
• Usable ace affects optimal strategy significantly<br>
• Dealer's showing card influences decision boundary
"""

pretty_print("Analysis Complete", analysis_text, style='result')

In [None]:
"""
Cell 8: Visualize Learned Value Function
Purpose: Create 3D visualizations of the state-value function
"""

pretty_print("Generating Value Function Plots",
             "Creating 3D surface plots for state values<br>" +
             "Separate visualizations for usable/non-usable ace",
             style='info')

# Plot the state-value function
plot_blackjack_values(V_to_plot)

pretty_print("Value Function Visualization",
             "<strong>Interpretation:</strong><br>" +
             "• Higher values (red): Favorable states likely to win<br>" +
             "• Lower values (blue): Unfavorable states likely to lose<br>" +
             "• Peak values around sum 20-21: Best winning positions<br>" +
             "• Valley for low sums: Poor positions requiring hits",
             style='note')

In [None]:
"""
Cell 9: Visualize Derived Greedy Policy
Purpose: Show the optimal policy learned from Q-values
"""

pretty_print("Generating Policy Heatmaps",
             "Creating 2D heatmaps showing optimal actions<br>" +
             "Red = HIT, Blue = STICK",
             style='info')

# Plot the greedy policy
plot_policy(greedy_policy)

pretty_print("Policy Visualization",
             "<strong>Policy Patterns:</strong><br>" +
             "• Clear threshold around sum 17-20<br>" +
             "• More conservative with usable ace (can't bust)<br>" +
             "• Adapts to dealer's showing card<br>" +
             "• Matches intuitive Blackjack strategy",
             style='note')

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Monte Carlo Convergence:</strong> With 500,000 episodes, the value function converges to stable estimates, demonstrating the law of large numbers in action.</p>
        <p><strong>2. Policy Structure:</strong> The learned policy shows a clear decision boundary around player sum 17-20, which aligns with optimal Blackjack strategy.</p>
        <p><strong>3. Usable Ace Impact:</strong> States with usable ace show different optimal strategies, as the flexibility of ace prevents busting.</p>
        <p><strong>4. Dealer Card Influence:</strong> The optimal policy adapts based on dealer's showing card - more aggressive against weak dealer cards (4-6).</p>
        <p><strong>5. First-Visit Efficiency:</strong> First-visit MC provides unbiased estimates while being computationally efficient.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>How would the value function change if we used Every-Visit MC instead of First-Visit?</li>
        <li>What happens to convergence speed with different initial policies?</li>
        <li>How could we implement MC Control to find the optimal policy directly?</li>
        <li>Why is Monte Carlo particularly suitable for Blackjack compared to Dynamic Programming?</li>
        <li>How would adding card counting affect the state space and learning?</li>
        <li>What are the advantages of MC methods when the model is unknown or complex?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 5-1: Blackjack with Monte Carlo Methods</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 5-2 - Monte Carlo Control</p>
</div>