<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 5-1: Blackjack with Monte Carlo Methods
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 5 | Intermediate Level | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo methods learn directly from episodes of experience without requiring a model of the environment's dynamics.
        First introduced for RL by <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" style="color: #17a2b8;">Stanislaw Ulam</a> 
        during the Manhattan Project, these methods are particularly effective for episodic tasks. This lab implements the
        <strong>First-Visit Monte Carlo</strong> algorithm on the classic Blackjack problem from
        <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto (2018)</a>, Example 5.1.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand Monte Carlo prediction methods</li>
        <li>Implement First-Visit MC algorithm</li>
        <li>Learn from sampled episodes of experience</li>
        <li>Estimate action-value functions Q(s,a)</li>
        <li>Visualize value functions and policies</li>
        <li>Work with OpenAI Gym environments</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Blackjack Rules</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Get sum close to 21 without exceeding</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Hit (draw card) or Stick (stop)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (player_sum, dealer_card, usable_ace)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → +1 (win), 0 (draw), -1 (lose)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Ace</code> → Can be 1 or 11 (usable if 11)</div>
    </div>
</td>
</tr>
</table>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 1: Environment Setup and Dependencies
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 1</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Importing libraries and configuring the computational environment
    </p>
</div>

In [None]:
import sys\nimport gymnasium as gym\nimport numpy as np\nfrom collections import defaultdict\nfrom mpl_toolkits.mplot3d import Axes3D\nimport matplotlib.pyplot as plt\nfrom matplotlib import cm\nimport warnings\nwarnings.filterwarnings('ignore')\n\nplt.rcParams['figure.dpi'] = 100\nplt.rcParams['figure.figsize'] = (12, 8)\nplt.rcParams['font.size'] = 10\n\nprint('Libraries imported successfully')\nprint(f'Gymnasium version: {gym.__version__}')

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 2: Creating the Blackjack Environment
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 2</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Initializing the OpenAI Gymnasium Blackjack-v1 environment
    </p>
</div>

In [None]:
env = gym.make('Blackjack-v1')\n\nprint(f'Environment: Blackjack-v1')\nprint(f'Action space: {env.action_space}')\nprint(f'Number of actions: {env.action_space.n}')\nprint('Actions: 0 = Stick (stop), 1 = Hit (draw card)')\n\nsample_state, _ = env.reset()\nprint(f'\\nSample initial state: {sample_state}')\nprint(f'  Player sum: {sample_state[0]}')\nprint(f'  Dealer showing: {sample_state[1]}')\nprint(f'  Usable ace: {sample_state[2]}')

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 3: Monte Carlo ES Algorithm Overview
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 3</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Algorithm for Finding Optimal Policy
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0;">Algorithm Overview</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Monte Carlo ES uses Exploring Starts to ensure all state-action pairs are visited.
    </p>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="https://github.com/mdehghani86/RL_labs/blob/master/Lab%2005/MCM_ES.jpg?raw=true" 
         alt="Monte Carlo ES Pseudocode" 
         style="width: 70%; max-width: 800px; border: 2px solid #17a2b8; border-radius: 8px;">
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 4: Stochastic Policy for Exploration
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 4</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Defining an arbitrary exploration policy
    </p>
</div>

In [None]:
def play_episode_arbitrary_policy(env):\n    episode = []\n    state, _ = env.reset()\n    \n    while True:\n        if state[0] > 18:\n            action_probs = [0.8, 0.2]\n        else:\n            action_probs = [0.2, 0.8]\n        \n        action = np.random.choice([0, 1], p=action_probs)\n        next_state, reward, terminated, truncated, info = env.step(action)\n        done = terminated or truncated\n        \n        episode.append((state, action, reward))\n        state = next_state\n        \n        if done:\n            break\n    \n    return episode\n\nsample_episode = play_episode_arbitrary_policy(env)\nprint(f'Episode length: {len(sample_episode)} steps')\nprint(f'Final reward: {sample_episode[-1][2]}')

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 5: First-Visit Monte Carlo Learning
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 5</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Implementing Q-value updates and prediction loop
    </p>
</div>

In [None]:
def update_Q(episode, Q, returns_sum, N, gamma=1.0):\n    visited = set()\n    \n    for t, (state, action, reward) in enumerate(episode):\n        sa_pair = (state, action)\n        \n        if sa_pair not in visited:\n            visited.add(sa_pair)\n            G = sum((gamma ** k) * r for k, (_, _, r) in enumerate(episode[t:]))\n            returns_sum[state][action] += G\n            N[state][action] += 1.0\n            Q[state][action] = returns_sum[state][action] / N[state][action]\n\ndef mc_predict(env, num_episodes, gamma=1.0):\n    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))\n    N = defaultdict(lambda: np.zeros(env.action_space.n))\n    Q = defaultdict(lambda: np.zeros(env.action_space.n))\n    \n    print(f'Starting MC prediction with {num_episodes:,} episodes...')\n    \n    for i_episode in range(1, num_episodes + 1):\n        episode = play_episode_arbitrary_policy(env)\n        update_Q(episode, Q, returns_sum, N, gamma)\n        \n        if i_episode % 50000 == 0:\n            print(f'Episode {i_episode:,}/{num_episodes:,}')\n    \n    print('MC prediction complete')\n    return Q\n\nprint('MC functions ready')

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 6: Visualization Functions
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 6</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Creating 3D value plots and 2D policy heatmaps
    </p>
</div>

In [None]:
def plot_blackjack_values(V):\n    def get_Z(player_sum, dealer_card, usable_ace):\n        state = (player_sum, dealer_card, usable_ace)\n        return V.get(state, 0)\n    \n    def create_surface(usable_ace, ax):\n        player_range = np.arange(12, 22)\n        dealer_range = np.arange(1, 11)\n        X, Y = np.meshgrid(player_range, dealer_range)\n        Z = np.array([[get_Z(x, y, usable_ace) for x in player_range] for y in dealer_range])\n        surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm, linewidth=0, antialiased=True, vmin=-1, vmax=1, alpha=0.8)\n        ax.set_xlabel('Player Sum')\n        ax.set_ylabel('Dealer Showing')\n        ax.set_zlabel('Value')\n        ax.set_zlim(-1, 1)\n        ax.view_init(elev=25, azim=-130)\n        return surf\n    \n    fig = plt.figure(figsize=(14, 11))\n    ax1 = fig.add_subplot(211, projection='3d')\n    ax1.set_title('State Values WITH Usable Ace', fontsize=13, fontweight='bold')\n    surf1 = create_surface(True, ax1)\n    fig.colorbar(surf1, ax=ax1, shrink=0.5, aspect=10)\n    ax2 = fig.add_subplot(212, projection='3d')\n    ax2.set_title('State Values WITHOUT Usable Ace', fontsize=13, fontweight='bold')\n    surf2 = create_surface(False, ax2)\n    fig.colorbar(surf2, ax=ax2, shrink=0.5, aspect=10)\n    plt.tight_layout()\n    plt.show()\n\ndef plot_policy(policy):\n    def get_action(player_sum, dealer_card, usable_ace):\n        state = (player_sum, dealer_card, usable_ace)\n        return policy.get(state, 1)\n    \n    def create_heatmap(usable_ace, ax):\n        player_range = range(12, 22)\n        dealer_range = range(1, 11)\n        Z = np.array([[get_action(player, dealer, usable_ace) for player in player_range] for dealer in dealer_range])\n        im = ax.imshow(Z, cmap='RdYlGn_r', aspect='auto', vmin=0, vmax=1, extent=[11.5, 21.5, 0.5, 10.5], origin='lower', interpolation='nearest')\n        ax.set_xticks(range(12, 22))\n        ax.set_yticks(range(1, 11))\n        ax.set_yticklabels(['A'] + list(range(2, 11)))\n        ax.set_xlabel('Player Sum')\n        ax.set_ylabel('Dealer Showing')\n        ax.grid(True, color='black', linewidth=0.5, alpha=0.3)\n        cbar = plt.colorbar(im, ax=ax, ticks=[0, 1], fraction=0.046, pad=0.04)\n        cbar.ax.set_yticklabels(['STICK', 'HIT'])\n        return im\n    \n    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))\n    ax1.set_title('Policy WITH Usable Ace', fontsize=12, fontweight='bold')\n    create_heatmap(True, ax1)\n    ax2.set_title('Policy WITHOUT Usable Ace', fontsize=12, fontweight='bold')\n    create_heatmap(False, ax2)\n    plt.tight_layout()\n    plt.show()\n\nprint('Visualization functions ready')

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px; margin-top: 30px; border-radius: 8px; box-shadow: 0 4px 12px rgba(0,0,0,0.15);">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h2 style="font-family: 'Helvetica Neue', sans-serif; font-size: 20px; margin: 0; font-weight: 300;">
            Section 7: Running Monte Carlo Experiments
        </h2>
        <span style="font-size: 11px; opacity: 0.9;">Section 7</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        Learning Q-values and extracting optimal policy
    </p>
</div>

In [None]:
NUM_EPISODES = 500000\nQ = mc_predict(env, NUM_EPISODES)\n\nV_arbitrary = {}\nfor state, action_values in Q.items():\n    if state[0] > 18:\n        V_arbitrary[state] = 0.8 * action_values[0] + 0.2 * action_values[1]\n    else:\n        V_arbitrary[state] = 0.2 * action_values[0] + 0.8 * action_values[1]\n\noptimal_policy = {}\nfor state, action_values in Q.items():\n    optimal_policy[state] = np.argmax(action_values)\n\nstates_count = len(Q)\nstick_count = sum(1 for a in optimal_policy.values() if a == 0)\nhit_count = sum(1 for a in optimal_policy.values() if a == 1)\n\nprint(f'States explored: {states_count}')\nprint(f'STICK states: {stick_count} ({100*stick_count/states_count:.1f}%)')\nprint(f'HIT states: {hit_count} ({100*hit_count/states_count:.1f}%)')

In [None]:
plot_blackjack_values(V_arbitrary)

In [None]:
plot_policy(optimal_policy)