<a href="https://colab.research.google.com/github/mdehghani86/RL_labs/blob/master/Lab_4_2_Policy_Improvement_and_Policy_Iteration_json.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 4-2: Policy Improvement and Policy Iteration
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 4 | Advanced Level | 90 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Policy iteration combines policy evaluation and policy improvement to find optimal policies for MDPs.
        This lab implements the complete policy iteration algorithm on <strong>Jack's Car Rental</strong> problem from
        <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto (2018)</a>, Example 4.2.
        This classic problem demonstrates how dynamic programming handles complex state spaces with multiple constraints
        and stochastic dynamics modeled by <a href="https://en.wikipedia.org/wiki/Poisson_distribution" style="color: #17a2b8;">Poisson distributions</a>.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Implement policy improvement theorem</li>
        <li>Understand policy iteration algorithm</li>
        <li>Handle complex state-action spaces</li>
        <li>Work with Poisson-distributed dynamics</li>
        <li>Visualize policy and value function evolution</li>
        <li>Solve Jack's Car Rental problem</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Problem Details</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">States</code> → (cars_loc1, cars_loc2) ∈ [0,20]×[0,20]</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Transfer [-5, +5] cars between locations</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Rewards</code> → $10/rental - $2/transfer</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Dynamics</code> → Poisson(λ) for requests/returns</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Constraints</code> → Max 20 cars per location</div>
    </div>
</td>
</tr>
</table>

## Section 1: Environment Setup and Utilities

We begin by importing necessary libraries and loading the pretty print utility for enhanced output formatting.

In [None]:
"""
Cell 1: Import Libraries and Load Pretty Print Utility
Purpose: Set up computational environment with necessary libraries and pretty print utility
"""

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import time
import requests
import warnings
warnings.filterwarnings('ignore')

# Fetch and execute the pretty print utility from GitHub
try:
    url = 'https://raw.githubusercontent.com/mdehghani86/RL_labs/master/utility/rl_utility.py'
    response = requests.get(url)
    exec(response.text)
    pretty_print("Environment Ready",
                 "Successfully loaded pretty_print utility<br>" +
                 "Libraries imported: NumPy, SciPy, Matplotlib<br>" +
                 "Ready for Policy Iteration implementation",
                 style='success')
except Exception as e:
    # Fallback definition if GitHub fetch fails
    from IPython.display import display, HTML
    def pretty_print(title, content, style='info'):
        themes = {
            'info': {'primary': '#17a2b8', 'secondary': '#0e5a63', 'background': '#f8f9fa'},
            'success': {'primary': '#28a745', 'secondary': '#155724', 'background': '#f8fff9'},
            'warning': {'primary': '#ffc107', 'secondary': '#e0a800', 'background': '#fffdf5'},
            'result': {'primary': '#6f42c1', 'secondary': '#4e2c8e', 'background': '#faf5ff'},
            'note': {'primary': '#20c997', 'secondary': '#0d7a5f', 'background': '#f0fdf9'}
        }
        theme = themes.get(style, themes['info'])
        html = f'''
        <div style="border-radius: 5px; margin: 10px 0; width: 20cm; max-width: 20cm; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
            <div style="background: linear-gradient(90deg, {theme['primary']} 0%, {theme['secondary']} 100%); padding: 10px 15px; border-radius: 5px 5px 0 0;">
                <strong style="color: white; font-size: 14px;">{title}</strong>
            </div>
            <div style="background: {theme['background']}; padding: 10px 15px; border-radius: 0 0 5px 5px; border-left: 3px solid {theme['primary']};">
                <div style="color: rgba(0,0,0,0.8); font-size: 12px; line-height: 1.5;">{content}</div>
            </div>
        </div>
        '''
        display(HTML(html))

    pretty_print("Fallback Mode",
                 f"Using local pretty_print definition<br>Error: {str(e)}",
                 style='warning')

# Configure matplotlib for better visualizations
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

## Section 2: Jack's Car Rental Problem Setup

### Problem Description

Jack manages two car rental locations. Each day:
- **Requests** arrive at each location (Poisson distributed)
- **Returns** come back to each location (Poisson distributed)
- Jack can **transfer** cars between locations overnight (max 5)

### Mathematical Formulation

- **State**: $s = (n_1, n_2)$ where $n_i$ = number of cars at location $i$
- **Action**: $a \in \{-5, -4, ..., 4, 5\}$ (negative = move from 1 to 2)
- **Reward**: $R = 10 \times (rentals) - 2 \times |transfers|$
- **Dynamics**: Request/return rates follow Poisson distributions

In [None]:
"""
Cell 2: Define Problem Constants and Parameters
Purpose: Set up all constants for Jack's Car Rental problem
"""

# ============================================
# PROBLEM CONSTANTS
# ============================================

# Maximum number of cars at each location
MAX_CARS = 20

# Maximum cars that can be transferred overnight
MAX_TRANSFER = 5

# Rewards and costs
RENTAL_REWARD = 10  # Revenue per car rental
TRANSFER_COST = 2   # Cost per car transferred

# Policy evaluation convergence threshold
EPSILON = 1e-5

# Discount factor for future rewards
GAMMA = 0.9

# Poisson distribution parameters (λ values)
# Location 1: Requests ~ Poisson(3), Returns ~ Poisson(3)
# Location 2: Requests ~ Poisson(4), Returns ~ Poisson(2)
LAMBDA_REQUESTS_1 = 3
LAMBDA_RETURNS_1 = 3
LAMBDA_REQUESTS_2 = 4
LAMBDA_RETURNS_2 = 2

# Poisson pmf is negligible above this upper bound
# This truncation speeds up computation
POISSON_UPPER_BOUND = 12

# Pre-compute Poisson PMFs for efficiency
POISSON_RETURNS_2 = poisson.pmf(range(POISSON_UPPER_BOUND), LAMBDA_RETURNS_2)  # λ=2
POISSON_REQUESTS_1 = poisson.pmf(range(POISSON_UPPER_BOUND), LAMBDA_REQUESTS_1)  # λ=3
POISSON_RETURNS_1 = poisson.pmf(range(POISSON_UPPER_BOUND), LAMBDA_RETURNS_1)  # λ=3
POISSON_REQUESTS_2 = poisson.pmf(range(POISSON_UPPER_BOUND), LAMBDA_REQUESTS_2)  # λ=4

# Create state space: all possible (cars_at_loc1, cars_at_loc2) combinations
STATES = np.array([[x, y] for x in range(MAX_CARS + 1)
                          for y in range(MAX_CARS + 1)])

pretty_print("Problem Constants Initialized",
             f"State space size: {len(STATES)} states<br>" +
             f"Action space: [{-MAX_TRANSFER}, {MAX_TRANSFER}]<br>" +
             f"Discount factor γ = {GAMMA}<br>" +
             f"Convergence threshold ε = {EPSILON}",
             style='info')

## Section 3: Policy Evaluation Implementation

### The Bellman Equation for Policy Evaluation

For a given policy $\pi$, we compute:

$$V^\pi(s) = \sum_{s',r} p(s',r|s,\pi(s))[r + \gamma V^\pi(s')]$$

Where the transition probabilities depend on Poisson-distributed requests and returns.

In [None]:
"""
Cell 3: Policy Evaluation Function
Purpose: Implement iterative policy evaluation for Jack's Car Rental
"""

def update_values(states, policy, values):
    """
    Perform one sweep of policy evaluation across all states

    This implements the Bellman expectation equation for the given policy.
    For each state, we compute the expected value considering:
    1. The action taken according to the policy
    2. Stochastic rental requests (Poisson distributed)
    3. Stochastic returns (Poisson distributed)
    4. Rewards from rentals minus transfer costs

    Args:
        states: Array of all possible states
        policy: Current policy (action for each state)
        values: Current value function estimates

    Returns:
        values: Updated value function
        delta: Maximum change in value (for convergence check)
    """
    delta = 0

    for state, action in zip(states, policy):
        # Store old value for delta calculation
        v_old = values[state[0], state[1]]

        # Execute action: transfer cars between locations
        # Positive action: transfer from loc1 to loc2
        # Negative action: transfer from loc2 to loc1
        state_after_transfer = state.copy()
        state_after_transfer[0] -= action  # Location 1 loses action cars
        state_after_transfer[1] += action  # Location 2 gains action cars

        # Ensure state remains within bounds [0, 20]
        state_after_transfer = np.clip(state_after_transfer, 0, MAX_CARS)

        # Calculate expected value over all possible request/return combinations
        # Using meshgrid for vectorized computation
        n_returns_2, n_returns_1, n_requests_2, n_requests_1 = np.meshgrid(
            range(POISSON_UPPER_BOUND), range(POISSON_UPPER_BOUND),
            range(POISSON_UPPER_BOUND), range(POISSON_UPPER_BOUND)
        )

        # Compute joint probability of rental requests
        requests_joint_prob = np.outer(POISSON_REQUESTS_1, POISSON_REQUESTS_2)

        # Actual rentals = min(requests, available_cars)
        n_rentals_1 = np.minimum(n_requests_1, state_after_transfer[0])
        n_rentals_2 = np.minimum(n_requests_2, state_after_transfer[1])

        # Calculate immediate rewards
        rewards = (RENTAL_REWARD * (n_rentals_1 + n_rentals_2).flatten()
                  - TRANSFER_COST * abs(action))

        # Compute joint probability of returns
        returns_joint_prob = np.outer(POISSON_RETURNS_1, POISSON_RETURNS_2)

        # Calculate final state after rentals and returns
        n_final_1 = np.minimum(
            state_after_transfer[0] - n_rentals_1 + n_returns_1, MAX_CARS
        ).flatten()
        n_final_2 = np.minimum(
            state_after_transfer[1] - n_rentals_2 + n_returns_2, MAX_CARS
        ).flatten()

        # Look up values of next states
        v_next = values[n_final_1, n_final_2]

        # Compute total joint probability
        joint_prob = np.outer(requests_joint_prob, returns_joint_prob).flatten()

        # Bellman update: expected value = Σ p(s',r|s,a)[r + γV(s')]
        values[state[0], state[1]] = joint_prob @ (rewards + GAMMA * v_next)

        # Track maximum change for convergence
        delta = max(delta, abs(v_old - values[state[0], state[1]]))

    return values, delta

pretty_print("Policy Evaluation Ready",
             "Bellman expectation equation implemented<br>" +
             "Handles Poisson-distributed dynamics<br>" +
             "Vectorized for computational efficiency",
             style='success')

## Section 4: Policy Improvement Implementation

### Policy Improvement Theorem

Given a value function $V^\pi$, we can improve the policy by acting greedily:

$$\pi'(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V^\pi(s')]$$

This guarantees $V^{\pi'} \geq V^\pi$ for all states.

In [None]:
"""
Cell 4: Policy Improvement Function
Purpose: Implement greedy policy improvement based on current value function
"""

def update_policy(states, policy, values):
    """
    Improve policy by acting greedily with respect to current value function

    For each state, we evaluate all possible actions and select the one
    that maximizes expected return. This implements the policy improvement
    theorem, guaranteeing monotonic improvement.

    Args:
        states: Array of all possible states
        policy: Current policy to be improved
        values: Current value function

    Returns:
        policy: Improved policy
        stable: True if policy unchanged (convergence)
    """
    stable = True

    for i in range(states.shape[0]):
        state = states[i]
        old_action = policy[i]

        # Determine valid action range considering constraints:
        # 1. Can't transfer more cars than available at source
        # 2. Can't exceed capacity at destination
        # 3. Maximum transfer limit of 5 cars

        # Lower bound: max transfer from loc2 to loc1
        actions_lb = max(
            -state[1],          # Can't transfer more than available at loc2
            state[0] - MAX_CARS,  # Can't exceed capacity at loc1
            -MAX_TRANSFER       # Transfer limit
        )

        # Upper bound: max transfer from loc1 to loc2
        actions_ub = min(
            state[0],           # Can't transfer more than available at loc1
            MAX_CARS - state[1],  # Can't exceed capacity at loc2
            MAX_TRANSFER        # Transfer limit
        )

        # Create action space for this state
        actions = np.arange(actions_lb, actions_ub + 1)
        action_values = []

        # Evaluate each possible action
        for action in actions:
            # Apply action to get state after transfer
            state_after_transfer = state.copy()
            state_after_transfer[0] -= action
            state_after_transfer[1] += action
            state_after_transfer = np.clip(state_after_transfer, 0, MAX_CARS)

            # Calculate expected value for this action
            # (Similar computation as in policy evaluation)
            n_returns_2, n_returns_1, n_requests_2, n_requests_1 = np.meshgrid(
                range(POISSON_UPPER_BOUND), range(POISSON_UPPER_BOUND),
                range(POISSON_UPPER_BOUND), range(POISSON_UPPER_BOUND)
            )

            # Joint probability of requests
            requests_joint_prob = np.outer(POISSON_REQUESTS_1, POISSON_REQUESTS_2)

            # Calculate rentals (limited by available cars)
            n_rentals_1 = np.minimum(n_requests_1, state_after_transfer[0])
            n_rentals_2 = np.minimum(n_requests_2, state_after_transfer[1])

            # Immediate rewards
            rewards = (RENTAL_REWARD * (n_rentals_1 + n_rentals_2).flatten()
                      - TRANSFER_COST * abs(action))

            # Joint probability of returns
            returns_joint_prob = np.outer(POISSON_RETURNS_1, POISSON_RETURNS_2)

            # Final states after rentals and returns
            n_final_1 = np.minimum(
                state_after_transfer[0] - n_rentals_1 + n_returns_1, MAX_CARS
            ).flatten()
            n_final_2 = np.minimum(
                state_after_transfer[1] - n_rentals_2 + n_returns_2, MAX_CARS
            ).flatten()

            # Look up values of next states
            v_next = values[n_final_1, n_final_2]

            # Total joint probability
            joint_prob = np.outer(requests_joint_prob, returns_joint_prob).flatten()

            # Q(s,a) = expected return for this state-action pair
            action_value = joint_prob @ (rewards + GAMMA * v_next)
            action_values.append(action_value)

        # Select action with maximum expected value (greedy)
        policy[i] = actions[np.argmax(action_values)]

        # Check if policy changed
        if stable and policy[i] != old_action:
            stable = False

    return policy, stable

pretty_print("Policy Improvement Ready",
             "Greedy policy improvement implemented<br>" +
             "Evaluates all valid actions per state<br>" +
             "Selects action maximizing expected return",
             style='success')

## Section 5: Policy Iteration Algorithm

### The Complete Algorithm

Policy Iteration alternates between:
1. **Policy Evaluation**: Compute $V^\pi$ for current policy
2. **Policy Improvement**: Update $\pi$ to be greedy w.r.t. $V^\pi$

This process converges to the optimal policy $\pi^*$ and optimal value function $V^*$.

In [None]:
"""
Cell 5: Main Policy Iteration Implementation
Purpose: Combine policy evaluation and improvement to find optimal policy
"""

def policy_iteration():
    """
    Execute complete policy iteration algorithm for Jack's Car Rental

    Alternates between:
    1. Policy Evaluation: Compute V^π for current policy
    2. Policy Improvement: Make π greedy w.r.t. V^π

    Continues until policy is stable (no further improvements possible)
    """

    # Initialize value function to zero for all states
    values = np.zeros((MAX_CARS + 1, MAX_CARS + 1))

    # Initialize policy: do nothing (transfer 0 cars) for all states
    policy = np.zeros(STATES.shape[0], dtype=int)

    pretty_print("Starting Policy Iteration",
                 f"Initial policy: No transfers<br>" +
                 f"State space: {MAX_CARS + 1} × {MAX_CARS + 1} = {len(STATES)} states<br>" +
                 f"Action space: {2 * MAX_TRANSFER + 1} possible actions per state",
                 style='info')

    # Plot initial policy
    plt.figure(figsize=(8, 6))
    plt.imshow(policy.reshape(MAX_CARS + 1, MAX_CARS + 1), cmap='RdBu_r',
               vmin=-MAX_TRANSFER, vmax=MAX_TRANSFER, origin='lower')
    plt.colorbar(label='Cars transferred (+ from 1 to 2, - from 2 to 1)')
    plt.xlabel('Cars at Location 1')
    plt.ylabel('Cars at Location 2')
    plt.title('Initial Policy (Iteration 0)')
    plt.show()

    # Main policy iteration loop
    stable = False
    iteration = 0

    while not stable:
        iteration += 1
        start_time = time.time()

        pretty_print(f"Iteration {iteration}",
                    "Starting policy evaluation...",
                    style='info')

        # POLICY EVALUATION
        # Iterate until value function converges
        eval_iterations = 0
        while True:
            values, delta = update_values(STATES, policy, values)
            eval_iterations += 1

            if eval_iterations % 10 == 0:
                print(f"  Evaluation iteration {eval_iterations}: δ = {delta:.6f}")

            # Check convergence
            if delta < EPSILON:
                pretty_print("Policy Evaluation Complete",
                           f"Converged after {eval_iterations} iterations<br>" +
                           f"Final δ = {delta:.8f}",
                           style='success')
                break

        # POLICY IMPROVEMENT
        pretty_print(f"Iteration {iteration}",
                    "Starting policy improvement...",
                    style='info')

        policy, stable = update_policy(STATES, policy, values)

        elapsed_time = time.time() - start_time

        if stable:
            pretty_print("Policy Iteration Complete!",
                       f"Optimal policy found after {iteration} iterations<br>" +
                       f"Last iteration time: {elapsed_time:.2f} seconds",
                       style='result')
        else:
            pretty_print(f"Iteration {iteration} Complete",
                       f"Policy improved<br>" +
                       f"Time: {elapsed_time:.2f} seconds",
                       style='success')

        # Visualize current policy
        plt.figure(figsize=(8, 6))
        policy_grid = policy.reshape(MAX_CARS + 1, MAX_CARS + 1)
        plt.imshow(policy_grid, cmap='RdBu_r',
                  vmin=-MAX_TRANSFER, vmax=MAX_TRANSFER, origin='lower')
        plt.colorbar(label='Cars transferred (+ from 1 to 2, - from 2 to 1)')
        plt.xlabel('Cars at Location 1')
        plt.ylabel('Cars at Location 2')
        plt.title(f'Policy at Iteration {iteration}')

        # Add contour lines for better visualization
        X, Y = np.meshgrid(range(MAX_CARS + 1), range(MAX_CARS + 1))
        plt.contour(X, Y, policy_grid, levels=range(-5, 6),
                   colors='black', alpha=0.4, linewidths=0.5)
        plt.show()

    return policy, values

pretty_print("Policy Iteration Function Ready",
             "Complete algorithm implemented<br>" +
             "Will iterate until optimal policy found",
             style='success')

## Section 6: Execute Policy Iteration and Visualize Results

In [None]:
"""
Cell 6: Run Policy Iteration and Generate Final Visualizations
Purpose: Execute the complete algorithm and visualize optimal policy and value function
"""

# Run policy iteration to find optimal policy
pretty_print("Executing Policy Iteration",
             "This will take several minutes to converge...<br>" +
             "Watch as the policy evolves toward optimality!",
             style='warning')

optimal_policy, optimal_values = policy_iteration()

# Create comprehensive visualization of results
fig = plt.figure(figsize=(15, 5))

# Subplot 1: Final optimal policy
ax1 = plt.subplot(1, 3, 1)
policy_grid = optimal_policy.reshape(MAX_CARS + 1, MAX_CARS + 1)
im1 = ax1.imshow(policy_grid, cmap='RdBu_r',
                 vmin=-MAX_TRANSFER, vmax=MAX_TRANSFER, origin='lower')
ax1.set_xlabel('Cars at Location 1')
ax1.set_ylabel('Cars at Location 2')
ax1.set_title('Optimal Policy π*')
plt.colorbar(im1, ax=ax1, label='Transfer')

# Add contour lines
X, Y = np.meshgrid(range(MAX_CARS + 1), range(MAX_CARS + 1))
ax1.contour(X, Y, policy_grid, levels=range(-5, 6),
           colors='black', alpha=0.4, linewidths=0.5)

# Subplot 2: Value function heatmap
ax2 = plt.subplot(1, 3, 2)
im2 = ax2.imshow(optimal_values, cmap='viridis', origin='lower')
ax2.set_xlabel('Cars at Location 1')
ax2.set_ylabel('Cars at Location 2')
ax2.set_title('Value Function V*')
plt.colorbar(im2, ax=ax2, label='Expected Return')

# Subplot 3: 3D surface plot of value function
ax3 = plt.subplot(1, 3, 3, projection='3d')
ax3.plot_surface(X, Y, optimal_values, cmap='viridis',
                 edgecolor='none', alpha=0.8)
ax3.set_xlabel('Cars at Location 1')
ax3.set_ylabel('Cars at Location 2')
ax3.set_zlabel('Expected Return')
ax3.set_title('Value Function V* (3D)')
ax3.view_init(elev=30, azim=45)

plt.tight_layout()
plt.show()

# Analyze optimal policy characteristics
positive_transfers = np.sum(optimal_policy > 0)
negative_transfers = np.sum(optimal_policy < 0)
no_transfers = np.sum(optimal_policy == 0)
max_value = np.max(optimal_values)
min_value = np.min(optimal_values)

analysis_text = f"""
<strong>Optimal Policy Analysis:</strong><br><br>
• States with transfers from Location 1 to 2: {positive_transfers} ({100*positive_transfers/len(STATES):.1f}%)<br>
• States with transfers from Location 2 to 1: {negative_transfers} ({100*negative_transfers/len(STATES):.1f}%)<br>
• States with no transfer: {no_transfers} ({100*no_transfers/len(STATES):.1f}%)<br><br>
<strong>Value Function Statistics:</strong><br>
• Maximum expected return: ${max_value:.2f}<br>
• Minimum expected return: ${min_value:.2f}<br>
• Average expected return: ${np.mean(optimal_values):.2f}<br><br>
<strong>Key Insights:</strong><br>
• The policy tends to balance cars between locations<br>
• Location 2 has higher demand (λ=4) but lower returns (λ=2)<br>
• Optimal policy compensates by transferring cars to Location 2<br>
• Transfer costs create a threshold effect in the policy
"""

pretty_print("Results Analysis", analysis_text, style='result')

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Policy Structure:</strong> The optimal policy shows a clear diagonal pattern, transferring cars from Location 1 to 2 when Location 1 has excess inventory and Location 2 is low.</p>
        <p><strong>2. Asymmetric Dynamics:</strong> The different Poisson parameters at each location create an asymmetric optimal policy that favors transfers to Location 2.</p>
        <p><strong>3. Transfer Threshold:</strong> Due to transfer costs ($2 per car), small imbalances are not corrected - there's a threshold effect.</p>
        <p><strong>4. Convergence:</strong> Policy iteration converges in relatively few iterations (typically 4-6) despite the large state space (441 states).</p>
        <p><strong>5. Value Function:</strong> The value function is smooth and increases toward balanced inventory states, reflecting higher earning potential.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>How would the optimal policy change if transfer costs increased to $5 per car?</li>
        <li>What if we added a capacity constraint on overnight transfers (e.g., only one truck available)?</li>
        <li>How would non-linear transfer costs (e.g., fixed cost + per-car cost) affect the policy?</li>
        <li>Could we speed up convergence using value iteration instead of policy iteration?</li>
        <li>How would the solution change with different Poisson parameters?</li>
        <li>What real-world factors are we ignoring that might affect Jack's optimal strategy?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 4-2/3: Policy Improvement and Policy Iteration</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 5 - Monte Carlo Methods</p>
</div>