<a href="https://colab.research.google.com/github/mdehghani86/RL_labs/blob/master/Lab%204-1%3A%20Policy%20Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[link text](https://)<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 4-1: Policy Evaluation
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 4 | Intermediate Level | 60 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Policy evaluation is the foundation of dynamic programming methods in reinforcement learning. First introduced by
        <a href="https://en.wikipedia.org/wiki/Richard_E._Bellman" style="color: #17a2b8;">Richard Bellman</a> in the 1950s,
        this iterative algorithm computes the state-value function for a given policy by repeatedly applying the Bellman equation.
        This lab implements the fundamental algorithm from
        <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto (2018)</a>,
        Chapter 4, demonstrating convergence on the classic 4×4 gridworld.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand the iterative policy evaluation algorithm</li>
        <li>Implement the Bellman equation for value computation</li>
        <li>Observe convergence properties of dynamic programming</li>
        <li>Extract greedy policies from value functions</li>
        <li>Reproduce Figure 4.1 from Sutton & Barto</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Key Concepts</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">V(s)</code> → State-value function</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">π(a|s)</code> → Policy (action probability)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">γ</code> → Discount factor</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Bellman Equation</code> → V(s) = Σ π(a|s)[R + γV(s')]</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Gridworld</code> → 4×4 MDP environment</div>
    </div>
</td>
</tr>
</table>

## Section 1: Environment Setup and Pretty Print Utility

We begin by setting up our environment and loading the pretty print utility from GitHub for enhanced output formatting.

In [1]:
"""
Cell 1: Import Libraries and Load Pretty Print Utility
Purpose: Set up the computational environment and load the pretty print utility from GitHub
"""

import numpy as np
import requests
import sys
from io import StringIO
from tabulate import tabulate
import warnings
warnings.filterwarnings('ignore')

# Fetch and execute the pretty print utility from GitHub
try:
    url = 'https://raw.githubusercontent.com/mdehghani86/RL_labs/master/utility/rl_utility.py'
    response = requests.get(url)
    exec(response.text)
    pretty_print("Utility Loaded",
                 "Successfully loaded pretty_print utility from GitHub<br>" +
                 "Ready to begin Policy Evaluation implementation",
                 style='success')
except Exception as e:
    # Fallback definition if GitHub fetch fails
    from IPython.display import display, HTML
    def pretty_print(title, content, style='info'):
        """Fallback pretty print function"""
        themes = {
            'info': {'primary': '#17a2b8', 'secondary': '#0e5a63', 'background': '#f8f9fa'},
            'success': {'primary': '#28a745', 'secondary': '#155724', 'background': '#f8fff9'},
            'warning': {'primary': '#ffc107', 'secondary': '#e0a800', 'background': '#fffdf5'},
            'result': {'primary': '#6f42c1', 'secondary': '#4e2c8e', 'background': '#faf5ff'},
            'note': {'primary': '#20c997', 'secondary': '#0d7a5f', 'background': '#f0fdf9'}
        }
        theme = themes.get(style, themes['info'])
        html = f'''
        <div style="border-radius: 5px; margin: 10px 0; width: 20cm; max-width: 20cm; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
            <div style="background: linear-gradient(90deg, {theme['primary']} 0%, {theme['secondary']} 100%); padding: 10px 15px; border-radius: 5px 5px 0 0;">
                <strong style="color: white; font-size: 14px;">{title}</strong>
            </div>
            <div style="background: {theme['background']}; padding: 10px 15px; border-radius: 0 0 5px 5px; border-left: 3px solid {theme['primary']};">
                <div style="color: rgba(0,0,0,0.8); font-size: 12px; line-height: 1.5;">{content}</div>
            </div>
        </div>
        '''
        display(HTML(html))
    pretty_print("Fallback Mode",
                 "Using local pretty_print definition<br>" +
                 "GitHub utility fetch failed, but continuing with fallback",
                 style='warning')

## Section 2: Gridworld MDP Implementation

### Theoretical Foundation

A **Markov Decision Process (MDP)** is defined by the tuple $(S, A, P, R, γ)$ where:
- $S$ is the state space
- $A$ is the action space
- $P(s'|s,a)$ is the state transition probability
- $R(s,a,s')$ is the reward function
- $γ$ is the discount factor

Our gridworld is a deterministic MDP where the agent moves in a 4×4 grid with two terminal states.

In [2]:
"""
Cell 2: BaseGridworld Class Definition
Purpose: Implement the gridworld MDP environment with states, actions, and rewards
"""

class BaseGridworld:
    """
    Defines the base class for the Gridworld MDP.

    State representation: (x,y) cartesian coordinates
    - Origin (0,0) is bottom-left corner
    - x increases rightward, y increases upward

    Action representation: (x_offset, y_offset)
    - (0, 1): Move UP
    - (0, -1): Move DOWN
    - (1, 0): Move RIGHT
    - (-1, 0): Move LEFT

    Example transitions:
    - State (0,0) + Action (1,0) → Next state (1,0) [move right]
    - State (1,2) + Action (0,1) → Next state (1,3) [move up]
    """

    def __init__(self, width, height, start_state=None, terminal_states=[]):
        """
        Initialize the gridworld environment

        Args:
            width: Grid width (x-dimension)
            height: Grid height (y-dimension)
            start_state: Initial agent position (tuple)
            terminal_states: List of terminal state positions
        """
        self.width = width
        self.height = height
        self.start_state = start_state
        self.terminal_states = terminal_states

        # Initialize agent position
        self.reset_state()

    def get_possible_actions(self, state):
        """
        Returns all possible actions from any state
        Following the uniform random policy assumption
        """
        # Four cardinal directions: UP, LEFT, DOWN, RIGHT
        all_actions = [(0, 1), (-1, 0), (0, -1), (1, 0)]
        return all_actions

    def get_states(self):
        """
        Generate all possible states in the gridworld
        Returns list of (x,y) tuples for all grid positions
        """
        return [(x, y) for x in range(self.width) for y in range(self.height)]

    def get_state_reward_transition(self, state, action):
        """
        Execute action from state and return next state and reward

        Implements deterministic transitions with boundary conditions:
        - Actions that would move off-grid keep agent in same position
        - All non-terminal transitions give reward of -1
        - Terminal state transitions give reward of 0
        """
        # Compute potential next state
        next_state = np.array(state) + np.array(action)

        # Apply boundary conditions (agent stays in place if hitting wall)
        next_state = self._clip_state_to_grid(next_state)

        # Convert to integer tuple
        next_state = int(next_state[0]), int(next_state[1])

        # Get reward for this transition
        reward = self.get_reward(state, action, next_state)

        return next_state, reward

    def get_reward(self, state, action, next_state):
        """
        Reward function: -1 for all non-terminal transitions, 0 from terminal states
        This encourages the agent to reach terminal states quickly
        """
        if state in self.terminal_states:
            return 0
        else:
            return -1

    def _clip_state_to_grid(self, state):
        """
        Ensure state remains within grid boundaries
        Implements "bump into wall" mechanics
        """
        x, y = state
        x_clipped = np.clip(x, 0, self.width - 1)
        y_clipped = np.clip(y, 0, self.height - 1)
        return x_clipped, y_clipped

    def is_terminal(self, state):
        """Check if a state is terminal"""
        return tuple(state) in self.terminal_states

    def reset_state(self):
        """Reset agent to starting position"""
        self.state = self.start_state
        return self.state

pretty_print("Gridworld MDP Created",
             "BaseGridworld class implemented with:<br>" +
             "• Deterministic state transitions<br>" +
             "• Reward function: -1 (non-terminal), 0 (terminal)<br>" +
             "• Wall collision handling (agent stays in place)",
             style='success')

## Section 3: Policy Evaluation Algorithm

### The Bellman Equation for Policy Evaluation

For a given policy $\pi$, the state-value function satisfies:

$$V^\pi(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$

In our gridworld with uniform random policy:
- $\pi(a|s) = 0.25$ for all actions (equal probability)
- $P(s'|s,a) = 1$ for the deterministic next state
- Iterative updates: $V_{k+1}(s) \leftarrow \sum_a \pi(a|s)[R + \gamma V_k(s')]$

In [3]:
"""
Cell 3: Display Utility Functions
Purpose: Helper functions for visualizing policies and state values
"""

def action_to_arrow(action):
    """
    Convert action tuple to arrow symbol for visualization

    Mapping:
    (0, 1) → ↑ (up)
    (0, -1) → ↓ (down)
    (1, 0) → → (right)
    (-1, 0) → ← (left)
    """
    x, y = action
    if y == +1: return '↑'
    if y == -1: return '↓'
    if x == +1: return '→'
    if x == -1: return '←'
    return ''

pretty_print("Display Utilities Ready",
             "Action-to-arrow converter loaded for policy visualization",
             style='info')

In [4]:
"""
Cell 4: UniformPolicyAgent Implementation
Purpose: Implement the iterative policy evaluation algorithm with uniform random policy
"""

class UniformPolicyAgent:
    """
    Implements iterative policy evaluation for uniform random policy
    Algorithm: Sutton & Barto Chapter 4, page 75
    """

    def __init__(self, mdp, γ=0.9, eps=1e-2, n_iterations=1000):
        """
        Initialize and run policy evaluation

        Args:
            mdp: Gridworld MDP environment
            γ: Discount factor (0 to 1)
            eps: Convergence threshold
            n_iterations: Maximum iterations
        """
        self.mdp = mdp
        self.γ = γ

        # Initialize V(s) = 0 for all states
        self.values = np.zeros((self.mdp.width, self.mdp.height))
        self.policy = {}

        # Run iterative policy evaluation
        pretty_print("Starting Policy Evaluation",
                    f"Parameters: γ={γ}, ε={eps}, max_iter={n_iterations}",
                    style='info')

        # Iterative policy evaluation main loop
        for iteration in range(n_iterations):
            # Store V_{k+1} values
            new_values = np.zeros_like(self.values)

            # Update each state
            for state in self.mdp.get_states():
                # Skip terminal states (their value stays 0)
                if state in self.mdp.terminal_states:
                    continue

                # Compute expected value under uniform random policy
                q_values = {}
                for action in self.mdp.get_possible_actions(state):
                    # Uniform random policy: π(a|s) = 1/4 for all actions
                    action_prob = 1.0 / len(self.mdp.get_possible_actions(state))

                    # Compute Q(s,a) = R + γV(s')
                    q_values[action] = self.compute_q_value(state, action)

                    # Bellman update: V(s) = Σ_a π(a|s) * Q(s,a)
                    new_values[state] += action_prob * q_values[action]

            # Check convergence: ||V_{k+1} - V_k|| < ε
            delta = np.sum(np.abs(new_values - self.values))
            if delta < eps:
                pretty_print("Convergence Achieved",
                           f"Converged at iteration {iteration} with δ={delta:.6f}",
                           style='success')
                break

            # Update values for next iteration
            self.values = new_values

        # Extract greedy policy from final value function
        self.policy = self.update_policy()

    def compute_q_value(self, state, action):
        """
        Compute Q(s,a) = R(s,a,s') + γV(s')

        This is the expected return for taking action a in state s
        """
        # Get next state and immediate reward
        next_state, reward = self.mdp.get_state_reward_transition(state, action)

        # Q-value = immediate reward + discounted future value
        return reward + self.γ * self.values[next_state]

    def update_policy(self):
        """
        Extract greedy policy from current value function
        π'(s) = argmax_a Q(s,a)

        Returns all optimal actions when there are ties
        """
        policy = {}

        for state in self.mdp.get_states():
            # Skip terminal states
            if state in self.mdp.terminal_states:
                continue

            # Compute Q-values for all actions
            q_values = {}
            for action in self.mdp.get_possible_actions(state):
                q_values[action] = self.compute_q_value(state, action)

            # Find all actions with maximum Q-value (handle ties)
            max_q = max(q_values.values())
            # Include all actions within rounding error of max
            policy[state] = [a for a, v in q_values.items()
                           if round(v, 5) == round(max_q, 5)]

        return policy

pretty_print("Policy Evaluation Agent Ready",
             "UniformPolicyAgent class implemented with:<br>" +
             "• Iterative Bellman updates<br>" +
             "• Convergence checking<br>" +
             "• Greedy policy extraction",
             style='success')

## Section 4: Running Policy Evaluation Experiments

We'll now reproduce Figure 4.1 from Sutton & Barto, showing:
1. How state values evolve over iterations
2. How the greedy policy improves
3. Convergence to the optimal policy

In [5]:
"""
Cell 5: Main Experiment Runner
Purpose: Execute policy evaluation for multiple iteration counts to show convergence
"""

def compute_state_value_and_policy(iterations=[], γ=1):
    """
    Run policy evaluation for different iteration counts
    Shows convergence of both values and policies

    Args:
        iterations: List of iteration counts to evaluate
        γ: Discount factor
    """
    # Create 4×4 gridworld with two corners as terminal states
    mdp = BaseGridworld(width=4, height=4,
                       terminal_states=[(0, 3), (3, 0)])

    pretty_print("Gridworld Configuration",
                f"Size: 4×4<br>" +
                f"Terminal states: (0,3) top-left, (3,0) bottom-right<br>" +
                f"Discount factor γ = {γ}",
                style='info')

    for n_iter in iterations:
        # Run policy evaluation for n_iter iterations
        agent = UniformPolicyAgent(mdp=mdp, γ=γ, n_iterations=n_iter)

        # Display iteration header
        pretty_print(f"Iteration k = {n_iter}",
                    "Policy evaluation results after specified iterations",
                    style='result')

        # Display state values
        print("\nState-Value Function V(s):")
        # Flip vertically so (0,0) appears at bottom-left
        value_grid = np.flipud(agent.values.T)
        print(np.round(value_grid, 2))

        # Create formatted table
        formatted_grid = tabulate(np.round(value_grid, 2),
                                 tablefmt='grid',
                                 floatfmt=".2f")
        print(formatted_grid)

        # Display greedy policy
        print("\nGreedy Policy π'(s):")
        policy_grid = [['' for x in range(mdp.width)]
                      for y in range(mdp.height)]

        # Convert policy to arrows
        for (x, y), actions in agent.policy.items():
            # Show all optimal actions (there may be ties)
            arrows = ' '.join([action_to_arrow(a) for a in actions])
            policy_grid[y][x] = arrows

        # Flip vertically for display
        policy_grid = policy_grid[::-1]

        # Create formatted policy table
        formatted_policy = tabulate(policy_grid, tablefmt='grid')
        print(formatted_policy)
        print()

In [6]:
"""
Cell 6: Execute Experiments and Display Results
Purpose: Run the complete experiment sequence to reproduce Figure 4.1
"""

# Define iteration sequence to show convergence progression
iteration_sequence = [0, 1, 2, 3, 10, 1000]

# Set discount factor
γ = 1.0  # No discounting for this example

pretty_print("Starting Figure 4.1 Reproduction",
             f"Running policy evaluation for k = {iteration_sequence}<br>" +
             f"Uniform random policy: π(a|s) = 0.25 for all actions<br>" +
             f"This reproduces Sutton & Barto Figure 4.1",
             style='note')

# Run the experiments
compute_state_value_and_policy(iterations=iteration_sequence, γ=γ)


State-Value Function V(s):
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
+------+------+------+------+
| 0.00 | 0.00 | 0.00 | 0.00 |
+------+------+------+------+
| 0.00 | 0.00 | 0.00 | 0.00 |
+------+------+------+------+
| 0.00 | 0.00 | 0.00 | 0.00 |
+------+------+------+------+
| 0.00 | 0.00 | 0.00 | 0.00 |
+------+------+------+------+

Greedy Policy π'(s):
+---------+---------+---------+---------+
|         | ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → |
+---------+---------+---------+---------+
| ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → |
+---------+---------+---------+---------+
| ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → |
+---------+---------+---------+---------+
| ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → |         |
+---------+---------+---------+---------+




State-Value Function V(s):
[[ 0. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1.  0.]]
+-------+-------+-------+-------+
|  0.00 | -1.00 | -1.00 | -1.00 |
+-------+-------+-------+-------+
| -1.00 | -1.00 | -1.00 | -1.00 |
+-------+-------+-------+-------+
| -1.00 | -1.00 | -1.00 | -1.00 |
+-------+-------+-------+-------+
| -1.00 | -1.00 | -1.00 |  0.00 |
+-------+-------+-------+-------+

Greedy Policy π'(s):
+---------+---------+---------+---------+
|         | ←       | ↑ ← ↓ → | ↑ ← ↓ → |
+---------+---------+---------+---------+
| ↑       | ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → |
+---------+---------+---------+---------+
| ↑ ← ↓ → | ↑ ← ↓ → | ↑ ← ↓ → | ↓       |
+---------+---------+---------+---------+
| ↑ ← ↓ → | ↑ ← ↓ → | →       |         |
+---------+---------+---------+---------+




State-Value Function V(s):
[[ 0.   -1.75 -2.   -2.  ]
 [-1.75 -2.   -2.   -2.  ]
 [-2.   -2.   -2.   -1.75]
 [-2.   -2.   -1.75  0.  ]]
+-------+-------+-------+-------+
|  0.00 | -1.75 | -2.00 | -2.00 |
+-------+-------+-------+-------+
| -1.75 | -2.00 | -2.00 | -2.00 |
+-------+-------+-------+-------+
| -2.00 | -2.00 | -2.00 | -1.75 |
+-------+-------+-------+-------+
| -2.00 | -2.00 | -1.75 |  0.00 |
+-------+-------+-------+-------+

Greedy Policy π'(s):
+---------+---------+---------+---------+
|         | ←       | ←       | ↑ ← ↓ → |
+---------+---------+---------+---------+
| ↑       | ↑ ←     | ↑ ← ↓ → | ↓       |
+---------+---------+---------+---------+
| ↑       | ↑ ← ↓ → | ↓ →     | ↓       |
+---------+---------+---------+---------+
| ↑ ← ↓ → | →       | →       |         |
+---------+---------+---------+---------+




State-Value Function V(s):
[[ 0.   -2.44 -2.94 -3.  ]
 [-2.44 -2.88 -3.   -2.94]
 [-2.94 -3.   -2.88 -2.44]
 [-3.   -2.94 -2.44  0.  ]]
+-------+-------+-------+-------+
|  0.00 | -2.44 | -2.94 | -3.00 |
+-------+-------+-------+-------+
| -2.44 | -2.88 | -3.00 | -2.94 |
+-------+-------+-------+-------+
| -2.94 | -3.00 | -2.88 | -2.44 |
+-------+-------+-------+-------+
| -3.00 | -2.94 | -2.44 |  0.00 |
+-------+-------+-------+-------+

Greedy Policy π'(s):
+-----+-----+-----+-----+
|     | ←   | ←   | ← ↓ |
+-----+-----+-----+-----+
| ↑   | ↑ ← | ← ↓ | ↓   |
+-----+-----+-----+-----+
| ↑   | ↑ → | ↓ → | ↓   |
+-----+-----+-----+-----+
| ↑ → | →   | →   |     |
+-----+-----+-----+-----+




State-Value Function V(s):
[[ 0.   -6.14 -8.35 -8.97]
 [-6.14 -7.74 -8.43 -8.35]
 [-8.35 -8.43 -7.74 -6.14]
 [-8.97 -8.35 -6.14  0.  ]]
+-------+-------+-------+-------+
|  0.00 | -6.14 | -8.35 | -8.97 |
+-------+-------+-------+-------+
| -6.14 | -7.74 | -8.43 | -8.35 |
+-------+-------+-------+-------+
| -8.35 | -8.43 | -7.74 | -6.14 |
+-------+-------+-------+-------+
| -8.97 | -8.35 | -6.14 |  0.00 |
+-------+-------+-------+-------+

Greedy Policy π'(s):
+-----+-----+-----+-----+
|     | ←   | ←   | ← ↓ |
+-----+-----+-----+-----+
| ↑   | ↑ ← | ← ↓ | ↓   |
+-----+-----+-----+-----+
| ↑   | ↑ → | ↓ → | ↓   |
+-----+-----+-----+-----+
| ↑ → | →   | →   |     |
+-----+-----+-----+-----+




State-Value Function V(s):
[[  0.   -13.99 -19.99 -21.98]
 [-13.99 -17.99 -19.99 -19.99]
 [-19.99 -19.99 -17.99 -13.99]
 [-21.98 -19.99 -13.99   0.  ]]
+--------+--------+--------+--------+
|   0.00 | -13.99 | -19.99 | -21.98 |
+--------+--------+--------+--------+
| -13.99 | -17.99 | -19.99 | -19.99 |
+--------+--------+--------+--------+
| -19.99 | -19.99 | -17.99 | -13.99 |
+--------+--------+--------+--------+
| -21.98 | -19.99 | -13.99 |   0.00 |
+--------+--------+--------+--------+

Greedy Policy π'(s):
+-----+-----+-----+-----+
|     | ←   | ←   | ← ↓ |
+-----+-----+-----+-----+
| ↑   | ↑ ← | ← ↓ | ↓   |
+-----+-----+-----+-----+
| ↑   | ↑ → | ↓ → | ↓   |
+-----+-----+-----+-----+
| ↑ → | →   | →   |     |
+-----+-----+-----+-----+



In [7]:
"""
Cell 7: Final Analysis and Conclusions
Purpose: Summarize findings and explain convergence properties
"""

analysis_text = """
<strong>Figure 4.1 Analysis - Convergence of Iterative Policy Evaluation</strong><br><br>
<strong>Key Observations:</strong><br>
• <strong>k=0:</strong> Initial values all zero, random policy<br>
• <strong>k=1:</strong> First Bellman update, values reflect immediate rewards<br>
• <strong>k=2:</strong> Values propagate from terminal states<br>
• <strong>k=3:</strong> Policy becomes optimal (though values not yet converged)<br>
• <strong>k=10:</strong> Values closer to convergence<br>
• <strong>k=1000:</strong> Full convergence achieved<br><br>

<strong>Important Insights:</strong><br>
• The greedy policy becomes optimal after just 3 iterations<br>
• Value function continues refining even after policy is optimal<br>
• Demonstrates separation of policy improvement and value convergence<br>
• Terminal states act as "sinks" pulling values toward them
"""

pretty_print("Convergence Analysis", analysis_text, style='result')

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. Rapid Policy Convergence:</strong> The greedy policy with respect to V(s) becomes optimal after only 3 iterations, even though the value function hasn't fully converged.</p>
        <p><strong>2. Value Propagation:</strong> Values propagate backward from terminal states, with each iteration extending the "influence" of terminal rewards by one step.</p>
        <p><strong>3. Bellman Consistency:</strong> At convergence, values satisfy the Bellman equation exactly: V(s) = Σ π(a|s)[R(s,a) + γV(s')].</p>
        <p><strong>4. Policy Improvement Theorem:</strong> The greedy policy with respect to V^π is guaranteed to be at least as good as π, and strictly better if π is not optimal.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why does the policy converge to optimal before the values fully converge?</li>
        <li>How would changing γ from 1.0 to 0.9 affect the convergence pattern?</li>
        <li>What would happen if we used a non-uniform initial policy instead?</li>
        <li>How does the grid structure (walls, terminal states) influence value propagation?</li>
        <li>Can you predict the exact optimal value function analytically for this gridworld?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 4-1: Policy Evaluation</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 4-2 - Policy Iteration</p>
</div>