# Part 3: Dynamic Programming

In this notebook, we'll learn **Dynamic Programming (DP)** methods for solving MDPs. These are **model-based** methods that require complete knowledge of the environment.

## What You'll Learn
- Policy Evaluation (computing $V^\pi$)
- Policy Improvement (making policy better)
- Policy Iteration (evaluation + improvement)
- Value Iteration (finding $V^*$ directly)
- Comparison of methods

## Prerequisites
- Understanding of MDPs and Bellman equations (Notebook 02)

Let's begin!

## Setup

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap
import time

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
np.random.seed(42)

print("Setup complete!")

In [None]:
# Create FrozenLake environment and extract MDP components
env = gym.make("FrozenLake-v1", is_slippery=True)

n_states = env.observation_space.n
n_actions = env.action_space.n
action_names = ['LEFT', 'DOWN', 'RIGHT', 'UP']
action_arrows = ['←', '↓', '→', '↑']

# Extract transition and reward matrices
def extract_mdp(env):
    """Extract P[s,a,s'] and R[s,a] from environment."""
    n_s = env.observation_space.n
    n_a = env.action_space.n
    
    P = np.zeros((n_s, n_a, n_s))
    R = np.zeros((n_s, n_a))
    
    for s in range(n_s):
        for a in range(n_a):
            for prob, next_s, reward, done in env.unwrapped.P[s][a]:
                P[s, a, next_s] += prob
                R[s, a] += prob * reward
    
    return P, R

P, R = extract_mdp(env)

print("FrozenLake MDP Loaded")
print("=" * 40)
print(f"States: {n_states}")
print(f"Actions: {n_actions}")
print(f"Transition matrix shape: {P.shape}")
print(f"Reward matrix shape: {R.shape}")

In [None]:
# Visualization helper functions
def plot_value_function(V, title="Value Function", ax=None):
    """Plot value function as a heatmap."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    V_grid = V.reshape(nrow, ncol)
    
    im = ax.imshow(V_grid, cmap='RdYlGn', vmin=0, vmax=max(V.max(), 0.01))
    plt.colorbar(im, ax=ax, shrink=0.8)
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            color = 'white' if V_grid[i, j] < V.max() / 2 else 'black'
            ax.text(j, i, f'{cell}\n{V[state]:.3f}', ha='center', va='center',
                   fontsize=9, color=color)
    
    ax.set_xticks(range(ncol))
    ax.set_yticks(range(nrow))
    ax.set_title(title)
    return ax

def plot_policy(policy, title="Policy", ax=None):
    """Plot policy showing best action in each state."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))
    
    desc = env.unwrapped.desc.astype(str)
    nrow, ncol = desc.shape
    colors = {'S': 'lightblue', 'F': 'white', 'H': 'lightcoral', 'G': 'lightgreen'}
    
    for i in range(nrow):
        for j in range(ncol):
            state = i * ncol + j
            cell = desc[i, j]
            
            rect = plt.Rectangle((j, nrow-1-i), 1, 1, fill=True,
                                 facecolor=colors.get(cell, 'white'), edgecolor='black')
            ax.add_patch(rect)
            
            # Get best action (handle both deterministic and stochastic policies)
            if len(policy.shape) == 1:
                best_action = int(policy[state])
            else:
                best_action = np.argmax(policy[state])
            
            if cell not in ['H', 'G']:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, 
                       f'{cell}\n{action_arrows[best_action]}',
                       ha='center', va='center', fontsize=14, fontweight='bold')
            else:
                ax.text(j + 0.5, nrow - 1 - i + 0.5, cell,
                       ha='center', va='center', fontsize=14, fontweight='bold')
    
    ax.set_xlim(0, ncol)
    ax.set_ylim(0, nrow)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(title)
    return ax

print("Visualization functions ready!")

---
# 1. What is Dynamic Programming?

**Dynamic Programming** (DP) is a method for solving complex problems by:
1. Breaking them into simpler subproblems
2. Solving the subproblems
3. Combining solutions to solve the original problem

## Requirements for DP

DP can be applied when the problem has:
1. **Optimal substructure**: Optimal solution can be decomposed into subproblems
2. **Overlapping subproblems**: Subproblems recur many times (solutions can be cached)

MDPs satisfy both properties:
- Bellman equation gives recursive decomposition (optimal substructure)
- Value function stores solutions (overlapping subproblems)

## DP in Reinforcement Learning

DP assumes **full knowledge of the MDP** (model-based):
- We know the transition probabilities $P_{ss'}^a$
- We know the reward function $R_s^a$

This is used for **planning** in a known environment, not learning from experience.

---
# 2. Policy Evaluation (Prediction)

**Problem**: Given a policy $\pi$, compute the state-value function $V^\pi$.

## Approach: Iterative Policy Evaluation

Use the Bellman expectation equation as an update rule:

$$V_{k+1}(s) = \sum_a \pi(a|s) \left[ R_s^a + \gamma \sum_{s'} P_{ss'}^a V_k(s') \right]$$

Start with arbitrary $V_0$ and iterate until convergence.

**Convergence**: $V_k \to V^\pi$ as $k \to \infty$

In [None]:
def policy_evaluation(P, R, policy, gamma, theta=1e-8, max_iterations=1000):
    """
    Iterative Policy Evaluation.
    
    Args:
        P: Transition matrix P[s,a,s']
        R: Reward matrix R[s,a]
        policy: Policy matrix π[s,a] (probabilities)
        gamma: Discount factor
        theta: Convergence threshold
        max_iterations: Maximum iterations
    
    Returns:
        V: State value function
        history: List of V at each iteration (for visualization)
    """
    n_states = P.shape[0]
    n_actions = P.shape[1]
    
    # Initialize V arbitrarily (zeros is fine)
    V = np.zeros(n_states)
    history = [V.copy()]
    
    for iteration in range(max_iterations):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            # V(s) = Σ_a π(a|s) * [R(s,a) + γ * Σ_s' P(s'|s,a) * V(s')]
            for a in range(n_actions):
                # Expected value of next state
                expected_next_V = np.sum(P[s, a] * V)
                # Add contribution from this action
                V_new[s] += policy[s, a] * (R[s, a] + gamma * expected_next_V)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        history.append(V.copy())
        
        if delta < theta:
            print(f"Policy Evaluation converged in {iteration + 1} iterations (delta={delta:.2e})")
            break
    
    return V, history

In [None]:
# Evaluate a uniform random policy
uniform_policy = np.ones((n_states, n_actions)) / n_actions

print("Evaluating Uniform Random Policy")
print("=" * 50)

V_random, history_random = policy_evaluation(P, R, uniform_policy, gamma=0.99)

print("\nValue function for random policy:")
print(V_random.reshape(4, 4).round(4))

In [None]:
# Visualize the convergence of policy evaluation
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

iterations_to_show = [0, 1, 2, 5, 10, 20, 50, len(history_random)-1]
iterations_to_show = [min(i, len(history_random)-1) for i in iterations_to_show]

for idx, (ax, it) in enumerate(zip(axes.flat, iterations_to_show)):
    V_it = history_random[it]
    plot_value_function(V_it, title=f"Iteration {it}", ax=ax)

plt.suptitle("Policy Evaluation Convergence (Random Policy, γ=0.99)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Plot convergence curve
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Max change per iteration
deltas = [np.max(np.abs(history_random[i+1] - history_random[i])) 
          for i in range(len(history_random)-1)]
axes[0].semilogy(deltas)
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Max |V_new - V_old| (log scale)')
axes[0].set_title('Convergence of Policy Evaluation')
axes[0].axhline(y=1e-8, color='r', linestyle='--', label='Threshold (1e-8)')
axes[0].legend()

# Plot 2: Value of specific states over iterations
states_to_track = [0, 1, 6, 14]  # Start, near start, middle, near goal
for s in states_to_track:
    values = [history_random[i][s] for i in range(len(history_random))]
    axes[1].plot(values, label=f'State {s}')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('V(s)')
axes[1].set_title('Value Evolution for Selected States')
axes[1].legend()

plt.tight_layout()
plt.show()

---
# 3. Policy Improvement

**Problem**: Given a value function $V^\pi$, find a better policy $\pi'$.

## Approach: Greedy Policy

For each state, choose the action that maximizes expected return:

$$\pi'(s) = \arg\max_a \left[ R_s^a + \gamma \sum_{s'} P_{ss'}^a V^\pi(s') \right]$$

## Policy Improvement Theorem

If $\pi'$ is the greedy policy with respect to $V^\pi$, then:

$$V^{\pi'}(s) \geq V^\pi(s) \text{ for all } s$$

The new policy is at least as good as the old one!

In [None]:
def policy_improvement(P, R, V, gamma):
    """
    Compute greedy policy with respect to value function V.
    
    Args:
        P: Transition matrix
        R: Reward matrix
        V: Current value function
        gamma: Discount factor
    
    Returns:
        policy: New deterministic policy (one-hot encoded)
        Q: Action-value function
    """
    n_states = P.shape[0]
    n_actions = P.shape[1]
    
    # Compute Q-values
    Q = np.zeros((n_states, n_actions))
    for s in range(n_states):
        for a in range(n_actions):
            Q[s, a] = R[s, a] + gamma * np.sum(P[s, a] * V)
    
    # Create greedy policy
    policy = np.zeros((n_states, n_actions))
    for s in range(n_states):
        best_action = np.argmax(Q[s])
        policy[s, best_action] = 1.0
    
    return policy, Q

In [None]:
# Improve the random policy
print("Policy Improvement")
print("=" * 50)

improved_policy, Q = policy_improvement(P, R, V_random, gamma=0.99)

# Show the improved policy
print("\nImproved Policy (best action for each state):")
best_actions = np.argmax(improved_policy, axis=1).reshape(4, 4)
for i in range(4):
    row = [action_arrows[a] for a in best_actions[i]]
    print(f"  {row}")

In [None]:
# Compare random and improved policies
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plot_policy(uniform_policy, title="Original: Uniform Random Policy", ax=axes[0])
plot_policy(improved_policy, title="Improved: Greedy w.r.t. V^π_random", ax=axes[1])

plt.tight_layout()
plt.show()

In [None]:
# Evaluate the improved policy to verify it's better
V_improved, _ = policy_evaluation(P, R, improved_policy, gamma=0.99)

print("\nComparison of Value Functions")
print("=" * 50)
print(f"{'State':<8} {'V_random':>12} {'V_improved':>12} {'Improvement':>12}")
print("-" * 50)
for s in range(n_states):
    diff = V_improved[s] - V_random[s]
    print(f"{s:<8} {V_random[s]:>12.4f} {V_improved[s]:>12.4f} {diff:>12.4f}")

print(f"\nTotal improvement: {np.sum(V_improved - V_random):.4f}")
print(f"All states improved or same: {np.all(V_improved >= V_random - 1e-10)}")

---
# 4. Policy Iteration

**Policy Iteration** alternates between:
1. **Policy Evaluation**: Compute $V^\pi$ for current policy
2. **Policy Improvement**: Make policy greedy with respect to $V^\pi$

Repeat until the policy no longer changes (converges to optimal policy $\pi^*$).

## Algorithm

```
1. Initialize policy π arbitrarily
2. Repeat:
   a. Policy Evaluation: Compute V^π
   b. Policy Improvement: π' = greedy(V^π)
   c. If π' = π, stop (converged)
   d. π = π'
3. Return π* and V*
```

## Convergence

Policy iteration is guaranteed to converge to the optimal policy in a finite number of iterations (since there are finite deterministic policies).

In [None]:
def policy_iteration(P, R, gamma, theta=1e-8, max_iterations=100):
    """
    Policy Iteration algorithm.
    
    Args:
        P: Transition matrix
        R: Reward matrix
        gamma: Discount factor
        theta: Convergence threshold for policy evaluation
        max_iterations: Maximum policy improvement iterations
    
    Returns:
        policy: Optimal policy
        V: Optimal value function
        policy_history: List of policies at each iteration
        V_history: List of value functions at each iteration
    """
    n_states = P.shape[0]
    n_actions = P.shape[1]
    
    # Initialize with random policy
    policy = np.ones((n_states, n_actions)) / n_actions
    
    policy_history = [policy.copy()]
    V_history = []
    
    for iteration in range(max_iterations):
        # Policy Evaluation
        V, _ = policy_evaluation(P, R, policy, gamma, theta=theta)
        V_history.append(V.copy())
        
        # Policy Improvement
        new_policy, Q = policy_improvement(P, R, V, gamma)
        
        # Check if policy changed
        if np.array_equal(new_policy, policy):
            print(f"Policy Iteration converged in {iteration + 1} iterations")
            break
        
        policy = new_policy
        policy_history.append(policy.copy())
    
    return policy, V, policy_history, V_history

In [None]:
# Run Policy Iteration
print("Running Policy Iteration")
print("=" * 50)

start_time = time.time()
optimal_policy_pi, V_star_pi, policy_history, V_history = policy_iteration(P, R, gamma=0.99)
pi_time = time.time() - start_time

print(f"\nTime taken: {pi_time:.4f} seconds")
print(f"\nOptimal Value Function V*:")
print(V_star_pi.reshape(4, 4).round(4))

In [None]:
# Visualize policy evolution
n_policies = len(policy_history)
fig, axes = plt.subplots(1, min(n_policies, 5), figsize=(4*min(n_policies, 5), 4))

if n_policies == 1:
    axes = [axes]

for i, ax in enumerate(axes):
    if i < n_policies:
        plot_policy(policy_history[i], title=f"Iteration {i}", ax=ax)

plt.suptitle("Policy Evolution in Policy Iteration", fontsize=14, y=1.05)
plt.tight_layout()
plt.show()

In [None]:
# Visualize final optimal policy and value function
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plot_value_function(V_star_pi, title="Optimal Value Function V*", ax=axes[0])
plot_policy(optimal_policy_pi, title="Optimal Policy π*", ax=axes[1])

plt.suptitle("Policy Iteration Result (γ=0.99)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

---
# 5. Value Iteration

**Value Iteration** combines policy evaluation and improvement into a single update:

$$V_{k+1}(s) = \max_a \left[ R_s^a + \gamma \sum_{s'} P_{ss'}^a V_k(s') \right]$$

This is applying the **Bellman Optimality Equation** as an update rule.

## Key Insight

- Policy Iteration does full policy evaluation (many iterations) before each improvement
- Value Iteration does only **one sweep** of evaluation, then immediately improves
- Value Iteration is like "truncated" policy iteration with k=1

## Algorithm

```
1. Initialize V arbitrarily
2. Repeat:
   For each state s:
     V(s) = max_a [R(s,a) + γ * Σ P(s'|s,a) * V(s')]
   Until V converges
3. Extract policy: π(s) = argmax_a [R(s,a) + γ * Σ P(s'|s,a) * V(s')]
```

In [None]:
def value_iteration(P, R, gamma, theta=1e-8, max_iterations=1000):
    """
    Value Iteration algorithm.
    
    Args:
        P: Transition matrix
        R: Reward matrix
        gamma: Discount factor
        theta: Convergence threshold
        max_iterations: Maximum iterations
    
    Returns:
        V: Optimal value function
        policy: Optimal policy
        history: Value function at each iteration
    """
    n_states = P.shape[0]
    n_actions = P.shape[1]
    
    # Initialize V arbitrarily
    V = np.zeros(n_states)
    history = [V.copy()]
    
    for iteration in range(max_iterations):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            # Compute Q-values for all actions
            Q_s = np.zeros(n_actions)
            for a in range(n_actions):
                Q_s[a] = R[s, a] + gamma * np.sum(P[s, a] * V)
            
            # Take maximum
            V_new[s] = np.max(Q_s)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        history.append(V.copy())
        
        if delta < theta:
            print(f"Value Iteration converged in {iteration + 1} iterations (delta={delta:.2e})")
            break
    
    # Extract optimal policy
    policy = np.zeros((n_states, n_actions))
    for s in range(n_states):
        Q_s = np.zeros(n_actions)
        for a in range(n_actions):
            Q_s[a] = R[s, a] + gamma * np.sum(P[s, a] * V)
        policy[s, np.argmax(Q_s)] = 1.0
    
    return V, policy, history

In [None]:
# Run Value Iteration
print("Running Value Iteration")
print("=" * 50)

start_time = time.time()
V_star_vi, optimal_policy_vi, vi_history = value_iteration(P, R, gamma=0.99)
vi_time = time.time() - start_time

print(f"\nTime taken: {vi_time:.4f} seconds")
print(f"\nOptimal Value Function V*:")
print(V_star_vi.reshape(4, 4).round(4))

In [None]:
# Visualize Value Iteration convergence
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

iterations_to_show = [0, 1, 5, 10, 20, 50, 100, len(vi_history)-1]
iterations_to_show = [min(i, len(vi_history)-1) for i in iterations_to_show]

for idx, (ax, it) in enumerate(zip(axes.flat, iterations_to_show)):
    V_it = vi_history[it]
    plot_value_function(V_it, title=f"Iteration {it}", ax=ax)

plt.suptitle("Value Iteration Convergence (γ=0.99)", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Plot convergence comparison
fig, ax = plt.subplots(figsize=(10, 5))

# Value Iteration convergence
vi_deltas = [np.max(np.abs(vi_history[i+1] - vi_history[i])) 
             for i in range(len(vi_history)-1)]
ax.semilogy(vi_deltas, label='Value Iteration', linewidth=2)

ax.axhline(y=1e-8, color='r', linestyle='--', label='Threshold')
ax.set_xlabel('Iteration')
ax.set_ylabel('Max |V_new - V_old| (log scale)')
ax.set_title('Value Iteration Convergence')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
# 6. Comparison: Policy Iteration vs Value Iteration

Let's compare the two methods in detail.

In [None]:
# Compare final results
print("Comparison of Policy Iteration vs Value Iteration")
print("=" * 60)

# Check if they found the same solution
V_diff = np.max(np.abs(V_star_pi - V_star_vi))
policy_same = np.array_equal(np.argmax(optimal_policy_pi, axis=1), 
                              np.argmax(optimal_policy_vi, axis=1))

print(f"\nMax difference in V*: {V_diff:.2e}")
print(f"Same optimal policy: {policy_same}")
print(f"\nPolicy Iteration time: {pi_time:.4f}s")
print(f"Value Iteration time: {vi_time:.4f}s")

In [None]:
# Visualize both results side by side
fig, axes = plt.subplots(2, 2, figsize=(12, 12))

plot_value_function(V_star_pi, title="V* (Policy Iteration)", ax=axes[0, 0])
plot_value_function(V_star_vi, title="V* (Value Iteration)", ax=axes[0, 1])
plot_policy(optimal_policy_pi, title="π* (Policy Iteration)", ax=axes[1, 0])
plot_policy(optimal_policy_vi, title="π* (Value Iteration)", ax=axes[1, 1])

plt.tight_layout()
plt.show()

In [None]:
# Benchmark with different discount factors
gammas = [0.5, 0.9, 0.95, 0.99, 0.999]
results = []

print("Benchmarking with Different Discount Factors")
print("=" * 70)
print(f"{'γ':>8} {'PI iters':>12} {'PI time':>12} {'VI iters':>12} {'VI time':>12}")
print("-" * 70)

for gamma in gammas:
    # Policy Iteration
    start = time.time()
    _, V_pi, policy_hist, _ = policy_iteration(P, R, gamma, theta=1e-8)
    pi_time_g = time.time() - start
    pi_iters = len(policy_hist)
    
    # Value Iteration
    start = time.time()
    V_vi, _, vi_hist = value_iteration(P, R, gamma, theta=1e-8)
    vi_time_g = time.time() - start
    vi_iters = len(vi_hist) - 1
    
    print(f"{gamma:>8.3f} {pi_iters:>12} {pi_time_g:>12.4f}s {vi_iters:>12} {vi_time_g:>12.4f}s")
    results.append({'gamma': gamma, 'pi_iters': pi_iters, 'pi_time': pi_time_g,
                    'vi_iters': vi_iters, 'vi_time': vi_time_g})

In [None]:
# Visualize benchmark results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot iterations
gammas_plot = [r['gamma'] for r in results]
pi_iters_plot = [r['pi_iters'] for r in results]
vi_iters_plot = [r['vi_iters'] for r in results]

x = np.arange(len(gammas_plot))
width = 0.35

axes[0].bar(x - width/2, pi_iters_plot, width, label='Policy Iteration', color='steelblue')
axes[0].bar(x + width/2, vi_iters_plot, width, label='Value Iteration', color='orange')
axes[0].set_xlabel('Discount Factor γ')
axes[0].set_ylabel('Iterations')
axes[0].set_title('Iterations to Convergence')
axes[0].set_xticks(x)
axes[0].set_xticklabels([f'{g}' for g in gammas_plot])
axes[0].legend()

# Plot time
pi_times_plot = [r['pi_time'] for r in results]
vi_times_plot = [r['vi_time'] for r in results]

axes[1].bar(x - width/2, pi_times_plot, width, label='Policy Iteration', color='steelblue')
axes[1].bar(x + width/2, vi_times_plot, width, label='Value Iteration', color='orange')
axes[1].set_xlabel('Discount Factor γ')
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Computation Time')
axes[1].set_xticks(x)
axes[1].set_xticklabels([f'{g}' for g in gammas_plot])
axes[1].legend()

plt.tight_layout()
plt.show()

---
# 7. Evaluating the Optimal Policy

Let's test our learned optimal policy on the actual environment.

In [None]:
def evaluate_policy_empirically(env, policy, n_episodes=10000):
    """Evaluate a policy by running episodes."""
    rewards = []
    steps_list = []
    
    for _ in range(n_episodes):
        obs, _ = env.reset()
        total_reward = 0
        steps = 0
        done = False
        
        while not done:
            # Get action from policy
            if len(policy.shape) == 1:
                action = int(policy[obs])
            else:
                action = np.random.choice(n_actions, p=policy[obs])
            
            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            steps += 1
            done = terminated or truncated
        
        rewards.append(total_reward)
        steps_list.append(steps)
    
    return np.array(rewards), np.array(steps_list)

# Evaluate different policies
env_eval = gym.make("FrozenLake-v1", is_slippery=True)
n_eval_episodes = 10000

print(f"Evaluating Policies ({n_eval_episodes} episodes each)")
print("=" * 60)

# Random policy
uniform_policy = np.ones((n_states, n_actions)) / n_actions
rewards_random, steps_random = evaluate_policy_empirically(env_eval, uniform_policy, n_eval_episodes)
print(f"Random Policy: Success rate = {np.mean(rewards_random)*100:.2f}%, Avg steps = {np.mean(steps_random):.1f}")

# Optimal policy from Policy Iteration
rewards_pi, steps_pi = evaluate_policy_empirically(env_eval, optimal_policy_pi, n_eval_episodes)
print(f"Policy Iteration: Success rate = {np.mean(rewards_pi)*100:.2f}%, Avg steps = {np.mean(steps_pi):.1f}")

# Optimal policy from Value Iteration
rewards_vi, steps_vi = evaluate_policy_empirically(env_eval, optimal_policy_vi, n_eval_episodes)
print(f"Value Iteration: Success rate = {np.mean(rewards_vi)*100:.2f}%, Avg steps = {np.mean(steps_vi):.1f}")

env_eval.close()

In [None]:
# Visualize evaluation results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Success rates
policies_names = ['Random', 'Policy Iteration', 'Value Iteration']
success_rates = [np.mean(rewards_random)*100, np.mean(rewards_pi)*100, np.mean(rewards_vi)*100]
colors = ['gray', 'steelblue', 'orange']

bars = axes[0].bar(policies_names, success_rates, color=colors, edgecolor='black')
axes[0].set_ylabel('Success Rate (%)')
axes[0].set_title('Policy Comparison: Success Rate')
axes[0].set_ylim(0, 100)

for bar, rate in zip(bars, success_rates):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{rate:.1f}%', ha='center', fontsize=11, fontweight='bold')

# Average steps (only for successful episodes)
avg_steps = [np.mean(steps_random), np.mean(steps_pi), np.mean(steps_vi)]

bars = axes[1].bar(policies_names, avg_steps, color=colors, edgecolor='black')
axes[1].set_ylabel('Average Steps')
axes[1].set_title('Policy Comparison: Average Episode Length')

for bar, steps in zip(bars, avg_steps):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{steps:.1f}', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

---
# 8. Effect of Discount Factor on Optimal Policy

Let's see how different discount factors affect the learned optimal policy.

In [None]:
# Compute optimal policies for different discount factors
gammas_to_compare = [0.1, 0.5, 0.9, 0.99]

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for idx, gamma in enumerate(gammas_to_compare):
    V, policy, _ = value_iteration(P, R, gamma)
    
    plot_value_function(V, title=f"V* (γ={gamma})", ax=axes[0, idx])
    plot_policy(policy, title=f"π* (γ={gamma})", ax=axes[1, idx])

plt.suptitle("Effect of Discount Factor on Optimal Policy", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Lower γ: Agent is 'shortsighted', values states closer to goal more")
print("- Higher γ: Agent is 'farsighted', considers long-term consequences")
print("- The optimal policy may change based on how much we value future rewards")

---
# Summary

## Dynamic Programming Methods

| Method | What it does | Key equation | Requirements |
|--------|-------------|--------------|---------------|
| **Policy Evaluation** | Computes $V^\pi$ | Bellman Expectation | Policy $\pi$, full MDP |
| **Policy Improvement** | Gets better policy | Greedy w.r.t. V | Value function V |
| **Policy Iteration** | Finds $\pi^*$, $V^*$ | Eval + Improve | Full MDP |
| **Value Iteration** | Finds $V^*$, $\pi^*$ | Bellman Optimality | Full MDP |

## Key Takeaways

1. **DP requires full model knowledge** - we need P and R
2. **Policy Iteration**: Alternates evaluation and improvement, fewer iterations but each is expensive
3. **Value Iteration**: One-step evaluation + improvement, more iterations but each is cheap
4. **Both converge** to the optimal policy and value function
5. **Discount factor** affects what the optimal policy looks like

## Limitations of DP

- Requires **complete model** of the environment (P, R)
- Computationally expensive for large state spaces (curse of dimensionality)
- Cannot be used when the environment is unknown

## Next Steps

In the next notebook (**04_monte_carlo.ipynb**), we'll learn **model-free** methods that can learn from experience without knowing the transition probabilities!

In [None]:
print("Congratulations! You've completed Part 3 of the RL Tutorial!")
print("\nKey takeaways:")
print("- Policy Evaluation: Iteratively compute V^π using Bellman expectation")
print("- Policy Improvement: Make policy greedy w.r.t. current V")
print("- Policy Iteration: Alternate eval and improvement until convergence")
print("- Value Iteration: Apply Bellman optimality directly")
print("- Both methods find the optimal policy when given the full MDP")
print("\nNext: 04_monte_carlo.ipynb")