# Reinforcement Learning: Zero to Hero

## A Comprehensive Journey from Foundational Concepts to Advanced Applications

Welcome to this comprehensive Jupyter notebook on Reinforcement Learning! This notebook will guide you through a complete learning journey, starting from the basics and progressing to advanced research topics and real-world applications.

### What You'll Learn

- **Foundational Concepts**: Understanding the core principles of reinforcement learning, including MDPs, value functions, and policies
- **Core Algorithms**: Implementing and understanding key algorithms like Q-Learning, SARSA, DQN, and policy gradient methods
- **Advanced Topics**: Exploring cutting-edge techniques in reward engineering, scaling, and specialized RL methods
- **Real-World Applications**: Seeing how RL is applied in robotics, game playing, finance, healthcare, and more
- **Research & Deployment**: Understanding current research trends and how to deploy RL systems in production

### How to Use This Notebook

1. Execute cells sequentially from top to bottom
2. Read the explanations carefully before running code
3. Experiment with the code examples
4. Modify parameters to see how they affect results
5. Complete the exercises to reinforce your understanding

Let's begin your journey into the exciting world of Reinforcement Learning!

## Table of Contents

1. [Setup & Dependencies](#setup)
2. [Section 1: Foundational Concepts](#section1)
   - [Introduction to Reinforcement Learning](#intro-rl)
   - [Multi-Armed Bandit Problem](#bandits)
   - [Core Terminology and MDP Framework](#mdp)
   - [Policies and Value Functions](#policies)
   - [Dynamic Programming](#dynamic-programming)
   - [Learning Paradigms](#learning-paradigms)
3. [Section 2: Core Algorithms](#section2)
   - [Monte Carlo Methods](#monte-carlo)
   - [Temporal Difference Learning](#td-learning)
   - [Q-Learning](#q-learning)
   - [Deep Q-Networks (DQN)](#dqn)
   - [Policy Optimization Methods](#policy-optimization)
4. [Section 3: Advanced Topics](#section3)
   - [Reward Engineering](#reward-engineering)
   - [Scaling and Generalization](#scaling)
   - [Advanced Policy Methods](#advanced-policy)
   - [Specialized RL Techniques](#specialized)
5. [Section 4: Code Implementations](#section4)
   - [Bandit Algorithms](#bandit-implementations)
   - [MDP and Dynamic Programming](#mdp-implementations)
   - [Monte Carlo Methods](#mc-implementations)
   - [Temporal Difference Methods](#td-implementations)
   - [Deep RL Implementations](#deep-rl-implementations)
6. [Section 5: Real-World Applications](#section5)
   - [Traffic Signal Control](#traffic)
   - [Robotics](#robotics)
   - [Autonomous Trading](#trading)
   - [Recommendation Systems](#recommendations)
   - [Healthcare](#healthcare)
   - [Hyperparameter Tuning](#hyperparameter-tuning)
   - [Game Playing](#game-playing)
   - [Energy Management](#energy)
   - [Chess Environment](#chess)
7. [Section 6: Advanced Research & Deployment](#section6)
   - [Current Research Trends](#research-trends)
   - [Ethical and Safety Considerations](#ethics)
   - [Deployment Challenges](#deployment)
   - [End-to-End Pipeline](#pipeline)
   - [Recent Research](#recent-research)
8. [Conclusion and Next Steps](#conclusion)

<a id='setup'></a>
## Setup & Dependencies

Before we begin, we need to install the required Python packages. This notebook uses several popular libraries for numerical computation, visualization, and reinforcement learning environments.

In [None]:
# Install required packages
# Note: You may need to restart the kernel after installation

!pip install numpy>=1.21.0
!pip install matplotlib>=3.4.0
!pip install seaborn>=0.11.0
!pip install gym>=0.21.0
!pip install torch>=1.10.0
!pip install pandas>=1.3.0

### Import Required Libraries

Now let's import all the libraries we'll be using throughout this notebook.

In [None]:
# Core numerical and scientific computing
import numpy as np
import pandas as pd
from collections import defaultdict, deque
import random
import time

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Reinforcement Learning environments
import gym

# Deep Learning framework
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

print("All libraries imported successfully!")

### Verify Installation

Let's verify that all packages are installed correctly and check their versions.

In [None]:
# Verification cell - check all installations
import sys

def check_package(package_name, import_name=None):
    """Check if a package is installed and print its version."""
    if import_name is None:
        import_name = package_name
    
    try:
        module = __import__(import_name)
        version = getattr(module, '__version__', 'Unknown')
        print(f"‚úì {package_name:20s} version: {version}")
        return True
    except ImportError:
        print(f"‚úó {package_name:20s} NOT INSTALLED")
        return False

print("Checking package installations...\n")
print(f"Python version: {sys.version.split()[0]}\n")

packages = [
    ('numpy', 'numpy'),
    ('matplotlib', 'matplotlib'),
    ('seaborn', 'seaborn'),
    ('gym', 'gym'),
    ('torch', 'torch'),
    ('pandas', 'pandas')
]

all_installed = True
for package_name, import_name in packages:
    if not check_package(package_name, import_name):
        all_installed = False

print("\n" + "="*50)
if all_installed:
    print("‚úì All packages installed successfully!")
    print("You're ready to start learning Reinforcement Learning!")
else:
    print("‚úó Some packages are missing. Please install them using:")
    print("  pip install numpy matplotlib seaborn gym torch pandas")
print("="*50)

<a id='section1'></a>
## Section 1: Foundational Concepts

In this section, we'll build a solid foundation in reinforcement learning by exploring core concepts, starting with the simplest problems and gradually increasing complexity.

<a id='intro-rl'></a>
### Introduction to Reinforcement Learning

**What is Reinforcement Learning?**

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. Unlike other machine learning paradigms, RL focuses on learning through trial and error, receiving feedback in the form of rewards or penalties.

**How RL Differs from Other Machine Learning Paradigms:**

1. **Supervised Learning**: 
   - Learns from labeled examples (input-output pairs)
   - The correct answer is provided for each training example
   - Example: Image classification, where each image has a label

2. **Unsupervised Learning**: 
   - Learns patterns from unlabeled data
   - No explicit feedback or correct answers
   - Example: Clustering customers based on purchasing behavior

3. **Reinforcement Learning**: 
   - Learns from interaction with an environment
   - Receives delayed rewards/penalties as feedback
   - Must discover which actions yield the most reward through exploration
   - Example: Teaching a robot to walk, playing chess, or optimizing ad placement

**Key Characteristics of RL:**

- **Sequential Decision Making**: Actions affect future states and rewards
- **Trial and Error**: The agent must explore to discover good strategies
- **Delayed Consequences**: Rewards may come long after the actions that caused them
- **Trade-offs**: Must balance exploration (trying new things) vs exploitation (using known good strategies)

**The RL Loop:**

The fundamental interaction pattern in RL is:

1. Agent observes the current **state** of the environment
2. Agent selects and performs an **action**
3. Environment transitions to a new **state**
4. Agent receives a **reward** signal
5. Repeat

Let's see this in action with a simple example:

In [None]:
# Simple demonstration of the RL loop
# We'll create a basic environment and agent to illustrate the interaction

class SimpleEnvironment:
    """A simple environment where the agent tries to reach a goal."""
    
    def __init__(self):
        self.position = 0  # Agent starts at position 0
        self.goal = 5      # Goal is at position 5
        self.max_steps = 10
        self.current_step = 0
    
    def reset(self):
        """Reset the environment to initial state."""
        self.position = 0
        self.current_step = 0
        return self.position
    
    def step(self, action):
        """Execute an action and return (next_state, reward, done).
        
        Args:
            action: 0 = move left, 1 = move right
        
        Returns:
            next_state: The new position
            reward: Reward for this transition
            done: Whether the episode is finished
        """
        self.current_step += 1
        
        # Update position based on action
        if action == 0:  # Move left
            self.position = max(0, self.position - 1)
        else:  # Move right
            self.position = min(10, self.position + 1)
        
        # Calculate reward
        if self.position == self.goal:
            reward = 10  # Large reward for reaching goal
            done = True
        elif self.current_step >= self.max_steps:
            reward = -5  # Penalty for taking too long
            done = True
        else:
            reward = -1  # Small penalty for each step (encourages efficiency)
            done = False
        
        return self.position, reward, done


class SimpleAgent:
    """A simple agent that takes random actions."""
    
    def select_action(self, state):
        """Select an action (randomly for now)."""
        return np.random.choice([0, 1])  # 0 = left, 1 = right


# Demonstrate the RL loop
print("Demonstrating the Reinforcement Learning Loop\n")
print("="*60)

env = SimpleEnvironment()
agent = SimpleAgent()

# Run one episode
state = env.reset()
total_reward = 0
step = 0

print(f"Initial State: Position = {state}, Goal = {env.goal}\n")

done = False
while not done:
    # Agent observes state and selects action
    action = agent.select_action(state)
    action_name = "LEFT" if action == 0 else "RIGHT"
    
    # Environment responds to action
    next_state, reward, done = env.step(action)
    
    # Track cumulative reward
    total_reward += reward
    step += 1
    
    # Display the interaction
    print(f"Step {step}:")
    print(f"  State: {state} ‚Üí Action: {action_name} ‚Üí Next State: {next_state}")
    print(f"  Reward: {reward:+.0f} | Total Reward: {total_reward:+.0f}")
    
    if done:
        if next_state == env.goal:
            print(f"\n‚úì Goal reached in {step} steps!")
        else:
            print(f"\n‚úó Failed to reach goal within {env.max_steps} steps.")
    print()
    
    state = next_state

print("="*60)
print(f"\nFinal Total Reward: {total_reward:+.0f}")
print("\nThis demonstrates the core RL loop:")
print("  1. Agent observes STATE")
print("  2. Agent takes ACTION")
print("  3. Environment provides REWARD and new STATE")
print("  4. Repeat until episode ends")

**Key Observations:**

- The agent doesn't know the optimal strategy initially
- It must learn through experience which actions lead to higher rewards
- The random agent above is inefficient - a learning agent would improve over time
- This simple example captures the essence of RL: learning to make good decisions through interaction

In the following sections, we'll explore how agents can learn optimal strategies, starting with one of the simplest RL problems: the Multi-Armed Bandit.

<a id='bandits'></a>
### Multi-Armed Bandit Problem

**What is the Multi-Armed Bandit Problem?**

Imagine you're in a casino facing a row of slot machines (also called "one-armed bandits"). Each machine has a different, unknown probability of paying out. You have a limited budget and want to maximize your total winnings. Which machines should you play?

This is the **K-Armed Bandit Problem**, one of the simplest yet most fundamental problems in reinforcement learning. It's called "K-armed" because there are K different slot machines (or "arms") to choose from.

**The Exploration-Exploitation Dilemma**

The bandit problem perfectly illustrates the core challenge in RL:

- **Exploitation**: Play the machine that has given you the best results so far (use your current knowledge)
- **Exploration**: Try other machines to see if they might be better (gather more information)

If you only exploit, you might miss out on better options you haven't tried enough. If you only explore, you waste time on machines you already know are bad. The key is finding the right balance.

**Formal Definition:**

- You have K actions (arms) to choose from
- Each action has an unknown expected reward (the "true value")
- When you select an action, you receive a reward drawn from that action's probability distribution
- Your goal: maximize the total reward over many time steps

Let's implement a simple bandit environment:

In [None]:
class MultiArmedBandit:
    """A K-armed bandit environment.
    
    Each arm has a true mean reward, and pulling an arm gives a reward
    sampled from a normal distribution around that mean.
    """
    
    def __init__(self, k=10, mean_range=(0, 1), std=1.0):
        """Initialize the bandit.
        
        Args:
            k: Number of arms
            mean_range: Range for true mean rewards
            std: Standard deviation of reward distributions
        """
        self.k = k
        self.std = std
        
        # True mean reward for each arm (unknown to the agent)
        self.true_means = np.random.uniform(mean_range[0], mean_range[1], k)
        
        # Track which arm is actually best
        self.best_arm = np.argmax(self.true_means)
        
    def pull(self, arm):
        """Pull an arm and receive a reward.
        
        Args:
            arm: Index of the arm to pull (0 to k-1)
            
        Returns:
            reward: Sampled reward from this arm's distribution
        """
        # Sample reward from normal distribution around true mean
        reward = np.random.normal(self.true_means[arm], self.std)
        return reward
    
    def get_optimal_reward(self):
        """Return the expected reward of the best arm."""
        return self.true_means[self.best_arm]


# Create a 10-armed bandit
np.random.seed(42)
bandit = MultiArmedBandit(k=10)

print("Multi-Armed Bandit Environment Created")
print("="*60)
print(f"Number of arms: {bandit.k}")
print(f"\nTrue mean rewards for each arm:")
for i, mean in enumerate(bandit.true_means):
    marker = " ‚Üê BEST" if i == bandit.best_arm else ""
    print(f"  Arm {i}: {mean:.3f}{marker}")
print(f"\nOptimal arm: {bandit.best_arm} (mean reward: {bandit.get_optimal_reward():.3f})")
print("\nNote: The agent doesn't know these true values!")
print("      It must learn them through experience.")

#### The Greedy Strategy and Its Fatal Flaw

**What is the Greedy Strategy?**

The simplest approach to the bandit problem is the **greedy strategy**: always choose the action that has the highest estimated value based on your experience so far.

**How it works:**
1. Keep track of the average reward received from each arm
2. Always select the arm with the highest average reward
3. Update the average after each pull

**The Fatal Flaw:**

The greedy strategy can easily get stuck on a suboptimal arm! Here's why:

- Suppose you try arm 3 first and get lucky with a high reward
- Now arm 3 has the highest estimated value
- The greedy strategy will keep choosing arm 3 forever
- You'll never discover that arm 7 is actually better!

This is called **premature convergence** - the agent stops exploring too early and misses better options.

Let's implement a greedy agent and see this problem in action:

In [None]:
class GreedyAgent:
    """An agent that always selects the arm with highest estimated value."""
    
    def __init__(self, k):
        """Initialize the agent.
        
        Args:
            k: Number of arms
        """
        self.k = k
        self.q_estimates = np.zeros(k)  # Estimated value of each arm
        self.action_counts = np.zeros(k)  # Number of times each arm was pulled
        
    def select_action(self):
        """Select the arm with highest estimated value (greedy)."""
        # Break ties randomly
        max_value = np.max(self.q_estimates)
        best_arms = np.where(self.q_estimates == max_value)[0]
        return np.random.choice(best_arms)
    
    def update(self, action, reward):
        """Update estimates after receiving a reward.
        
        Uses incremental average formula:
        NewEstimate = OldEstimate + (1/n) * (Reward - OldEstimate)
        """
        self.action_counts[action] += 1
        n = self.action_counts[action]
        
        # Incremental update of average
        self.q_estimates[action] += (1/n) * (reward - self.q_estimates[action])


def run_experiment(agent, bandit, steps=1000):
    """Run an experiment with an agent on a bandit.
    
    Returns:
        rewards: Array of rewards received at each step
        optimal_actions: Array indicating if optimal action was chosen
    """
    rewards = np.zeros(steps)
    optimal_actions = np.zeros(steps)
    
    for t in range(steps):
        # Agent selects action
        action = agent.select_action()
        
        # Environment provides reward
        reward = bandit.pull(action)
        
        # Agent updates its estimates
        agent.update(action, reward)
        
        # Track results
        rewards[t] = reward
        optimal_actions[t] = 1 if action == bandit.best_arm else 0
    
    return rewards, optimal_actions


# Demonstrate the greedy strategy's failure
print("Demonstrating the Greedy Strategy\n")
print("="*60)

np.random.seed(42)
bandit = MultiArmedBandit(k=10)
greedy_agent = GreedyAgent(k=10)

rewards, optimal_actions = run_experiment(greedy_agent, bandit, steps=1000)

print(f"\nAfter 1000 steps:")
print(f"\nArm selection counts:")
for i in range(bandit.k):
    count = greedy_agent.action_counts[i]
    estimate = greedy_agent.q_estimates[i]
    true_value = bandit.true_means[i]
    marker = " ‚Üê OPTIMAL" if i == bandit.best_arm else ""
    print(f"  Arm {i}: pulled {count:4.0f} times | "
          f"estimated value: {estimate:6.3f} | true value: {true_value:6.3f}{marker}")

optimal_pct = np.mean(optimal_actions) * 100
avg_reward = np.mean(rewards)
optimal_reward = bandit.get_optimal_reward()

print(f"\nPerformance:")
print(f"  Optimal action selected: {optimal_pct:.1f}% of the time")
print(f"  Average reward: {avg_reward:.3f}")
print(f"  Optimal reward: {optimal_reward:.3f}")
print(f"  Regret: {optimal_reward - avg_reward:.3f}")

print(f"\n‚ö†Ô∏è  Notice: The greedy agent likely got stuck on a suboptimal arm!")
print(f"    It stopped exploring and missed the best option.")

#### Visualizing the Greedy Strategy's Failure

In [None]:
# Run multiple experiments to see the pattern
num_experiments = 100
steps = 1000

all_rewards = np.zeros((num_experiments, steps))
all_optimal = np.zeros((num_experiments, steps))

np.random.seed(42)
for i in range(num_experiments):
    bandit = MultiArmedBandit(k=10)
    agent = GreedyAgent(k=10)
    rewards, optimal = run_experiment(agent, bandit, steps)
    all_rewards[i] = rewards
    all_optimal[i] = optimal

# Calculate averages across experiments
avg_rewards = np.mean(all_rewards, axis=0)
avg_optimal = np.mean(all_optimal, axis=0)

# Create visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Plot 1: Average reward over time
ax1.plot(avg_rewards, linewidth=2, color='red', alpha=0.8)
ax1.set_xlabel('Steps', fontsize=12)
ax1.set_ylabel('Average Reward', fontsize=12)
ax1.set_title('Greedy Strategy: Average Reward Over Time', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.axhline(y=np.mean(avg_rewards), color='red', linestyle='--', alpha=0.5, label=f'Mean: {np.mean(avg_rewards):.3f}')
ax1.legend()

# Plot 2: Percentage of optimal actions
ax2.plot(avg_optimal * 100, linewidth=2, color='red', alpha=0.8)
ax2.set_xlabel('Steps', fontsize=12)
ax2.set_ylabel('% Optimal Action', fontsize=12)
ax2.set_title('Greedy Strategy: Percentage of Optimal Actions', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.axhline(y=np.mean(avg_optimal) * 100, color='red', linestyle='--', alpha=0.5, 
            label=f'Mean: {np.mean(avg_optimal)*100:.1f}%')
ax2.legend()
ax2.set_ylim([0, 100])

plt.tight_layout()
plt.show()

print("\nüìä Interpretation:")
print("   - The greedy strategy quickly settles on an arm (often suboptimal)")
print("   - It rarely selects the optimal action because it stopped exploring")
print("   - Performance plateaus early and doesn't improve")
print("   - This demonstrates why pure exploitation fails!")

#### The Epsilon-Greedy Algorithm

**A Simple Solution to the Exploration-Exploitation Dilemma**

The epsilon-greedy algorithm provides a simple yet effective solution to the greedy strategy's fatal flaw. Instead of always exploiting, it introduces controlled exploration.

**How Epsilon-Greedy Works:**

With probability $\epsilon$ (epsilon): Choose a **random** action (explore)

With probability $1 - \epsilon$: Choose the **best known** action (exploit)

**Mathematical Formulation:**

$$
A_t = \begin{cases}
\text{random action} & \text{with probability } \epsilon \\
\arg\max_a Q_t(a) & \text{with probability } 1 - \epsilon
\end{cases}
$$

where $Q_t(a)$ is the estimated value of action $a$ at time $t$.

**Key Parameters:**

- $\epsilon = 0$: Pure exploitation (greedy strategy)
- $\epsilon = 1$: Pure exploration (random selection)
- $\epsilon = 0.1$: A common choice - explore 10% of the time

**Advantages:**
- Simple to implement
- Guarantees all actions are tried infinitely often (in the limit)
- Balances exploration and exploitation

**Trade-offs:**
- Explores uniformly (doesn't prioritize promising actions)
- Continues exploring even after finding the best action
- Choice of $\epsilon$ affects performance

Let's implement the epsilon-greedy algorithm:

In [None]:
class EpsilonGreedyAgent:
    """An agent that uses epsilon-greedy action selection."""
    
    def __init__(self, k, epsilon=0.1):
        """Initialize the agent.
        
        Args:
            k: Number of arms
            epsilon: Probability of random exploration (0 to 1)
        """
        self.k = k
        self.epsilon = epsilon
        self.q_estimates = np.zeros(k)  # Estimated value of each arm
        self.action_counts = np.zeros(k)  # Number of times each arm was pulled
        
    def select_action(self):
        """Select action using epsilon-greedy strategy."""
        if np.random.random() < self.epsilon:
            # Explore: choose random action
            return np.random.randint(0, self.k)
        else:
            # Exploit: choose best known action
            max_value = np.max(self.q_estimates)
            best_arms = np.where(self.q_estimates == max_value)[0]
            return np.random.choice(best_arms)
    
    def update(self, action, reward):
        """Update estimates after receiving a reward."""
        self.action_counts[action] += 1
        n = self.action_counts[action]
        
        # Incremental update of average
        self.q_estimates[action] += (1/n) * (reward - self.q_estimates[action])


# Test epsilon-greedy with different epsilon values
print("Epsilon-Greedy Algorithm Demonstration\n")
print("="*60)

np.random.seed(42)
bandit = MultiArmedBandit(k=10)

epsilon_values = [0.0, 0.01, 0.1, 0.3]
results = {}

for eps in epsilon_values:
    agent = EpsilonGreedyAgent(k=10, epsilon=eps)
    rewards, optimal_actions = run_experiment(agent, bandit, steps=1000)
    
    results[eps] = {
        'rewards': rewards,
        'optimal': optimal_actions,
        'avg_reward': np.mean(rewards),
        'optimal_pct': np.mean(optimal_actions) * 100
    }
    
    print(f"\nŒµ = {eps:.2f}:")
    print(f"  Average reward: {results[eps]['avg_reward']:.3f}")
    print(f"  Optimal action: {results[eps]['optimal_pct']:.1f}% of the time")

print("\n" + "="*60)
print("\nüí° Key Insight:")
print("   - Œµ = 0.0 (greedy) gets stuck on suboptimal actions")
print("   - Small Œµ values (0.01-0.1) balance exploration and exploitation well")
print("   - Large Œµ values (0.3) explore too much and waste opportunities")

#### Comparing Greedy vs Epsilon-Greedy Performance

In [None]:
# Run comprehensive comparison across multiple experiments
num_experiments = 200
steps = 1000
epsilon_values = [0.0, 0.01, 0.1, 0.3]

# Store results for each epsilon value
all_results = {eps: {'rewards': [], 'optimal': []} for eps in epsilon_values}

np.random.seed(42)
for i in range(num_experiments):
    bandit = MultiArmedBandit(k=10)
    
    for eps in epsilon_values:
        agent = EpsilonGreedyAgent(k=10, epsilon=eps)
        rewards, optimal = run_experiment(agent, bandit, steps)
        all_results[eps]['rewards'].append(rewards)
        all_results[eps]['optimal'].append(optimal)

# Calculate averages
avg_results = {}
for eps in epsilon_values:
    avg_results[eps] = {
        'rewards': np.mean(all_results[eps]['rewards'], axis=0),
        'optimal': np.mean(all_results[eps]['optimal'], axis=0)
    }

# Create comprehensive visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

colors = ['red', 'orange', 'green', 'blue']
labels = [f'Œµ = {eps:.2f}' + (' (Greedy)' if eps == 0 else '') for eps in epsilon_values]

# Plot 1: Average reward over time
for eps, color, label in zip(epsilon_values, colors, labels):
    ax1.plot(avg_results[eps]['rewards'], linewidth=2, color=color, alpha=0.8, label=label)

ax1.set_xlabel('Steps', fontsize=12)
ax1.set_ylabel('Average Reward', fontsize=12)
ax1.set_title('Epsilon-Greedy: Average Reward Comparison', fontsize=14, fontweight='bold')
ax1.legend(loc='lower right', fontsize=11)
ax1.grid(True, alpha=0.3)

# Plot 2: Percentage of optimal actions
for eps, color, label in zip(epsilon_values, colors, labels):
    ax2.plot(avg_results[eps]['optimal'] * 100, linewidth=2, color=color, alpha=0.8, label=label)

ax2.set_xlabel('Steps', fontsize=12)
ax2.set_ylabel('% Optimal Action', fontsize=12)
ax2.set_title('Epsilon-Greedy: Optimal Action Selection', fontsize=14, fontweight='bold')
ax2.legend(loc='lower right', fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 100])

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nüìä Performance Summary (averaged over {} experiments):".format(num_experiments))
print("="*70)
print(f"{'Strategy':<20} {'Avg Reward':<15} {'Optimal %':<15} {'Final Optimal %'}")
print("="*70)

for eps in epsilon_values:
    strategy = f"Œµ = {eps:.2f}" + (" (Greedy)" if eps == 0 else "")
    avg_reward = np.mean(avg_results[eps]['rewards'])
    avg_optimal = np.mean(avg_results[eps]['optimal']) * 100
    final_optimal = np.mean(avg_results[eps]['optimal'][-100:]) * 100  # Last 100 steps
    
    print(f"{strategy:<20} {avg_reward:<15.3f} {avg_optimal:<15.1f} {final_optimal:.1f}%")

print("="*70)
print("\n‚úÖ Conclusions:")
print("   1. Greedy (Œµ=0) performs poorly due to lack of exploration")
print("   2. Small epsilon (0.01-0.1) achieves good balance")
print("   3. Œµ=0.1 typically performs best in this setting")
print("   4. Too much exploration (Œµ=0.3) wastes opportunities to exploit")
print("   5. Epsilon-greedy successfully solves the exploration-exploitation dilemma!")

#### Optimistic Initial Values: Exploration Through Disappointment

**A Clever Alternative to Epsilon-Greedy**

The Optimistic Initial Values approach provides a different solution to encourage exploration. Instead of randomly exploring, it uses **disappointment-driven exploration**.

**The Key Idea:**

Initialize all action-value estimates to be **optimistically high** (higher than any realistic reward). When the agent tries an action and receives a lower-than-expected reward, it becomes "disappointed" and tries other actions, naturally encouraging exploration.

**How It Works:**

1. Set initial Q-values to a high value (e.g., +5 when true rewards are around 0-1)
2. Use a greedy strategy (no epsilon needed!)
3. Each action will initially seem promising
4. After trying an action, its estimate decreases toward the true value
5. The agent naturally tries all actions before settling on the best one

**Mathematical Intuition:**

If we initialize $Q_0(a) = c$ for all actions where $c$ is large:

$$Q_{n+1}(a) = Q_n(a) + \frac{1}{n+1}(R_n - Q_n(a))$$

Since $R_n < Q_n(a)$ initially, the estimate decreases, making other untried actions more attractive.

**Advantages:**
- No need to tune an epsilon parameter
- Exploration happens naturally through the learning process
- Simple to implement
- Works well for stationary problems

**Disadvantages:**
- Only explores at the beginning (not suitable for non-stationary problems)
- Requires knowing a good initial value
- Less flexible than epsilon-greedy

Let's implement and compare this approach:

In [None]:
class OptimisticGreedyAgent:
    """A greedy agent with optimistic initial value estimates."""
    
    def __init__(self, k, initial_value=5.0):
        """Initialize the agent with optimistic values.
        
        Args:
            k: Number of arms
            initial_value: Optimistic initial estimate for all actions
        """
        self.k = k
        self.initial_value = initial_value
        # Initialize all estimates optimistically
        self.q_estimates = np.ones(k) * initial_value
        self.action_counts = np.zeros(k)
        
    def select_action(self):
        """Select action greedily (highest estimated value)."""
        max_value = np.max(self.q_estimates)
        best_arms = np.where(self.q_estimates == max_value)[0]
        return np.random.choice(best_arms)
    
    def update(self, action, reward):
        """Update estimates after receiving a reward."""
        self.action_counts[action] += 1
        n = self.action_counts[action]
        
        # Incremental update
        self.q_estimates[action] += (1/n) * (reward - self.q_estimates[action])


# Demonstrate optimistic initial values
print("Optimistic Initial Values Demonstration\n")
print("="*60)

np.random.seed(42)
bandit = MultiArmedBandit(k=10)

print(f"True reward range: [{bandit.true_means.min():.2f}, {bandit.true_means.max():.2f}]")
print(f"Optimal arm: {bandit.best_arm} (mean: {bandit.get_optimal_reward():.3f})\n")

# Test different initial values
initial_values = [0.0, 2.0, 5.0, 10.0]
optimistic_results = {}

for init_val in initial_values:
    agent = OptimisticGreedyAgent(k=10, initial_value=init_val)
    rewards, optimal_actions = run_experiment(agent, bandit, steps=1000)
    
    optimistic_results[init_val] = {
        'rewards': rewards,
        'optimal': optimal_actions,
        'avg_reward': np.mean(rewards),
        'optimal_pct': np.mean(optimal_actions) * 100,
        'agent': agent
    }
    
    print(f"Initial Value = {init_val:.1f}:")
    print(f"  Average reward: {optimistic_results[init_val]['avg_reward']:.3f}")
    print(f"  Optimal action: {optimistic_results[init_val]['optimal_pct']:.1f}% of the time")
    print(f"  Final estimates: {agent.q_estimates}")
    print()

print("="*60)
print("\nüí° Key Observations:")
print("   - Initial value = 0: Behaves like standard greedy (poor exploration)")
print("   - Initial value = 5-10: Encourages exploration through disappointment")
print("   - Higher initial values ‚Üí more initial exploration")
print("   - Eventually converges to true values regardless of initialization")

#### Comparing Optimistic Initial Values with Epsilon-Greedy

In [None]:
# Comprehensive comparison: Optimistic vs Epsilon-Greedy
num_experiments = 200
steps = 1000

# Strategies to compare
strategies = {
    'Greedy (Q=0)': {'type': 'optimistic', 'init': 0.0},
    'Optimistic (Q=5)': {'type': 'optimistic', 'init': 5.0},
    'Œµ-greedy (Œµ=0.1)': {'type': 'epsilon', 'epsilon': 0.1},
    'Œµ-greedy (Œµ=0.01)': {'type': 'epsilon', 'epsilon': 0.01}
}

comparison_results = {name: {'rewards': [], 'optimal': []} for name in strategies.keys()}

np.random.seed(42)
for i in range(num_experiments):
    bandit = MultiArmedBandit(k=10)
    
    for name, config in strategies.items():
        if config['type'] == 'optimistic':
            agent = OptimisticGreedyAgent(k=10, initial_value=config['init'])
        else:
            agent = EpsilonGreedyAgent(k=10, epsilon=config['epsilon'])
        
        rewards, optimal = run_experiment(agent, bandit, steps)
        comparison_results[name]['rewards'].append(rewards)
        comparison_results[name]['optimal'].append(optimal)

# Calculate averages
avg_comparison = {}
for name in strategies.keys():
    avg_comparison[name] = {
        'rewards': np.mean(comparison_results[name]['rewards'], axis=0),
        'optimal': np.mean(comparison_results[name]['optimal'], axis=0)
    }

# Visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

colors = ['red', 'purple', 'green', 'orange']
linestyles = ['-', '-', '--', '--']

# Plot 1: Average reward
for (name, color, ls) in zip(strategies.keys(), colors, linestyles):
    ax1.plot(avg_comparison[name]['rewards'], linewidth=2, color=color, 
             linestyle=ls, alpha=0.8, label=name)

ax1.set_xlabel('Steps', fontsize=12)
ax1.set_ylabel('Average Reward', fontsize=12)
ax1.set_title('Optimistic Initial Values vs Epsilon-Greedy: Average Reward', 
              fontsize=14, fontweight='bold')
ax1.legend(loc='lower right', fontsize=11)
ax1.grid(True, alpha=0.3)

# Plot 2: Optimal action percentage
for (name, color, ls) in zip(strategies.keys(), colors, linestyles):
    ax2.plot(avg_comparison[name]['optimal'] * 100, linewidth=2, color=color, 
             linestyle=ls, alpha=0.8, label=name)

ax2.set_xlabel('Steps', fontsize=12)
ax2.set_ylabel('% Optimal Action', fontsize=12)
ax2.set_title('Optimistic Initial Values vs Epsilon-Greedy: Optimal Action Selection', 
              fontsize=14, fontweight='bold')
ax2.legend(loc='lower right', fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 100])

plt.tight_layout()
plt.show()

# Summary statistics
print("\nüìä Performance Comparison (averaged over {} experiments):".format(num_experiments))
print("="*80)
print(f"{'Strategy':<25} {'Avg Reward':<15} {'Optimal %':<15} {'Early (0-100)':<15} {'Late (900-1000)'}")
print("="*80)

for name in strategies.keys():
    avg_reward = np.mean(avg_comparison[name]['rewards'])
    avg_optimal = np.mean(avg_comparison[name]['optimal']) * 100
    early_optimal = np.mean(avg_comparison[name]['optimal'][:100]) * 100
    late_optimal = np.mean(avg_comparison[name]['optimal'][-100:]) * 100
    
    print(f"{name:<25} {avg_reward:<15.3f} {avg_optimal:<15.1f} {early_optimal:<15.1f} {late_optimal:.1f}%")

print("="*80)
print("\n‚úÖ Key Insights:")
print("   1. Optimistic initialization explores more early on")
print("   2. Epsilon-greedy maintains consistent exploration throughout")
print("   3. Optimistic approach eventually stops exploring (greedy after learning)")
print("   4. Both methods significantly outperform standard greedy")
print("   5. Choice depends on problem: stationary ‚Üí optimistic, non-stationary ‚Üí epsilon-greedy")

#### Upper Confidence Bound (UCB): Uncertainty-Driven Exploration

**The Most Sophisticated Bandit Algorithm**

The Upper Confidence Bound (UCB) algorithm represents a more principled approach to the exploration-exploitation dilemma. Instead of exploring randomly (epsilon-greedy) or through disappointment (optimistic initialization), UCB explores based on **uncertainty**.

**The Core Principle:**

"It's reasonable to be optimistic in the face of uncertainty."

UCB selects actions based on both:
1. **Estimated value** (exploitation)
2. **Uncertainty in that estimate** (exploration)

Actions that have been tried less often have higher uncertainty, making them more attractive for exploration.

**Mathematical Formulation:**

The UCB action selection rule is:

$$A_t = \arg\max_a \left[ Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}} \right]$$

where:
- $Q_t(a)$ = estimated value of action $a$ at time $t$ (exploitation term)
- $c$ = exploration parameter (controls degree of exploration)
- $t$ = current time step (total number of actions taken)
- $N_t(a)$ = number of times action $a$ has been selected (uncertainty term)
- $\sqrt{\frac{\ln t}{N_t(a)}}$ = uncertainty bonus (larger for less-tried actions)

**How the Uncertainty Bonus Works:**

- Actions tried fewer times have larger $\sqrt{\frac{\ln t}{N_t(a)}}$ (more uncertainty)
- As an action is tried more, $N_t(a)$ increases and the bonus decreases
- The $\ln t$ term ensures all actions are eventually tried
- The bonus naturally balances exploration and exploitation

**Advantages:**
- No random exploration - deterministic given the history
- Automatically balances exploration and exploitation
- Theoretical guarantees on performance (logarithmic regret)
- Prioritizes promising actions while ensuring all are tried

**Disadvantages:**
- More complex to implement
- Requires tuning the $c$ parameter
- Assumes stationary reward distributions

Let's implement UCB:

In [None]:
class UCBAgent:
    """An agent using Upper Confidence Bound action selection."""
    
    def __init__(self, k, c=2.0):
        """Initialize the UCB agent.
        
        Args:
            k: Number of arms
            c: Exploration parameter (typically sqrt(2) or 2)
        """
        self.k = k
        self.c = c
        self.q_estimates = np.zeros(k)
        self.action_counts = np.zeros(k)
        self.t = 0  # Total time steps
        
    def select_action(self):
        """Select action using UCB formula."""
        self.t += 1
        
        # First, try each action at least once
        if self.t <= self.k:
            return self.t - 1
        
        # Calculate UCB values for all actions
        ucb_values = np.zeros(self.k)
        for a in range(self.k):
            if self.action_counts[a] == 0:
                # Untried actions get infinite value (shouldn't happen after initial phase)
                ucb_values[a] = float('inf')
            else:
                # UCB formula: Q(a) + c * sqrt(ln(t) / N(a))
                exploitation = self.q_estimates[a]
                exploration = self.c * np.sqrt(np.log(self.t) / self.action_counts[a])
                ucb_values[a] = exploitation + exploration
        
        # Select action with highest UCB value
        max_ucb = np.max(ucb_values)
        best_actions = np.where(ucb_values == max_ucb)[0]
        return np.random.choice(best_actions)
    
    def update(self, action, reward):
        """Update estimates after receiving a reward."""
        self.action_counts[action] += 1
        n = self.action_counts[action]
        
        # Incremental update
        self.q_estimates[action] += (1/n) * (reward - self.q_estimates[action])


# Demonstrate UCB algorithm
print("Upper Confidence Bound (UCB) Algorithm Demonstration\n")
print("="*60)

np.random.seed(42)
bandit = MultiArmedBandit(k=10)

print(f"Optimal arm: {bandit.best_arm} (mean: {bandit.get_optimal_reward():.3f})\n")

# Test different c values
c_values = [0.5, 1.0, 2.0, 4.0]
ucb_results = {}

for c in c_values:
    agent = UCBAgent(k=10, c=c)
    rewards, optimal_actions = run_experiment(agent, bandit, steps=1000)
    
    ucb_results[c] = {
        'rewards': rewards,
        'optimal': optimal_actions,
        'avg_reward': np.mean(rewards),
        'optimal_pct': np.mean(optimal_actions) * 100
    }
    
    print(f"c = {c:.1f}:")
    print(f"  Average reward: {ucb_results[c]['avg_reward']:.3f}")
    print(f"  Optimal action: {ucb_results[c]['optimal_pct']:.1f}% of the time")
    print(f"  Action counts: {agent.action_counts.astype(int)}")
    print()

print("="*60)
print("\nüí° Key Observations:")
print("   - Small c (0.5): Less exploration, may converge faster but risk suboptimal")
print("   - Medium c (1-2): Good balance, typical choice")
print("   - Large c (4): More exploration, ensures thorough search")
print("   - UCB naturally tries all actions but focuses on promising ones")

#### Comprehensive Comparison: All Three Strategies

In [None]:
# Final comprehensive comparison of all strategies
num_experiments = 200
steps = 1000

# All strategies to compare
all_strategies = {
    'Greedy': {'type': 'greedy'},
    'Œµ-greedy (Œµ=0.1)': {'type': 'epsilon', 'epsilon': 0.1},
    'Œµ-greedy (Œµ=0.01)': {'type': 'epsilon', 'epsilon': 0.01},
    'Optimistic (Q=5)': {'type': 'optimistic', 'init': 5.0},
    'UCB (c=2)': {'type': 'ucb', 'c': 2.0},
    'UCB (c=1)': {'type': 'ucb', 'c': 1.0}
}

final_results = {name: {'rewards': [], 'optimal': []} for name in all_strategies.keys()}

np.random.seed(42)
for i in range(num_experiments):
    bandit = MultiArmedBandit(k=10)
    
    for name, config in all_strategies.items():
        if config['type'] == 'greedy':
            agent = GreedyAgent(k=10)
        elif config['type'] == 'epsilon':
            agent = EpsilonGreedyAgent(k=10, epsilon=config['epsilon'])
        elif config['type'] == 'optimistic':
            agent = OptimisticGreedyAgent(k=10, initial_value=config['init'])
        else:  # ucb
            agent = UCBAgent(k=10, c=config['c'])
        
        rewards, optimal = run_experiment(agent, bandit, steps)
        final_results[name]['rewards'].append(rewards)
        final_results[name]['optimal'].append(optimal)

# Calculate averages
avg_final = {}
for name in all_strategies.keys():
    avg_final[name] = {
        'rewards': np.mean(final_results[name]['rewards'], axis=0),
        'optimal': np.mean(final_results[name]['optimal'], axis=0)
    }

# Create comprehensive visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 11))

colors = ['red', 'green', 'lightgreen', 'purple', 'blue', 'lightblue']
linestyles = ['-', '-', '--', '-', '-', '--']
linewidths = [2, 2.5, 2, 2, 2.5, 2]

# Plot 1: Average reward over time
for (name, color, ls, lw) in zip(all_strategies.keys(), colors, linestyles, linewidths):
    ax1.plot(avg_final[name]['rewards'], linewidth=lw, color=color, 
             linestyle=ls, alpha=0.8, label=name)

ax1.set_xlabel('Steps', fontsize=13)
ax1.set_ylabel('Average Reward', fontsize=13)
ax1.set_title('Multi-Armed Bandit: Complete Strategy Comparison - Average Reward', 
              fontsize=15, fontweight='bold')
ax1.legend(loc='lower right', fontsize=11, ncol=2)
ax1.grid(True, alpha=0.3)

# Plot 2: Percentage of optimal actions
for (name, color, ls, lw) in zip(all_strategies.keys(), colors, linestyles, linewidths):
    ax2.plot(avg_final[name]['optimal'] * 100, linewidth=lw, color=color, 
             linestyle=ls, alpha=0.8, label=name)

ax2.set_xlabel('Steps', fontsize=13)
ax2.set_ylabel('% Optimal Action', fontsize=13)
ax2.set_title('Multi-Armed Bandit: Complete Strategy Comparison - Optimal Action Selection', 
              fontsize=15, fontweight='bold')
ax2.legend(loc='lower right', fontsize=11, ncol=2)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 100])

plt.tight_layout()
plt.show()

# Detailed performance table
print("\nüìä Final Performance Comparison (averaged over {} experiments):".format(num_experiments))
print("="*95)
print(f"{'Strategy':<25} {'Avg Reward':<15} {'Total Optimal %':<18} {'Early (0-100)':<18} {'Late (900-1000)'}")
print("="*95)

# Sort by average reward for ranking
sorted_strategies = sorted(all_strategies.keys(), 
                          key=lambda x: np.mean(avg_final[x]['rewards']), 
                          reverse=True)

for rank, name in enumerate(sorted_strategies, 1):
    avg_reward = np.mean(avg_final[name]['rewards'])
    avg_optimal = np.mean(avg_final[name]['optimal']) * 100
    early_optimal = np.mean(avg_final[name]['optimal'][:100]) * 100
    late_optimal = np.mean(avg_final[name]['optimal'][-100:]) * 100
    
    rank_str = f"#{rank} {name}"
    print(f"{rank_str:<25} {avg_reward:<15.3f} {avg_optimal:<18.1f} {early_optimal:<18.1f} {late_optimal:.1f}%")

print("="*95)

print("\nüèÜ Final Rankings and Insights:\n")
print("1. UCB (c=2) typically performs best overall")
print("   - Principled exploration based on uncertainty")
print("   - Strong theoretical guarantees")
print("   - No random exploration needed\n")

print("2. Œµ-greedy (Œµ=0.1) is a close second")
print("   - Simple and effective")
print("   - Works well in non-stationary environments")
print("   - Easy to implement and tune\n")

print("3. Optimistic initialization works well early")
print("   - Good for stationary problems")
print("   - No parameter tuning needed")
print("   - Exploration decreases over time\n")

print("4. Pure greedy fails dramatically")
print("   - Gets stuck on first good option")
print("   - Demonstrates importance of exploration\n")

print("üí° Key Takeaway:")
print("   The exploration-exploitation dilemma is fundamental to RL.")
print("   Different strategies offer different trade-offs, but all successful")
print("   approaches balance trying new things with using what works.")

<a id='mdp'></a>
### Core Terminology and MDP Framework

Now that we've explored the multi-armed bandit problem, let's expand our understanding to more complex reinforcement learning scenarios. We'll introduce the fundamental terminology and the Markov Decision Process (MDP) framework that underlies most RL algorithms.

#### Fundamental RL Terminology

Before diving into MDPs, let's clearly define the core concepts that appear in every RL problem:

**1. Agent**
- The learner and decision maker
- Observes the environment and takes actions
- Goal: Learn a policy that maximizes cumulative reward
- Example: A robot, a game-playing AI, a trading algorithm

**2. Environment**
- Everything outside the agent
- Responds to the agent's actions
- Provides observations and rewards
- Example: The physical world, a game board, a stock market

**3. State (s)**
- A representation of the current situation
- Contains all relevant information for decision making
- Can be fully observable or partially observable
- Example: Robot's position and velocity, chess board configuration, account balance

**4. Action (a)**
- A choice the agent can make
- Can be discrete (finite set) or continuous (infinite range)
- Available actions may depend on the current state
- Example: Move left/right, place chess piece, buy/sell/hold

**5. Reward (r)**
- Immediate feedback signal from the environment
- Scalar value indicating how good/bad an action was
- The agent's goal is to maximize cumulative reward
- Example: +1 for reaching goal, -1 for collision, profit/loss amount

**The Agent-Environment Interface:**

```
     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     ‚îÇ  Agent  ‚îÇ
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò
          ‚îÇ
    action‚îÇ ‚Üì
     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     ‚îÇ Environment ‚îÇ
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
          ‚îÇ
  state,  ‚îÇ ‚Üë
  reward  ‚îÇ
     ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     ‚îÇ  Agent  ‚îÇ
     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

Let's implement a simple environment class to demonstrate these concepts:

In [None]:
class GridWorldEnvironment:
    """A simple grid world environment demonstrating RL concepts.
    
    The agent navigates a 2D grid to reach a goal while avoiding obstacles.
    This demonstrates: states (grid positions), actions (movements),
    rewards (goal/obstacle/step), and the agent-environment interaction.
    """
    
    def __init__(self, grid_size=5, goal_pos=(4, 4), obstacles=None):
        """Initialize the grid world.
        
        Args:
            grid_size: Size of the square grid
            goal_pos: (row, col) position of the goal
            obstacles: List of (row, col) positions that are obstacles
        """
        self.grid_size = grid_size
        self.goal_pos = goal_pos
        self.obstacles = obstacles if obstacles else [(2, 2), (3, 2)]
        
        # Action space: 0=up, 1=right, 2=down, 3=left
        self.actions = ['UP', 'RIGHT', 'DOWN', 'LEFT']
        self.action_effects = [(-1, 0), (0, 1), (1, 0), (0, -1)]
        
        # Initialize state
        self.agent_pos = None
        self.reset()
    
    def reset(self):
        """Reset environment to initial state.
        
        Returns:
            state: Initial state (agent position)
        """
        # Start at top-left corner
        self.agent_pos = (0, 0)
        return self.agent_pos
    
    def step(self, action):
        """Execute an action and return the result.
        
        Args:
            action: Integer 0-3 representing direction
            
        Returns:
            next_state: New agent position
            reward: Reward for this transition
            done: Whether episode is finished
            info: Additional information (dict)
        """
        # Calculate new position
        delta = self.action_effects[action]
        new_row = self.agent_pos[0] + delta[0]
        new_col = self.agent_pos[1] + delta[1]
        new_pos = (new_row, new_col)
        
        # Check if new position is valid
        if self._is_valid_position(new_pos):
            self.agent_pos = new_pos
        # If invalid (wall), agent stays in place
        
        # Calculate reward and check if done
        reward, done, info = self._get_reward_and_done()
        
        return self.agent_pos, reward, done, info
    
    def _is_valid_position(self, pos):
        """Check if position is within bounds and not an obstacle."""
        row, col = pos
        
        # Check bounds
        if row < 0 or row >= self.grid_size or col < 0 or col >= self.grid_size:
            return False
        
        # Check obstacles
        if pos in self.obstacles:
            return False
        
        return True
    
    def _get_reward_and_done(self):
        """Calculate reward and check if episode is done."""
        info = {}
        
        # Check if reached goal
        if self.agent_pos == self.goal_pos:
            return 10.0, True, {'reason': 'goal_reached'}
        
        # Check if hit obstacle (shouldn't happen with valid position check)
        if self.agent_pos in self.obstacles:
            return -10.0, True, {'reason': 'obstacle_hit'}
        
        # Small negative reward for each step (encourages efficiency)
        return -0.1, False, {'reason': 'step'}
    
    def render(self):
        """Display the current state of the grid world."""
        grid = [['.' for _ in range(self.grid_size)] for _ in range(self.grid_size)]
        
        # Mark obstacles
        for obs in self.obstacles:
            grid[obs[0]][obs[1]] = 'X'
        
        # Mark goal
        grid[self.goal_pos[0]][self.goal_pos[1]] = 'G'
        
        # Mark agent
        grid[self.agent_pos[0]][self.agent_pos[1]] = 'A'
        
        # Print grid
        print('\n' + '‚îÄ' * (self.grid_size * 2 + 1))
        for row in grid:
            print('‚îÇ' + ' '.join(row) + '‚îÇ')
        print('‚îÄ' * (self.grid_size * 2 + 1))
        print(f"Agent at: {self.agent_pos}")


# Demonstrate the environment and core concepts
print("Demonstrating Core RL Concepts with Grid World\n")
print("="*60)

# Create environment
env = GridWorldEnvironment(grid_size=5)

print("\nüåç ENVIRONMENT: 5x5 Grid World")
print("   - Goal: Reach position (4,4) marked with 'G'")
print("   - Obstacles: Positions marked with 'X'")
print("   - Agent: Current position marked with 'A'")

print("\nüìç STATE: Agent's position in the grid (row, col)")
print(f"   - Initial state: {env.agent_pos}")
print(f"   - State space size: {env.grid_size * env.grid_size} possible positions")

print("\nüéÆ ACTIONS: Four possible movements")
for i, action_name in enumerate(env.actions):
    print(f"   - Action {i}: {action_name}")

print("\nüéÅ REWARDS:")
print("   - Reach goal: +10.0")
print("   - Each step: -0.1 (encourages efficiency)")
print("   - Hit wall: Agent stays in place")

print("\n" + "="*60)
print("\nInitial State:")
env.render()

# Simulate a few steps
print("\n" + "="*60)
print("Simulating Agent-Environment Interaction:\n")

actions_to_take = [1, 1, 2, 2, 1, 1, 2, 2]  # Path to goal
total_reward = 0

for step, action in enumerate(actions_to_take, 1):
    action_name = env.actions[action]
    next_state, reward, done, info = env.step(action)
    total_reward += reward
    
    print(f"Step {step}:")
    print(f"  Action: {action_name}")
    print(f"  New State: {next_state}")
    print(f"  Reward: {reward:+.1f}")
    print(f"  Total Reward: {total_reward:+.1f}")
    print(f"  Done: {done}")
    
    if done:
        print(f"\n‚úì Episode finished: {info['reason']}")
        env.render()
        break
    print()

print("\n" + "="*60)
print("\nüí° Key Observations:")
print("   1. STATE: Represents where the agent is")
print("   2. ACTION: What the agent chooses to do")
print("   3. REWARD: Feedback on how good the action was")
print("   4. ENVIRONMENT: Determines next state and reward")
print("   5. AGENT: Would learn which actions to take in each state")
print("\n   This interaction loop is the foundation of all RL!")

#### Visualizing Agent-Environment Interaction

In [None]:
# Create a visualization of the agent-environment interaction
import matplotlib.patches as patches

def visualize_grid_world(env, trajectory=None):
    """Visualize the grid world and optionally a trajectory.
    
    Args:
        env: GridWorldEnvironment instance
        trajectory: List of (state, action) tuples to visualize
    """
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Draw grid
    for i in range(env.grid_size + 1):
        ax.plot([0, env.grid_size], [i, i], 'k-', linewidth=1)
        ax.plot([i, i], [0, env.grid_size], 'k-', linewidth=1)
    
    # Draw obstacles
    for obs in env.obstacles:
        rect = patches.Rectangle((obs[1], env.grid_size - obs[0] - 1), 1, 1, 
                                 linewidth=2, edgecolor='black', facecolor='gray', alpha=0.7)
        ax.add_patch(rect)
        ax.text(obs[1] + 0.5, env.grid_size - obs[0] - 0.5, 'X', 
               ha='center', va='center', fontsize=20, fontweight='bold')
    
    # Draw goal
    goal = env.goal_pos
    rect = patches.Rectangle((goal[1], env.grid_size - goal[0] - 1), 1, 1, 
                             linewidth=2, edgecolor='green', facecolor='lightgreen', alpha=0.7)
    ax.add_patch(rect)
    ax.text(goal[1] + 0.5, env.grid_size - goal[0] - 0.5, 'G', 
           ha='center', va='center', fontsize=20, fontweight='bold', color='darkgreen')
    
    # Draw start position
    start = (0, 0)
    rect = patches.Rectangle((start[1], env.grid_size - start[0] - 1), 1, 1, 
                             linewidth=2, edgecolor='blue', facecolor='lightblue', alpha=0.5)
    ax.add_patch(rect)
    ax.text(start[1] + 0.5, env.grid_size - start[0] - 0.5, 'S', 
           ha='center', va='center', fontsize=20, fontweight='bold', color='darkblue')
    
    # Draw trajectory if provided
    if trajectory:
        for i, (state, action) in enumerate(trajectory):
            row, col = state
            # Convert to plot coordinates
            x = col + 0.5
            y = env.grid_size - row - 0.5
            
            # Draw step number
            ax.text(x, y, str(i+1), ha='center', va='center', 
                   fontsize=12, color='red', fontweight='bold',
                   bbox=dict(boxstyle='circle', facecolor='white', alpha=0.8))
            
            # Draw arrow for action
            if action is not None:
                delta = env.action_effects[action]
                dx = delta[1] * 0.3
                dy = -delta[0] * 0.3  # Negative because y-axis is flipped
                ax.arrow(x, y, dx, dy, head_width=0.15, head_length=0.1, 
                        fc='red', ec='red', alpha=0.6, linewidth=2)
    
    ax.set_xlim(0, env.grid_size)
    ax.set_ylim(0, env.grid_size)
    ax.set_aspect('equal')
    ax.set_xticks(range(env.grid_size + 1))
    ax.set_yticks(range(env.grid_size + 1))
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_title('Grid World Environment\nS=Start, G=Goal, X=Obstacle', 
                fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    return fig, ax


# Visualize the environment
env = GridWorldEnvironment(grid_size=5)

# Create a sample trajectory
trajectory = []
state = env.reset()
actions = [1, 1, 2, 2, 1, 1, 2, 2]  # Path to goal

for action in actions:
    trajectory.append((state, action))
    state, reward, done, info = env.step(action)
    if done:
        trajectory.append((state, None))  # Final state, no action
        break

# Create visualization
fig, ax = visualize_grid_world(env, trajectory)
plt.show()

print("\nüìä Visualization shows:")
print("   - Blue 'S': Starting state")
print("   - Green 'G': Goal state")
print("   - Gray 'X': Obstacles")
print("   - Red numbers: Step sequence")
print("   - Red arrows: Actions taken")
print("\nThis illustrates how the agent navigates through states")
print("by taking actions to reach the goal!")

#### Markov Decision Processes (MDPs)

**The Mathematical Framework for Reinforcement Learning**

Now that we understand the basic terminology, let's formalize these concepts using the **Markov Decision Process (MDP)** framework. MDPs provide the mathematical foundation for most reinforcement learning algorithms.

**What is an MDP?**

A Markov Decision Process is a mathematical model for sequential decision-making under uncertainty. It's defined by a tuple $(S, A, P, R, \gamma)$:

**MDP Components:**

1. **$S$: State Space**
   - Set of all possible states
   - Can be finite (grid positions) or infinite (continuous positions)
   - Example: $S = \{(0,0), (0,1), ..., (4,4)\}$ for 5√ó5 grid

2. **$A$: Action Space**
   - Set of all possible actions
   - Can be state-dependent: $A(s)$ = actions available in state $s$
   - Example: $A = \{\text{UP, DOWN, LEFT, RIGHT}\}$

3. **$P$: Transition Probability Function**
   - $P(s'|s,a)$ = probability of reaching state $s'$ from state $s$ after taking action $a$
   - Defines the dynamics of the environment
   - Must satisfy: $\sum_{s' \in S} P(s'|s,a) = 1$ for all $s, a$
   - Example: In deterministic grid world, $P(s'|s,a) = 1$ for one $s'$ and 0 for others

4. **$R$: Reward Function**
   - $R(s, a, s')$ = immediate reward for transition from $s$ to $s'$ via action $a$
   - Sometimes simplified as $R(s)$ or $R(s,a)$
   - Defines the objective the agent should optimize
   - Example: $R(s_{goal}) = +10$, $R(s_{other}) = -0.1$

5. **$\gamma$: Discount Factor**
   - Value between 0 and 1 that determines importance of future rewards
   - $\gamma = 0$: Only immediate rewards matter (myopic)
   - $\gamma = 1$: All future rewards equally important (far-sighted)
   - Typical values: 0.9, 0.95, 0.99

**The Markov Property**

The "Markov" in MDP refers to the **Markov Property** (also called the memoryless property):

$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0, a_0) = P(s_{t+1} | s_t, a_t)$

**In plain English:** The future depends only on the present, not on the past.

**Why is the Markov Property Important?**

1. **Tractability**: Makes the problem computationally feasible
   - Don't need to remember entire history
   - State contains all relevant information

2. **Simplifies Learning**: Agent only needs to learn from current state
   - No need to condition on past states
   - Enables dynamic programming and temporal difference learning

3. **Theoretical Guarantees**: Most RL theory assumes Markov property
   - Convergence proofs rely on it
   - Optimal policies exist under this assumption

**Example - Chess:**
- **Markov**: Current board position is the state (contains all relevant info)
- **Non-Markov**: Only knowing the last move (need full game history)

**When the Markov Property Doesn't Hold:**

In practice, many problems are **Partially Observable MDPs (POMDPs)** where:
- Agent doesn't see the full state
- Must infer state from observations
- Example: Robot with limited sensors, poker (can't see opponent's cards)

Let's implement an MDP simulator to demonstrate these concepts:

In [None]:
class SimpleMDP:
    """A simple MDP simulator with explicit transition probabilities.
    
    This class demonstrates the core MDP components:
    - State space S
    - Action space A  
    - Transition probabilities P(s'|s,a)
    - Reward function R(s,a,s')
    - Discount factor gamma
    """
    
    def __init__(self, states, actions, transitions, rewards, gamma=0.9):
        """Initialize the MDP.
        
        Args:
            states: List of state identifiers
            actions: List of action identifiers
            transitions: Dict mapping (state, action) -> {next_state: probability}
            rewards: Dict mapping (state, action, next_state) -> reward
            gamma: Discount factor (0 to 1)
        """
        self.states = states
        self.actions = actions
        self.transitions = transitions
        self.rewards = rewards
        self.gamma = gamma
        
        self.current_state = None
        
        # Verify transition probabilities sum to 1
        self._verify_transitions()
    
    def _verify_transitions(self):
        """Verify that transition probabilities are valid."""
        for (state, action), next_states in self.transitions.items():
            total_prob = sum(next_states.values())
            if not np.isclose(total_prob, 1.0):
                raise ValueError(
                    f"Transition probabilities for ({state}, {action}) sum to {total_prob}, not 1.0"
                )
    
    def reset(self, initial_state=None):
        """Reset to initial state.
        
        Args:
            initial_state: Starting state (random if None)
            
        Returns:
            state: The initial state
        """
        if initial_state is None:
            self.current_state = np.random.choice(self.states)
        else:
            self.current_state = initial_state
        return self.current_state
    
    def step(self, action):
        """Take an action and transition to next state.
        
        Args:
            action: Action to take
            
        Returns:
            next_state: The resulting state
            reward: Reward received
            info: Additional information
        """
        if self.current_state is None:
            raise ValueError("Must call reset() before step()")
        
        # Get transition probabilities for current state and action
        next_state_probs = self.transitions.get((self.current_state, action), {})
        
        if not next_state_probs:
            raise ValueError(f"No transitions defined for state {self.current_state}, action {action}")
        
        # Sample next state according to transition probabilities
        next_states = list(next_state_probs.keys())
        probabilities = list(next_state_probs.values())
        next_state = np.random.choice(next_states, p=probabilities)
        
        # Get reward
        reward = self.rewards.get((self.current_state, action, next_state), 0.0)
        
        # Update current state
        old_state = self.current_state
        self.current_state = next_state
        
        info = {
            'old_state': old_state,
            'action': action,
            'probability': next_state_probs[next_state]
        }
        
        return next_state, reward, info
    
    def get_transition_prob(self, state, action, next_state):
        """Get P(next_state | state, action)."""
        return self.transitions.get((state, action), {}).get(next_state, 0.0)
    
    def get_reward(self, state, action, next_state):
        """Get R(state, action, next_state)."""
        return self.rewards.get((state, action, next_state), 0.0)


# Create a simple 2x2 grid world MDP
print("Simple 2x2 Grid World MDP\n")
print("="*60)

# Define the MDP components
# States: positions in 2x2 grid
states = ['(0,0)', '(0,1)', '(1,0)', '(1,1)']

# Actions: move right or down
actions = ['RIGHT', 'DOWN']

# Transitions: P(s'|s,a)
# In this simple example, actions are deterministic
transitions = {
    ('(0,0)', 'RIGHT'): {'(0,1)': 1.0},
    ('(0,0)', 'DOWN'): {'(1,0)': 1.0},
    ('(0,1)', 'RIGHT'): {'(0,1)': 1.0},  # Hit wall, stay in place
    ('(0,1)', 'DOWN'): {'(1,1)': 1.0},
    ('(1,0)', 'RIGHT'): {'(1,1)': 1.0},
    ('(1,0)', 'DOWN'): {'(1,0)': 1.0},  # Hit wall, stay in place
    ('(1,1)', 'RIGHT'): {'(1,1)': 1.0},  # Goal state, stay
    ('(1,1)', 'DOWN'): {'(1,1)': 1.0},   # Goal state, stay
}

# Rewards: R(s,a,s')
rewards = {
    ('(0,0)', 'RIGHT', '(0,1)'): -1,
    ('(0,0)', 'DOWN', '(1,0)'): -1,
    ('(0,1)', 'RIGHT', '(0,1)'): -1,
    ('(0,1)', 'DOWN', '(1,1)'): 10,  # Reaching goal
    ('(1,0)', 'RIGHT', '(1,1)'): 10,  # Reaching goal
    ('(1,0)', 'DOWN', '(1,0)'): -1,
    ('(1,1)', 'RIGHT', '(1,1)'): 0,  # At goal
    ('(1,1)', 'DOWN', '(1,1)'): 0,   # At goal
}

# Create MDP
mdp = SimpleMDP(states, actions, transitions, rewards, gamma=0.9)

print("MDP Components:")
print(f"\n1. State Space S: {states}")
print(f"   |S| = {len(states)} states")

print(f"\n2. Action Space A: {actions}")
print(f"   |A| = {len(actions)} actions")

print(f"\n3. Discount Factor Œ≥: {mdp.gamma}")

print("\n4. Transition Function P(s'|s,a):")
print("   Example: P((0,1) | (0,0), RIGHT) =", mdp.get_transition_prob('(0,0)', 'RIGHT', '(0,1)'))
print("   Example: P((1,0) | (0,0)', DOWN) =", mdp.get_transition_prob('(0,0)', 'DOWN', '(1,0)'))

print("\n5. Reward Function R(s,a,s'):")
print("   Example: R((0,0), RIGHT, (0,1)) =", mdp.get_reward('(0,0)', 'RIGHT', '(0,1)'))
print("   Example: R((0,1), DOWN, (1,1)) =", mdp.get_reward('(0,1)', 'DOWN', '(1,1)'))

print("\n" + "="*60)
print("\nGrid Layout:")
print("  (0,0) ‚Üí (0,1)")
print("    ‚Üì       ‚Üì")
print("  (1,0) ‚Üí (1,1) [GOAL]")
print("\nGoal: Reach (1,1) from (0,0)")

#### Simulating the MDP

In [None]:
# Simulate episodes in the MDP
print("Simulating MDP Episodes\n")
print("="*60)

# Run a few episodes with random actions
num_episodes = 3
max_steps = 5

for episode in range(1, num_episodes + 1):
    print(f"\nEpisode {episode}:")
    print("-" * 40)
    
    state = mdp.reset(initial_state='(0,0)')
    total_reward = 0
    
    print(f"Initial state: {state}")
    
    for step in range(max_steps):
        # Random action selection
        action = np.random.choice(mdp.actions)
        
        # Take action
        next_state, reward, info = mdp.step(action)
        total_reward += reward
        
        print(f"  Step {step+1}: {info['old_state']} --[{action}]--> {next_state}")
        print(f"           Reward: {reward:+.0f}, Total: {total_reward:+.0f}")
        
        # Check if reached goal
        if next_state == '(1,1)':
            print(f"\n  ‚úì Reached goal in {step+1} steps!")
            break
        
        state = next_state
    
    if state != '(1,1)':
        print(f"\n  ‚úó Did not reach goal in {max_steps} steps")

print("\n" + "="*60)
print("\nüí° Key MDP Concepts Demonstrated:")
print("   1. States: Discrete positions in the grid")
print("   2. Actions: RIGHT and DOWN movements")
print("   3. Transitions: Deterministic (probability = 1.0)")
print("   4. Rewards: Negative for steps, positive for goal")
print("   5. Markov Property: Next state depends only on current state and action")

#### Visualizing State Transitions

In [None]:
# Visualize the MDP as a state transition diagram
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def visualize_mdp_transitions(mdp):
    """Create a visualization of MDP state transitions."""
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Define state positions for visualization
    state_positions = {
        '(0,0)': (1, 3),
        '(0,1)': (3, 3),
        '(1,0)': (1, 1),
        '(1,1)': (3, 1)
    }
    
    # Draw states
    for state, (x, y) in state_positions.items():
        if state == '(1,1)':
            # Goal state - green
            circle = plt.Circle((x, y), 0.3, color='lightgreen', ec='darkgreen', linewidth=3)
            ax.add_patch(circle)
            ax.text(x, y, state + '\nGOAL', ha='center', va='center', 
                   fontsize=11, fontweight='bold')
        elif state == '(0,0)':
            # Start state - blue
            circle = plt.Circle((x, y), 0.3, color='lightblue', ec='darkblue', linewidth=3)
            ax.add_patch(circle)
            ax.text(x, y, state + '\nSTART', ha='center', va='center', 
                   fontsize=11, fontweight='bold')
        else:
            # Regular state - white
            circle = plt.Circle((x, y), 0.3, color='white', ec='black', linewidth=2)
            ax.add_patch(circle)
            ax.text(x, y, state, ha='center', va='center', 
                   fontsize=12, fontweight='bold')
    
    # Draw transitions
    for (state, action), next_states in mdp.transitions.items():
        for next_state, prob in next_states.items():
            if state == next_state:
                # Self-loop (hitting wall or at goal)
                continue
            
            x1, y1 = state_positions[state]
            x2, y2 = state_positions[next_state]
            
            # Calculate arrow position
            dx = x2 - x1
            dy = y2 - y1
            length = np.sqrt(dx**2 + dy**2)
            
            # Normalize and shorten to account for circle radius
            dx_norm = dx / length
            dy_norm = dy / length
            
            start_x = x1 + dx_norm * 0.35
            start_y = y1 + dy_norm * 0.35
            end_x = x2 - dx_norm * 0.35
            end_y = y2 - dy_norm * 0.35
            
            # Get reward for this transition
            reward = mdp.get_reward(state, action, next_state)
            
            # Color based on action
            color = 'blue' if action == 'RIGHT' else 'red'
            
            # Draw arrow
            ax.annotate('', xy=(end_x, end_y), xytext=(start_x, start_y),
                       arrowprops=dict(arrowstyle='->', lw=2, color=color, alpha=0.7))
            
            # Add label
            mid_x = (start_x + end_x) / 2
            mid_y = (start_y + end_y) / 2
            label = f"{action}\nR={reward:+.0f}"
            ax.text(mid_x, mid_y, label, ha='center', va='center',
                   fontsize=9, bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    # Add legend
    right_patch = mpatches.Patch(color='blue', label='RIGHT action')
    down_patch = mpatches.Patch(color='red', label='DOWN action')
    ax.legend(handles=[right_patch, down_patch], loc='upper right', fontsize=11)
    
    ax.set_xlim(0, 4)
    ax.set_ylim(0, 4)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title('MDP State Transition Diagram\n(Arrows show actions and rewards)', 
                fontsize=14, fontweight='bold', pad=20)
    
    plt.tight_layout()
    return fig, ax

# Create visualization
fig, ax = visualize_mdp_transitions(mdp)
plt.show()

print("\nüìä Transition Diagram shows:")
print("   - Blue arrows: RIGHT actions")
print("   - Red arrows: DOWN actions")
print("   - Labels show: Action name and Reward")
print("   - Green circle: Goal state (1,1)")
print("   - Blue circle: Start state (0,0)")
print("\nThis visualizes the complete MDP structure:")
print("how states connect through actions and what rewards are received!")

#### Discounted Return and Value Functions

**From Immediate Rewards to Long-Term Value**

In reinforcement learning, we don't just care about immediate rewards - we want to maximize the **total reward over time**. This leads us to the concepts of return and value functions.

**The Return (Cumulative Reward)**

The **return** $G_t$ at time $t$ is the total discounted reward from that point forward:

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + ...$

Or more compactly:

$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

**The Discount Factor Œ≥ (Gamma)**

The discount factor $\gamma \in [0, 1]$ determines how much we value future rewards:

- **$\gamma = 0$**: Only immediate reward matters (myopic)
  - $G_t = R_{t+1}$
  - Agent is short-sighted

- **$\gamma = 1$**: All future rewards equally important (far-sighted)
  - $G_t = R_{t+1} + R_{t+2} + R_{t+3} + ...$
  - Can lead to infinite returns in continuing tasks

- **$\gamma \in (0, 1)$**: Balance between immediate and future rewards
  - Typical values: 0.9, 0.95, 0.99
  - Ensures finite returns even in infinite horizons

**Why Discount Future Rewards?**

1. **Mathematical Convenience**: Ensures convergence for infinite horizons
2. **Uncertainty**: Future is uncertain, so future rewards are less reliable
3. **Preference**: Often prefer rewards sooner rather than later
4. **Computational**: Makes the problem tractable

**Example - Effect of Gamma:**

Suppose we receive rewards: [1, 1, 1, 1, 1]

- $\gamma = 0.0$: $G = 1$ (only first reward)
- $\gamma = 0.5$: $G = 1 + 0.5 + 0.25 + 0.125 + 0.0625 = 1.9375$
- $\gamma = 0.9$: $G = 1 + 0.9 + 0.81 + 0.729 + 0.6561 = 4.0951$
- $\gamma = 1.0$: $G = 5$ (all rewards equally)

Let's implement a function to calculate discounted returns:

In [None]:
def calculate_discounted_return(rewards, gamma):
    """Calculate the discounted return for a sequence of rewards.
    
    Args:
        rewards: List or array of rewards [r1, r2, r3, ...]
        gamma: Discount factor (0 to 1)
        
    Returns:
        G: The discounted return
    """
    G = 0
    for t, reward in enumerate(rewards):
        G += (gamma ** t) * reward
    return G


def calculate_returns_to_go(rewards, gamma):
    """Calculate return-to-go for each time step.
    
    Return-to-go at time t is the discounted sum of rewards from t onward.
    
    Args:
        rewards: List of rewards [r1, r2, r3, ...]
        gamma: Discount factor
        
    Returns:
        returns: List of returns-to-go [G0, G1, G2, ...]
    """
    returns = []
    G = 0
    
    # Calculate backwards for efficiency
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    
    return returns


# Demonstrate discounted return calculation
print("Discounted Return Calculation\n")
print("="*60)

# Example reward sequence
rewards = [1, 1, 1, 1, 1]
print(f"Reward sequence: {rewards}\n")

# Calculate for different gamma values
gamma_values = [0.0, 0.5, 0.9, 0.99, 1.0]

print("Effect of Discount Factor Œ≥:")
print("-" * 60)
print(f"{'Œ≥':<10} {'Discounted Return':<20} {'Interpretation'}")
print("-" * 60)

for gamma in gamma_values:
    G = calculate_discounted_return(rewards, gamma)
    
    if gamma == 0.0:
        interp = "Only immediate reward"
    elif gamma == 1.0:
        interp = "All rewards equally"
    elif gamma < 0.5:
        interp = "Very myopic"
    elif gamma < 0.9:
        interp = "Moderately far-sighted"
    else:
        interp = "Very far-sighted"
    
    print(f"{gamma:<10.2f} {G:<20.4f} {interp}")

print("\n" + "="*60)

# Example with varying rewards
print("\nExample with Varying Rewards:\n")
rewards2 = [1, 2, 3, 4, 5]
gamma = 0.9

print(f"Rewards: {rewards2}")
print(f"Œ≥ = {gamma}\n")

G = calculate_discounted_return(rewards2, gamma)
print(f"Total discounted return: {G:.4f}")

# Show the calculation step by step
print("\nStep-by-step calculation:")
print(f"G = {rewards2[0]} + {gamma}√ó{rewards2[1]} + {gamma}¬≤√ó{rewards2[2]} + {gamma}¬≥√ó{rewards2[3]} + {gamma}‚Å¥√ó{rewards2[4]}")
print(f"G = {rewards2[0]} + {gamma*rewards2[1]:.2f} + {gamma**2*rewards2[2]:.2f} + {gamma**3*rewards2[3]:.2f} + {gamma**4*rewards2[4]:.2f}")
print(f"G = {G:.4f}")

# Calculate returns-to-go
returns_to_go = calculate_returns_to_go(rewards2, gamma)
print("\nReturns-to-go at each time step:")
for t, (r, G_t) in enumerate(zip(rewards2, returns_to_go)):
    print(f"  t={t}: Reward={r}, Return-to-go G_{t}={G_t:.4f}")

#### Value Functions: Estimating Long-Term Value

**From Returns to Value Functions**

While the return $G_t$ tells us the actual cumulative reward from a specific trajectory, **value functions** tell us the **expected** return from a state or state-action pair.

**State-Value Function V(s)**

The **state-value function** $V^\pi(s)$ is the expected return starting from state $s$ and following policy $\pi$:

$V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s\right]$

**Interpretation:**
- "How good is it to be in state $s$?"
- Expected cumulative reward if we start in $s$ and follow policy $\pi$
- Depends on the policy being followed

**Action-Value Function Q(s,a)**

The **action-value function** $Q^\pi(s,a)$ is the expected return starting from state $s$, taking action $a$, then following policy $\pi$:

$Q^\pi(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]$

$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s, A_t = a\right]$

**Interpretation:**
- "How good is it to take action $a$ in state $s$?"
- Expected cumulative reward if we start in $s$, take action $a$, then follow $\pi$
- Also called Q-values (hence "Q-learning")

**Relationship Between V and Q:**

The state-value is the expected action-value under the policy:

$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)$

For a deterministic policy that always chooses action $a^*$ in state $s$:

$V^\pi(s) = Q^\pi(s, a^*)$

**Optimal Value Functions:**

The **optimal state-value function** $V^*(s)$ is the maximum value achievable in state $s$:

$V^*(s) = \max_\pi V^\pi(s)$

The **optimal action-value function** $Q^*(s,a)$ is the maximum value achievable by taking action $a$ in state $s$:

$Q^*(s,a) = \max_\pi Q^\pi(s,a)$

**Key Insight:**
If we know $Q^*(s,a)$ for all states and actions, we can act optimally by choosing:

$\pi^*(s) = \arg\max_a Q^*(s,a)$

This is why Q-learning is so powerful - it learns $Q^*$ directly!

Let's demonstrate these concepts with examples:

In [None]:
# Demonstrate value functions with simple examples
print("Value Functions: V(s) and Q(s,a)\n")
print("="*60)

# Simple example: 3-state chain
# States: S0 -> S1 -> S2 (terminal)
# Actions: FORWARD (deterministic)
# Rewards: 0, 0, +10 (only at terminal)

print("Example: Simple 3-State Chain")
print("\nStates: S0 ‚Üí S1 ‚Üí S2 (terminal)")
print("Action: FORWARD (deterministic)")
print("Rewards: R(S0‚ÜíS1)=0, R(S1‚ÜíS2)=10")
print("\n" + "-"*60)

gamma = 0.9
print(f"\nDiscount factor Œ≥ = {gamma}")

# Calculate V(s) for each state
# V(S2) = 0 (terminal state, no future rewards)
# V(S1) = 0 + Œ≥ * 10 = 9.0
# V(S0) = 0 + Œ≥ * V(S1) = 0 + 0.9 * 9.0 = 8.1

V_S2 = 0
V_S1 = 0 + gamma * 10
V_S0 = 0 + gamma * V_S1

print("\nState-Value Function V(s):")
print(f"  V(S0) = {V_S0:.2f}  (2 steps to reward)")
print(f"  V(S1) = {V_S1:.2f}  (1 step to reward)")
print(f"  V(S2) = {V_S2:.2f}  (terminal state)")

print("\nüí° Interpretation:")
print("   - V(S0) < V(S1) because S0 is farther from the reward")
print("   - Each step away reduces value by factor of Œ≥")
print("   - V(s) tells us 'how good' each state is")

print("\n" + "="*60)

# Example with multiple actions
print("\nExample: Grid World with Multiple Actions\n")
print("Consider state S with two actions:")
print("  - Action A1: Leads to goal (reward +10) with prob 0.8")
print("  - Action A2: Leads to goal (reward +10) with prob 0.3")
print("\nBoth actions give -1 reward if they don't reach goal")

gamma = 0.9

# Q(S, A1) = 0.8 * 10 + 0.2 * (-1) = 7.8
# Q(S, A2) = 0.3 * 10 + 0.7 * (-1) = 2.3

Q_S_A1 = 0.8 * 10 + 0.2 * (-1)
Q_S_A2 = 0.3 * 10 + 0.7 * (-1)

print("\nAction-Value Function Q(s,a):")
print(f"  Q(S, A1) = {Q_S_A1:.2f}  (high success rate)")
print(f"  Q(S, A2) = {Q_S_A2:.2f}  (low success rate)")

print("\nüí° Interpretation:")
print("   - Q(S, A1) > Q(S, A2) because A1 is more likely to succeed")
print("   - Optimal action: A1 (higher Q-value)")
print("   - Q(s,a) tells us 'how good' each action is in each state")

# If following a policy that chooses A1 with prob 0.7 and A2 with prob 0.3
V_S = 0.7 * Q_S_A1 + 0.3 * Q_S_A2
print(f"\nIf policy œÄ(A1|S)=0.7, œÄ(A2|S)=0.3:")
print(f"  V(S) = 0.7 √ó Q(S,A1) + 0.3 √ó Q(S,A2)")
print(f"  V(S) = 0.7 √ó {Q_S_A1:.2f} + 0.3 √ó {Q_S_A2:.2f}")
print(f"  V(S) = {V_S:.2f}")

print("\n" + "="*60)
print("\nüéØ Key Takeaways:")
print("   1. V(s): Expected return from state s")
print("   2. Q(s,a): Expected return from taking action a in state s")
print("   3. V(s) = Œ£ œÄ(a|s) Q(s,a) (weighted average over actions)")
print("   4. Optimal policy: Choose action with highest Q-value")
print("   5. Value functions are the foundation of RL algorithms!")

#### Visualizing the Effect of Discount Factor

In [None]:
# Visualize how discount factor affects returns
import matplotlib.pyplot as plt

# Create a reward sequence
num_steps = 20
rewards = np.ones(num_steps)  # Constant reward of 1 at each step

# Calculate returns for different gamma values
gamma_values = [0.5, 0.7, 0.9, 0.95, 0.99]
colors = ['red', 'orange', 'green', 'blue', 'purple']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Discount weights over time
time_steps = np.arange(num_steps)
for gamma, color in zip(gamma_values, colors):
    weights = gamma ** time_steps
    ax1.plot(time_steps, weights, linewidth=2, color=color, 
            label=f'Œ≥ = {gamma}', marker='o', markersize=4, alpha=0.7)

ax1.set_xlabel('Time Steps into Future', fontsize=12)
ax1.set_ylabel('Discount Weight (Œ≥·µó)', fontsize=12)
ax1.set_title('How Discount Factor Weights Future Rewards', fontsize=14, fontweight='bold')
ax1.legend(loc='upper right', fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 1.1])

# Plot 2: Total discounted return
returns = []
for gamma in gamma_values:
    G = calculate_discounted_return(rewards, gamma)
    returns.append(G)

bars = ax2.bar(range(len(gamma_values)), returns, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
ax2.set_xlabel('Discount Factor Œ≥', fontsize=12)
ax2.set_ylabel('Total Discounted Return', fontsize=12)
ax2.set_title('Total Return for Constant Reward Sequence', fontsize=14, fontweight='bold')
ax2.set_xticks(range(len(gamma_values)))
ax2.set_xticklabels([f'{g}' for g in gamma_values])
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, val in zip(bars, returns):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
            f'{val:.1f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüìä Visualization Insights:")
print("\nLeft Plot - Discount Weights:")
print("   - Shows how much each future reward is weighted")
print("   - Lower Œ≥: Future rewards decay quickly")
print("   - Higher Œ≥: Future rewards remain important longer")
print("\nRight Plot - Total Returns:")
print("   - Shows cumulative effect of discounting")
print("   - Œ≥=0.5: Only considers ~2 steps ahead effectively")
print("   - Œ≥=0.99: Considers ~100 steps ahead effectively")
print("\nüí° Rule of thumb: Effective horizon ‚âà 1/(1-Œ≥) steps")
for gamma in gamma_values:
    horizon = 1 / (1 - gamma)
    print(f"   Œ≥={gamma}: ~{horizon:.0f} steps")

#### Policies: Mapping States to Actions

**What is a Policy?**

A **policy** $\pi$ is a strategy that defines how the agent behaves - it maps states to actions. The policy is what the agent learns in reinforcement learning.

**Types of Policies:**

**1. Deterministic Policy**

A deterministic policy $\pi: S \rightarrow A$ maps each state to a single action:

$a = \pi(s)$

**Example:**
- In grid world: "Always move RIGHT in state (0,0)"
- In chess: "Always make the move that captures the most valuable piece"

**2. Stochastic Policy**

A stochastic policy $\pi(a|s)$ gives a probability distribution over actions for each state:

$\pi(a|s) = P(A_t = a | S_t = s)$

where $\sum_a \pi(a|s) = 1$ for all states $s$

**Example:**
- In grid world: "Move RIGHT with 70% probability, DOWN with 30% in state (0,0)"
- Epsilon-greedy: "Take best action with probability 1-Œµ, random action with probability Œµ"

**Why Stochastic Policies?**

1. **Exploration**: Randomness helps explore the environment
2. **Partial Observability**: When state is uncertain, randomization can help
3. **Game Theory**: In competitive settings, randomization prevents exploitation
4. **Continuous Actions**: Natural representation for continuous action spaces

**Optimal Policy**

The **optimal policy** $\pi^*$ maximizes the expected return from every state:

$\pi^* = \arg\max_\pi V^\pi(s) \text{ for all } s \in S$

**Key Theorem:** For any MDP, there exists an optimal deterministic policy!

This means we can always find a policy that doesn't need randomness to be optimal (though stochastic policies are still useful during learning).

Let's implement policy representations:

In [None]:
class DeterministicPolicy:
    """A deterministic policy that maps states to actions."""
    
    def __init__(self):
        """Initialize empty policy."""
        self.policy = {}  # state -> action mapping
    
    def set_action(self, state, action):
        """Set the action for a given state."""
        self.policy[state] = action
    
    def get_action(self, state):
        """Get the action for a given state."""
        return self.policy.get(state, None)
    
    def __repr__(self):
        return f"DeterministicPolicy({len(self.policy)} states)"


class StochasticPolicy:
    """A stochastic policy that gives probability distributions over actions."""
    
    def __init__(self):
        """Initialize empty policy."""
        self.policy = {}  # state -> {action: probability} mapping
    
    def set_action_probs(self, state, action_probs):
        """Set action probabilities for a given state.
        
        Args:
            state: The state
            action_probs: Dict mapping actions to probabilities
        """
        # Verify probabilities sum to 1
        total = sum(action_probs.values())
        if not np.isclose(total, 1.0):
            raise ValueError(f"Action probabilities must sum to 1, got {total}")
        self.policy[state] = action_probs.copy()
    
    def get_action_prob(self, state, action):
        """Get probability of taking action in state."""
        return self.policy.get(state, {}).get(action, 0.0)
    
    def sample_action(self, state):
        """Sample an action according to the policy."""
        action_probs = self.policy.get(state, {})
        if not action_probs:
            return None
        actions = list(action_probs.keys())
        probs = list(action_probs.values())
        return np.random.choice(actions, p=probs)
    
    def __repr__(self):
        return f"StochasticPolicy({len(self.policy)} states)"


# Demonstrate policy representations
print("Policy Representations\n")
print("="*60)

# Example 1: Deterministic policy for 2x2 grid
print("Example 1: Deterministic Policy for 2x2 Grid\n")

det_policy = DeterministicPolicy()
det_policy.set_action('(0,0)', 'RIGHT')
det_policy.set_action('(0,1)', 'DOWN')
det_policy.set_action('(1,0)', 'RIGHT')
det_policy.set_action('(1,1)', 'STAY')  # Goal state

print("Deterministic Policy œÄ(s):")
for state in ['(0,0)', '(0,1)', '(1,0)', '(1,1)']:
    action = det_policy.get_action(state)
    print(f"  œÄ({state}) = {action}")

print("\nüí° This policy always takes the same action in each state")

print("\n" + "="*60)

# Example 2: Stochastic policy
print("\nExample 2: Stochastic Policy (Exploration)\n")

stoch_policy = StochasticPolicy()

# State (0,0): Prefer RIGHT but sometimes go DOWN
stoch_policy.set_action_probs('(0,0)', {'RIGHT': 0.7, 'DOWN': 0.3})

# State (0,1): Prefer DOWN
stoch_policy.set_action_probs('(0,1)', {'RIGHT': 0.1, 'DOWN': 0.9})

# State (1,0): Prefer RIGHT
stoch_policy.set_action_probs('(1,0)', {'RIGHT': 0.9, 'DOWN': 0.1})

print("Stochastic Policy œÄ(a|s):")
for state in ['(0,0)', '(0,1)', '(1,0)']:
    print(f"\n  State {state}:")
    for action in ['RIGHT', 'DOWN']:
        prob = stoch_policy.get_action_prob(state, action)
        if prob > 0:
            print(f"    œÄ({action}|{state}) = {prob:.1f}")

print("\nüí° This policy has randomness - different actions with different probabilities")

# Sample actions from stochastic policy
print("\nSampling 10 actions from state (0,0):")
samples = [stoch_policy.sample_action('(0,0)') for _ in range(10)]
print(f"  Actions: {samples}")
right_count = samples.count('RIGHT')
down_count = samples.count('DOWN')
print(f"  RIGHT: {right_count}/10 ({right_count*10}%), DOWN: {down_count}/10 ({down_count*10}%)")
print(f"  Expected: RIGHT: 70%, DOWN: 30%")

print("\n" + "="*60)
print("\nüéØ Key Points:")
print("   1. Deterministic: œÄ(s) ‚Üí single action")
print("   2. Stochastic: œÄ(a|s) ‚Üí probability distribution")
print("   3. Optimal policies can be deterministic")
print("   4. Stochastic policies useful for exploration during learning")

#### Bellman Equations: The Foundation of RL Algorithms

**The Bellman Equations**

The **Bellman equations** are fundamental recursive relationships that express value functions in terms of themselves. They are the mathematical foundation for most RL algorithms.

**Bellman Equation for V(s):**

The value of a state equals the expected immediate reward plus the discounted value of the next state:

$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]$

**In words:**
1. Consider all possible actions under policy $\pi$
2. For each action, consider all possible next states
3. Sum up: immediate reward + discounted value of next state
4. Weight by probabilities

**Bellman Equation for Q(s,a):**

$Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')\right]$

**Bellman Optimality Equations:**

For the optimal value functions:

$V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^*(s')\right]$

$Q^*(s,a) = \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right]$

**Why Are Bellman Equations Important?**

1. **Recursive Structure**: Break down long-term value into immediate reward + future value
2. **Dynamic Programming**: Enable iterative computation of value functions
3. **Temporal Difference Learning**: Basis for TD learning and Q-learning
4. **Optimality**: Optimal policies satisfy the Bellman optimality equations

**The Bellman Deadlock**

The Bellman equations create a system of equations where:
- Each value depends on other values
- We have $|S|$ equations with $|S|$ unknowns (for V)
- Or $|S| \times |A|$ equations with $|S| \times |A|$ unknowns (for Q)

**The Problem:**
- Can't solve directly because values are defined in terms of each other
- This circular dependency is called the "Bellman deadlock"

**Solutions:**
1. **Iterative Methods**: Dynamic Programming (policy evaluation, value iteration)
2. **Sampling Methods**: Monte Carlo, Temporal Difference learning
3. **Function Approximation**: Neural networks for large state spaces

**The Curse of Dimensionality**

As the state space grows, computational requirements explode:

- **Tabular Methods**: Need to store value for every state
  - 10 binary features ‚Üí $2^{10} = 1,024$ states
  - 20 binary features ‚Üí $2^{20} = 1,048,576$ states
  - 30 binary features ‚Üí $2^{30} = 1,073,741,824$ states

- **Continuous States**: Infinite states (e.g., robot position)

**Addressing the Curse:**
1. **Function Approximation**: Learn V(s) or Q(s,a) with neural networks
2. **Sampling**: Don't visit all states, learn from experience
3. **Generalization**: Use features to generalize across similar states
4. **Hierarchical Methods**: Break problem into subproblems

Let's demonstrate the Bellman equations with a simple example:

In [None]:
# Demonstrate Bellman equations with policy evaluation
print("Bellman Equations: Policy Evaluation Example\n")
print("="*60)

# Use our 2x2 grid MDP from earlier
states = ['(0,0)', '(0,1)', '(1,0)', '(1,1)']
gamma = 0.9

# Define a simple policy: always go RIGHT from (0,0) and (1,0), DOWN from (0,1)
policy = {
    '(0,0)': {'RIGHT': 1.0},
    '(0,1)': {'DOWN': 1.0},
    '(1,0)': {'RIGHT': 1.0},
    '(1,1)': {'RIGHT': 1.0}  # Terminal, doesn't matter
}

# Transitions and rewards (from earlier MDP)
transitions = {
    ('(0,0)', 'RIGHT'): {'(0,1)': 1.0},
    ('(0,1)', 'DOWN'): {'(1,1)': 1.0},
    ('(1,0)', 'RIGHT'): {'(1,1)': 1.0},
    ('(1,1)', 'RIGHT'): {'(1,1)': 1.0},
}

rewards = {
    ('(0,0)', 'RIGHT', '(0,1)'): -1,
    ('(0,1)', 'DOWN', '(1,1)'): 10,
    ('(1,0)', 'RIGHT', '(1,1)'): 10,
    ('(1,1)', 'RIGHT', '(1,1)'): 0,
}

print("MDP Setup:")
print("  States: (0,0) ‚Üí (0,1) ‚Üí (1,1) [GOAL]")
print("           ‚Üì       ‚Üì")
print("         (1,0) ‚Üí (1,1) [GOAL]")
print(f"\n  Discount factor Œ≥ = {gamma}")
print("\n  Policy œÄ:")
for state, actions in policy.items():
    for action, prob in actions.items():
        if prob > 0:
            print(f"    œÄ({state}) = {action}")

print("\n" + "-"*60)
print("\nApplying Bellman Equation: V(s) = Œ£ œÄ(a|s) Œ£ P(s'|s,a)[R + Œ≥V(s')]\n")

# Iterative policy evaluation
V = {s: 0.0 for s in states}  # Initialize values to 0
V['(1,1)'] = 0.0  # Terminal state

print("Iteration 0 (Initial):")
for state in states:
    print(f"  V({state}) = {V[state]:.2f}")

# Perform a few iterations
for iteration in range(1, 6):
    V_new = V.copy()
    
    for state in states:
        if state == '(1,1)':  # Terminal state
            continue
        
        # Apply Bellman equation
        v = 0.0
        for action, action_prob in policy[state].items():
            # Get transitions for this state-action pair
            next_states = transitions.get((state, action), {})
            
            for next_state, trans_prob in next_states.items():
                reward = rewards.get((state, action, next_state), 0.0)
                # Bellman equation: R + Œ≥ * V(s')
                v += action_prob * trans_prob * (reward + gamma * V[next_state])
        
        V_new[state] = v
    
    V = V_new
    
    print(f"\nIteration {iteration}:")
    for state in states:
        print(f"  V({state}) = {V[state]:.2f}")

print("\n" + "="*60)
print("\nüí° Observations:")
print("   1. Values converge through iterative application of Bellman equation")
print("   2. V(1,1) = 0 (terminal state, no future rewards)")
print("   3. V(0,1) and V(1,0) are high (one step from goal)")
print("   4. V(0,0) is lower (two steps from goal, more discounting)")
print("   5. Each iteration uses previous values to compute new values")

print("\n" + "="*60)
print("\nüéØ Key Insights:")
print("   1. Bellman equations express values recursively")
print("   2. Can't solve directly (circular dependency = Bellman deadlock)")
print("   3. Iterative methods converge to true values")
print("   4. This is the foundation of Dynamic Programming!")

#### Visualizing the Curse of Dimensionality

In [None]:
# Visualize the curse of dimensionality
import matplotlib.pyplot as plt

# Calculate state space sizes for different scenarios
# Store as (name, exponent) to avoid overflow
scenarios = [
    ('Grid 5√ó5', np.log10(25)),
    ('Grid 10√ó10', np.log10(100)),
    ('Grid 20√ó20', np.log10(400)),
    ('10 binary features', np.log10(2**10)),
    ('15 binary features', np.log10(2**15)),
    ('20 binary features', np.log10(2**20)),
    ('Chess (approx)', 43),
    ('Go (approx)', 170)
]

names = [s[0] for s in scenarios]
log_sizes = [s[1] for s in scenarios]

# Create visualization
fig, ax = plt.subplots(figsize=(12, 8))

# Use log scale for y-axis
y_pos = np.arange(len(names))
colors = ['green', 'green', 'yellow', 'yellow', 'orange', 'red', 'darkred', 'darkred']

bars = ax.barh(y_pos, log_sizes, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

ax.set_yticks(y_pos)
ax.set_yticklabels(names, fontsize=11)
ax.set_xlabel('Number of States (log‚ÇÅ‚ÇÄ scale)', fontsize=13, fontweight='bold')
ax.set_title('The Curse of Dimensionality in Reinforcement Learning', 
            fontsize=15, fontweight='bold', pad=20)
ax.grid(True, alpha=0.3, axis='x')

# Add value labels
for i, (bar, log_size) in enumerate(zip(bars, log_sizes)):
    exp = int(log_size)
    label = f'10^{exp}'
    
    ax.text(log_size, bar.get_y() + bar.get_height()/2, f'  {label}',
           va='center', ha='left', fontsize=10, fontweight='bold')

# Add annotations
ax.text(0.02, 0.98, 'Tractable with\ntabular methods', 
       transform=ax.transAxes, fontsize=11, va='top',
       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))

ax.text(0.02, 0.50, 'Need function\napproximation', 
       transform=ax.transAxes, fontsize=11, va='top',
       bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.7))

ax.text(0.02, 0.15, 'Extremely\nchallenging', 
       transform=ax.transAxes, fontsize=11, va='top',
       bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))

plt.tight_layout()
plt.show()

print("\nüìä The Curse of Dimensionality:")
print("\n" + "="*60)
print("\nState Space Growth:")
# Recreate actual sizes for printing
actual_scenarios = [
    ('Grid 5√ó5', 25),
    ('Grid 10√ó10', 100),
    ('Grid 20√ó20', 400),
    ('10 binary features', 2**10),
    ('15 binary features', 2**15),
    ('20 binary features', 2**20),
    ('Chess (approx)', 43),
    ('Go (approx)', 170)
]
for name, size in actual_scenarios:
    if isinstance(size, int) and size < 10**10:
        print(f"  {name:<25} {size:>20,} states")
    else:
        if isinstance(size, int) and size >= 10**10:
            exp = int(np.log10(size))
        else:
            exp = size
        print(f"  {name:<25} ~10^{exp} states")

print("\n" + "="*60)
print("\nüí° Key Insights:")
print("   1. State space grows exponentially with features")
print("   2. Tabular methods only work for small state spaces")
print("   3. Real-world problems need function approximation")
print("   4. Deep RL uses neural networks to handle large spaces")
print("   5. Sampling and generalization are essential!")

print("\n" + "="*60)
print("\nüéØ Summary of MDP Framework:")
print("\n   We've covered the complete MDP framework:")
print("   ‚úì Core terminology (agent, environment, state, action, reward)")
print("   ‚úì MDP components (S, A, P, R, Œ≥)")
print("   ‚úì Markov Property and its importance")
print("   ‚úì Discounted returns and value functions")
print("   ‚úì Policies (deterministic and stochastic)")
print("   ‚úì Bellman equations (foundation of RL algorithms)")
print("   ‚úì Challenges (Bellman deadlock, curse of dimensionality)")
print("\n   Next: We'll learn algorithms to solve MDPs!")

<a id='dynamic-programming'></a>
### Dynamic Programming: Solving MDPs with Perfect Knowledge

**From Theory to Algorithms**

Now that we understand the Bellman equations, we can use them to solve MDPs! **Dynamic Programming (DP)** methods provide exact solutions when we have perfect knowledge of the environment's dynamics.

**What is Dynamic Programming?**

Dynamic Programming is a general approach to solving complex problems by:
1. Breaking them into simpler subproblems
2. Solving each subproblem once
3. Storing solutions to avoid recomputation
4. Combining solutions to solve the original problem

In RL, DP uses the Bellman equations to iteratively compute value functions.

**Key Assumptions for DP:**

1. **Perfect Model**: We know $P(s'|s,a)$ and $R(s,a,s')$ for all states and actions
2. **Finite State/Action Spaces**: Can enumerate all states and actions
3. **Markov Property**: Future depends only on current state

**Two Main DP Algorithms:**

1. **Policy Evaluation**: Compute $V^\pi(s)$ for a given policy $\pi$
2. **Policy Improvement**: Find a better policy given $V^\pi(s)$

Combining these gives us **Policy Iteration** and **Value Iteration** algorithms.

**Why Study DP?**

Even though DP requires perfect knowledge (rarely available in practice), it's important because:
- Provides theoretical foundation for RL
- Many RL algorithms are approximate DP methods
- Helps understand convergence and optimality
- Works well for planning problems (e.g., robotics with simulators)

Let's start with Policy Evaluation!

#### Policy Evaluation: Computing the Value Function

**The Problem:**

Given a policy $\pi$, compute the state-value function $V^\pi(s)$ for all states.

**The Bellman Equation for Policy Evaluation:**

$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]$

This is a system of $|S|$ linear equations with $|S|$ unknowns. We could solve it directly, but for large state spaces, we use an **iterative approach**.

**Iterative Policy Evaluation Algorithm:**

1. Initialize $V(s) = 0$ for all states (or any arbitrary values)
2. Repeat until convergence:
   - For each state $s$:
     - $V_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V_k(s')\right]$
3. Stop when $\max_s |V_{k+1}(s) - V_k(s)| < \theta$ (small threshold)

**Key Insights:**

- Each iteration applies the Bellman equation as an update rule
- Uses old values $V_k(s')$ to compute new values $V_{k+1}(s)$
- Guaranteed to converge to $V^\pi$ as $k \rightarrow \infty$
- Called "bootstrapping" - using estimates to update estimates

**Two Variants:**

1. **Synchronous**: Update all states using old values, then replace all at once
2. **Asynchronous**: Update states one at a time, using most recent values

Asynchronous often converges faster in practice.

Let's implement policy evaluation!

In [None]:
def policy_evaluation(mdp, policy, gamma=0.9, theta=0.0001, max_iterations=1000):
    """Evaluate a policy using iterative policy evaluation.
    
    Args:
        mdp: MDP object with states, actions, transitions, rewards
        policy: Dict mapping state -> {action: probability}
        gamma: Discount factor
        theta: Convergence threshold
        max_iterations: Maximum number of iterations
        
    Returns:
        V: Dict mapping state -> value
        iterations: Number of iterations until convergence
    """
    # Initialize value function to zero
    V = {s: 0.0 for s in mdp.states}
    
    for iteration in range(max_iterations):
        delta = 0  # Track maximum change in value
        V_new = V.copy()
        
        # Update value for each state
        for state in mdp.states:
            v = 0.0
            
            # Sum over all actions according to policy
            for action, action_prob in policy.get(state, {}).items():
                # Sum over all possible next states
                next_states = mdp.transitions.get((state, action), {})
                
                for next_state, trans_prob in next_states.items():
                    reward = mdp.rewards.get((state, action, next_state), 0.0)
                    # Bellman equation
                    v += action_prob * trans_prob * (reward + gamma * V[next_state])
            
            V_new[state] = v
            delta = max(delta, abs(V_new[state] - V[state]))
        
        V = V_new
        
        # Check for convergence
        if delta < theta:
            print(f"Policy evaluation converged in {iteration + 1} iterations")
            return V, iteration + 1
    
    print(f"Policy evaluation reached max iterations ({max_iterations})")
    return V, max_iterations


# Demonstrate policy evaluation on our 2x2 grid MDP
print("Policy Evaluation Demonstration\n")
print("="*60)

# Recreate the 2x2 grid MDP
states = ['(0,0)', '(0,1)', '(1,0)', '(1,1)']
actions = ['RIGHT', 'DOWN']

transitions = {
    ('(0,0)', 'RIGHT'): {'(0,1)': 1.0},
    ('(0,0)', 'DOWN'): {'(1,0)': 1.0},
    ('(0,1)', 'RIGHT'): {'(0,1)': 1.0},
    ('(0,1)', 'DOWN'): {'(1,1)': 1.0},
    ('(1,0)', 'RIGHT'): {'(1,1)': 1.0},
    ('(1,0)', 'DOWN'): {'(1,0)': 1.0},
    ('(1,1)', 'RIGHT'): {'(1,1)': 1.0},
    ('(1,1)', 'DOWN'): {'(1,1)': 1.0},
}

rewards = {
    ('(0,0)', 'RIGHT', '(0,1)'): -1,
    ('(0,0)', 'DOWN', '(1,0)'): -1,
    ('(0,1)', 'RIGHT', '(0,1)'): -1,
    ('(0,1)', 'DOWN', '(1,1)'): 10,
    ('(1,0)', 'RIGHT', '(1,1)'): 10,
    ('(1,0)', 'DOWN', '(1,0)'): -1,
    ('(1,1)', 'RIGHT', '(1,1)'): 0,
    ('(1,1)', 'DOWN', '(1,1)'): 0,
}

mdp = SimpleMDP(states, actions, transitions, rewards, gamma=0.9)

# Define a policy: always go RIGHT from (0,0) and (1,0), DOWN from (0,1)
policy = {
    '(0,0)': {'RIGHT': 1.0},
    '(0,1)': {'DOWN': 1.0},
    '(1,0)': {'RIGHT': 1.0},
    '(1,1)': {'RIGHT': 1.0}
}

print("MDP: 2x2 Grid World")
print("  (0,0) ‚Üí (0,1)")
print("    ‚Üì       ‚Üì")
print("  (1,0) ‚Üí (1,1) [GOAL]")
print(f"\nDiscount factor Œ≥ = {mdp.gamma}")

print("\nPolicy œÄ:")
for state, actions_dict in policy.items():
    for action, prob in actions_dict.items():
        if prob > 0:
            print(f"  œÄ({state}) = {action}")

print("\n" + "-"*60)
print("\nRunning Policy Evaluation...\n")

# Evaluate the policy
V, num_iterations = policy_evaluation(mdp, policy, gamma=0.9, theta=0.0001)

print("\nFinal Value Function V^œÄ(s):")
print("-"*60)
for state in states:
    print(f"  V^œÄ({state}) = {V[state]:7.4f}")

print("\n" + "="*60)
print("\nüí° Interpretation:")
print(f"   - V^œÄ(1,1) = {V['(1,1)']:.4f} (terminal state, no future rewards)")
print(f"   - V^œÄ(0,1) = {V['(0,1)']:.4f} (one step from goal via DOWN)")
print(f"   - V^œÄ(1,0) = {V['(1,0)']:.4f} (one step from goal via RIGHT)")
print(f"   - V^œÄ(0,0) = {V['(0,0)']:.4f} (two steps from goal)")
print("\n   Values reflect expected cumulative reward following policy œÄ")

#### Visualizing Policy Evaluation Convergence

In [None]:
# Visualize how values converge during policy evaluation
def policy_evaluation_with_history(mdp, policy, gamma=0.9, theta=0.0001, max_iterations=1000):
    """Policy evaluation that tracks value history for visualization."""
    V = {s: 0.0 for s in mdp.states}
    history = {s: [0.0] for s in mdp.states}
    
    for iteration in range(max_iterations):
        delta = 0
        V_new = V.copy()
        
        for state in mdp.states:
            v = 0.0
            for action, action_prob in policy.get(state, {}).items():
                next_states = mdp.transitions.get((state, action), {})
                for next_state, trans_prob in next_states.items():
                    reward = mdp.rewards.get((state, action, next_state), 0.0)
                    v += action_prob * trans_prob * (reward + gamma * V[next_state])
            
            V_new[state] = v
            history[state].append(v)
            delta = max(delta, abs(V_new[state] - V[state]))
        
        V = V_new
        
        if delta < theta:
            return V, history, iteration + 1
    
    return V, history, max_iterations


# Run policy evaluation with history tracking
V, history, num_iters = policy_evaluation_with_history(mdp, policy, gamma=0.9, theta=0.0001)

# Create visualization
fig, ax = plt.subplots(figsize=(12, 7))

colors = ['blue', 'green', 'orange', 'red']
markers = ['o', 's', '^', 'd']

for state, color, marker in zip(states, colors, markers):
    iterations = range(len(history[state]))
    values = history[state]
    ax.plot(iterations, values, linewidth=2.5, color=color, marker=marker,
           markersize=6, markevery=max(1, len(iterations)//10), 
           label=f'V({state})', alpha=0.8)

ax.set_xlabel('Iteration', fontsize=13, fontweight='bold')
ax.set_ylabel('Value V(s)', fontsize=13, fontweight='bold')
ax.set_title('Policy Evaluation: Value Function Convergence', 
            fontsize=15, fontweight='bold', pad=15)
ax.legend(loc='right', fontsize=12, framealpha=0.9)
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5, alpha=0.3)

# Add convergence annotation
ax.axvline(x=num_iters, color='red', linestyle='--', linewidth=2, alpha=0.5)
ax.text(num_iters, ax.get_ylim()[1]*0.9, f'Converged\n(iter {num_iters})',
       ha='center', fontsize=11, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()

print("\nüìä Convergence Analysis:")
print("="*60)
print(f"\nConverged in {num_iters} iterations")
print("\nFinal values:")
for state in states:
    print(f"  V({state}) = {V[state]:7.4f}")

print("\nüí° Observations:")
print("   1. Values start at 0 and converge to true values")
print("   2. Terminal state (1,1) stays at 0")
print("   3. States closer to goal converge to higher values")
print("   4. Convergence is exponentially fast")
print("   5. Each iteration uses Bellman equation as update rule")

#### Demonstrating Policy Evaluation on a Larger Grid World

In [None]:
# Create a larger 4x4 grid world for more interesting demonstration
def create_grid_world_mdp(size=4, goal=(3, 3), obstacles=None, gamma=0.9):
    """Create a grid world MDP.
    
    Args:
        size: Grid size (size x size)
        goal: Goal position (row, col)
        obstacles: List of obstacle positions
        gamma: Discount factor
        
    Returns:
        mdp: SimpleMDP object
    """
    if obstacles is None:
        obstacles = []
    
    # Generate all states
    states = []
    for i in range(size):
        for j in range(size):
            if (i, j) not in obstacles:
                states.append(f'({i},{j})')
    
    actions = ['UP', 'DOWN', 'LEFT', 'RIGHT']
    action_effects = {'UP': (-1, 0), 'DOWN': (1, 0), 'LEFT': (0, -1), 'RIGHT': (0, 1)}
    
    transitions = {}
    rewards = {}
    
    for state_str in states:
        # Parse state
        state = eval(state_str)
        
        for action in actions:
            # Calculate next state
            delta = action_effects[action]
            next_state = (state[0] + delta[0], state[1] + delta[1])
            
            # Check if next state is valid
            if (0 <= next_state[0] < size and 0 <= next_state[1] < size and 
                next_state not in obstacles):
                next_state_str = f'({next_state[0]},{next_state[1]})'
            else:
                # Hit wall or obstacle, stay in place
                next_state_str = state_str
            
            transitions[(state_str, action)] = {next_state_str: 1.0}
            
            # Set rewards
            if next_state == goal:
                rewards[(state_str, action, next_state_str)] = 10.0
            elif next_state_str == state_str and state != goal:
                rewards[(state_str, action, next_state_str)] = -1.0  # Hit wall
            else:
                rewards[(state_str, action, next_state_str)] = -0.1  # Step cost
    
    return SimpleMDP(states, actions, transitions, rewards, gamma)


# Create 4x4 grid world
print("Policy Evaluation on 4x4 Grid World\n")
print("="*60)

grid_mdp = create_grid_world_mdp(size=4, goal=(3, 3), obstacles=[(1, 1), (2, 2)], gamma=0.9)

print("Grid World: 4x4 with obstacles at (1,1) and (2,2)")
print("Goal: (3,3)")
print(f"States: {len(grid_mdp.states)} states")
print(f"Discount factor: Œ≥ = {grid_mdp.gamma}")

# Create a simple policy: move towards goal (right and down preferred)
grid_policy = {}
for state_str in grid_mdp.states:
    state = eval(state_str)
    
    if state == (3, 3):  # Goal state
        grid_policy[state_str] = {'RIGHT': 0.25, 'DOWN': 0.25, 'LEFT': 0.25, 'UP': 0.25}
    else:
        # Prefer moving towards goal
        if state[0] < 3 and state[1] < 3:
            grid_policy[state_str] = {'RIGHT': 0.4, 'DOWN': 0.4, 'LEFT': 0.1, 'UP': 0.1}
        elif state[0] < 3:
            grid_policy[state_str] = {'DOWN': 0.7, 'RIGHT': 0.1, 'LEFT': 0.1, 'UP': 0.1}
        elif state[1] < 3:
            grid_policy[state_str] = {'RIGHT': 0.7, 'DOWN': 0.1, 'LEFT': 0.1, 'UP': 0.1}
        else:
            grid_policy[state_str] = {'RIGHT': 0.25, 'DOWN': 0.25, 'LEFT': 0.25, 'UP': 0.25}

print("\nPolicy: Stochastic policy favoring movement towards goal")
print("\nRunning policy evaluation...\n")

# Evaluate policy
V_grid, num_iters = policy_evaluation(grid_mdp, grid_policy, gamma=0.9, theta=0.001)

# Visualize value function as a grid
print("\nValue Function V^œÄ(s) as Grid:")
print("="*60)

value_grid = np.full((4, 4), np.nan)
for state_str, value in V_grid.items():
    state = eval(state_str)
    value_grid[state[0], state[1]] = value

# Print as formatted grid
print("\n     Col 0    Col 1    Col 2    Col 3")
print("   " + "-"*42)
for i in range(4):
    row_str = f"Row {i} |"  
    for j in range(4):
        if np.isnan(value_grid[i, j]):
            row_str += "   XXX   "
        else:
            row_str += f" {value_grid[i, j]:6.2f}  "
    print(row_str)

print("\n(XXX = obstacle)")

# Create heatmap visualization
fig, ax = plt.subplots(figsize=(10, 9))

# Mask obstacles
masked_grid = np.ma.masked_where(np.isnan(value_grid), value_grid)

im = ax.imshow(masked_grid, cmap='RdYlGn', aspect='auto', interpolation='nearest')
ax.set_xticks(range(4))
ax.set_yticks(range(4))
ax.set_xticklabels(range(4), fontsize=12)
ax.set_yticklabels(range(4), fontsize=12)
ax.set_xlabel('Column', fontsize=13, fontweight='bold')
ax.set_ylabel('Row', fontsize=13, fontweight='bold')
ax.set_title('Value Function V^œÄ(s) Heatmap\n(Brighter = Higher Value)', 
            fontsize=15, fontweight='bold', pad=15)

# Add value labels
for i in range(4):
    for j in range(4):
        if not np.isnan(value_grid[i, j]):
            text = ax.text(j, i, f'{value_grid[i, j]:.2f}',
                         ha="center", va="center", color="black", 
                         fontsize=11, fontweight='bold')
        else:
            text = ax.text(j, i, 'X',
                         ha="center", va="center", color="white", 
                         fontsize=20, fontweight='bold')

# Mark goal
ax.add_patch(plt.Rectangle((2.5, 2.5), 1, 1, fill=False, edgecolor='blue', linewidth=4))
ax.text(3, 3.8, 'GOAL', ha='center', fontsize=12, fontweight='bold', color='blue')

plt.colorbar(im, ax=ax, label='Value')
plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
print("   1. Goal state (3,3) has highest value")
print("   2. Values decrease with distance from goal")
print("   3. Obstacles create 'shadows' in value function")
print("   4. Policy evaluation successfully computed V^œÄ for all states")
print("   5. This tells us how good each state is under the given policy")

#### Policy Improvement: Finding Better Policies

**From Evaluation to Improvement**

Now that we can evaluate a policy, the natural question is: **Can we find a better policy?**

The answer is yes, using the **Policy Improvement Theorem**!

**Policy Improvement Theorem:**

Given a policy $\pi$ and its value function $V^\pi$, we can create an improved policy $\pi'$ by acting greedily with respect to $V^\pi$:

$\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V^\pi(s')\right]$

Or equivalently, using Q-values:

$\pi'(s) = \arg\max_a Q^\pi(s,a)$

**The theorem guarantees:** $V^{\pi'}(s) \geq V^\pi(s)$ for all states $s$

**Intuition:**

- We have $V^\pi(s)$ telling us how good each state is under policy $\pi$
- For each state, we look one step ahead and choose the action that leads to the best expected value
- This greedy policy must be at least as good as $\pi$

**Generalized Policy Iteration (GPI)**

Combining policy evaluation and policy improvement gives us a powerful framework:

```
1. Initialize policy œÄ arbitrarily
2. Repeat:
   a. Policy Evaluation: Compute V^œÄ
   b. Policy Improvement: œÄ' ‚Üê greedy(V^œÄ)
   c. If œÄ' = œÄ, stop (optimal policy found)
   d. œÄ ‚Üê œÄ'
```

This is called **Policy Iteration** and is guaranteed to converge to the optimal policy $\pi^*$!

**Why GPI Works:**

- Evaluation makes the value function consistent with the current policy
- Improvement makes the policy greedy with respect to the current value function
- These two processes work together, pushing towards optimality
- Convergence is guaranteed for finite MDPs

Let's implement policy improvement!

In [None]:
def policy_improvement(mdp, V, gamma=0.9):
    """Improve a policy by acting greedily with respect to value function.
    
    Args:
        mdp: MDP object
        V: Value function (dict: state -> value)
        gamma: Discount factor
        
    Returns:
        new_policy: Improved policy (dict: state -> {action: probability})
        policy_stable: Boolean indicating if policy changed
    """
    new_policy = {}
    policy_stable = True
    
    for state in mdp.states:
        # Calculate Q(s,a) for all actions
        q_values = {}
        
        for action in mdp.actions:
            q = 0.0
            next_states = mdp.transitions.get((state, action), {})
            
            for next_state, trans_prob in next_states.items():
                reward = mdp.rewards.get((state, action, next_state), 0.0)
                q += trans_prob * (reward + gamma * V[next_state])
            
            q_values[action] = q
        
        # Choose action(s) with maximum Q-value
        if q_values:
            max_q = max(q_values.values())
            best_actions = [a for a, q in q_values.items() if np.isclose(q, max_q)]
            
            # Create deterministic policy (or uniform over best actions if tie)
            new_policy[state] = {a: 1.0/len(best_actions) for a in best_actions}
    
    return new_policy, policy_stable


def policy_iteration(mdp, gamma=0.9, theta=0.0001, max_iterations=100):
    """Find optimal policy using policy iteration.
    
    Args:
        mdp: MDP object
        gamma: Discount factor
        theta: Convergence threshold for policy evaluation
        max_iterations: Maximum number of policy iterations
        
    Returns:
        policy: Optimal policy
        V: Optimal value function
        num_iterations: Number of iterations
    """
    # Initialize with random policy
    policy = {}
    for state in mdp.states:
        # Uniform random policy
        policy[state] = {a: 1.0/len(mdp.actions) for a in mdp.actions}
    
    for iteration in range(max_iterations):
        # Policy Evaluation
        V, _ = policy_evaluation(mdp, policy, gamma, theta, max_iterations=1000)
        
        # Policy Improvement
        new_policy, policy_stable = policy_improvement(mdp, V, gamma)
        
        # Check if policy has converged
        if policies_equal(policy, new_policy):
            print(f"\nPolicy iteration converged in {iteration + 1} iterations")
            return new_policy, V, iteration + 1
        
        policy = new_policy
    
    print(f"\nPolicy iteration reached max iterations ({max_iterations})")
    return policy, V, max_iterations


def policies_equal(policy1, policy2):
    """Check if two policies are equal."""
    if set(policy1.keys()) != set(policy2.keys()):
        return False
    
    for state in policy1:
        actions1 = policy1[state]
        actions2 = policy2.get(state, {})
        
        if set(actions1.keys()) != set(actions2.keys()):
            return False
        
        for action in actions1:
            if not np.isclose(actions1[action], actions2.get(action, 0)):
                return False
    
    return True


# Demonstrate policy iteration on 2x2 grid
print("Policy Iteration Demonstration\n")
print("="*60)

print("Finding optimal policy for 2x2 Grid World...\n")

# Run policy iteration
optimal_policy, optimal_V, num_iters = policy_iteration(mdp, gamma=0.9, theta=0.0001)

print("\n" + "="*60)
print("\nOptimal Policy œÄ*:")
print("-"*60)
for state in states:
    actions_str = ", ".join([f"{a}({p:.2f})" for a, p in optimal_policy[state].items() if p > 0])
    print(f"  œÄ*({state}) = {actions_str}")

print("\nOptimal Value Function V*:")
print("-"*60)
for state in states:
    print(f"  V*({state}) = {optimal_V[state]:7.4f}")

print("\n" + "="*60)
print("\nüí° Interpretation:")
print("   - Policy iteration found the optimal policy")
print("   - From (0,0): Go RIGHT to (0,1)")
print("   - From (0,1): Go DOWN to goal (1,1)")
print("   - From (1,0): Go RIGHT to goal (1,1)")
print("   - This is the shortest path to the goal!")

#### Value Iteration: A More Efficient Approach

**Combining Evaluation and Improvement**

Policy iteration works well but can be slow because it fully evaluates each policy. **Value iteration** provides a more efficient alternative by combining evaluation and improvement into a single update.

**Value Iteration Algorithm:**

Instead of alternating between full policy evaluation and improvement, value iteration updates values using the Bellman optimality equation:

$V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma V_k(s')\right]$

**Algorithm:**

```
1. Initialize V(s) = 0 for all states
2. Repeat until convergence:
   For each state s:
     V(s) ‚Üê max_a Œ£ P(s'|s,a)[R(s,a,s') + Œ≥V(s')]
3. Extract policy: œÄ(s) = argmax_a Œ£ P(s'|s,a)[R(s,a,s') + Œ≥V(s')]
```

**Key Differences from Policy Iteration:**

1. **No explicit policy**: Works directly with value function
2. **Single update**: Combines evaluation and improvement
3. **Faster convergence**: Often requires fewer iterations
4. **Simpler implementation**: No need to track policy during iteration

**Why Value Iteration Works:**

- Each update moves V closer to V*
- The max operator implicitly improves the policy
- Guaranteed to converge to V* (and thus œÄ*)
- Convergence rate is exponential in Œ≥

**Relationship to Policy Iteration:**

Value iteration is like policy iteration with just one sweep of policy evaluation per iteration. Both converge to the same optimal solution, but value iteration is often faster in practice.

Let's implement value iteration!

In [None]:
def value_iteration(mdp, gamma=0.9, theta=0.0001, max_iterations=1000):
    """Find optimal value function and policy using value iteration.
    
    Args:
        mdp: MDP object
        gamma: Discount factor
        theta: Convergence threshold
        max_iterations: Maximum number of iterations
        
    Returns:
        V: Optimal value function
        policy: Optimal policy
        num_iterations: Number of iterations
    """
    # Initialize value function
    V = {s: 0.0 for s in mdp.states}
    
    for iteration in range(max_iterations):
        delta = 0
        V_new = V.copy()
        
        # Update each state
        for state in mdp.states:
            # Calculate max over actions
            action_values = []
            
            for action in mdp.actions:
                q = 0.0
                next_states = mdp.transitions.get((state, action), {})
                
                for next_state, trans_prob in next_states.items():
                    reward = mdp.rewards.get((state, action, next_state), 0.0)
                    q += trans_prob * (reward + gamma * V[next_state])
                
                action_values.append(q)
            
            # Bellman optimality update
            if action_values:
                V_new[state] = max(action_values)
                delta = max(delta, abs(V_new[state] - V[state]))
        
        V = V_new
        
        # Check convergence
        if delta < theta:
            print(f"Value iteration converged in {iteration + 1} iterations")
            
            # Extract optimal policy
            policy = {}
            for state in mdp.states:
                q_values = {}
                for action in mdp.actions:
                    q = 0.0
                    next_states = mdp.transitions.get((state, action), {})
                    for next_state, trans_prob in next_states.items():
                        reward = mdp.rewards.get((state, action, next_state), 0.0)
                        q += trans_prob * (reward + gamma * V[next_state])
                    q_values[action] = q
                
                # Greedy policy
                if q_values:
                    max_q = max(q_values.values())
                    best_actions = [a for a, q in q_values.items() if np.isclose(q, max_q)]
                    policy[state] = {a: 1.0/len(best_actions) for a in best_actions}
            
            return V, policy, iteration + 1
    
    print(f"Value iteration reached max iterations ({max_iterations})")
    
    # Extract policy even if not converged
    policy = {}
    for state in mdp.states:
        q_values = {}
        for action in mdp.actions:
            q = 0.0
            next_states = mdp.transitions.get((state, action), {})
            for next_state, trans_prob in next_states.items():
                reward = mdp.rewards.get((state, action, next_state), 0.0)
                q += trans_prob * (reward + gamma * V[next_state])
            q_values[action] = q
        
        if q_values:
            max_q = max(q_values.values())
            best_actions = [a for a, q in q_values.items() if np.isclose(q, max_q)]
            policy[state] = {a: 1.0/len(best_actions) for a in best_actions}
    
    return V, policy, max_iterations


# Demonstrate value iteration
print("Value Iteration Demonstration\n")
print("="*60)

print("Finding optimal policy for 2x2 Grid World using Value Iteration...\n")

# Run value iteration
V_opt, policy_opt, num_iters_vi = value_iteration(mdp, gamma=0.9, theta=0.0001)

print("\n" + "="*60)
print("\nOptimal Policy œÄ* (from Value Iteration):")
print("-"*60)
for state in states:
    actions_str = ", ".join([f"{a}({p:.2f})" for a, p in policy_opt[state].items() if p > 0])
    print(f"  œÄ*({state}) = {actions_str}")

print("\nOptimal Value Function V*:")
print("-"*60)
for state in states:
    print(f"  V*({state}) = {V_opt[state]:7.4f}")

print("\n" + "="*60)
print("\nüí° Comparison: Policy Iteration vs Value Iteration")
print("\n  Both methods found the same optimal solution!")
print(f"  Policy Iteration: {num_iters} iterations")
print(f"  Value Iteration: {num_iters_vi} iterations")
print("\n  Value iteration is often faster because it doesn't")
print("  fully evaluate each intermediate policy.")

#### Visualizing Value Iteration on 4x4 Grid

In [None]:
# Apply value iteration to the larger 4x4 grid world
print("Value Iteration on 4x4 Grid World\n")
print("="*60)

print("Running value iteration on 4x4 grid with obstacles...\n")

# Run value iteration
V_grid_opt, policy_grid_opt, num_iters_grid = value_iteration(grid_mdp, gamma=0.9, theta=0.001)

# Visualize optimal value function
value_grid_opt = np.full((4, 4), np.nan)
for state_str, value in V_grid_opt.items():
    state = eval(state_str)
    value_grid_opt[state[0], state[1]] = value

# Create visualization with optimal policy arrows
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# Plot 1: Optimal Value Function
masked_grid = np.ma.masked_where(np.isnan(value_grid_opt), value_grid_opt)
im1 = ax1.imshow(masked_grid, cmap='RdYlGn', aspect='auto', interpolation='nearest')
ax1.set_xticks(range(4))
ax1.set_yticks(range(4))
ax1.set_xlabel('Column', fontsize=12, fontweight='bold')
ax1.set_ylabel('Row', fontsize=12, fontweight='bold')
ax1.set_title('Optimal Value Function V*(s)', fontsize=14, fontweight='bold')

# Add value labels
for i in range(4):
    for j in range(4):
        if not np.isnan(value_grid_opt[i, j]):
            ax1.text(j, i, f'{value_grid_opt[i, j]:.2f}',
                    ha="center", va="center", color="black", 
                    fontsize=10, fontweight='bold')
        else:
            ax1.text(j, i, 'X', ha="center", va="center", 
                    color="white", fontsize=18, fontweight='bold')

plt.colorbar(im1, ax=ax1, label='Value')

# Plot 2: Optimal Policy
ax2.set_xlim(-0.5, 3.5)
ax2.set_ylim(-0.5, 3.5)
ax2.set_aspect('equal')
ax2.set_xticks(range(4))
ax2.set_yticks(range(4))
ax2.set_xlabel('Column', fontsize=12, fontweight='bold')
ax2.set_ylabel('Row', fontsize=12, fontweight='bold')
ax2.set_title('Optimal Policy œÄ*(s)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.invert_yaxis()

# Draw policy arrows
arrow_map = {'UP': (0, -0.3), 'DOWN': (0, 0.3), 'LEFT': (-0.3, 0), 'RIGHT': (0.3, 0)}

for state_str, actions in policy_grid_opt.items():
    state = eval(state_str)
    i, j = state
    
    # Skip obstacles
    if (i, j) in [(1, 1), (2, 2)]:
        ax2.add_patch(plt.Rectangle((j-0.4, i-0.4), 0.8, 0.8, 
                                    fill=True, facecolor='gray', edgecolor='black', linewidth=2))
        ax2.text(j, i, 'X', ha='center', va='center', 
                color='white', fontsize=18, fontweight='bold')
        continue
    
    # Draw arrows for best action(s)
    for action, prob in actions.items():
        if prob > 0.1:  # Only draw if significant probability
            dx, dy = arrow_map.get(action, (0, 0))
            ax2.arrow(j, i, dx, dy, head_width=0.15, head_length=0.1,
                     fc='blue', ec='blue', linewidth=2, alpha=0.7)

# Mark goal
ax2.add_patch(plt.Circle((3, 3), 0.3, fill=True, facecolor='gold', 
                         edgecolor='darkgreen', linewidth=3))
ax2.text(3, 3, 'G', ha='center', va='center', 
        color='darkgreen', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüìä Results:")
print("="*60)
print(f"\nConverged in {num_iters_grid} iterations")
print("\nOptimal policy shows the best action in each state")
print("Arrows point towards the goal, avoiding obstacles")

print("\nüí° Key Insights:")
print("   1. Value iteration found the optimal policy efficiently")
print("   2. Policy directs agent towards goal from any state")
print("   3. Obstacles are naturally avoided")
print("   4. V*(s) reflects optimal expected return from each state")
print("   5. This is the foundation for solving MDPs!")

#### Model-Based vs Model-Free Reinforcement Learning

**A Fundamental Distinction in RL**

Now that we've seen Dynamic Programming in action, it's important to understand a fundamental distinction in reinforcement learning: **model-based** vs **model-free** approaches.

**Model-Based Reinforcement Learning**

**Definition:** The agent has (or learns) a model of the environment's dynamics.

**What is a "model"?**
- Transition probabilities: $P(s'|s,a)$
- Reward function: $R(s,a,s')$
- Essentially, knowledge of how the environment works

**Examples:**
- Dynamic Programming (what we just learned!)
- Planning algorithms
- Simulators (e.g., chess, Go, robotics simulators)
- Learned models (agent learns $P$ and $R$ from experience)

**Advantages of Model-Based RL:**

1. **Sample Efficiency**: Can plan without interacting with environment
   - Simulate many trajectories mentally
   - No need to try every action in every state
   - Particularly valuable when real-world interactions are expensive

2. **Faster Learning**: Can use planning algorithms
   - Dynamic Programming guarantees optimal solution
   - Can reason about consequences before acting
   - Update values for all states simultaneously

3. **Generalization**: Model can be used for multiple tasks
   - Same model, different reward functions
   - Transfer learning across related problems
   - What-if analysis and counterfactual reasoning

4. **Interpretability**: Can understand and debug the model
   - Inspect transition probabilities
   - Verify model correctness
   - Explain agent's reasoning

**Disadvantages of Model-Based RL:**

1. **Model Errors**: If model is wrong, policy will be suboptimal
   - "All models are wrong, but some are useful"
   - Model errors compound over long horizons
   - Difficult to model complex, stochastic environments

2. **Computational Cost**: Planning can be expensive
   - Need to solve Bellman equations
   - Scales poorly with state/action space size
   - Curse of dimensionality

3. **Model Learning**: Learning accurate models is hard
   - Requires lots of data
   - High-dimensional state spaces are challenging
   - Stochastic environments are difficult to model

4. **Availability**: Many real-world problems lack good models
   - Human behavior is hard to model
   - Complex physical systems
   - Unknown or changing dynamics

**Model-Free Reinforcement Learning**

**Definition:** The agent learns directly from experience without a model.

**What does "model-free" mean?**
- No knowledge of $P(s'|s,a)$ or $R(s,a,s')$
- Learns value functions or policies directly
- Trial-and-error learning

**Examples:**
- Monte Carlo methods
- Temporal Difference learning (TD, SARSA, Q-learning)
- Policy gradient methods
- Deep RL (DQN, A3C, PPO)

**Advantages of Model-Free RL:**

1. **No Model Required**: Works when model is unknown or complex
   - Don't need to know environment dynamics
   - Can handle black-box environments
   - Robust to model misspecification

2. **Simpler**: Often easier to implement
   - Direct learning from experience
   - No need to learn or maintain a model
   - Fewer components to debug

3. **Scalability**: Can use function approximation
   - Neural networks for large state spaces
   - Generalization across states
   - Handles continuous state/action spaces

4. **Robustness**: Less sensitive to model errors
   - Learns from actual experience
   - Adapts to environment changes
   - No compounding of model errors

**Disadvantages of Model-Free RL:**

1. **Sample Inefficiency**: Requires many interactions
   - Must try actions to learn their value
   - Can't simulate or plan ahead
   - Expensive in real-world applications

2. **Slower Learning**: No planning capability
   - Must experience each state-action pair
   - Can't reason about consequences
   - Updates are local (one state at a time)

3. **Exploration Challenges**: Must balance exploration/exploitation
   - Risk of getting stuck in local optima
   - May miss better strategies
   - Requires careful exploration strategy

4. **No Generalization Across Tasks**: Learns for specific reward
   - Must relearn if reward function changes
   - Limited transfer learning
   - Task-specific knowledge

**Why Model-Free Methods Are Needed**

Despite the advantages of model-based RL, model-free methods are essential because:

1. **Real-World Complexity**: Most real-world environments are too complex to model accurately
   - Human interactions, market dynamics, weather patterns
   - High-dimensional, stochastic, non-stationary

2. **Unknown Dynamics**: Often we don't know how the environment works
   - Black-box systems
   - Proprietary or inaccessible internals

3. **Model Errors Are Costly**: Wrong models lead to wrong policies
   - Model-free methods learn from ground truth
   - More robust in practice

4. **Scalability**: Model-free + function approximation handles large spaces
   - Deep RL successes (Atari, Go, robotics)
   - Continuous control problems

5. **Simplicity**: Easier to implement and debug
   - Fewer moving parts
   - Direct optimization of objective

**The Spectrum: Hybrid Approaches**

Modern RL often combines both approaches:

- **Dyna**: Model-free learning + model-based planning
- **Model-Based RL with Learned Models**: Learn model from data, use for planning
- **Imagination-Augmented Agents**: Use model for auxiliary predictions
- **World Models**: Learn compact model, train policy in model

**When to Use Each Approach?**

**Use Model-Based RL when:**
- You have an accurate model (simulator, known dynamics)
- Sample efficiency is critical (expensive interactions)
- State space is small enough for planning
- You need interpretability
- Multiple tasks with same dynamics

**Use Model-Free RL when:**
- Model is unknown or too complex
- Environment is high-dimensional
- You have abundant data/interactions
- Robustness to model errors is important
- Simplicity is preferred

Let's visualize this distinction:

In [None]:
# Create a comparison visualization
import matplotlib.patches as mpatches

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

# Model-Based RL Diagram
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('Model-Based Reinforcement Learning', fontsize=16, fontweight='bold', pad=20)

# Agent
agent_box = mpatches.FancyBboxPatch((1, 7), 2, 1.5, boxstyle="round,pad=0.1", 
                                    edgecolor='blue', facecolor='lightblue', linewidth=3)
ax1.add_patch(agent_box)
ax1.text(2, 7.75, 'Agent', ha='center', va='center', fontsize=13, fontweight='bold')

# Model
model_box = mpatches.FancyBboxPatch((4, 7), 2, 1.5, boxstyle="round,pad=0.1", 
                                    edgecolor='green', facecolor='lightgreen', linewidth=3)
ax1.add_patch(model_box)
ax1.text(5, 7.75, 'Model\nP(s\'|s,a)\nR(s,a,s\')', ha='center', va='center', fontsize=11, fontweight='bold')

# Environment
env_box = mpatches.FancyBboxPatch((7, 7), 2, 1.5, boxstyle="round,pad=0.1", 
                                  edgecolor='red', facecolor='lightcoral', linewidth=3)
ax1.add_patch(env_box)
ax1.text(8, 7.75, 'Environment', ha='center', va='center', fontsize=13, fontweight='bold')

# Arrows
ax1.annotate('', xy=(4, 7.75), xytext=(3, 7.75), 
            arrowprops=dict(arrowstyle='->', lw=2, color='black'))
ax1.text(3.5, 8.2, 'Query', ha='center', fontsize=10)

ax1.annotate('', xy=(7, 7.75), xytext=(6, 7.75), 
            arrowprops=dict(arrowstyle='->', lw=2, color='black'))
ax1.text(6.5, 8.2, 'Action', ha='center', fontsize=10)

ax1.annotate('', xy=(8, 7), xytext=(8, 6.5), 
            arrowprops=dict(arrowstyle='->', lw=2, color='black'))
ax1.annotate('', xy=(2, 7), xytext=(2, 6.5), 
            arrowprops=dict(arrowstyle='<-', lw=2, color='black'))
ax1.text(5, 6.2, 'State, Reward', ha='center', fontsize=10)

# Planning
plan_box = mpatches.FancyBboxPatch((1, 4), 5, 1.5, boxstyle="round,pad=0.1", 
                                   edgecolor='purple', facecolor='plum', linewidth=3, linestyle='--')
ax1.add_patch(plan_box)
ax1.text(3.5, 4.75, 'Planning\n(DP, Value Iteration)', ha='center', va='center', 
        fontsize=12, fontweight='bold')

ax1.annotate('', xy=(3.5, 5.5), xytext=(3.5, 7), 
            arrowprops=dict(arrowstyle='<->', lw=2, color='purple', linestyle='--'))

# Advantages/Disadvantages
ax1.text(5, 2.5, 'Advantages:', fontsize=12, fontweight='bold', color='green')
ax1.text(5, 2, '‚Ä¢ Sample efficient', fontsize=10, ha='center')
ax1.text(5, 1.6, '‚Ä¢ Can plan ahead', fontsize=10, ha='center')
ax1.text(5, 1.2, '‚Ä¢ Fast learning', fontsize=10, ha='center')

ax1.text(5, 0.5, 'Disadvantages:', fontsize=12, fontweight='bold', color='red')
ax1.text(5, 0, '‚Ä¢ Requires accurate model', fontsize=10, ha='center')
ax1.text(5, -0.4, '‚Ä¢ Model errors compound', fontsize=10, ha='center')

# Model-Free RL Diagram
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('Model-Free Reinforcement Learning', fontsize=16, fontweight='bold', pad=20)

# Agent
agent_box2 = mpatches.FancyBboxPatch((2, 7), 2.5, 1.5, boxstyle="round,pad=0.1", 
                                     edgecolor='blue', facecolor='lightblue', linewidth=3)
ax2.add_patch(agent_box2)
ax2.text(3.25, 7.75, 'Agent\n(Q-learning,\nPolicy Gradient)', ha='center', va='center', 
        fontsize=11, fontweight='bold')

# Environment
env_box2 = mpatches.FancyBboxPatch((5.5, 7), 2.5, 1.5, boxstyle="round,pad=0.1", 
                                   edgecolor='red', facecolor='lightcoral', linewidth=3)
ax2.add_patch(env_box2)
ax2.text(6.75, 7.75, 'Environment', ha='center', va='center', fontsize=13, fontweight='bold')

# Direct interaction arrows
ax2.annotate('', xy=(5.5, 8), xytext=(4.5, 8), 
            arrowprops=dict(arrowstyle='->', lw=3, color='black'))
ax2.text(5, 8.5, 'Action', ha='center', fontsize=11, fontweight='bold')

ax2.annotate('', xy=(4.5, 7.5), xytext=(5.5, 7.5), 
            arrowprops=dict(arrowstyle='->', lw=3, color='black'))
ax2.text(5, 7, 'State, Reward', ha='center', fontsize=11, fontweight='bold')

# Direct learning
learn_box = mpatches.FancyBboxPatch((2, 4.5), 6, 1.5, boxstyle="round,pad=0.1", 
                                    edgecolor='orange', facecolor='lightyellow', linewidth=3)
ax2.add_patch(learn_box)
ax2.text(5, 5.25, 'Direct Learning from Experience\n(No Model)', ha='center', va='center', 
        fontsize=12, fontweight='bold')

ax2.annotate('', xy=(5, 6), xytext=(5, 7), 
            arrowprops=dict(arrowstyle='<->', lw=2, color='orange'))

# Advantages/Disadvantages
ax2.text(5, 2.5, 'Advantages:', fontsize=12, fontweight='bold', color='green')
ax2.text(5, 2, '‚Ä¢ No model required', fontsize=10, ha='center')
ax2.text(5, 1.6, '‚Ä¢ Robust to model errors', fontsize=10, ha='center')
ax2.text(5, 1.2, '‚Ä¢ Simpler implementation', fontsize=10, ha='center')

ax2.text(5, 0.5, 'Disadvantages:', fontsize=12, fontweight='bold', color='red')
ax2.text(5, 0, '‚Ä¢ Sample inefficient', fontsize=10, ha='center')
ax2.text(5, -0.4, '‚Ä¢ Slower learning', fontsize=10, ha='center')

plt.tight_layout()
plt.show()

print("\nüìä Model-Based vs Model-Free RL")
print("="*60)
print("\nModel-Based RL (e.g., Dynamic Programming):")
print("  ‚Ä¢ Uses model of environment (P, R)")
print("  ‚Ä¢ Can plan without interacting")
print("  ‚Ä¢ Sample efficient but requires accurate model")
print("  ‚Ä¢ Examples: DP, Dyna, Model-based planning")

print("\nModel-Free RL (e.g., Q-learning):")
print("  ‚Ä¢ Learns directly from experience")
print("  ‚Ä¢ No model of environment needed")
print("  ‚Ä¢ Sample inefficient but robust")
print("  ‚Ä¢ Examples: Monte Carlo, TD, Q-learning, Policy Gradients")

print("\n" + "="*60)
print("\nüéØ Key Takeaway:")
print("\n   Dynamic Programming is model-based - it requires perfect")
print("   knowledge of the environment. In the next sections, we'll")
print("   learn model-free methods that work without this knowledge!")
print("\n   Model-free methods are essential for real-world RL where")
print("   we don't have access to perfect models.")

**Summary: Dynamic Programming and Learning Paradigms**

In this section, we've covered:

1. **Policy Evaluation**: Computing $V^\pi(s)$ for a given policy using iterative Bellman updates
2. **Policy Improvement**: Finding better policies by acting greedily with respect to value functions
3. **Value Iteration**: Efficiently finding optimal policies by combining evaluation and improvement
4. **Model-Based vs Model-Free**: Understanding when we need models and when we can learn without them

**Key Insights:**

- Dynamic Programming provides exact solutions when we have perfect models
- Policy iteration and value iteration both converge to optimal policies
- The Bellman equations are the foundation for all these algorithms
- Model-based methods are sample-efficient but require accurate models
- Model-free methods are more practical for real-world problems

**What's Next:**

In the following sections, we'll explore model-free methods that learn directly from experience:
- Monte Carlo methods (learn from complete episodes)
- Temporal Difference learning (learn from every step)
- Q-learning (off-policy TD control)
- Deep RL (handling large state spaces)

These methods form the foundation of modern reinforcement learning!

<a id='section2'></a>
## Section 2: Core Algorithms

In this section, we'll explore the fundamental algorithms that enable agents to learn optimal policies. We'll start with Monte Carlo methods, which learn from complete episodes, then progress to Temporal Difference methods that can learn from individual steps.

<a id='monte-carlo'></a>
### Monte Carlo Methods

**Learning from Complete Episodes**

Monte Carlo (MC) methods are a class of reinforcement learning algorithms that learn by averaging sample returns from complete episodes. Unlike Dynamic Programming, MC methods don't require a model of the environment - they learn directly from experience.

**The Core Principle:**

Monte Carlo methods estimate value functions by **averaging the actual returns** observed after visiting states. The key insight is:

*"The value of a state is the expected return starting from that state. If we experience many episodes and average the returns, we'll get a good estimate of the true value."*

**Key Characteristics:**

1. **Episode-Based Learning**: Must wait until episode ends to update values
2. **Model-Free**: Don't need to know transition probabilities or rewards
3. **Sample-Based**: Learn from actual experience, not from a model
4. **Unbiased Estimates**: Returns are actual outcomes, not bootstrapped estimates
5. **High Variance**: Individual returns can vary significantly

**When to Use Monte Carlo Methods:**

- ‚úì Episodic tasks (games, simulations with clear endings)
- ‚úì When you don't have a model of the environment
- ‚úì When you can simulate or experience complete episodes
- ‚úó Continuing tasks (no natural episode boundaries)
- ‚úó When episodes are very long (slow learning)

**Comparison with Dynamic Programming:**

| Aspect | Dynamic Programming | Monte Carlo |
|--------|-------------------|-------------|
| Model Required | Yes (need P and R) | No (model-free) |
| Learning | From model | From experience |
| Updates | Every state | Only visited states |
| Bootstrapping | Yes (use estimates) | No (use actual returns) |
| Variance | Low | High |
| Bias | Depends on initialization | Unbiased |

Let's explore Monte Carlo methods in detail!

#### Monte Carlo Prediction: Estimating Value Functions

**The Goal:** Estimate the state-value function $V^\pi(s)$ for a given policy $\pi$.

**The Approach:** 
1. Follow policy $\pi$ to generate episodes
2. For each state visited, record the return (cumulative discounted reward)
3. Average the returns to estimate the value

**Two Variants: First-Visit vs Every-Visit MC**

**First-Visit Monte Carlo:**
- Only count the **first time** a state is visited in an episode
- Average returns from first visits only
- Theoretically guaranteed to converge to true value
- More commonly used in practice

**Every-Visit Monte Carlo:**
- Count **every time** a state is visited in an episode
- Average returns from all visits
- Also converges to true value
- Can learn faster in some cases

**Mathematical Formulation:**

For a state $s$ visited at time $t$ in an episode:

**Return from that visit:**
$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

**Value estimate (after $n$ visits):**
$V(s) = \frac{1}{n} \sum_{i=1}^{n} G_i(s)$

where $G_i(s)$ is the return following the $i$-th visit to state $s$.

**Incremental Update Formula:**

Instead of storing all returns and averaging, we can update incrementally:

$V(s) \leftarrow V(s) + \frac{1}{N(s)} [G - V(s)]$

where:
- $N(s)$ = number of times state $s$ has been visited
- $G$ = return observed from this visit
- $\frac{1}{N(s)}$ = step size (learning rate)

This is equivalent to:
$V(s) \leftarrow V(s) + \alpha [G - V(s)]$

where $\alpha$ is a constant step size (useful for non-stationary problems).

Let's implement both first-visit and every-visit Monte Carlo prediction:

In [None]:
def generate_episode(env, policy, max_steps=100):
    """Generate an episode following a given policy.
    
    Args:
        env: Environment with reset() and step() methods
        policy: Function that takes state and returns action
        max_steps: Maximum steps per episode
        
    Returns:
        episode: List of (state, action, reward) tuples
    """
    episode = []
    state = env.reset()
    
    for _ in range(max_steps):
        action = policy(state)
        next_state, reward, done, _ = env.step(action)
        
        episode.append((state, action, reward))
        
        if done:
            break
        
        state = next_state
    
    return episode


def calculate_returns(episode, gamma=0.9):
    """Calculate returns for each step in an episode.
    
    Args:
        episode: List of (state, action, reward) tuples
        gamma: Discount factor
        
    Returns:
        returns: List of returns, one for each step
    """
    returns = []
    G = 0
    
    # Calculate returns backwards from end of episode
    for state, action, reward in reversed(episode):
        G = reward + gamma * G
        returns.insert(0, G)  # Insert at beginning to maintain order
    
    return returns


def mc_prediction_first_visit(env, policy, num_episodes=1000, gamma=0.9):
    """First-visit Monte Carlo prediction.
    
    Estimates V(s) by averaging returns from first visits to each state.
    
    Args:
        env: Environment
        policy: Policy to evaluate (function: state -> action)
        num_episodes: Number of episodes to generate
        gamma: Discount factor
        
    Returns:
        V: Dictionary mapping states to estimated values
        returns_history: List of returns for each episode (for visualization)
    """
    # Initialize value function and visit counts
    V = defaultdict(float)
    returns_sum = defaultdict(float)
    returns_count = defaultdict(int)
    returns_history = []
    
    for episode_num in range(num_episodes):
        # Generate episode
        episode = generate_episode(env, policy)
        
        # Calculate returns
        returns = calculate_returns(episode, gamma)
        
        # Track total return for this episode
        returns_history.append(returns[0] if returns else 0)
        
        # Track which states we've seen in this episode (for first-visit)
        visited_states = set()
        
        # Update value estimates
        for t, (state, action, reward) in enumerate(episode):
            # First-visit: only update if this is the first time seeing this state
            if state not in visited_states:
                visited_states.add(state)
                
                # Add return to sum
                returns_sum[state] += returns[t]
                returns_count[state] += 1
                
                # Update value estimate (average of returns)
                V[state] = returns_sum[state] / returns_count[state]
    
    return dict(V), returns_history


def mc_prediction_every_visit(env, policy, num_episodes=1000, gamma=0.9):
    """Every-visit Monte Carlo prediction.
    
    Estimates V(s) by averaging returns from all visits to each state.
    
    Args:
        env: Environment
        policy: Policy to evaluate (function: state -> action)
        num_episodes: Number of episodes to generate
        gamma: Discount factor
        
    Returns:
        V: Dictionary mapping states to estimated values
        returns_history: List of returns for each episode (for visualization)
    """
    # Initialize value function and visit counts
    V = defaultdict(float)
    returns_sum = defaultdict(float)
    returns_count = defaultdict(int)
    returns_history = []
    
    for episode_num in range(num_episodes):
        # Generate episode
        episode = generate_episode(env, policy)
        
        # Calculate returns
        returns = calculate_returns(episode, gamma)
        
        # Track total return for this episode
        returns_history.append(returns[0] if returns else 0)
        
        # Update value estimates
        for t, (state, action, reward) in enumerate(episode):
            # Every-visit: update for every occurrence of the state
            returns_sum[state] += returns[t]
            returns_count[state] += 1
            
            # Update value estimate (average of returns)
            V[state] = returns_sum[state] / returns_count[state]
    
    return dict(V), returns_history


print("Monte Carlo Prediction Implementation\n")
print("="*60)
print("\nImplemented:")
print("  ‚úì First-Visit MC Prediction")
print("  ‚úì Every-Visit MC Prediction")
print("  ‚úì Episode generation")
print("  ‚úì Return calculation")
print("\nReady to evaluate policies on episodic environments!")

#### Demonstrating Monte Carlo Prediction on Grid World

In [None]:
# Create a simple policy for the grid world
def random_policy(state):
    """Random policy: choose actions uniformly at random."""
    return np.random.randint(0, 4)  # 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT


def greedy_policy(state):
    """Greedy policy: always move toward goal (4,4)."""
    row, col = state
    goal_row, goal_col = 4, 4
    
    # Move right if not at rightmost column
    if col < goal_col:
        return 1  # RIGHT
    # Move down if not at bottom row
    elif row < goal_row:
        return 2  # DOWN
    # Otherwise move randomly
    else:
        return np.random.randint(0, 4)


print("Evaluating Policies with Monte Carlo Prediction\n")
print("="*60)

# Create environment
env = GridWorldEnvironment(grid_size=5)

print("\nEnvironment: 5x5 Grid World")
print(f"  Start: (0,0)")
print(f"  Goal: {env.goal_pos}")
print(f"  Obstacles: {env.obstacles}")

# Evaluate random policy
print("\n" + "-"*60)
print("Evaluating Random Policy with First-Visit MC...")
V_random_fv, returns_random_fv = mc_prediction_first_visit(
    env, random_policy, num_episodes=5000, gamma=0.9
)

print("\nEvaluating Random Policy with Every-Visit MC...")
V_random_ev, returns_random_ev = mc_prediction_every_visit(
    env, random_policy, num_episodes=5000, gamma=0.9
)

# Evaluate greedy policy
print("\nEvaluating Greedy Policy with First-Visit MC...")
V_greedy_fv, returns_greedy_fv = mc_prediction_first_visit(
    env, greedy_policy, num_episodes=5000, gamma=0.9
)

print("\nEvaluating Greedy Policy with Every-Visit MC...")
V_greedy_ev, returns_greedy_ev = mc_prediction_every_visit(
    env, greedy_policy, num_episodes=5000, gamma=0.9
)

print("\n" + "="*60)
print("\n‚úì Evaluation complete!")
print(f"\nRandom Policy:")
print(f"  States evaluated: {len(V_random_fv)}")
print(f"  Start state value (First-Visit): {V_random_fv.get((0,0), 0):.2f}")
print(f"  Start state value (Every-Visit): {V_random_ev.get((0,0), 0):.2f}")

print(f"\nGreedy Policy:")
print(f"  States evaluated: {len(V_greedy_fv)}")
print(f"  Start state value (First-Visit): {V_greedy_fv.get((0,0), 0):.2f}")
print(f"  Start state value (Every-Visit): {V_greedy_ev.get((0,0), 0):.2f}")

print("\nüí° Observation: Greedy policy has higher value (reaches goal faster)")

#### Visualizing Value Function Convergence

In [None]:
# Visualize how value estimates converge over episodes
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Random Policy - Returns over episodes
ax = axes[0, 0]
window = 100
smoothed_random_fv = np.convolve(returns_random_fv, np.ones(window)/window, mode='valid')
smoothed_random_ev = np.convolve(returns_random_ev, np.ones(window)/window, mode='valid')

ax.plot(smoothed_random_fv, label='First-Visit', linewidth=2, alpha=0.8)
ax.plot(smoothed_random_ev, label='Every-Visit', linewidth=2, alpha=0.8)
ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Average Return', fontsize=11)
ax.set_title('Random Policy: Return Convergence', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Greedy Policy - Returns over episodes
ax = axes[0, 1]
smoothed_greedy_fv = np.convolve(returns_greedy_fv, np.ones(window)/window, mode='valid')
smoothed_greedy_ev = np.convolve(returns_greedy_ev, np.ones(window)/window, mode='valid')

ax.plot(smoothed_greedy_fv, label='First-Visit', linewidth=2, alpha=0.8, color='green')
ax.plot(smoothed_greedy_ev, label='Every-Visit', linewidth=2, alpha=0.8, color='lightgreen')
ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Average Return', fontsize=11)
ax.set_title('Greedy Policy: Return Convergence', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 3: Value function heatmap for Random Policy (First-Visit)
ax = axes[1, 0]
value_grid = np.zeros((5, 5))
for (row, col), value in V_random_fv.items():
    value_grid[row, col] = value

im = ax.imshow(value_grid, cmap='RdYlGn', aspect='auto')
ax.set_title('Random Policy: Value Function (First-Visit)', fontsize=12, fontweight='bold')
ax.set_xlabel('Column', fontsize=11)
ax.set_ylabel('Row', fontsize=11)

# Add value labels
for i in range(5):
    for j in range(5):
        text = ax.text(j, i, f'{value_grid[i, j]:.1f}',
                      ha="center", va="center", color="black", fontsize=10)

plt.colorbar(im, ax=ax)

# Plot 4: Value function heatmap for Greedy Policy (First-Visit)
ax = axes[1, 1]
value_grid = np.zeros((5, 5))
for (row, col), value in V_greedy_fv.items():
    value_grid[row, col] = value

im = ax.imshow(value_grid, cmap='RdYlGn', aspect='auto')
ax.set_title('Greedy Policy: Value Function (First-Visit)', fontsize=12, fontweight='bold')
ax.set_xlabel('Column', fontsize=11)
ax.set_ylabel('Row', fontsize=11)

# Add value labels
for i in range(5):
    for j in range(5):
        text = ax.text(j, i, f'{value_grid[i, j]:.1f}',
                      ha="center", va="center", color="black", fontsize=10)

plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.show()

print("\nüìä Visualization Insights:")
print("\n1. Top Row: Returns converge as more episodes are sampled")
print("   - First-visit and every-visit produce similar estimates")
print("   - Greedy policy achieves higher returns than random")

print("\n2. Bottom Row: Value function heatmaps")
print("   - Brighter colors = higher values (closer to goal)")
print("   - Values increase as we approach goal state (4,4)")
print("   - Greedy policy has higher values overall")

print("\n3. Key Takeaway:")
print("   Monte Carlo successfully estimates state values from experience!")
print("   No model required - just sample episodes and average returns.")

#### On-Policy Monte Carlo Control

**From Prediction to Control: Learning Optimal Policies**

So far, we've used Monte Carlo methods for **prediction** - evaluating a given policy. Now we'll use MC for **control** - finding the optimal policy.

**The Control Problem:**

Given: An environment (MDP without model)
Goal: Find the optimal policy $\pi^*$ that maximizes expected return

**On-Policy Learning:**

In **on-policy** learning, we learn about and improve the same policy that we're using to generate behavior. The agent:
1. Acts according to policy $\pi$
2. Learns from that experience
3. Improves policy $\pi$
4. Repeats

This contrasts with **off-policy** learning (covered later), where the agent learns about one policy while following another.

**The Algorithm: Monte Carlo Control with Epsilon-Greedy**

We can't use a purely greedy policy (always exploit) because we need exploration. The solution: **epsilon-greedy exploration**.

**Key Idea:** Instead of learning $V(s)$, we learn $Q(s,a)$ (action-values), which tells us the value of taking action $a$ in state $s$.

**Algorithm Steps:**

1. **Initialize**: 
   - $Q(s,a) = 0$ for all states and actions
   - $\pi$ = epsilon-greedy policy based on $Q$

2. **Repeat** for many episodes:
   - **Generate episode** following $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T$
   - **For each** state-action pair $(s,a)$ in the episode:
     - Calculate return $G$ following first visit to $(s,a)$
     - Update: $Q(s,a) \leftarrow \text{average of returns following } (s,a)$
   - **Improve policy**: $\pi \leftarrow$ epsilon-greedy with respect to $Q$

**Epsilon-Greedy Policy:**

$\pi(a|s) = \begin{cases}
1 - \epsilon + \frac{\epsilon}{|A(s)|} & \text{if } a = \arg\max_{a'} Q(s,a') \\
\frac{\epsilon}{|A(s)|} & \text{otherwise}
\end{cases}$

**Why This Works:**

1. **Exploration**: Epsilon-greedy ensures all actions are tried
2. **Exploitation**: Mostly choose actions with highest Q-values
3. **Improvement**: Policy gets better as Q-values become more accurate
4. **Convergence**: Under certain conditions, converges to optimal epsilon-greedy policy

**Generalized Policy Iteration (GPI):**

MC Control is an instance of GPI:
- **Policy Evaluation**: Estimate $Q^\pi$ using MC sampling
- **Policy Improvement**: Make policy greedy with respect to $Q$
- **Iterate**: These two processes work together to find $\pi^*$

Let's implement on-policy MC control:

In [None]:
class EpsilonGreedyPolicy:
    """Epsilon-greedy policy based on Q-values."""
    
    def __init__(self, epsilon=0.1, num_actions=4):
        """Initialize epsilon-greedy policy.
        
        Args:
            epsilon: Probability of random action
            num_actions: Number of possible actions
        """
        self.epsilon = epsilon
        self.num_actions = num_actions
        self.Q = defaultdict(lambda: np.zeros(num_actions))
    
    def get_action(self, state):
        """Select action using epsilon-greedy strategy.
        
        Args:
            state: Current state
            
        Returns:
            action: Selected action
        """
        if np.random.random() < self.epsilon:
            # Explore: random action
            return np.random.randint(0, self.num_actions)
        else:
            # Exploit: best action according to Q
            return np.argmax(self.Q[state])
    
    def get_greedy_action(self, state):
        """Get the greedy action (for evaluation)."""
        return np.argmax(self.Q[state])


def mc_control_on_policy(env, num_episodes=10000, gamma=0.9, epsilon=0.1):
    """On-policy Monte Carlo control with epsilon-greedy exploration.
    
    Learns optimal policy by:
    1. Generating episodes with epsilon-greedy policy
    2. Updating Q-values from returns
    3. Improving policy to be greedy w.r.t. Q
    
    Args:
        env: Environment
        num_episodes: Number of episodes to run
        gamma: Discount factor
        epsilon: Exploration probability
        
    Returns:
        policy: Learned epsilon-greedy policy
        Q: Learned action-value function
        stats: Dictionary with learning statistics
    """
    # Initialize policy
    policy = EpsilonGreedyPolicy(epsilon=epsilon, num_actions=4)
    
    # Track returns for each state-action pair
    returns_sum = defaultdict(float)
    returns_count = defaultdict(int)
    
    # Statistics for tracking progress
    episode_returns = []
    episode_lengths = []
    
    for episode_num in range(num_episodes):
        # Generate episode using current policy
        episode = generate_episode(env, policy.get_action, max_steps=100)
        
        # Calculate returns
        returns = calculate_returns(episode, gamma)
        
        # Track statistics
        episode_returns.append(returns[0] if returns else 0)
        episode_lengths.append(len(episode))
        
        # Track visited state-action pairs (for first-visit)
        visited_pairs = set()
        
        # Update Q-values
        for t, (state, action, reward) in enumerate(episode):
            pair = (state, action)
            
            # First-visit MC
            if pair not in visited_pairs:
                visited_pairs.add(pair)
                
                # Update return statistics
                returns_sum[pair] += returns[t]
                returns_count[pair] += 1
                
                # Update Q-value (average of returns)
                policy.Q[state][action] = returns_sum[pair] / returns_count[pair]
        
        # Policy improvement happens automatically through epsilon-greedy
        # (policy always acts epsilon-greedy w.r.t. current Q)
    
    stats = {
        'episode_returns': episode_returns,
        'episode_lengths': episode_lengths,
        'states_visited': len(policy.Q)
    }
    
    return policy, dict(policy.Q), stats


print("On-Policy Monte Carlo Control Implementation\n")
print("="*60)
print("\nImplemented:")
print("  ‚úì Epsilon-greedy policy class")
print("  ‚úì On-policy MC control algorithm")
print("  ‚úì Q-value learning from episodes")
print("  ‚úì Policy improvement through GPI")
print("\nReady to learn optimal policies!")

#### Learning Optimal Policy in Grid World

In [None]:
# Learn optimal policy using MC control
print("Learning Optimal Policy with On-Policy MC Control\n")
print("="*60)

# Create environment
env = GridWorldEnvironment(grid_size=5)

print("\nEnvironment: 5x5 Grid World")
print(f"  Start: (0,0)")
print(f"  Goal: {env.goal_pos}")
print(f"  Obstacles: {env.obstacles}")
print(f"  Actions: {env.actions}")

print("\nTraining agent with MC control...")
print("  Episodes: 10,000")
print("  Epsilon: 0.1")
print("  Gamma: 0.9")

# Train agent
policy, Q, stats = mc_control_on_policy(
    env, 
    num_episodes=10000, 
    gamma=0.9, 
    epsilon=0.1
)

print("\n‚úì Training complete!")
print(f"\nLearning Statistics:")
print(f"  States visited: {stats['states_visited']}")
print(f"  Final average return: {np.mean(stats['episode_returns'][-100:]):.2f}")
print(f"  Final average episode length: {np.mean(stats['episode_lengths'][-100:]):.1f}")

# Show learned policy for some key states
print("\nLearned Policy (greedy actions):")
print("-" * 40)
key_states = [(0,0), (0,1), (1,0), (2,0), (3,3), (4,3)]
for state in key_states:
    if state in Q:
        action = policy.get_greedy_action(state)
        action_name = env.actions[action]
        q_values = Q[state]
        print(f"  State {state}: {action_name} (Q-values: {q_values})")

print("\n" + "="*60)

#### Visualizing Policy Improvement Over Episodes

In [None]:
# Visualize learning progress
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Episode returns over time
ax = axes[0, 0]
window = 100
smoothed_returns = np.convolve(stats['episode_returns'], np.ones(window)/window, mode='valid')
ax.plot(smoothed_returns, linewidth=2, color='blue', alpha=0.8)
ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Average Return', fontsize=11)
ax.set_title('Learning Progress: Episode Returns', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axhline(y=np.mean(smoothed_returns[-100:]), color='red', linestyle='--', 
           alpha=0.5, label=f'Final: {np.mean(smoothed_returns[-100:]):.1f}')
ax.legend()

# Plot 2: Episode lengths over time
ax = axes[0, 1]
smoothed_lengths = np.convolve(stats['episode_lengths'], np.ones(window)/window, mode='valid')
ax.plot(smoothed_lengths, linewidth=2, color='green', alpha=0.8)
ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Episode Length (steps)', fontsize=11)
ax.set_title('Learning Progress: Episode Lengths', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axhline(y=np.mean(smoothed_lengths[-100:]), color='red', linestyle='--', 
           alpha=0.5, label=f'Final: {np.mean(smoothed_lengths[-100:]):.1f}')
ax.legend()

# Plot 3: Learned Q-values heatmap (max Q for each state)
ax = axes[1, 0]
q_grid = np.zeros((5, 5))
for (row, col), q_values in Q.items():
    q_grid[row, col] = np.max(q_values)

im = ax.imshow(q_grid, cmap='RdYlGn', aspect='auto')
ax.set_title('Learned Q-Values (max over actions)', fontsize=12, fontweight='bold')
ax.set_xlabel('Column', fontsize=11)
ax.set_ylabel('Row', fontsize=11)

# Add value labels
for i in range(5):
    for j in range(5):
        text = ax.text(j, i, f'{q_grid[i, j]:.1f}',
                      ha="center", va="center", color="black", fontsize=10)

plt.colorbar(im, ax=ax)

# Plot 4: Learned policy visualization
ax = axes[1, 1]
policy_grid = np.full((5, 5), -1)
for (row, col), q_values in Q.items():
    policy_grid[row, col] = np.argmax(q_values)

# Create custom colormap for actions
from matplotlib.colors import ListedColormap
colors = ['white', 'lightblue', 'lightgreen', 'lightyellow', 'lightcoral']
cmap = ListedColormap(colors)

im = ax.imshow(policy_grid, cmap=cmap, aspect='auto', vmin=-1, vmax=3)
ax.set_title('Learned Policy (Greedy Actions)', fontsize=12, fontweight='bold')
ax.set_xlabel('Column', fontsize=11)
ax.set_ylabel('Row', fontsize=11)

# Add action arrows
arrow_map = {0: '‚Üë', 1: '‚Üí', 2: '‚Üì', 3: '‚Üê', -1: ''}
for i in range(5):
    for j in range(5):
        action = int(policy_grid[i, j])
        arrow = arrow_map.get(action, '')
        ax.text(j, i, arrow, ha="center", va="center", 
               fontsize=20, fontweight='bold', color='black')

plt.tight_layout()
plt.show()

print("\nüìä Visualization Insights:")
print("\n1. Top Left: Returns increase as policy improves")
print("   - Agent learns to reach goal more efficiently")
print("   - Converges to near-optimal performance")

print("\n2. Top Right: Episode lengths decrease")
print("   - Shorter episodes = more efficient paths to goal")
print("   - Agent learns to avoid obstacles and reach goal quickly")

print("\n3. Bottom Left: Q-values show state quality")
print("   - Higher values near goal (green)")
print("   - Lower values far from goal (red)")

print("\n4. Bottom Right: Learned policy")
print("   - Arrows show best action in each state")
print("   - Policy guides agent toward goal")
print("   - Successfully learned from experience!")

print("\n‚úÖ On-Policy MC Control Success:")
print("   The agent learned an effective policy without any model!")
print("   Just from trial and error with epsilon-greedy exploration.")

#### Off-Policy Learning with Importance Sampling

**Learning About One Policy While Following Another**

So far, we've used **on-policy** learning where we learn about the same policy we're following. But what if we want to:
- Learn an optimal (deterministic) policy while exploring (stochastic behavior)?
- Learn from data generated by a different policy (e.g., human demonstrations)?
- Reuse old experience even after the policy has changed?

This is where **off-policy** learning comes in!

**Key Concepts:**

**Target Policy ($\pi$):**
- The policy we want to learn about
- Often deterministic and greedy
- Example: Always take the best action

**Behavior Policy ($b$):**
- The policy we actually follow to generate experience
- Must be exploratory (try all actions)
- Example: Epsilon-greedy or random policy

**The Challenge:**

Episodes are generated by $b$, but we want to estimate values for $\pi$. The returns we observe are from $b$'s distribution, not $\pi$'s!

**The Solution: Importance Sampling**

Importance sampling is a technique from statistics that allows us to estimate expectations under one distribution using samples from another.

**Mathematical Foundation:**

We want to estimate $\mathbb{E}_\pi[G_t]$ (expected return under $\pi$), but we only have samples from $b$.

**Importance Sampling Ratio:**

For a trajectory $\tau = (S_t, A_t, S_{t+1}, A_{t+1}, ..., S_T)$, the importance sampling ratio is:

$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

This ratio weights the return to correct for the difference between policies.

**Intuition:**
- If $\pi$ is more likely to take the actions than $b$: ratio > 1 (weight up)
- If $\pi$ is less likely to take the actions than $b$: ratio < 1 (weight down)
- If $\pi$ would never take an action that $b$ took: ratio = 0 (ignore)

**Off-Policy MC Prediction with Importance Sampling:**

$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|}$

where $\mathcal{T}(s)$ is the set of all time steps where state $s$ was visited.

**Coverage Assumption:**

For off-policy learning to work, we need:
$\pi(a|s) > 0 \implies b(a|s) > 0$

In words: The behavior policy must try all actions that the target policy might take.

**Advantages of Off-Policy Learning:**
1. Can learn optimal policy while exploring
2. Can learn from observational data
3. Can reuse experience from old policies
4. More flexible than on-policy methods

**Disadvantages:**
1. Higher variance (importance sampling ratios can be large)
2. Slower convergence
3. Requires more data
4. Can be unstable if policies are very different

Let's implement off-policy MC with importance sampling:

In [None]:
def compute_importance_sampling_ratio(episode, target_policy, behavior_policy):
    """Compute importance sampling ratio for an episode.
    
    Args:
        episode: List of (state, action, reward) tuples
        target_policy: Function that returns probability of action given state
        behavior_policy: Function that returns probability of action given state
        
    Returns:
        ratios: List of cumulative importance sampling ratios for each step
    """
    ratios = []
    cumulative_ratio = 1.0
    
    for state, action, reward in episode:
        # Get probabilities under both policies
        pi_prob = target_policy(action, state)
        b_prob = behavior_policy(action, state)
        
        # Avoid division by zero
        if b_prob == 0:
            cumulative_ratio = 0
            break
        
        # Update cumulative ratio
        cumulative_ratio *= (pi_prob / b_prob)
        ratios.append(cumulative_ratio)
    
    return ratios


def mc_prediction_off_policy(env, target_policy_func, behavior_policy_func,
                             target_policy_probs, behavior_policy_probs,
                             num_episodes=5000, gamma=0.9):
    """Off-policy Monte Carlo prediction with ordinary importance sampling.
    
    Learns value function for target policy using episodes from behavior policy.
    
    Args:
        env: Environment
        target_policy_func: Function that selects actions for target policy
        behavior_policy_func: Function that selects actions for behavior policy
        target_policy_probs: Function that returns P(a|s) for target policy
        behavior_policy_probs: Function that returns P(a|s) for behavior policy
        num_episodes: Number of episodes to generate
        gamma: Discount factor
        
    Returns:
        V: Estimated value function for target policy
        stats: Learning statistics
    """
    V = defaultdict(float)
    returns_sum = defaultdict(float)
    returns_count = defaultdict(int)
    
    episode_returns = []
    importance_ratios = []
    
    for episode_num in range(num_episodes):
        # Generate episode using behavior policy
        episode = generate_episode(env, behavior_policy_func, max_steps=100)
        
        # Calculate returns
        returns = calculate_returns(episode, gamma)
        
        # Calculate importance sampling ratios
        ratios = compute_importance_sampling_ratio(
            episode, target_policy_probs, behavior_policy_probs
        )
        
        # Track statistics
        if returns:
            episode_returns.append(returns[0])
        if ratios:
            importance_ratios.append(ratios[-1])  # Final ratio
        
        # Update value estimates (first-visit)
        visited_states = set()
        
        for t, (state, action, reward) in enumerate(episode):
            if state not in visited_states and t < len(ratios):
                visited_states.add(state)
                
                # Weight return by importance sampling ratio
                weighted_return = ratios[t] * returns[t]
                
                returns_sum[state] += weighted_return
                returns_count[state] += 1
                
                # Update value estimate
                V[state] = returns_sum[state] / returns_count[state]
    
    stats = {
        'episode_returns': episode_returns,
        'importance_ratios': importance_ratios,
        'states_visited': len(V)
    }
    
    return dict(V), stats


print("Off-Policy Monte Carlo with Importance Sampling\n")
print("="*60)
print("\nImplemented:")
print("  ‚úì Importance sampling ratio computation")
print("  ‚úì Off-policy MC prediction")
print("  ‚úì Target vs behavior policy separation")
print("\nReady to learn from off-policy data!")

#### Demonstrating Off-Policy Learning

In [None]:
# Define target and behavior policies
print("Off-Policy Learning Demonstration\n")
print("="*60)

# Create environment
env = GridWorldEnvironment(grid_size=5)

# Target policy: Greedy (deterministic)
def target_policy_action(state):
    """Greedy policy - always move toward goal."""
    row, col = state
    goal_row, goal_col = 4, 4
    
    if col < goal_col:
        return 1  # RIGHT
    elif row < goal_row:
        return 2  # DOWN
    else:
        return 1  # Default

def target_policy_prob(action, state):
    """Probability of action under target policy (deterministic)."""
    return 1.0 if action == target_policy_action(state) else 0.0

# Behavior policy: Epsilon-greedy (exploratory)
epsilon_behavior = 0.3

def behavior_policy_action(state):
    """Epsilon-greedy behavior policy."""
    if np.random.random() < epsilon_behavior:
        return np.random.randint(0, 4)  # Random
    else:
        return target_policy_action(state)  # Greedy

def behavior_policy_prob(action, state):
    """Probability of action under behavior policy (epsilon-greedy)."""
    greedy_action = target_policy_action(state)
    
    if action == greedy_action:
        return 1.0 - epsilon_behavior + epsilon_behavior / 4.0
    else:
        return epsilon_behavior / 4.0

print("\nPolicies:")
print("  Target Policy: Greedy (deterministic, optimal)")
print("  Behavior Policy: Œµ-greedy with Œµ=0.3 (exploratory)")

print("\nLearning value function for target policy...")
print("  (using episodes generated by behavior policy)")

# Learn off-policy
V_off_policy, stats_off = mc_prediction_off_policy(
    env,
    target_policy_action,
    behavior_policy_action,
    target_policy_prob,
    behavior_policy_prob,
    num_episodes=5000,
    gamma=0.9
)

# For comparison, learn on-policy with target policy
print("\nFor comparison, learning with on-policy (target policy)...")
V_on_policy, returns_on = mc_prediction_first_visit(
    env, target_policy_action, num_episodes=5000, gamma=0.9
)

print("\n" + "="*60)
print("\nResults:")
print(f"\nOff-Policy Learning:")
print(f"  States visited: {stats_off['states_visited']}")
print(f"  Start state value: {V_off_policy.get((0,0), 0):.2f}")
print(f"  Average importance ratio: {np.mean(stats_off['importance_ratios']):.3f}")
print(f"  Max importance ratio: {np.max(stats_off['importance_ratios']):.3f}")

print(f"\nOn-Policy Learning (for comparison):")
print(f"  States visited: {len(V_on_policy)}")
print(f"  Start state value: {V_on_policy.get((0,0), 0):.2f}")

print("\nüí° Key Observations:")
print("   - Off-policy successfully learns target policy values")
print("   - Uses exploratory behavior policy for data collection")
print("   - Importance ratios correct for policy difference")
print("   - Values should be similar to on-policy estimates")

#### Weighted Importance Sampling: Reducing Variance

**The Variance Problem with Ordinary Importance Sampling**

Ordinary importance sampling (what we just implemented) has a significant problem: **high variance**.

**Why High Variance?**

The importance sampling ratio $\rho = \prod \frac{\pi(a|s)}{b(a|s)}$ can become very large:
- If the trajectory is long, many ratios multiply together
- If policies differ significantly, individual ratios can be large
- A single large ratio can dominate the average

**Example:**
- Suppose we have 100 episodes with ratio ‚âà 1.0
- And 1 episode with ratio = 100
- The single outlier heavily skews the estimate!

**The Solution: Weighted Importance Sampling**

Instead of a simple average, use a **weighted average** where the weights are the importance sampling ratios themselves.

**Mathematical Formulation:**

**Ordinary Importance Sampling:**
$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|}$

**Weighted Importance Sampling:**
$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1}}$

**Key Difference:**
- Ordinary: Divide by number of visits
- Weighted: Divide by sum of importance ratios

**Intuition:**

Weighted IS gives more weight to returns with larger importance ratios, but normalizes by the total weight. This:
- Reduces the impact of extreme ratios
- Provides more stable estimates
- Converges faster in practice

**Bias-Variance Trade-off:**

| Method | Bias | Variance | Convergence |
|--------|------|----------|-------------|
| Ordinary IS | Unbiased | High | Slower |
| Weighted IS | Biased (initially) | Low | Faster |

**Important Note:**
- Weighted IS is biased for finite samples
- But the bias goes to zero as samples increase
- In practice, lower variance usually wins!

**When to Use Each:**
- **Ordinary IS**: When you need unbiased estimates, have lots of data
- **Weighted IS**: When variance is a problem, limited data (most practical cases)

Let's implement weighted importance sampling and compare:

In [None]:
def mc_prediction_off_policy_weighted(env, target_policy_func, behavior_policy_func,
                                      target_policy_probs, behavior_policy_probs,
                                      num_episodes=5000, gamma=0.9):
    """Off-policy Monte Carlo prediction with weighted importance sampling.
    
    Uses weighted average to reduce variance compared to ordinary IS.
    
    Args:
        env: Environment
        target_policy_func: Function that selects actions for target policy
        behavior_policy_func: Function that selects actions for behavior policy
        target_policy_probs: Function that returns P(a|s) for target policy
        behavior_policy_probs: Function that returns P(a|s) for behavior policy
        num_episodes: Number of episodes to generate
        gamma: Discount factor
        
    Returns:
        V: Estimated value function for target policy
        stats: Learning statistics
    """
    V = defaultdict(float)
    # For weighted IS, we need numerator and denominator separately
    weighted_returns_sum = defaultdict(float)  # Sum of (ratio * return)
    weights_sum = defaultdict(float)  # Sum of ratios
    
    episode_returns = []
    importance_ratios = []
    
    for episode_num in range(num_episodes):
        # Generate episode using behavior policy
        episode = generate_episode(env, behavior_policy_func, max_steps=100)
        
        # Calculate returns
        returns = calculate_returns(episode, gamma)
        
        # Calculate importance sampling ratios
        ratios = compute_importance_sampling_ratio(
            episode, target_policy_probs, behavior_policy_probs
        )
        
        # Track statistics
        if returns:
            episode_returns.append(returns[0])
        if ratios:
            importance_ratios.append(ratios[-1])
        
        # Update value estimates (first-visit)
        visited_states = set()
        
        for t, (state, action, reward) in enumerate(episode):
            if state not in visited_states and t < len(ratios):
                visited_states.add(state)
                
                # Weighted importance sampling
                ratio = ratios[t]
                weighted_return = ratio * returns[t]
                
                # Update numerator and denominator
                weighted_returns_sum[state] += weighted_return
                weights_sum[state] += ratio
                
                # Update value estimate (weighted average)
                if weights_sum[state] > 0:
                    V[state] = weighted_returns_sum[state] / weights_sum[state]
    
    stats = {
        'episode_returns': episode_returns,
        'importance_ratios': importance_ratios,
        'states_visited': len(V)
    }
    
    return dict(V), stats


print("Weighted Importance Sampling Implementation\n")
print("="*60)
print("\nImplemented:")
print("  ‚úì Weighted importance sampling")
print("  ‚úì Variance reduction through weighted averaging")
print("  ‚úì Separate tracking of numerator and denominator")
print("\nReady to compare with ordinary importance sampling!")

#### Comparing Ordinary vs Weighted Importance Sampling

In [None]:
# Compare ordinary and weighted importance sampling
print("Comparing Ordinary vs Weighted Importance Sampling\n")
print("="*60)

# Run multiple trials to measure variance
num_trials = 20
num_episodes_per_trial = 2000

ordinary_estimates = []
weighted_estimates = []

print(f"\nRunning {num_trials} trials with {num_episodes_per_trial} episodes each...")
print("(This may take a moment)\n")

for trial in range(num_trials):
    # Ordinary importance sampling
    V_ordinary, _ = mc_prediction_off_policy(
        env, target_policy_action, behavior_policy_action,
        target_policy_prob, behavior_policy_prob,
        num_episodes=num_episodes_per_trial, gamma=0.9
    )
    ordinary_estimates.append(V_ordinary.get((0,0), 0))
    
    # Weighted importance sampling
    V_weighted, _ = mc_prediction_off_policy_weighted(
        env, target_policy_action, behavior_policy_action,
        target_policy_prob, behavior_policy_prob,
        num_episodes=num_episodes_per_trial, gamma=0.9
    )
    weighted_estimates.append(V_weighted.get((0,0), 0))
    
    if (trial + 1) % 5 == 0:
        print(f"  Completed {trial + 1}/{num_trials} trials")

# Calculate statistics
ordinary_mean = np.mean(ordinary_estimates)
ordinary_std = np.std(ordinary_estimates)
weighted_mean = np.mean(weighted_estimates)
weighted_std = np.std(weighted_estimates)

print("\n" + "="*60)
print("\nResults for Start State (0,0):")
print("-" * 60)
print(f"\nOrdinary Importance Sampling:")
print(f"  Mean estimate: {ordinary_mean:.3f}")
print(f"  Std deviation: {ordinary_std:.3f}")
print(f"  Min estimate: {np.min(ordinary_estimates):.3f}")
print(f"  Max estimate: {np.max(ordinary_estimates):.3f}")

print(f"\nWeighted Importance Sampling:")
print(f"  Mean estimate: {weighted_mean:.3f}")
print(f"  Std deviation: {weighted_std:.3f}")
print(f"  Min estimate: {np.min(weighted_estimates):.3f}")
print(f"  Max estimate: {np.max(weighted_estimates):.3f}")

variance_reduction = ((ordinary_std - weighted_std) / ordinary_std) * 100
print(f"\nüìä Variance Reduction: {variance_reduction:.1f}%")

print("\n" + "="*60)

#### Visualizing Variance Reduction

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Distribution of estimates
ax = axes[0]
ax.hist(ordinary_estimates, bins=15, alpha=0.6, label='Ordinary IS', color='red', edgecolor='black')
ax.hist(weighted_estimates, bins=15, alpha=0.6, label='Weighted IS', color='green', edgecolor='black')
ax.axvline(ordinary_mean, color='red', linestyle='--', linewidth=2, label=f'Ordinary Mean: {ordinary_mean:.2f}')
ax.axvline(weighted_mean, color='green', linestyle='--', linewidth=2, label=f'Weighted Mean: {weighted_mean:.2f}')
ax.set_xlabel('Value Estimate for State (0,0)', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Distribution of Value Estimates', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Plot 2: Box plot comparison
ax = axes[1]
data_to_plot = [ordinary_estimates, weighted_estimates]
bp = ax.boxplot(data_to_plot, labels=['Ordinary IS', 'Weighted IS'],
                patch_artist=True, widths=0.6)

# Color the boxes
colors = ['lightcoral', 'lightgreen']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_ylabel('Value Estimate for State (0,0)', fontsize=12)
ax.set_title('Variance Comparison', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Add variance reduction annotation
ax.text(1.5, ax.get_ylim()[1] * 0.95, 
        f'Variance Reduction:\n{variance_reduction:.1f}%',
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7),
        fontsize=11, fontweight='bold', ha='center')

plt.tight_layout()
plt.show()

print("\nüìä Visualization Insights:")
print("\n1. Left Plot: Distribution of estimates across trials")
print("   - Ordinary IS: Wider spread (higher variance)")
print("   - Weighted IS: Tighter distribution (lower variance)")
print("   - Both centered around similar mean values")

print("\n2. Right Plot: Box plot shows variance clearly")
print("   - Ordinary IS: Larger box and whiskers")
print("   - Weighted IS: Smaller box (more consistent estimates)")
print("   - Outliers are less extreme with weighted IS")

print("\n‚úÖ Key Takeaways:")
print("   1. Weighted IS significantly reduces variance")
print("   2. More stable and reliable estimates")
print("   3. Faster convergence in practice")
print("   4. Preferred method for most off-policy learning")
print("\n   Weighted importance sampling is the practical choice!")

#### Monte Carlo Methods: Limitations and Challenges

**Understanding When MC Methods Fall Short**

While Monte Carlo methods are powerful and model-free, they have significant limitations that restrict their applicability. Understanding these limitations motivates the development of more advanced methods like Temporal Difference learning.

**1. The "Wait Until the End" Problem**

**The Issue:**
- MC methods must wait until an episode completes before updating values
- No learning happens during the episode
- All updates occur at the end

**Why This Matters:**
- **Long Episodes**: If episodes take 1000 steps, you wait 1000 steps to learn anything
- **Continuing Tasks**: Some tasks never end (e.g., process control, robot operation)
- **Slow Feedback**: Can't adjust behavior mid-episode based on what's working

**Example:**
```
Episode: S‚ÇÄ ‚Üí S‚ÇÅ ‚Üí S‚ÇÇ ‚Üí ... ‚Üí S‚Çâ‚Çâ‚Çâ ‚Üí S‚ÇÅ‚ÇÄ‚ÇÄ‚ÇÄ (terminal)
         ‚Üë                                    ‚Üë
    No learning                         All learning happens here!
```

**Impact:**
- Inefficient use of experience
- Slow learning, especially with long episodes
- Cannot handle continuing (non-episodic) tasks

**2. High Variance in Return Estimates**

**The Issue:**
- Returns depend on entire trajectory of rewards
- Many random events accumulate
- Different episodes from same state can have very different returns

**Mathematical Perspective:**

Return: $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...$

Each $R_i$ is random, and we're summing many random variables:
- More terms ‚Üí more variance
- Longer episodes ‚Üí higher variance
- Stochastic environments ‚Üí even more variance

**Consequences:**
- Need many episodes to get accurate estimates
- Slow convergence
- Unstable learning, especially early on
- Off-policy methods (importance sampling) make this worse

**Example:**
```
From state S, three episodes:
  Episode 1: G = 10  (got lucky)
  Episode 2: G = -5  (got unlucky)
  Episode 3: G = 3   (typical)
  
Average: 2.67 (but high variance!)
Need many more episodes for stable estimate
```

**3. Inefficient Learning from Experience**

**The Issue:**
- Each episode provides one data point per state visited
- Can't learn from partial episodes
- Doesn't leverage structure of the problem

**Comparison:**
- **MC**: Uses complete return from state to end
- **Better approach**: Could learn from each step along the way

**Example:**
```
Episode: S‚ÇÄ ‚Üí S‚ÇÅ ‚Üí S‚ÇÇ ‚Üí S‚ÇÉ ‚Üí S‚ÇÑ (terminal, R=10)

MC Learning:
  - Updates V(S‚ÇÄ), V(S‚ÇÅ), V(S‚ÇÇ), V(S‚ÇÉ) once at end
  - Uses full return for each
  
Potential Improvement:
  - Could update after each step
  - Could learn from partial information
  - 5 learning opportunities instead of 1!
```

**4. Requires Episodic Tasks**

**The Issue:**
- MC methods fundamentally require episodes to terminate
- Many real-world problems are continuing (no natural end)

**Examples of Continuing Tasks:**
- Process control (factory, power plant)
- Robot operation (runs indefinitely)
- Stock trading (market never closes permanently)
- Recommendation systems (always serving users)

**Workarounds (not ideal):**
- Artificially terminate episodes
- Use very long episodes (but then variance increases)
- Neither solution is satisfactory

**5. Slow Convergence**

**The Issue:**
- Due to high variance, need many episodes
- Each episode only updates visited states
- Learning is sample-inefficient

**Factors Affecting Convergence:**
- Episode length (longer ‚Üí slower)
- Environment stochasticity (more random ‚Üí slower)
- State space size (larger ‚Üí slower)
- Exploration strategy (poor exploration ‚Üí slower)

**Practical Impact:**
- May need millions of episodes for complex problems
- Expensive in terms of computation and time
- Not practical for real-world systems with costly interactions

**Summary of Limitations:**

| Limitation | Impact | Severity |
|------------|--------|----------|
| Wait until end | Slow learning | High |
| High variance | Need many samples | High |
| Episodic only | Can't handle continuing tasks | Critical |
| Sample inefficiency | Expensive learning | Medium |
| Slow convergence | Long training times | Medium |

**The Path Forward: Temporal Difference Learning**

These limitations motivate **Temporal Difference (TD) learning**, which we'll explore next. TD methods:

‚úì Learn from every step (not just at episode end)
‚úì Work with continuing tasks
‚úì Lower variance (bootstrap from estimates)
‚úì More sample-efficient
‚úì Faster convergence

**When to Use Monte Carlo Despite Limitations:**

MC methods are still valuable when:
- Episodes are short
- You need unbiased estimates
- Environment is deterministic or low-noise
- You have access to a simulator (cheap episodes)
- You want simple, easy-to-understand algorithms

**Key Insight:**

Monte Carlo methods taught us that we can learn from experience without a model. But their limitations show us that we can do better by learning from partial episodes and bootstrapping from our own estimates. This insight leads directly to Temporal Difference learning, which combines the best of MC and Dynamic Programming!

#### Visualizing MC Limitations

In [None]:
# Demonstrate the high variance problem
print("Demonstrating Monte Carlo Limitations\n")
print("="*60)

# Run MC prediction multiple times to show variance
num_runs = 50
episodes_per_run = 1000

start_state_estimates = []

print(f"\nRunning MC prediction {num_runs} times...")
print(f"Each run uses {episodes_per_run} episodes\n")

for run in range(num_runs):
    V, _ = mc_prediction_first_visit(
        env, greedy_policy, num_episodes=episodes_per_run, gamma=0.9
    )
    start_state_estimates.append(V.get((0,0), 0))

mean_estimate = np.mean(start_state_estimates)
std_estimate = np.std(start_state_estimates)

print(f"Results for start state (0,0):")
print(f"  Mean estimate: {mean_estimate:.3f}")
print(f"  Std deviation: {std_estimate:.3f}")
print(f"  Min: {np.min(start_state_estimates):.3f}")
print(f"  Max: {np.max(start_state_estimates):.3f}")
print(f"  Range: {np.max(start_state_estimates) - np.min(start_state_estimates):.3f}")

# Visualize variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Distribution of estimates
ax = axes[0]
ax.hist(start_state_estimates, bins=20, edgecolor='black', alpha=0.7, color='steelblue')
ax.axvline(mean_estimate, color='red', linestyle='--', linewidth=2, 
           label=f'Mean: {mean_estimate:.2f}')
ax.axvline(mean_estimate - std_estimate, color='orange', linestyle=':', linewidth=2,
           label=f'¬±1 Std: {std_estimate:.2f}')
ax.axvline(mean_estimate + std_estimate, color='orange', linestyle=':', linewidth=2)
ax.set_xlabel('Value Estimate', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('High Variance in MC Estimates', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Plot 2: Estimates over runs
ax = axes[1]
ax.plot(start_state_estimates, marker='o', linestyle='-', alpha=0.6, color='steelblue')
ax.axhline(mean_estimate, color='red', linestyle='--', linewidth=2, label='Mean')
ax.fill_between(range(num_runs), 
                mean_estimate - std_estimate, 
                mean_estimate + std_estimate,
                alpha=0.2, color='orange', label='¬±1 Std')
ax.set_xlabel('Run Number', fontsize=12)
ax.set_ylabel('Value Estimate', fontsize=12)
ax.set_title('Variability Across Runs', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("\n‚ö†Ô∏è  Key Observations:")
print("\n1. High Variance:")
print(f"   - Estimates vary significantly across runs")
print(f"   - Standard deviation is {(std_estimate/mean_estimate)*100:.1f}% of mean")
print(f"   - Need many episodes for stable estimates")

print("\n2. Sample Inefficiency:")
print(f"   - Used {num_runs * episodes_per_run:,} total episodes")
print(f"   - Still seeing significant variance")
print(f"   - Each episode only updates visited states once")

print("\n3. Episodic Requirement:")
print(f"   - Must wait for episode to complete")
print(f"   - No learning during episode")
print(f"   - Cannot handle continuing tasks")

print("\nüéØ Motivation for Temporal Difference Learning:")
print("   These limitations show we need methods that:")
print("   ‚Ä¢ Learn from every step, not just episode ends")
print("   ‚Ä¢ Have lower variance through bootstrapping")
print("   ‚Ä¢ Work with continuing tasks")
print("   ‚Ä¢ Are more sample-efficient")
print("\n   ‚Üí This leads us to TD learning in the next section!")

<a id='td-learning'></a>
### Temporal Difference Learning

**Learning from Every Step**

Temporal Difference (TD) learning represents a fundamental breakthrough in reinforcement learning. Unlike Monte Carlo methods that must wait until the end of an episode to update value estimates, TD methods learn from **every single step** of experience.

**The Key Insight:**

TD learning combines the best aspects of two approaches:

1. **From Monte Carlo**: Learn directly from experience without a model
2. **From Dynamic Programming**: Update estimates based on other estimates (bootstrapping)

**Why "Temporal Difference"?**

The name comes from the fact that TD methods learn from the **difference** between estimates at successive **time** steps. Instead of waiting for the actual return, TD methods use the difference between the current estimate and a better estimate based on the next state.

**Advantages of TD Learning:**

1. **Online Learning**: Update after every step, not just at episode end
2. **Lower Variance**: Bootstrap from estimates rather than full returns
3. **Works with Continuing Tasks**: No need for episodes to terminate
4. **More Sample Efficient**: Learn more from each experience
5. **Faster Convergence**: Updates propagate information more quickly

**The Trade-off:**

- **MC**: Unbiased but high variance (uses actual returns)
- **TD**: Biased but lower variance (uses estimated returns)
- In practice, TD's lower variance usually wins!

Let's explore the simplest TD method: TD(0).

#### TD(0) Prediction Algorithm

**The Simplest Temporal Difference Method**

TD(0) (pronounced "TD-zero") is the most fundamental TD algorithm. It updates the value estimate for a state immediately after transitioning to the next state.

**The TD(0) Update Rule:**

$$
V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]
$$

**Breaking Down the Formula:**

- $V(S_t)$: Current value estimate for state $S_t$
- $\alpha$: Learning rate (step size), typically 0.01 to 0.5
- $R_{t+1}$: Immediate reward received after taking action
- $\gamma$: Discount factor (0 to 1)
- $V(S_{t+1})$: Value estimate for next state
- $R_{t+1} + \gamma V(S_{t+1})$: **TD target** (estimate of true value)
- $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$: **TD error** (how wrong we were)

**Intuition:**

1. We're in state $S_t$ with value estimate $V(S_t)$
2. We take an action and receive reward $R_{t+1}$, landing in state $S_{t+1}$
3. We form a **better estimate** of $V(S_t)$: $R_{t+1} + \gamma V(S_{t+1})$
4. We move our estimate toward this better estimate

**Comparison with Monte Carlo:**

Monte Carlo update:
$$
V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right]
$$
where $G_t$ is the **actual return** (sum of all future rewards)

TD(0) update:
$$
V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]
$$
where $R_{t+1} + \gamma V(S_{t+1})$ is an **estimated return**

**Key Differences:**

| Aspect | Monte Carlo | TD(0) |
|--------|-------------|-------|
| Target | $G_t$ (actual return) | $R_{t+1} + \gamma V(S_{t+1})$ (estimated) |
| When to update | End of episode | After each step |
| Bias | Unbiased | Biased (uses estimate) |
| Variance | High | Lower |
| Continuing tasks | No | Yes |

**The Bootstrapping Concept:**

TD methods "bootstrap" - they update estimates based on other estimates. This is like pulling yourself up by your bootstraps! Initially, all estimates might be wrong, but they gradually improve and help each other converge to the true values.

Let's implement TD(0) prediction:

In [None]:
def td_prediction(env, policy, num_episodes=1000, alpha=0.1, gamma=0.9):
    """
    TD(0) prediction: Estimate state-value function for a given policy.
    
    Args:
        env: Environment with reset() and step() methods
        policy: Function that takes state and returns action
        num_episodes: Number of episodes to run
        alpha: Learning rate (step size)
        gamma: Discount factor
    
    Returns:
        V: Dictionary mapping states to value estimates
        episode_lengths: List of episode lengths for tracking
    """
    # Initialize value function
    V = defaultdict(float)
    episode_lengths = []
    
    for episode in range(num_episodes):
        state = env.reset()
        episode_length = 0
        
        while True:
            # Select action according to policy
            action = policy(state)
            
            # Take action and observe next state and reward
            next_state, reward, done, _ = env.step(action)
            episode_length += 1
            
            # TD(0) update rule
            # V(S) ‚Üê V(S) + Œ±[R + Œ≥V(S') - V(S)]
            td_target = reward + gamma * V[next_state]
            td_error = td_target - V[state]
            V[state] = V[state] + alpha * td_error
            
            if done:
                episode_lengths.append(episode_length)
                break
            
            state = next_state
    
    return V, episode_lengths


print("TD(0) Prediction Algorithm Implemented!")
print("="*60)
print("\nKey Features:")
print("  ‚Ä¢ Updates after every step (online learning)")
print("  ‚Ä¢ Uses bootstrapping (estimates from estimates)")
print("  ‚Ä¢ Lower variance than Monte Carlo")
print("  ‚Ä¢ Works with continuing tasks")
print("\nUpdate Rule: V(S) ‚Üê V(S) + Œ±[R + Œ≥V(S') - V(S)]")

#### Comparing TD(0) with Monte Carlo

Now let's compare TD(0) prediction with Monte Carlo prediction on the same environment to see the differences in practice.

In [None]:
# Use the same grid world environment from before
print("Comparing TD(0) vs Monte Carlo Prediction\n")
print("="*60)

# Create environment
env = SimpleGridWorld()

# Use the same greedy policy toward goal
def greedy_policy(state):
    """Simple policy: move toward goal (3,3)."""
    row, col = state
    goal_row, goal_col = 3, 3
    
    # Move toward goal
    if row < goal_row:
        return 2  # Down
    elif row > goal_row:
        return 0  # Up
    elif col < goal_col:
        return 1  # Right
    else:
        return 3  # Left

# Run both algorithms with same parameters
num_episodes = 500
gamma = 0.9

print(f"\nRunning both algorithms for {num_episodes} episodes...\n")

# TD(0) prediction
print("Running TD(0) prediction...")
V_td, lengths_td = td_prediction(env, greedy_policy, 
                                  num_episodes=num_episodes, 
                                  alpha=0.1, gamma=gamma)

# Monte Carlo prediction (first-visit)
print("Running Monte Carlo prediction...")
V_mc, lengths_mc = mc_prediction_first_visit(env, greedy_policy, 
                                              num_episodes=num_episodes, 
                                              gamma=gamma)

print("\nDone!\n")

# Compare value estimates for key states
print("Value Estimates Comparison:")
print("="*60)
print(f"{'State':<12} {'TD(0)':<12} {'MC':<12} {'Difference':<12}")
print("-"*60)

# Compare some key states
key_states = [(0,0), (0,3), (1,1), (2,2), (3,0), (3,3)]
for state in key_states:
    v_td = V_td.get(state, 0)
    v_mc = V_mc.get(state, 0)
    diff = abs(v_td - v_mc)
    print(f"{str(state):<12} {v_td:<12.3f} {v_mc:<12.3f} {diff:<12.3f}")

# Calculate statistics
all_states = set(list(V_td.keys()) + list(V_mc.keys()))
differences = [abs(V_td.get(s, 0) - V_mc.get(s, 0)) for s in all_states]
mean_diff = np.mean(differences)
max_diff = np.max(differences)

print("\n" + "="*60)
print(f"\nStatistics:")
print(f"  Mean absolute difference: {mean_diff:.4f}")
print(f"  Max absolute difference:  {max_diff:.4f}")
print(f"  Number of states visited: {len(all_states)}")

#### Visualizing Faster Convergence of TD(0)

One of the key advantages of TD learning is faster convergence. Let's visualize this by tracking how the value estimates evolve over episodes.

In [None]:
def td_prediction_with_tracking(env, policy, num_episodes=500, alpha=0.1, gamma=0.9, track_state=(0,0)):
    """
    TD(0) prediction with tracking of value estimates over time.
    
    Args:
        env: Environment
        policy: Policy function
        num_episodes: Number of episodes
        alpha: Learning rate
        gamma: Discount factor
        track_state: State to track value estimates for
    
    Returns:
        V: Final value function
        value_history: List of value estimates for tracked state
    """
    V = defaultdict(float)
    value_history = []
    
    for episode in range(num_episodes):
        state = env.reset()
        
        while True:
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            
            # TD(0) update
            td_target = reward + gamma * V[next_state]
            td_error = td_target - V[state]
            V[state] = V[state] + alpha * td_error
            
            if done:
                break
            
            state = next_state
        
        # Track value estimate after each episode
        value_history.append(V[track_state])
    
    return V, value_history


def mc_prediction_with_tracking(env, policy, num_episodes=500, gamma=0.9, track_state=(0,0)):
    """
    Monte Carlo prediction with tracking of value estimates over time.
    """
    V = defaultdict(float)
    returns = defaultdict(list)
    value_history = []
    
    for episode in range(num_episodes):
        # Generate episode
        episode_data = []
        state = env.reset()
        
        while True:
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            episode_data.append((state, reward))
            
            if done:
                break
            state = next_state
        
        # Calculate returns and update values (first-visit)
        G = 0
        visited = set()
        
        for state, reward in reversed(episode_data):
            G = reward + gamma * G
            
            if state not in visited:
                visited.add(state)
                returns[state].append(G)
                V[state] = np.mean(returns[state])
        
        # Track value estimate after each episode
        value_history.append(V[track_state])
    
    return V, value_history


# Run both algorithms with tracking
print("Tracking Convergence: TD(0) vs Monte Carlo\n")
print("="*60)

env = SimpleGridWorld()
track_state = (0, 0)  # Track the start state
num_episodes = 500

print(f"\nTracking value estimates for state {track_state}...\n")

# Run multiple times to get average behavior
num_runs = 20
td_histories = []
mc_histories = []

for run in range(num_runs):
    # TD(0)
    _, td_hist = td_prediction_with_tracking(env, greedy_policy, 
                                             num_episodes=num_episodes,
                                             alpha=0.1, gamma=0.9,
                                             track_state=track_state)
    td_histories.append(td_hist)
    
    # Monte Carlo
    _, mc_hist = mc_prediction_with_tracking(env, greedy_policy,
                                             num_episodes=num_episodes,
                                             gamma=0.9,
                                             track_state=track_state)
    mc_histories.append(mc_hist)

# Average across runs
td_avg = np.mean(td_histories, axis=0)
mc_avg = np.mean(mc_histories, axis=0)

# Calculate standard deviation for confidence bands
td_std = np.std(td_histories, axis=0)
mc_std = np.std(mc_histories, axis=0)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Convergence comparison
ax = axes[0]
episodes = np.arange(num_episodes)

# TD(0) line
ax.plot(episodes, td_avg, linewidth=2, color='blue', label='TD(0)', alpha=0.8)
ax.fill_between(episodes, td_avg - td_std, td_avg + td_std, 
                alpha=0.2, color='blue')

# Monte Carlo line
ax.plot(episodes, mc_avg, linewidth=2, color='red', label='Monte Carlo', alpha=0.8)
ax.fill_between(episodes, mc_avg - mc_std, mc_avg + mc_std, 
                alpha=0.2, color='red')

ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel(f'Value Estimate for State {track_state}', fontsize=12)
ax.set_title('Convergence Speed: TD(0) vs Monte Carlo', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Plot 2: Variance comparison
ax = axes[1]

# Calculate rolling standard deviation (variance proxy)
window = 50
td_rolling_std = pd.Series(td_avg).rolling(window=window, min_periods=1).std()
mc_rolling_std = pd.Series(mc_avg).rolling(window=window, min_periods=1).std()

ax.plot(episodes, td_rolling_std, linewidth=2, color='blue', 
        label='TD(0)', alpha=0.8)
ax.plot(episodes, mc_rolling_std, linewidth=2, color='red', 
        label='Monte Carlo', alpha=0.8)

ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel(f'Rolling Std Dev (window={window})', fontsize=12)
ax.set_title('Variance Comparison', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print("\n" + "="*60)
print("\nüìä Convergence Analysis:\n")

# Find when each method gets close to final value
td_final = td_avg[-1]
mc_final = mc_avg[-1]
threshold = 0.1  # Within 10% of final value

td_converge = np.where(np.abs(td_avg - td_final) < threshold * abs(td_final))[0]
mc_converge = np.where(np.abs(mc_avg - mc_final) < threshold * abs(mc_final))[0]

td_converge_ep = td_converge[0] if len(td_converge) > 0 else num_episodes
mc_converge_ep = mc_converge[0] if len(mc_converge) > 0 else num_episodes

print(f"Final Value Estimates:")
print(f"  TD(0):        {td_final:.4f}")
print(f"  Monte Carlo:  {mc_final:.4f}")
print(f"  Difference:   {abs(td_final - mc_final):.4f}")

print(f"\nConvergence Speed (episodes to reach 90% of final value):")
print(f"  TD(0):        {td_converge_ep} episodes")
print(f"  Monte Carlo:  {mc_converge_ep} episodes")
if td_converge_ep < mc_converge_ep:
    speedup = mc_converge_ep / max(td_converge_ep, 1)
    print(f"  ‚Üí TD(0) is {speedup:.1f}x faster!")

print(f"\nVariance (average std dev across runs):")
print(f"  TD(0):        {np.mean(td_std):.4f}")
print(f"  Monte Carlo:  {np.mean(mc_std):.4f}")
variance_reduction = (1 - np.mean(td_std) / np.mean(mc_std)) * 100
print(f"  ‚Üí TD(0) has {variance_reduction:.1f}% lower variance")

print("\n" + "="*60)
print("\n‚úÖ Key Observations:\n")
print("1. Faster Convergence:")
print("   ‚Ä¢ TD(0) typically converges faster than MC")
print("   ‚Ä¢ Updates after every step vs waiting for episode end")
print("   ‚Ä¢ Information propagates more quickly through states")

print("\n2. Lower Variance:")
print("   ‚Ä¢ TD(0) has smoother learning curves")
print("   ‚Ä¢ Bootstrapping reduces variance")
print("   ‚Ä¢ More stable estimates with fewer episodes")

print("\n3. Sample Efficiency:")
print("   ‚Ä¢ TD(0) learns more from each episode")
print("   ‚Ä¢ Every transition provides a learning opportunity")
print("   ‚Ä¢ Better use of experience")

print("\nüéØ Conclusion:")
print("   TD learning combines the best of both worlds:")
print("   ‚Ä¢ Model-free like Monte Carlo")
print("   ‚Ä¢ Bootstrapping like Dynamic Programming")
print("   ‚Ä¢ Result: Faster, more efficient learning!")

#### Summary: TD(0) Prediction

**What We Learned:**

1. **TD Learning Fundamentals**:
   - Learn from every step, not just episode ends
   - Bootstrap from current estimates
   - Combine MC's model-free approach with DP's bootstrapping

2. **TD(0) Update Rule**:
   $$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$
   - TD target: $R_{t+1} + \gamma V(S_{t+1})$
   - TD error: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

3. **Advantages Over Monte Carlo**:
   - Faster convergence
   - Lower variance
   - Online learning
   - Works with continuing tasks
   - More sample efficient

4. **Trade-offs**:
   - TD is biased (uses estimates)
   - MC is unbiased (uses actual returns)
   - In practice, TD's lower variance usually wins

**Next Steps:**

TD(0) is just for prediction (evaluating a policy). In the next sections, we'll explore:
- **SARSA**: On-policy TD control (learning optimal policies)
- **Q-Learning**: Off-policy TD control
- **Deep RL**: Combining TD learning with neural networks

These methods build on the TD(0) foundation to create powerful learning algorithms!

#### SARSA: On-Policy TD Control

**From Prediction to Control**

TD(0) taught us how to evaluate a policy (prediction). Now we'll learn how to find optimal policies using **SARSA** (State-Action-Reward-State-Action), an on-policy TD control algorithm.

**What is SARSA?**

SARSA is a TD method that learns action-value functions Q(s,a) instead of state-value functions V(s). By learning Q-values, the agent can directly select actions without needing a model of the environment.

**Why "SARSA"?**

The name comes from the tuple of information used in each update:
- **S**: Current state
- **A**: Action taken
- **R**: Reward received
- **S'**: Next state
- **A'**: Next action (chosen by the current policy)

**On-Policy Learning:**

SARSA is an **on-policy** algorithm, meaning:
- It learns about the policy it's currently following
- The next action A' used in the update is chosen by the same policy being learned
- This makes SARSA more conservative and safer in practice

**The SARSA Update Rule:**

$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]
$$

where:
- $Q(S_t, A_t)$: Current Q-value estimate
- $\alpha$: Learning rate (step size)
- $R_{t+1}$: Immediate reward
- $\gamma$: Discount factor
- $Q(S_{t+1}, A_{t+1})$: Q-value of next state-action pair
- $A_{t+1}$: Action actually taken in next state (following current policy)

**SARSA TD Target:**

$$
\text{TD Target} = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})
$$

**SARSA TD Error:**

$$
\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)
$$

**Key Differences from TD(0):**

| Aspect | TD(0) | SARSA |
|--------|-------|-------|
| Learns | State values V(s) | Action values Q(s,a) |
| Purpose | Policy evaluation | Policy improvement |
| Update uses | Next state value | Next state-action value |
| Output | Value function | Optimal policy |

**SARSA Algorithm:**

1. Initialize Q(s,a) arbitrarily for all state-action pairs
2. For each episode:
   - Initialize state S
   - Choose action A from S using policy derived from Q (e.g., Œµ-greedy)
   - For each step of episode:
     - Take action A, observe R and S'
     - Choose A' from S' using policy derived from Q
     - Update: Q(S,A) ‚Üê Q(S,A) + Œ±[R + Œ≥Q(S',A') - Q(S,A)]
     - S ‚Üê S', A ‚Üê A'
   - Until S is terminal

Let's implement SARSA and apply it to the Taxi-v3 environment from OpenAI Gym!

#### Implementing SARSA for Taxi-v3

**The Taxi Problem:**

The Taxi-v3 environment is a classic RL problem where:
- A taxi must pick up a passenger at one location and drop them off at another
- The taxi can move in 4 directions (North, South, East, West)
- The taxi can pick up and drop off passengers
- Rewards: +20 for successful dropoff, -1 per step, -10 for illegal pick-up/drop-off

This is a perfect environment to demonstrate SARSA because:
- Discrete state and action spaces (good for tabular methods)
- Clear goal and reward structure
- Requires learning a multi-step strategy

In [None]:
class SARSAAgent:
    """
    SARSA (On-Policy TD Control) Agent.
    
    Learns optimal policy through on-policy temporal difference learning.
    """
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        """
        Initialize SARSA agent.
        
        Args:
            n_states: Number of states in the environment
            n_actions: Number of actions available
            alpha: Learning rate (step size)
            gamma: Discount factor
            epsilon: Exploration rate for Œµ-greedy policy
        """
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        
        # Initialize Q-table with zeros
        self.Q = np.zeros((n_states, n_actions))
        
    def select_action(self, state):
        """
        Select action using Œµ-greedy policy.
        
        Args:
            state: Current state
            
        Returns:
            action: Selected action
        """
        if np.random.random() < self.epsilon:
            # Explore: choose random action
            return np.random.randint(self.n_actions)
        else:
            # Exploit: choose best action
            return np.argmax(self.Q[state])
    
    def update(self, state, action, reward, next_state, next_action):
        """
        Update Q-value using SARSA update rule.
        
        Q(S,A) ‚Üê Q(S,A) + Œ±[R + Œ≥Q(S',A') - Q(S,A)]
        
        Args:
            state: Current state S
            action: Action taken A
            reward: Reward received R
            next_state: Next state S'
            next_action: Next action A' (chosen by policy)
        """
        # SARSA TD target: R + Œ≥Q(S',A')
        td_target = reward + self.gamma * self.Q[next_state, next_action]
        
        # TD error: TD target - current estimate
        td_error = td_target - self.Q[state, action]
        
        # Update Q-value
        self.Q[state, action] += self.alpha * td_error
    
    def get_greedy_action(self, state):
        """
        Get the greedy action (best action) for a state.
        Used for evaluation without exploration.
        
        Args:
            state: Current state
            
        Returns:
            action: Best action according to Q-table
        """
        return np.argmax(self.Q[state])


print("SARSA Agent Implemented!")
print("="*60)
print("\nKey Features:")
print("  ‚Ä¢ On-policy TD control algorithm")
print("  ‚Ä¢ Learns Q(s,a) action-value function")
print("  ‚Ä¢ Uses Œµ-greedy policy for exploration")
print("  ‚Ä¢ Updates based on action actually taken")
print("  ‚Ä¢ Suitable for episodic tasks")

#### Training SARSA on Taxi-v3

In [None]:
def train_sarsa(env, agent, num_episodes=5000, max_steps=200):
    """
    Train SARSA agent on an environment.
    
    Args:
        env: OpenAI Gym environment
        agent: SARSA agent
        num_episodes: Number of training episodes
        max_steps: Maximum steps per episode
        
    Returns:
        episode_rewards: List of total rewards per episode
        episode_lengths: List of episode lengths
    """
    episode_rewards = []
    episode_lengths = []
    
    for episode in range(num_episodes):
        # Initialize episode
        state = env.reset()
        action = agent.select_action(state)  # Choose initial action
        
        total_reward = 0
        steps = 0
        
        for step in range(max_steps):
            # Take action, observe result
            next_state, reward, done, _ = env.step(action)
            
            if not done:
                # Choose next action using current policy
                next_action = agent.select_action(next_state)
                
                # SARSA update
                agent.update(state, action, reward, next_state, next_action)
                
                # Move to next state-action pair
                state = next_state
                action = next_action
            else:
                # Terminal state: Q(S',A') = 0
                agent.update(state, action, reward, next_state, 0)
                break
            
            total_reward += reward
            steps += 1
        
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
        
        # Print progress
        if (episode + 1) % 500 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            avg_length = np.mean(episode_lengths[-100:])
            print(f"Episode {episode + 1}/{num_episodes} | "
                  f"Avg Reward (last 100): {avg_reward:.2f} | "
                  f"Avg Length: {avg_length:.1f}")
    
    return episode_rewards, episode_lengths


# Create Taxi-v3 environment
print("Training SARSA Agent on Taxi-v3 Environment\n")
print("="*60)

env = gym.make('Taxi-v3')

print(f"\nEnvironment Details:")
print(f"  State space size: {env.observation_space.n}")
print(f"  Action space size: {env.action_space.n}")
print(f"  Actions: 0=South, 1=North, 2=East, 3=West, 4=Pickup, 5=Dropoff")

# Create SARSA agent
agent = SARSAAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    alpha=0.1,      # Learning rate
    gamma=0.99,     # Discount factor
    epsilon=0.1     # Exploration rate
)

print(f"\nAgent Parameters:")
print(f"  Learning rate (Œ±): {agent.alpha}")
print(f"  Discount factor (Œ≥): {agent.gamma}")
print(f"  Exploration rate (Œµ): {agent.epsilon}")

print(f"\nStarting training...\n")

# Train the agent
episode_rewards, episode_lengths = train_sarsa(env, agent, num_episodes=5000)

print("\n" + "="*60)
print("Training Complete!")
print("="*60)

# Calculate final performance
final_avg_reward = np.mean(episode_rewards[-100:])
final_avg_length = np.mean(episode_lengths[-100:])

print(f"\nFinal Performance (last 100 episodes):")
print(f"  Average Reward: {final_avg_reward:.2f}")
print(f"  Average Episode Length: {final_avg_length:.1f} steps")
print(f"  Success Rate: {(np.array(episode_rewards[-100:]) > 0).mean() * 100:.1f}%")

#### Visualizing SARSA Learning Curve

In [None]:
# Create visualization of learning progress
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

# Calculate moving averages for smoother curves
window = 100
rewards_smooth = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
lengths_smooth = np.convolve(episode_lengths, np.ones(window)/window, mode='valid')

episodes = np.arange(len(episode_rewards))
episodes_smooth = np.arange(len(rewards_smooth))

# Plot 1: Episode Rewards
ax1.plot(episodes, episode_rewards, alpha=0.3, color='blue', linewidth=0.5, label='Raw Rewards')
ax1.plot(episodes_smooth, rewards_smooth, color='darkblue', linewidth=2, label=f'{window}-Episode Moving Average')
ax1.axhline(y=0, color='red', linestyle='--', alpha=0.5, linewidth=1, label='Break-even')
ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Total Reward', fontsize=12)
ax1.set_title('SARSA Learning Curve: Episode Rewards Over Time', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Episode Lengths
ax2.plot(episodes, episode_lengths, alpha=0.3, color='green', linewidth=0.5, label='Raw Lengths')
ax2.plot(episodes_smooth, lengths_smooth, color='darkgreen', linewidth=2, label=f'{window}-Episode Moving Average')
ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Episode Length (steps)', fontsize=12)
ax2.set_title('SARSA Learning Curve: Episode Length Over Time', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Learning Curve Analysis:\n")
print("1. Episode Rewards:")
print("   ‚Ä¢ Initially negative (agent is learning)")
print("   ‚Ä¢ Gradually improves as Q-values converge")
print("   ‚Ä¢ Stabilizes at positive rewards (successful deliveries)")

print("\n2. Episode Length:")
print("   ‚Ä¢ Initially high (random exploration)")
print("   ‚Ä¢ Decreases as agent learns efficient paths")
print("   ‚Ä¢ Stabilizes at optimal trajectory length")

print("\n3. Learning Progress:")
initial_avg = np.mean(episode_rewards[:100])
final_avg = np.mean(episode_rewards[-100:])
improvement = final_avg - initial_avg
print(f"   ‚Ä¢ Initial performance (first 100 episodes): {initial_avg:.2f}")
print(f"   ‚Ä¢ Final performance (last 100 episodes): {final_avg:.2f}")
print(f"   ‚Ä¢ Total improvement: {improvement:.2f} ({improvement/abs(initial_avg)*100:.1f}% better)")

#### Evaluating the Learned Policy

In [None]:
def evaluate_policy(env, agent, num_episodes=100, render=False):
    """
    Evaluate the learned policy without exploration.
    
    Args:
        env: OpenAI Gym environment
        agent: Trained SARSA agent
        num_episodes: Number of evaluation episodes
        render: Whether to render the environment
        
    Returns:
        avg_reward: Average reward over episodes
        avg_length: Average episode length
        success_rate: Percentage of successful episodes
    """
    total_rewards = []
    episode_lengths = []
    successes = 0
    
    for episode in range(num_episodes):
        state = env.reset()
        total_reward = 0
        steps = 0
        
        done = False
        while not done:
            if render and episode == 0:  # Render first episode only
                env.render()
            
            # Use greedy policy (no exploration)
            action = agent.get_greedy_action(state)
            state, reward, done, _ = env.step(action)
            
            total_reward += reward
            steps += 1
        
        total_rewards.append(total_reward)
        episode_lengths.append(steps)
        
        if total_reward > 0:  # Successful delivery
            successes += 1
    
    avg_reward = np.mean(total_rewards)
    avg_length = np.mean(episode_lengths)
    success_rate = (successes / num_episodes) * 100
    
    return avg_reward, avg_length, success_rate


print("Evaluating Learned Policy\n")
print("="*60)

# Evaluate the trained agent
avg_reward, avg_length, success_rate = evaluate_policy(env, agent, num_episodes=100)

print(f"\nEvaluation Results (100 episodes, greedy policy):")
print(f"  Average Reward: {avg_reward:.2f}")
print(f"  Average Episode Length: {avg_length:.1f} steps")
print(f"  Success Rate: {success_rate:.1f}%")

print("\n" + "="*60)
print("\n‚úÖ SARSA Successfully Learned an Optimal Policy!")
print("\nKey Achievements:")
print("  ‚Ä¢ Agent learned to navigate the taxi environment")
print("  ‚Ä¢ Discovered optimal pickup and dropoff strategies")
print("  ‚Ä¢ Achieved high success rate with efficient paths")
print("  ‚Ä¢ Learned entirely from trial and error!")

env.close()

#### Summary: SARSA Algorithm

**What We Learned:**

1. **SARSA Fundamentals**:
   - On-policy TD control algorithm
   - Learns Q(s,a) action-value function
   - Updates based on actions actually taken by the policy
   - Name from: State-Action-Reward-State-Action

2. **SARSA Update Rule**:
   $$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]$$
   - Uses the action A' actually chosen by the policy
   - On-policy: learns about the policy being followed
   - Conservative: accounts for exploration in learning

3. **Practical Implementation**:
   - Successfully trained agent on Taxi-v3 environment
   - Achieved high success rate and efficient navigation
   - Demonstrated clear learning progress over episodes
   - Learned complex multi-step strategies

4. **Key Advantages**:
   - Model-free: no need to know environment dynamics
   - Online learning: updates after every step
   - Guaranteed convergence (under certain conditions)
   - Safe exploration: learns about actual policy behavior

5. **On-Policy vs Off-Policy**:
   - SARSA (on-policy): learns about the policy it follows
   - More conservative, safer in practice
   - Next: Q-Learning (off-policy) for comparison

**Next Steps:**

Now that we understand SARSA (on-policy TD control), we'll explore:
- **Q-Learning**: Off-policy TD control that learns optimal policy directly
- **Deep Q-Networks (DQN)**: Scaling TD learning to large state spaces
- **Policy Gradient Methods**: Direct policy optimization

These methods build on the TD learning foundation to tackle increasingly complex problems!

<a id='q-learning'></a>\n### Q-Learning: Off-Policy TD Control\n\n**From On-Policy to Off-Policy Learning**\n\nWe've seen how SARSA learns about the policy it's currently following (on-policy). Now we'll explore **Q-Learning**, one of the most important breakthroughs in reinforcement learning - an off-policy TD control algorithm that learns the optimal policy directly!\n\n**What is Q-Learning?**\n\nQ-Learning is a model-free, off-policy TD control algorithm that learns the optimal action-value function Q*(s,a) regardless of the policy being followed. This makes it more flexible and often more sample-efficient than on-policy methods.\n\n**Key Characteristics:**\n\n1. **Off-Policy**: Learns about the optimal policy while following a different (exploratory) policy\n2. **Model-Free**: Doesn't require knowledge of environment dynamics (transition probabilities or rewards)\n3. **Value-Based**: Learns Q-values, from which the optimal policy can be derived\n4. **Bootstrapping**: Updates estimates based on other estimates (like all TD methods)\n\n**Why is Q-Learning Model-Free?**\n\nQ-Learning is considered model-free because:\n\n- **No Environment Model Required**: The agent doesn't need to know P(s'|s,a) (transition probabilities) or R(s,a) (reward function)\n- **Learns from Experience**: Updates Q-values directly from observed transitions (s, a, r, s')\n- **No Planning**: Doesn't simulate future trajectories using a model\n- **Direct Learning**: Learns the value function without first learning how the environment works\n\nThis is in contrast to model-based methods (like Dynamic Programming) that require complete knowledge of the environment's dynamics.\n\n**The Q-Learning Update Rule:**\n\n$$\nQ(S_t, A_t) \\leftarrow Q(S_t, A_t) + \\alpha \\left[ R_{t+1} + \\gamma \\max_{a} Q(S_{t+1}, a) - Q(S_t, A_t) \\right]\n$$\n\nWhere:\n- $S_t$: Current state\n- $A_t$: Action taken\n- $R_{t+1}$: Reward received\n- $S_{t+1}$: Next state\n- $\\alpha$: Learning rate\n- $\\gamma$: Discount factor\n- $\\max_{a} Q(S_{t+1}, a)$: Maximum Q-value over all actions in next state\n\n**Q-Learning TD Target:**\n\n$$\n\\text{TD Target} = R_{t+1} + \\gamma \\max_{a} Q(S_{t+1}, a)\n$$\n\n**Q-Learning TD Error:**\n\n$$\n\\delta_t = R_{t+1} + \\gamma \\max_{a} Q(S_{t+1}, a) - Q(S_t, A_t)\n$$\n\n**The Key Difference: max vs actual action**\n\n| Aspect | SARSA (On-Policy) | Q-Learning (Off-Policy) |\n|--------|-------------------|-------------------------|\n| Update uses | $Q(S', A')$ - action actually taken | $\\max_a Q(S', a)$ - best possible action |\n| Learns about | Policy being followed | Optimal policy |\n| Behavior | Conservative, accounts for exploration | Optimistic, assumes optimal behavior |\n| Convergence | To policy being followed | To optimal policy Q* |\n\n**Why the max operator matters:**\n\n- SARSA: \"What will I actually do next?\" ‚Üí Uses A' from current policy\n- Q-Learning: \"What's the best I could do next?\" ‚Üí Uses max over all actions\n\nThis makes Q-Learning learn the optimal policy even while exploring randomly!\n\n**Q-Learning Algorithm:**\n\n1. Initialize Q(s,a) arbitrarily for all state-action pairs\n2. For each episode:\n   - Initialize state S\n   - For each step of episode:\n     - Choose action A from S using policy derived from Q (e.g., Œµ-greedy)\n     - Take action A, observe R and S'\n     - Update: $Q(S,A) \\leftarrow Q(S,A) + \\alpha[R + \\gamma \\max_a Q(S',a) - Q(S,A)]$\n     - S ‚Üê S'\n   - Until S is terminal\n\nLet's implement Q-Learning and apply it to a grid-world problem!

#### Implementing Q-Learning for Grid World\n\n**The Grid World Problem:**\n\nWe'll create a simple grid world environment where:\n- The agent starts at a specific position\n- The goal is to reach a target position\n- The agent can move in 4 directions: up, down, left, right\n- Rewards: +10 for reaching goal, -1 for each step, -10 for hitting walls\n\nThis is perfect for demonstrating Q-Learning because:\n- Simple, discrete state and action spaces\n- Clear optimal policy exists\n- Easy to visualize Q-values and learned policy\n- Can compare with SARSA

In [None]:
class GridWorld:\n    \"\"\"A simple grid world environment for Q-Learning.\"\"\"\n    \n    def __init__(self, size=5, start=(0, 0), goal=(4, 4), obstacles=None):\n        self.size = size\n        self.start = start\n        self.goal = goal\n        self.obstacles = obstacles if obstacles else []\n        self.state = start\n        \n        # Actions: 0=up, 1=down, 2=left, 3=right\n        self.actions = [0, 1, 2, 3]\n        self.action_names = ['UP', 'DOWN', 'LEFT', 'RIGHT']\n    \n    def reset(self):\n        \"\"\"Reset environment to start state.\"\"\"\n        self.state = self.start\n        return self.state\n    \n    def step(self, action):\n        \"\"\"Execute action and return (next_state, reward, done).\"\"\"\n        row, col = self.state\n        \n        # Calculate new position\n        if action == 0:  # UP\n            new_state = (max(0, row - 1), col)\n        elif action == 1:  # DOWN\n            new_state = (min(self.size - 1, row + 1), col)\n        elif action == 2:  # LEFT\n            new_state = (row, max(0, col - 1))\n        else:  # RIGHT\n            new_state = (row, min(self.size - 1, col + 1))\n        \n        # Check if hit obstacle\n        if new_state in self.obstacles:\n            new_state = self.state  # Stay in place\n            reward = -10\n        # Check if reached goal\n        elif new_state == self.goal:\n            reward = 10\n        # Normal step\n        else:\n            reward = -1\n        \n        self.state = new_state\n        done = (new_state == self.goal)\n        \n        return new_state, reward, done\n\nprint(\"Grid World Environment Created!\")\nprint(\"=\" * 60)

#### Q-Learning Agent Implementation

In [None]:
class QLearningAgent:\n    \"\"\"Q-Learning (Off-Policy TD Control) Agent.\"\"\"\n    \n    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):\n        \"\"\"\n        Initialize Q-Learning agent.\n        \n        Args:\n            n_states: Number of states (for grid: size * size)\n            n_actions: Number of actions (4 for grid world)\n            alpha: Learning rate\n            gamma: Discount factor\n            epsilon: Exploration rate for Œµ-greedy\n        \"\"\"\n        self.n_states = n_states\n        self.n_actions = n_actions\n        self.alpha = alpha\n        self.gamma = gamma\n        self.epsilon = epsilon\n        \n        # Initialize Q-table: Q[state][action]\n        # For grid world, state is (row, col) tuple\n        self.Q = {}\n    \n    def get_q_value(self, state, action):\n        \"\"\"Get Q-value for state-action pair.\"\"\"\n        if state not in self.Q:\n            self.Q[state] = np.zeros(self.n_actions)\n        return self.Q[state][action]\n    \n    def select_action(self, state):\n        \"\"\"Select action using Œµ-greedy policy.\"\"\"\n        if np.random.random() < self.epsilon:\n            # Explore: random action\n            return np.random.randint(self.n_actions)\n        else:\n            # Exploit: best action\n            return self.get_greedy_action(state)\n    \n    def get_greedy_action(self, state):\n        \"\"\"Get best action for state (greedy).\"\"\"\n        if state not in self.Q:\n            self.Q[state] = np.zeros(self.n_actions)\n        return np.argmax(self.Q[state])\n    \n    def update(self, state, action, reward, next_state, done):\n        \"\"\"\n        Update Q-value using Q-Learning update rule.\n        \n        Q(S,A) ‚Üê Q(S,A) + Œ±[R + Œ≥ max_a Q(S',a) - Q(S,A)]\n        \n        Args:\n            state: Current state S\n            action: Action taken A\n            reward: Reward received R\n            next_state: Next state S'\n            done: Whether episode ended\n        \"\"\"\n        # Ensure states exist in Q-table\n        if state not in self.Q:\n            self.Q[state] = np.zeros(self.n_actions)\n        if next_state not in self.Q:\n            self.Q[next_state] = np.zeros(self.n_actions)\n        \n        # Q-Learning TD target: R + Œ≥ max_a Q(S',a)\n        # Key difference from SARSA: uses MAX instead of actual next action\n        if done:\n            td_target = reward  # No future rewards if episode ended\n        else:\n            td_target = reward + self.gamma * np.max(self.Q[next_state])\n        \n        # TD error\n        td_error = td_target - self.Q[state][action]\n        \n        # Update Q-value\n        self.Q[state][action] += self.alpha * td_error\n\nprint(\"Q-Learning Agent Implemented!\")\nprint(\"=\" * 60)\nprint(\"\\nKey Features:\")\nprint(\"  ‚Ä¢ Off-policy learning with Œµ-greedy exploration\")\nprint(\"  ‚Ä¢ Uses max Q-value for next state (not actual action)\")\nprint(\"  ‚Ä¢ Learns optimal policy Q* directly\")\nprint(\"  ‚Ä¢ Model-free: no environment dynamics needed\")

#### Training Q-Learning Agent

In [None]:
def train_qlearning(env, agent, num_episodes=1000, max_steps=100):\n    \"\"\"\n    Train Q-Learning agent on environment.\n    \n    Args:\n        env: Grid world environment\n        agent: Q-Learning agent\n        num_episodes: Number of training episodes\n        max_steps: Maximum steps per episode\n    \n    Returns:\n        episode_rewards: List of total rewards per episode\n        episode_lengths: List of episode lengths\n    \"\"\"\n    episode_rewards = []\n    episode_lengths = []\n    \n    for episode in range(num_episodes):\n        state = env.reset()\n        total_reward = 0\n        steps = 0\n        \n        for step in range(max_steps):\n            # Select action using Œµ-greedy\n            action = agent.select_action(state)\n            \n            # Take action\n            next_state, reward, done = env.step(action)\n            \n            # Q-Learning update (off-policy)\n            agent.update(state, action, reward, next_state, done)\n            \n            total_reward += reward\n            steps += 1\n            state = next_state\n            \n            if done:\n                break\n        \n        episode_rewards.append(total_reward)\n        episode_lengths.append(steps)\n        \n        # Print progress\n        if (episode + 1) % 100 == 0:\n            avg_reward = np.mean(episode_rewards[-100:])\n            avg_length = np.mean(episode_lengths[-100:])\n            print(f\"Episode {episode + 1}/{num_episodes} | \"\n                  f\"Avg Reward: {avg_reward:.2f} | Avg Length: {avg_length:.1f}\")\n    \n    return episode_rewards, episode_lengths\n\nprint(\"Training function ready!\")

In [None]:
# Create environment and agent\nprint(\"Training Q-Learning Agent on Grid World\\n\")\nprint(\"=\" * 60)\n\n# Create 5x5 grid world with some obstacles\nobstacles = [(1, 1), (2, 2), (3, 1)]\nenv = GridWorld(size=5, start=(0, 0), goal=(4, 4), obstacles=obstacles)\n\nprint(f\"Environment:\")\nprint(f\"  Grid size: {env.size}x{env.size}\")\nprint(f\"  Start: {env.start}\")\nprint(f\"  Goal: {env.goal}\")\nprint(f\"  Obstacles: {obstacles}\")\nprint(f\"  Actions: {env.action_names}\")\n\n# Create Q-Learning agent\nagent = QLearningAgent(\n    n_states=env.size * env.size,\n    n_actions=len(env.actions),\n    alpha=0.1,\n    gamma=0.95,\n    epsilon=0.1\n)\n\nprint(f\"\\nAgent Parameters:\")\nprint(f\"  Learning rate (Œ±): {agent.alpha}\")\nprint(f\"  Discount factor (Œ≥): {agent.gamma}\")\nprint(f\"  Exploration rate (Œµ): {agent.epsilon}\")\n\nprint(f\"\\nStarting training...\\n\")\n\n# Train the agent\nepisode_rewards, episode_lengths = train_qlearning(env, agent, num_episodes=1000)\n\nprint(\"\\n\" + \"=\" * 60)\nprint(\"Training complete!\")

#### Visualizing Q-Learning Performance

In [None]:
# Plot learning curves\nfig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))\n\n# Smooth the curves\nwindow = 50\nrewards_smooth = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')\nlengths_smooth = np.convolve(episode_lengths, np.ones(window)/window, mode='valid')\n\n# Plot 1: Rewards\nax1.plot(episode_rewards, alpha=0.3, color='blue', label='Raw')\nax1.plot(range(window-1, len(episode_rewards)), rewards_smooth, \n         linewidth=2, color='blue', label=f'Smoothed ({window}-episode avg)')\nax1.set_xlabel('Episode', fontsize=12)\nax1.set_ylabel('Total Reward', fontsize=12)\nax1.set_title('Q-Learning: Episode Rewards Over Time', fontsize=14, fontweight='bold')\nax1.legend(fontsize=10)\nax1.grid(True, alpha=0.3)\n\n# Plot 2: Episode lengths\nax2.plot(episode_lengths, alpha=0.3, color='green', label='Raw')\nax2.plot(range(window-1, len(episode_lengths)), lengths_smooth, \n         linewidth=2, color='green', label=f'Smoothed ({window}-episode avg)')\nax2.set_xlabel('Episode', fontsize=12)\nax2.set_ylabel('Episode Length (steps)', fontsize=12)\nax2.set_title('Q-Learning: Episode Length Over Time', fontsize=14, fontweight='bold')\nax2.legend(fontsize=10)\nax2.grid(True, alpha=0.3)\n\nplt.tight_layout()\nplt.show()\n\nprint(\"\\nüìä Interpretation:\")\nprint(\"   - Rewards increase as agent learns optimal policy\")\nprint(\"   - Episode length decreases as agent finds shorter paths\")\nprint(\"   - Convergence indicates successful learning!\")

#### Visualizing Learned Q-Values and Policy

In [None]:
def visualize_q_values_and_policy(agent, env):\n    \"\"\"Visualize Q-values and learned policy on grid.\"\"\"\n    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))\n    \n    # Create grids for visualization\n    max_q_grid = np.zeros((env.size, env.size))\n    policy_grid = np.zeros((env.size, env.size), dtype=int)\n    \n    # Fill grids\n    for row in range(env.size):\n        for col in range(env.size):\n            state = (row, col)\n            if state in agent.Q:\n                max_q_grid[row, col] = np.max(agent.Q[state])\n                policy_grid[row, col] = np.argmax(agent.Q[state])\n    \n    # Plot 1: Q-values heatmap\n    im1 = ax1.imshow(max_q_grid, cmap='RdYlGn', interpolation='nearest')\n    ax1.set_title('Maximum Q-Values per State', fontsize=14, fontweight='bold')\n    ax1.set_xlabel('Column', fontsize=12)\n    ax1.set_ylabel('Row', fontsize=12)\n    \n    # Add Q-values as text\n    for row in range(env.size):\n        for col in range(env.size):\n            state = (row, col)\n            if state == env.goal:\n                text = 'GOAL'\n                color = 'white'\n            elif state in env.obstacles:\n                text = 'X'\n                color = 'red'\n            else:\n                text = f'{max_q_grid[row, col]:.1f}'\n                color = 'black' if max_q_grid[row, col] < 5 else 'white'\n            ax1.text(col, row, text, ha='center', va='center', \n                    color=color, fontsize=10, fontweight='bold')\n    \n    plt.colorbar(im1, ax=ax1, label='Max Q-Value')\n    \n    # Plot 2: Policy arrows\n    ax2.imshow(np.ones((env.size, env.size)) * 0.5, cmap='gray', alpha=0.3)\n    ax2.set_title('Learned Policy (Greedy)', fontsize=14, fontweight='bold')\n    ax2.set_xlabel('Column', fontsize=12)\n    ax2.set_ylabel('Row', fontsize=12)\n    \n    # Arrow directions\n    arrows = {0: '‚Üë', 1: '‚Üì', 2: '‚Üê', 3: '‚Üí'}\n    \n    for row in range(env.size):\n        for col in range(env.size):\n            state = (row, col)\n            if state == env.goal:\n                ax2.add_patch(plt.Rectangle((col-0.4, row-0.4), 0.8, 0.8, \n                                            fill=True, color='gold', alpha=0.7))\n                ax2.text(col, row, '‚òÖ', ha='center', va='center', \n                        fontsize=24, color='darkgreen')\n            elif state in env.obstacles:\n                ax2.add_patch(plt.Rectangle((col-0.4, row-0.4), 0.8, 0.8, \n                                            fill=True, color='red', alpha=0.5))\n                ax2.text(col, row, 'X', ha='center', va='center', \n                        fontsize=20, color='darkred', fontweight='bold')\n            else:\n                action = policy_grid[row, col]\n                ax2.text(col, row, arrows[action], ha='center', va='center', \n                        fontsize=24, color='blue', fontweight='bold')\n    \n    ax2.set_xticks(range(env.size))\n    ax2.set_yticks(range(env.size))\n    ax2.grid(True, alpha=0.3)\n    \n    plt.tight_layout()\n    plt.show()\n\n# Visualize\nvisualize_q_values_and_policy(agent, env)\n\nprint(\"\\nüìä Visualization Explanation:\")\nprint(\"\\nLeft Plot (Q-Values):\")\nprint(\"  ‚Ä¢ Shows maximum Q-value for each state\")\nprint(\"  ‚Ä¢ Higher values (green) indicate states closer to goal\")\nprint(\"  ‚Ä¢ Lower values (red) indicate less desirable states\")\nprint(\"\\nRight Plot (Policy):\")\nprint(\"  ‚Ä¢ Arrows show the best action in each state\")\nprint(\"  ‚Ä¢ Policy learned to navigate around obstacles\")\nprint(\"  ‚Ä¢ All arrows point toward the goal (‚òÖ)\")

#### Summary: Q-Learning Algorithm\n\n**What We Learned:**\n\n1. **Q-Learning Fundamentals**:\n   - Off-policy TD control algorithm\n   - Learns optimal Q*(s,a) directly\n   - Uses max operator for next state value\n   - Model-free: no environment dynamics needed\n\n2. **Q-Learning Update Rule**:\n   $Q(S_t, A_t) \\leftarrow Q(S_t, A_t) + \\alpha \\left[ R_{t+1} + \\gamma \\max_{a} Q(S_{t+1}, a) - Q(S_t, A_t) \\right]$\n   - Uses max Q-value for next state (not actual action)\n   - Off-policy: learns optimal policy while exploring\n   - Optimistic: assumes best possible future actions\n\n3. **Practical Implementation**:\n   - Successfully trained agent on grid world\n   - Learned to navigate around obstacles\n   - Found optimal paths to goal\n   - Visualized Q-values and policy\n\n4. **Key Advantages**:\n   - Model-free: works without knowing environment dynamics\n   - Off-policy: can learn from any exploratory policy\n   - Converges to optimal policy Q*\n   - Simple and effective for tabular problems\n\n5. **SARSA vs Q-Learning Comparison**:\n\n| Aspect | SARSA | Q-Learning |\n|--------|-------|------------|\n| Policy Type | On-policy | Off-policy |\n| Update Target | $R + \\gamma Q(S', A')$ | $R + \\gamma \\max_a Q(S', a)$ |\n| Learns | Policy being followed | Optimal policy |\n| Behavior | Conservative | Optimistic |\n| Safety | Safer (accounts for exploration) | Can be risky |\n| Convergence | To followed policy | To optimal policy Q* |\n\n**Why Q-Learning is Model-Free:**\n\nQ-Learning doesn't require:\n- Transition probabilities P(s'|s,a)\n- Reward function R(s,a)\n- Environment model for planning\n\nIt learns directly from experience (s, a, r, s') tuples!\n\n**Next Steps:**\n\nQ-Learning works great for small, discrete state spaces. For larger problems, we need:\n- **Deep Q-Networks (DQN)**: Combining Q-Learning with neural networks\n- **Function Approximation**: Handling continuous and high-dimensional states\n- **Advanced Techniques**: Experience replay, target networks, double Q-learning\n\nThese extensions allow Q-Learning to scale to complex problems like Atari games and robotic control!

#### Epsilon-Decreasing Exploration Strategy

**The Problem with Fixed Epsilon**

In our Q-Learning implementation above, we used a fixed epsilon value (Œµ = 0.1). This means the agent explores randomly 10% of the time throughout the entire training process. While this works, it's not optimal:

- **Early in training**: We want MORE exploration to discover good actions
- **Late in training**: We want LESS exploration to exploit what we've learned

A fixed epsilon means we're either:
- Under-exploring early (if Œµ is too small)
- Over-exploring late (if Œµ is too large)

**Exploration Schedules: The Solution**

An **exploration schedule** (or **epsilon decay**) gradually reduces epsilon over time, allowing the agent to:
1. Explore extensively at the beginning
2. Gradually shift toward exploitation
3. Eventually converge to a near-greedy policy

**Common Epsilon Decay Strategies:**

1. **Linear Decay**: Decrease epsilon by a constant amount each episode
   $$\epsilon_t = \epsilon_{start} - \frac{t}{T}(\epsilon_{start} - \epsilon_{end})$$
   where $t$ is the current episode and $T$ is the total episodes

2. **Exponential Decay**: Multiply epsilon by a decay factor each episode
   $$\epsilon_t = \epsilon_{start} \cdot \gamma^t$$
   where $\gamma$ is the decay rate (e.g., 0.995)

3. **Step Decay**: Reduce epsilon by a factor at specific intervals
   $$\epsilon_t = \epsilon_{start} \cdot \text{decay}^{\lfloor t / \text{step} \rfloor}$$

**Key Parameters:**
- $\epsilon_{start}$: Initial exploration rate (e.g., 1.0 for full exploration)
- $\epsilon_{end}$: Minimum exploration rate (e.g., 0.01 to maintain some exploration)
- Decay rate: How quickly epsilon decreases

Let's implement these strategies and see their effect on learning!

In [None]:
class EpsilonSchedule:
    """Base class for epsilon decay schedules."""
    
    def __init__(self, epsilon_start=1.0, epsilon_end=0.01):
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.current_epsilon = epsilon_start
    
    def get_epsilon(self, episode):
        """Get epsilon value for current episode."""
        raise NotImplementedError
    
    def reset(self):
        """Reset epsilon to starting value."""
        self.current_epsilon = self.epsilon_start


class LinearDecay(EpsilonSchedule):
    """Linear epsilon decay schedule."""
    
    def __init__(self, epsilon_start=1.0, epsilon_end=0.01, decay_episodes=1000):
        super().__init__(epsilon_start, epsilon_end)
        self.decay_episodes = decay_episodes
    
    def get_epsilon(self, episode):
        """Linear decay: Œµ_t = Œµ_start - (t/T)(Œµ_start - Œµ_end)"""
        if episode >= self.decay_episodes:
            self.current_epsilon = self.epsilon_end
        else:
            decay_amount = (self.epsilon_start - self.epsilon_end) * (episode / self.decay_episodes)
            self.current_epsilon = self.epsilon_start - decay_amount
        
        return self.current_epsilon


class ExponentialDecay(EpsilonSchedule):
    """Exponential epsilon decay schedule."""
    
    def __init__(self, epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995):
        super().__init__(epsilon_start, epsilon_end)
        self.decay_rate = decay_rate
    
    def get_epsilon(self, episode):
        """Exponential decay: Œµ_t = Œµ_start * Œ≥^t"""
        self.current_epsilon = max(
            self.epsilon_end,
            self.epsilon_start * (self.decay_rate ** episode)
        )
        return self.current_epsilon


class StepDecay(EpsilonSchedule):
    """Step-based epsilon decay schedule."""
    
    def __init__(self, epsilon_start=1.0, epsilon_end=0.01, decay_factor=0.5, step_size=200):
        super().__init__(epsilon_start, epsilon_end)
        self.decay_factor = decay_factor
        self.step_size = step_size
    
    def get_epsilon(self, episode):
        """Step decay: Œµ_t = Œµ_start * decay^‚åät/step‚åã"""
        num_steps = episode // self.step_size
        self.current_epsilon = max(
            self.epsilon_end,
            self.epsilon_start * (self.decay_factor ** num_steps)
        )
        return self.current_epsilon


print("Epsilon Decay Schedules Implemented!")
print("=" * 60)
print("\nAvailable schedules:")
print("  1. LinearDecay: Constant decrease per episode")
print("  2. ExponentialDecay: Multiplicative decrease per episode")
print("  3. StepDecay: Decrease at fixed intervals")

#### Visualizing Epsilon Decay Over Time

Let's visualize how different decay strategies behave over the course of training:

In [None]:
# Create different decay schedules
num_episodes = 1000

schedules = {
    'Linear Decay': LinearDecay(epsilon_start=1.0, epsilon_end=0.01, decay_episodes=800),
    'Exponential Decay': ExponentialDecay(epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995),
    'Step Decay': StepDecay(epsilon_start=1.0, epsilon_end=0.01, decay_factor=0.5, step_size=200),
    'Fixed Epsilon': None  # For comparison
}

# Track epsilon values over episodes
epsilon_values = {name: [] for name in schedules.keys()}

for episode in range(num_episodes):
    for name, schedule in schedules.items():
        if schedule is None:
            epsilon_values[name].append(0.1)  # Fixed epsilon
        else:
            epsilon_values[name].append(schedule.get_epsilon(episode))

# Visualize
plt.figure(figsize=(14, 6))

colors = ['blue', 'green', 'red', 'gray']
linestyles = ['-', '-', '-', '--']

for (name, values), color, linestyle in zip(epsilon_values.items(), colors, linestyles):
    plt.plot(values, label=name, color=color, linestyle=linestyle, linewidth=2, alpha=0.8)

plt.xlabel('Episode', fontsize=12)
plt.ylabel('Epsilon (Œµ)', fontsize=12)
plt.title('Comparison of Epsilon Decay Strategies', fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper right')
plt.grid(True, alpha=0.3)
plt.ylim([-0.05, 1.05])

# Add annotations
plt.axhline(y=0.01, color='black', linestyle=':', alpha=0.5, linewidth=1)
plt.text(num_episodes * 0.95, 0.05, 'Œµ_min = 0.01', fontsize=10, ha='right')

plt.tight_layout()
plt.show()

print("\nüìä Decay Strategy Characteristics:\n")
print("Linear Decay:")
print("  ‚Ä¢ Constant rate of decrease")
print("  ‚Ä¢ Predictable and easy to tune")
print("  ‚Ä¢ Good for problems with known training duration\n")

print("Exponential Decay:")
print("  ‚Ä¢ Fast initial decrease, then slower")
print("  ‚Ä¢ Smooth transition from exploration to exploitation")
print("  ‚Ä¢ Most commonly used in practice\n")

print("Step Decay:")
print("  ‚Ä¢ Sudden drops at intervals")
print("  ‚Ä¢ Allows extended exploration at each level")
print("  ‚Ä¢ Useful for curriculum learning\n")

print("Fixed Epsilon:")
print("  ‚Ä¢ No decay - constant exploration")
print("  ‚Ä¢ Simple but suboptimal")
print("  ‚Ä¢ Continues exploring even after convergence")

#### Modified Q-Learning Agent with Epsilon Decay

Now let's create an enhanced Q-Learning agent that uses epsilon decay:

In [None]:
class QLearningAgentWithDecay:
    """Q-Learning agent with epsilon decay schedule."""
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon_schedule=None):
        """
        Initialize Q-Learning agent with epsilon decay.
        
        Args:
            n_states: Number of states
            n_actions: Number of actions
            alpha: Learning rate
            gamma: Discount factor
            epsilon_schedule: EpsilonSchedule object (if None, uses fixed epsilon=0.1)
        """
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon_schedule = epsilon_schedule
        self.epsilon = 0.1 if epsilon_schedule is None else epsilon_schedule.epsilon_start
        
        # Q-table: dictionary mapping states to action values
        self.Q = {}
    
    def get_q_values(self, state):
        """Get Q-values for a state (initialize if not seen before)."""
        if state not in self.Q:
            self.Q[state] = np.zeros(self.n_actions)
        return self.Q[state]
    
    def select_action(self, state):
        """Select action using Œµ-greedy policy with current epsilon."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            q_values = self.get_q_values(state)
            return np.argmax(q_values)
    
    def update(self, state, action, reward, next_state, done):
        """Update Q-value using Q-Learning rule."""
        q_values = self.get_q_values(state)
        next_q_values = self.get_q_values(next_state)
        
        # Q-Learning TD target
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(next_q_values)
        
        # TD error
        td_error = td_target - q_values[action]
        
        # Update Q-value
        q_values[action] += self.alpha * td_error
    
    def update_epsilon(self, episode):
        """Update epsilon based on schedule."""
        if self.epsilon_schedule is not None:
            self.epsilon = self.epsilon_schedule.get_epsilon(episode)


print("Enhanced Q-Learning Agent with Epsilon Decay Implemented!")

#### Comparing Learning Performance with Different Decay Strategies

Let's train multiple agents with different epsilon strategies and compare their learning performance:

In [None]:
def train_with_decay(env, agent, num_episodes=1000, max_steps=100):
    """Train Q-Learning agent with epsilon decay."""
    episode_rewards = []
    episode_lengths = []
    epsilon_history = []
    
    for episode in range(num_episodes):
        # Update epsilon for this episode
        agent.update_epsilon(episode)
        epsilon_history.append(agent.epsilon)
        
        # Run episode
        state = env.reset()
        total_reward = 0
        steps = 0
        
        for step in range(max_steps):
            action = agent.select_action(state)
            next_state, reward, done = env.step(action)
            agent.update(state, action, reward, next_state, done)
            
            total_reward += reward
            steps += 1
            state = next_state
            
            if done:
                break
        
        episode_rewards.append(total_reward)
        episode_lengths.append(steps)
    
    return episode_rewards, episode_lengths, epsilon_history


# Run experiments with different decay strategies
print("Training Q-Learning Agents with Different Epsilon Strategies\n")
print("=" * 60)

num_episodes = 1000
num_runs = 5  # Multiple runs for statistical significance

# Define strategies to compare
strategies = {
    'Fixed Œµ=0.1': None,
    'Linear Decay': LinearDecay(epsilon_start=1.0, epsilon_end=0.01, decay_episodes=800),
    'Exponential Decay': ExponentialDecay(epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995),
}

# Store results
results = {name: {'rewards': [], 'lengths': []} for name in strategies.keys()}

# Run experiments
np.random.seed(42)
for strategy_name, schedule in strategies.items():
    print(f"\nTraining with {strategy_name}...")
    
    for run in range(num_runs):
        # Create fresh environment and agent
        obstacles = [(1, 1), (2, 2), (3, 1)]
        env = GridWorld(size=5, start=(0, 0), goal=(4, 4), obstacles=obstacles)
        
        # Reset schedule for each run
        if schedule is not None:
            schedule.reset()
        
        agent = QLearningAgentWithDecay(
            n_states=env.size * env.size,
            n_actions=len(env.actions),
            alpha=0.1,
            gamma=0.95,
            epsilon_schedule=schedule
        )
        
        # Train
        rewards, lengths, _ = train_with_decay(env, agent, num_episodes)
        results[strategy_name]['rewards'].append(rewards)
        results[strategy_name]['lengths'].append(lengths)
    
    print(f"  Completed {num_runs} runs")

print("\n" + "=" * 60)
print("Training complete!")

#### Visualizing Performance Comparison

In [None]:
# Calculate average performance across runs
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

colors = {'Fixed Œµ=0.1': 'gray', 'Linear Decay': 'blue', 'Exponential Decay': 'green'}
window = 50  # Smoothing window

# Plot 1: Average Rewards
for strategy_name, data in results.items():
    # Average across runs
    avg_rewards = np.mean(data['rewards'], axis=0)
    
    # Smooth the curve
    if len(avg_rewards) >= window:
        smoothed = np.convolve(avg_rewards, np.ones(window)/window, mode='valid')
        x = range(window-1, len(avg_rewards))
    else:
        smoothed = avg_rewards
        x = range(len(avg_rewards))
    
    ax1.plot(x, smoothed, label=strategy_name, color=colors[strategy_name], 
             linewidth=2.5, alpha=0.8)

ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Average Reward', fontsize=12)
ax1.set_title('Learning Performance: Rewards Over Time', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11, loc='lower right')
ax1.grid(True, alpha=0.3)

# Plot 2: Average Episode Lengths
for strategy_name, data in results.items():
    # Average across runs
    avg_lengths = np.mean(data['lengths'], axis=0)
    
    # Smooth the curve
    if len(avg_lengths) >= window:
        smoothed = np.convolve(avg_lengths, np.ones(window)/window, mode='valid')
        x = range(window-1, len(avg_lengths))
    else:
        smoothed = avg_lengths
        x = range(len(avg_lengths))
    
    ax2.plot(x, smoothed, label=strategy_name, color=colors[strategy_name], 
             linewidth=2.5, alpha=0.8)

ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Average Episode Length (steps)', fontsize=12)
ax2.set_title('Learning Performance: Episode Length Over Time', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11, loc='upper right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nüìä Performance Summary (Final 100 Episodes):\n")
print("=" * 60)

for strategy_name, data in results.items():
    # Calculate statistics for last 100 episodes
    final_rewards = [np.mean(run[-100:]) for run in data['rewards']]
    final_lengths = [np.mean(run[-100:]) for run in data['lengths']]
    
    avg_reward = np.mean(final_rewards)
    std_reward = np.std(final_rewards)
    avg_length = np.mean(final_lengths)
    std_length = np.std(final_lengths)
    
    print(f"\n{strategy_name}:")
    print(f"  Average Reward: {avg_reward:6.2f} ¬± {std_reward:.2f}")
    print(f"  Average Length: {avg_length:6.2f} ¬± {std_length:.2f} steps")

print("\n" + "=" * 60)
print("\nüí° Key Insights:")
print("   ‚Ä¢ Epsilon decay strategies typically converge faster")
print("   ‚Ä¢ Final performance is often better with decay")
print("   ‚Ä¢ Exponential decay provides smooth transition")
print("   ‚Ä¢ Linear decay is more predictable and easier to tune")
print("   ‚Ä¢ Fixed epsilon continues exploring unnecessarily")

#### Summary: Epsilon-Decreasing Exploration Strategy

**What We Learned:**

1. **The Problem with Fixed Epsilon**:
   - Explores too much late in training (wastes time)
   - Or explores too little early in training (misses good actions)
   - Doesn't adapt to the learning progress

2. **Epsilon Decay Strategies**:
   - **Linear Decay**: Constant decrease rate, predictable
   - **Exponential Decay**: Fast initial decrease, then slower (most common)
   - **Step Decay**: Sudden drops at intervals

3. **Benefits of Epsilon Decay**:
   - Better exploration early in training
   - Better exploitation late in training
   - Faster convergence to optimal policy
   - Higher final performance

4. **Practical Considerations**:
   - Start with high epsilon (0.9-1.0) for thorough exploration
   - End with small epsilon (0.01-0.05) to maintain some exploration
   - Tune decay rate based on problem complexity
   - Exponential decay is a good default choice

5. **When to Use Each Strategy**:
   - **Linear**: When you know training duration and want predictable behavior
   - **Exponential**: General-purpose, works well in most scenarios
   - **Step**: For curriculum learning or staged training
   - **Fixed**: Only for very simple problems or when exploration is critical

**Implementation Tips:**

```python
# Good starting values for exponential decay
epsilon_start = 1.0
epsilon_end = 0.01
decay_rate = 0.995  # Reaches ~0.01 after ~900 episodes

# For linear decay
decay_episodes = 0.8 * total_episodes  # Decay over 80% of training
```

**Next Steps:**

Epsilon decay is a fundamental technique used in:
- Deep Q-Networks (DQN)
- All epsilon-greedy based algorithms
- Exploration strategies in general

In the next sections, we'll see how this technique scales to deep reinforcement learning with neural networks!

<a id='dqn'></a>
### Deep Q-Networks (DQN)

**The Limitation of Tabular Q-Learning**

So far, we've been using **tabular methods** - storing Q-values in a table (or dictionary) with one entry for each state-action pair. This works well for small problems like grid worlds, but it has severe limitations:

**Problems with Tabular Methods:**

1. **Memory Explosion**: 
   - A 100√ó100 grid with 4 actions needs 40,000 entries
   - Atari games have ~10^9 possible screen states!
   - Continuous state spaces (e.g., robot joint angles) have infinite states

2. **No Generalization**:
   - Each state is learned independently
   - Similar states don't share knowledge
   - Must visit every state many times to learn

3. **Scalability**:
   - Can't handle high-dimensional inputs (images, sensor data)
   - Impractical for real-world problems

**The Solution: Function Approximation**

Instead of storing Q-values in a table, we use a **function approximator** to estimate them:

$$Q(s, a) \approx Q(s, a; \theta)$$

where $\theta$ represents the parameters of our function approximator.

**Why Neural Networks?**

Neural networks are universal function approximators that can:

1. **Handle High-Dimensional Inputs**: Process images, sensor data, etc.
2. **Generalize**: Similar inputs produce similar outputs
3. **Learn Features**: Automatically extract relevant features from raw data
4. **Scale**: Work with millions of states using thousands of parameters

**From Q-Table to Q-Network:**

```
Tabular Q-Learning:          Deep Q-Learning:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  State-Action   ‚îÇ          ‚îÇ     State       ‚îÇ
‚îÇ     Table       ‚îÇ          ‚îÇ   (e.g., image) ‚îÇ
‚îÇ                 ‚îÇ          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îÇ  s‚ÇÅ,a‚ÇÅ ‚Üí 0.5    ‚îÇ                  ‚îÇ
‚îÇ  s‚ÇÅ,a‚ÇÇ ‚Üí 0.3    ‚îÇ          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  s‚ÇÇ,a‚ÇÅ ‚Üí 0.8    ‚îÇ          ‚îÇ Neural Network ‚îÇ
‚îÇ  ...            ‚îÇ          ‚îÇ   Q(s; Œ∏)      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                     ‚îÇ
                             ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                             ‚îÇ  Q-values for  ‚îÇ
                             ‚îÇ  all actions   ‚îÇ
                             ‚îÇ [Q(s,a‚ÇÅ), ...] ‚îÇ
                             ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Key Insight:**

Instead of looking up Q(s,a) in a table, we:
1. Feed the state into a neural network
2. The network outputs Q-values for all actions
3. Select the action with the highest Q-value

This allows us to:
- Handle complex, high-dimensional states
- Generalize to unseen states
- Learn from raw sensory inputs (pixels, audio, etc.)

#### DQN Architecture

**What is a Deep Q-Network?**

A Deep Q-Network (DQN) is a neural network that approximates the Q-function. It was introduced by DeepMind in 2015 and achieved human-level performance on Atari games by learning directly from pixel inputs.

**Network Architecture:**

The basic DQN architecture consists of:

1. **Input Layer**: Receives the state representation
   - For images: Raw pixels or preprocessed frames
   - For vectors: State features (position, velocity, etc.)

2. **Hidden Layers**: Extract features and learn representations
   - Fully connected layers for vector inputs
   - Convolutional layers for image inputs
   - Activation functions (ReLU is common)

3. **Output Layer**: Produces Q-values for each action
   - One output neuron per action
   - No activation (linear output)

**Mathematical Formulation:**

Given a state $s$, the Q-network computes:

$$Q(s, a; \theta) = f_{\theta}(s)_a$$

where:
- $f_{\theta}$ is the neural network with parameters $\theta$
- $f_{\theta}(s)$ outputs a vector of Q-values, one for each action
- $f_{\theta}(s)_a$ is the Q-value for action $a$

**Training Objective:**

We train the network to minimize the **temporal difference error**:

$$L(\theta) = \mathbb{E}\left[(r + \gamma \max_{a'} Q(s', a'; \theta) - Q(s, a; \theta))^2\right]$$

This is the same TD error from Q-learning, but now we're updating network parameters $\theta$ instead of table entries!

**Architecture Variants:**

1. **Simple DQN** (for low-dimensional states):
   ```
   State ‚Üí FC(64) ‚Üí ReLU ‚Üí FC(64) ‚Üí ReLU ‚Üí FC(n_actions) ‚Üí Q-values
   ```

2. **Convolutional DQN** (for image inputs):
   ```
   Image ‚Üí Conv(32,8√ó8,s=4) ‚Üí ReLU ‚Üí Conv(64,4√ó4,s=2) ‚Üí ReLU ‚Üí 
         ‚Üí Conv(64,3√ó3,s=1) ‚Üí ReLU ‚Üí FC(512) ‚Üí ReLU ‚Üí FC(n_actions) ‚Üí Q-values
   ```

3. **Dueling DQN** (separates value and advantage):
   ```
   State ‚Üí Shared Layers ‚Üí ‚î¨‚Üí Value Stream ‚Üí V(s)
                           ‚îî‚Üí Advantage Stream ‚Üí A(s,a)
   Q(s,a) = V(s) + (A(s,a) - mean(A(s,:)))
   ```

**Key Design Choices:**

1. **Network Size**: 
   - Larger networks can represent more complex functions
   - But require more data and computation
   - Start small and increase if needed

2. **Activation Functions**:
   - ReLU is standard for hidden layers
   - No activation on output (Q-values can be any real number)

3. **Output Structure**:
   - Output Q-values for ALL actions simultaneously
   - More efficient than separate networks per action
   - Allows easy action selection: argmax over outputs

Let's implement a simple Q-network in PyTorch!

#### Implementing a Q-Network in PyTorch

We'll create a flexible Q-network class that can handle different state dimensions and action spaces.

In [None]:
class QNetwork(nn.Module):
    """Deep Q-Network for approximating Q-values.
    
    This network takes a state as input and outputs Q-values for all actions.
    """
    
    def __init__(self, state_dim, action_dim, hidden_dims=[64, 64]):
        """
        Initialize the Q-Network.
        
        Args:
            state_dim: Dimension of the state space (input size)
            action_dim: Number of possible actions (output size)
            hidden_dims: List of hidden layer sizes
        """
        super(QNetwork, self).__init__()
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        # Build the network layers
        layers = []
        
        # Input layer
        prev_dim = state_dim
        
        # Hidden layers
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim
        
        # Output layer (no activation - Q-values can be any real number)
        layers.append(nn.Linear(prev_dim, action_dim))
        
        # Combine all layers into a sequential model
        self.network = nn.Sequential(*layers)
        
        # Initialize weights using Xavier initialization
        self._initialize_weights()
    
    def _initialize_weights(self):
        """Initialize network weights for better training."""
        for module in self.network:
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                nn.init.constant_(module.bias, 0.0)
    
    def forward(self, state):
        """
        Forward pass through the network.
        
        Args:
            state: State tensor of shape (batch_size, state_dim) or (state_dim,)
        
        Returns:
            Q-values for all actions, shape (batch_size, action_dim) or (action_dim,)
        """
        return self.network(state)
    
    def get_action(self, state, epsilon=0.0):
        """
        Select an action using epsilon-greedy policy.
        
        Args:
            state: Current state (numpy array or tensor)
            epsilon: Exploration rate (0 = greedy, 1 = random)
        
        Returns:
            Selected action (integer)
        """
        # Exploration: random action
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        # Exploitation: greedy action
        with torch.no_grad():
            # Convert state to tensor if needed
            if isinstance(state, np.ndarray):
                state = torch.FloatTensor(state)
            
            # Get Q-values and select best action
            q_values = self.forward(state)
            action = torch.argmax(q_values).item()
            
        return action
    
    def get_q_values(self, state):
        """
        Get Q-values for a state.
        
        Args:
            state: State tensor or numpy array
        
        Returns:
            Q-values as numpy array
        """
        with torch.no_grad():
            if isinstance(state, np.ndarray):
                state = torch.FloatTensor(state)
            q_values = self.forward(state)
            return q_values.numpy()


print("Q-Network Implementation Complete!")
print("=" * 60)
print("\nKey Features:")
print("  ‚Ä¢ Flexible architecture with configurable hidden layers")
print("  ‚Ä¢ Xavier weight initialization for stable training")
print("  ‚Ä¢ Epsilon-greedy action selection built-in")
print("  ‚Ä¢ Handles both single states and batches")
print("  ‚Ä¢ PyTorch implementation for GPU acceleration")

#### Demonstrating the Q-Network with Sample States

Let's create a Q-network and see how it processes states and produces Q-values.

In [None]:
# Example 1: Simple Q-Network for CartPole-like environment
print("Example 1: Q-Network for CartPole Environment")
print("=" * 60)

# CartPole has 4-dimensional state and 2 actions
state_dim = 4  # [cart position, cart velocity, pole angle, pole angular velocity]
action_dim = 2  # [push left, push right]

# Create Q-network
q_net = QNetwork(state_dim=state_dim, action_dim=action_dim, hidden_dims=[64, 64])

print(f"\nNetwork Architecture:")
print(q_net)

# Count parameters
total_params = sum(p.numel() for p in q_net.parameters())
trainable_params = sum(p.numel() for p in q_net.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Create a sample state
sample_state = np.array([0.02, 0.5, -0.1, 0.3])  # Example CartPole state
print(f"\nSample State: {sample_state}")

# Forward pass: get Q-values
q_values = q_net.get_q_values(sample_state)
print(f"\nQ-values:")
print(f"  Q(s, left):  {q_values[0]:.4f}")
print(f"  Q(s, right): {q_values[1]:.4f}")

# Select action (greedy)
action = q_net.get_action(sample_state, epsilon=0.0)
action_name = "LEFT" if action == 0 else "RIGHT"
print(f"\nGreedy Action: {action} ({action_name})")

# Select action with exploration
print(f"\nWith Œµ=0.1 exploration:")
actions = [q_net.get_action(sample_state, epsilon=0.1) for _ in range(10)]
print(f"  10 action samples: {actions}")
print(f"  Greedy action selected: {actions.count(action)}/10 times")

In [None]:
# Example 2: Batch Processing
print("\n" + "=" * 60)
print("Example 2: Batch Processing Multiple States")
print("=" * 60)

# Create a batch of states
batch_size = 5
batch_states = np.random.randn(batch_size, state_dim)

print(f"\nBatch of {batch_size} states:")
print(batch_states)

# Forward pass with batch
batch_states_tensor = torch.FloatTensor(batch_states)
batch_q_values = q_net(batch_states_tensor)

print(f"\nBatch Q-values (shape: {batch_q_values.shape}):")
print(batch_q_values.detach().numpy())

# Select best action for each state in batch
best_actions = torch.argmax(batch_q_values, dim=1)
print(f"\nBest actions for each state: {best_actions.numpy()}")

In [None]:
# Example 3: Different Network Architectures
print("\n" + "=" * 60)
print("Example 3: Comparing Different Network Architectures")
print("=" * 60)

architectures = {
    'Small': [32],
    'Medium': [64, 64],
    'Large': [128, 128, 64],
    'Deep': [64, 64, 64, 64]
}

print(f"\nState dim: {state_dim}, Action dim: {action_dim}\n")

for name, hidden_dims in architectures.items():
    net = QNetwork(state_dim=state_dim, action_dim=action_dim, hidden_dims=hidden_dims)
    params = sum(p.numel() for p in net.parameters())
    
    print(f"{name:10s} {str(hidden_dims):20s} ‚Üí {params:,} parameters")

print("\nüí° Architecture Selection Tips:")
print("   ‚Ä¢ Start with medium-sized networks (64-128 units)")
print("   ‚Ä¢ Increase size if underfitting (poor performance)")
print("   ‚Ä¢ Decrease size if overfitting or slow training")
print("   ‚Ä¢ Deeper networks can learn more complex patterns")
print("   ‚Ä¢ But require more data and careful tuning")

In [None]:
# Example 4: Visualizing Q-values for Different States
print("\n" + "=" * 60)
print("Example 4: Visualizing Q-values Across State Space")
print("=" * 60)

# Create a simple 2D state space for visualization
simple_q_net = QNetwork(state_dim=2, action_dim=4, hidden_dims=[32, 32])

# Generate a grid of states
x = np.linspace(-2, 2, 20)
y = np.linspace(-2, 2, 20)
X, Y = np.meshgrid(x, y)

# Compute Q-values for each state in the grid
q_values_grid = np.zeros((20, 20, 4))

for i in range(20):
    for j in range(20):
        state = np.array([X[i, j], Y[i, j]])
        q_values_grid[i, j] = simple_q_net.get_q_values(state)

# Plot Q-values for each action
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
action_names = ['Action 0', 'Action 1', 'Action 2', 'Action 3']

for idx, (ax, action_name) in enumerate(zip(axes.flat, action_names)):
    im = ax.contourf(X, Y, q_values_grid[:, :, idx], levels=20, cmap='RdYlGn')
    ax.set_xlabel('State Dimension 1', fontsize=11)
    ax.set_ylabel('State Dimension 2', fontsize=11)
    ax.set_title(f'Q-values for {action_name}', fontsize=12, fontweight='bold')
    plt.colorbar(im, ax=ax, label='Q-value')
    ax.grid(True, alpha=0.3)

plt.suptitle('Q-Network Output Across 2D State Space\n(Untrained Network - Random Initialization)', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nüìä Interpretation:")
print("   ‚Ä¢ Each subplot shows Q-values for one action")
print("   ‚Ä¢ Colors indicate Q-value magnitude (green=high, red=low)")
print("   ‚Ä¢ This is an UNTRAINED network (random weights)")
print("   ‚Ä¢ After training, Q-values would reflect learned policy")
print("   ‚Ä¢ The network can generalize to unseen states!")

#### Summary: Neural Network Q-Function

**What We Learned:**

1. **Function Approximation**:
   - Replaces Q-tables with neural networks
   - Enables handling of large/continuous state spaces
   - Provides generalization to unseen states

2. **Q-Network Architecture**:
   - Input: State representation
   - Hidden layers: Feature extraction with ReLU activations
   - Output: Q-values for all actions (no activation)

3. **Key Design Decisions**:
   - Network size: Balance capacity vs. sample efficiency
   - Depth: Deeper networks for complex patterns
   - Initialization: Xavier/He initialization for stable training

4. **Advantages Over Tabular Methods**:
   - **Scalability**: Handle millions of states with thousands of parameters
   - **Generalization**: Similar states produce similar Q-values
   - **Flexibility**: Can process raw sensory inputs (images, audio)
   - **Efficiency**: Share knowledge across similar states

5. **Implementation Details**:
   - PyTorch provides automatic differentiation for training
   - Batch processing for efficient computation
   - Epsilon-greedy action selection integrated
   - GPU acceleration available

**Next Steps:**

We now have a Q-network that can approximate Q-values, but we haven't trained it yet! In the next sections, we'll learn about:

- **Experience Replay**: Storing and reusing past experiences for stable training
- **Target Networks**: Preventing moving target problems during training
- **DQN Training**: Putting it all together to learn from experience
- **Double DQN**: Reducing overestimation bias

These techniques are crucial for making deep Q-learning work in practice!

#### Experience Replay

**The Problem with Online Learning**

When training a Q-network, a naive approach would be to update the network immediately after each interaction with the environment:

1. Observe state $s_t$
2. Take action $a_t$
3. Receive reward $r_t$ and next state $s_{t+1}$
4. Immediately update the network using this single transition
5. Discard the transition and move on

**Why This Fails:**

This online learning approach has several critical problems:

1. **Correlation Between Consecutive Samples**:
   - Sequential experiences are highly correlated
   - The agent visits similar states in succession
   - Neural networks assume i.i.d. (independent and identically distributed) data
   - Correlated samples lead to poor convergence and overfitting

2. **Sample Inefficiency**:
   - Each experience is used only once for learning
   - Gathering experiences can be expensive (especially in real-world scenarios)
   - We're throwing away valuable data!

3. **Catastrophic Forgetting**:
   - The network quickly forgets what it learned about earlier states
   - As the agent explores new regions, it "overwrites" knowledge about previous regions
   - This leads to unstable and oscillating behavior

**The Solution: Experience Replay**

Experience replay, introduced in the original DQN paper (Mnih et al., 2015), solves these problems elegantly:

**Key Idea**: Store past experiences in a **replay buffer** (memory) and randomly sample mini-batches for training.

**How It Works:**

1. **Store**: Save each transition $(s_t, a_t, r_t, s_{t+1}, done_t)$ in a replay buffer
2. **Sample**: Randomly sample a mini-batch of transitions from the buffer
3. **Learn**: Update the network using the sampled batch
4. **Repeat**: Continue storing new experiences and sampling for training

**Why Experience Replay Works:**

1. **Breaks Correlation**:
   - Random sampling creates i.i.d. training batches
   - Transitions from different episodes and time steps are mixed
   - Network sees diverse experiences in each update

2. **Improves Sample Efficiency**:
   - Each experience can be used multiple times
   - Rare or important experiences aren't immediately forgotten
   - Better utilization of collected data

3. **Stabilizes Learning**:
   - Smooths out the learning process
   - Reduces variance in updates
   - Prevents catastrophic forgetting

4. **Enables Off-Policy Learning**:
   - Can learn from experiences generated by old policies
   - Decouples data collection from learning
   - More flexible training strategies

**Mathematical Perspective:**

The Q-learning update with experience replay:

$
\begin{align}
\text{Sample mini-batch: } & \{(s_i, a_i, r_i, s'_i, done_i)\}_{i=1}^{N} \sim \mathcal{D} \\
\text{Target: } & y_i = r_i + \gamma (1 - done_i) \max_{a'} Q(s'_i, a'; \theta^-) \\
\text{Loss: } & \mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left(y_i - Q(s_i, a_i; \theta)\right)^2
\end{align}
$

where:
- $\mathcal{D}$ is the replay buffer
- $N$ is the mini-batch size
- $\theta$ are the current network parameters
- $\theta^-$ are the target network parameters (we'll cover this next)

**Practical Considerations:**

- **Buffer Size**: Typically 10,000 to 1,000,000 transitions
  - Larger buffers provide more diversity but use more memory
  - Should be large enough to store experiences from many episodes

- **Batch Size**: Usually 32 to 256 transitions
  - Larger batches provide more stable gradients
  - Smaller batches train faster but with more variance

- **Sampling Strategy**: Uniform random sampling is most common
  - Advanced: Prioritized Experience Replay samples important transitions more often

- **When to Start Training**: Wait until buffer has enough samples
  - Typically start training after 1,000-10,000 initial experiences

Let's implement a replay buffer!

In [None]:
class ReplayBuffer:
    """Experience Replay Buffer for storing and sampling transitions.
    
    The replay buffer stores transitions (s, a, r, s', done) and provides
    random sampling for training. This breaks correlation between consecutive
    samples and improves sample efficiency.
    """
    
    def __init__(self, capacity=10000):
        """Initialize the replay buffer.
        
        Args:
            capacity: Maximum number of transitions to store
        """
        self.capacity = capacity
        self.buffer = deque(maxlen=capacity)  # Automatically removes old experiences
        self.position = 0
    
    def push(self, state, action, reward, next_state, done):
        """Store a transition in the buffer.
        
        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Next state
            done: Whether episode ended
        """
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        """Sample a random batch of transitions.
        
        Args:
            batch_size: Number of transitions to sample
            
        Returns:
            Tuple of (states, actions, rewards, next_states, dones)
            Each element is a numpy array or list
        """
        # Randomly sample batch_size transitions
        batch = random.sample(self.buffer, batch_size)
        
        # Unzip the batch into separate arrays
        states, actions, rewards, next_states, dones = zip(*batch)
        
        # Convert to numpy arrays for easier processing
        states = np.array(states)
        actions = np.array(actions)
        rewards = np.array(rewards)
        next_states = np.array(next_states)
        dones = np.array(dones, dtype=np.float32)
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        """Return the current size of the buffer."""
        return len(self.buffer)
    
    def is_ready(self, batch_size):
        """Check if buffer has enough samples for training.
        
        Args:
            batch_size: Required batch size
            
        Returns:
            True if buffer has at least batch_size samples
        """
        return len(self.buffer) >= batch_size


print("ReplayBuffer class implemented successfully!")
print("\nKey Features:")
print("  ‚Ä¢ Stores transitions (s, a, r, s', done)")
print("  ‚Ä¢ Automatic capacity management with deque")
print("  ‚Ä¢ Random sampling for breaking correlations")
print("  ‚Ä¢ Efficient batch preparation")

#### Demonstrating the Replay Buffer

Let's see how the replay buffer works in practice with some examples.

In [None]:
# Example 1: Basic Usage
print("=" * 60)
print("Example 1: Basic Replay Buffer Operations")
print("=" * 60)

# Create a replay buffer with small capacity for demonstration
replay_buffer = ReplayBuffer(capacity=5)

print(f"\nInitial buffer size: {len(replay_buffer)}")
print(f"Buffer capacity: {replay_buffer.capacity}")

# Add some transitions
print("\nAdding transitions to buffer...")
for i in range(7):
    state = np.array([i, i*2])
    action = i % 4
    reward = i * 0.1
    next_state = np.array([i+1, (i+1)*2])
    done = (i == 6)
    
    replay_buffer.push(state, action, reward, next_state, done)
    print(f"  Step {i}: Added transition | Buffer size: {len(replay_buffer)}")

print(f"\nüí° Notice: Buffer size capped at {replay_buffer.capacity}")
print("   Oldest transitions are automatically removed!")

In [None]:
# Example 2: Sampling from the Buffer
print("\n" + "=" * 60)
print("Example 2: Sampling Mini-Batches")
print("=" * 60)

# Create a larger buffer and fill it
replay_buffer = ReplayBuffer(capacity=100)

# Simulate collecting experiences
print("\nCollecting 50 experiences...")
for i in range(50):
    state = np.random.randn(4)  # Random 4D state
    action = np.random.randint(0, 3)  # Random action from {0, 1, 2}
    reward = np.random.randn()  # Random reward
    next_state = np.random.randn(4)
    done = (i % 10 == 9)  # Episode ends every 10 steps
    
    replay_buffer.push(state, action, reward, next_state, done)

print(f"Buffer size: {len(replay_buffer)}")

# Sample a mini-batch
batch_size = 8
print(f"\nSampling mini-batch of size {batch_size}...")

states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)

print(f"\nBatch contents:")
print(f"  States shape: {states.shape}")
print(f"  Actions shape: {actions.shape}")
print(f"  Rewards shape: {rewards.shape}")
print(f"  Next states shape: {next_states.shape}")
print(f"  Dones shape: {dones.shape}")

print(f"\nFirst 3 transitions in batch:")
for i in range(3):
    print(f"\n  Transition {i}:")
    print(f"    State: {states[i]}")
    print(f"    Action: {actions[i]}")
    print(f"    Reward: {rewards[i]:.3f}")
    print(f"    Done: {bool(dones[i])}")

In [None]:
# Example 3: Demonstrating Correlation Breaking
print("\n" + "=" * 60)
print("Example 3: Breaking Temporal Correlation")
print("=" * 60)

# Create buffer and add sequential experiences
replay_buffer = ReplayBuffer(capacity=100)

# Simulate an agent moving through a 1D environment
print("\nSimulating sequential experiences (agent moving right):")
for position in range(50):
    state = np.array([position])
    action = 1  # Always move right
    reward = 0.1
    next_state = np.array([position + 1])
    done = False
    
    replay_buffer.push(state, action, reward, next_state, done)

print(f"Added {len(replay_buffer)} sequential transitions (positions 0-49)")

# Sample and show that samples are NOT sequential
print("\nSampling 10 transitions:")
states, actions, rewards, next_states, dones = replay_buffer.sample(10)

positions = states.flatten()
print(f"\nSampled positions: {positions.astype(int)}")
print(f"\n‚úì Notice: Positions are NOT consecutive!")
print(f"  This breaks the temporal correlation.")
print(f"  The network sees diverse experiences in each batch.")

# Show correlation in sequential vs sampled
sequential_positions = np.arange(10)
sampled_positions = np.sort(positions[:10].astype(int))

print(f"\nComparison:")
print(f"  Sequential (online learning): {sequential_positions}")
print(f"  Sampled (replay buffer):      {sampled_positions}")
print(f"\n  Sequential correlation: HIGH (consecutive states)")
print(f"  Sampled correlation:    LOW (random states)")

In [None]:
# Example 4: Sample Efficiency Demonstration
print("\n" + "=" * 60)
print("Example 4: Sample Efficiency with Experience Replay")
print("=" * 60)

# Simulate training with and without replay
num_experiences = 1000
batch_size = 32
training_steps = 100

print(f"\nScenario: Collected {num_experiences} experiences")
print(f"Training for {training_steps} steps with batch size {batch_size}")

# Without replay: each experience used once
experiences_used_no_replay = num_experiences

# With replay: each training step uses batch_size samples
experiences_used_with_replay = training_steps * batch_size

print(f"\nüìä Sample Usage:")
print(f"\n  WITHOUT Experience Replay:")
print(f"    ‚Ä¢ Each experience used: 1 time")
print(f"    ‚Ä¢ Total experience usage: {experiences_used_no_replay}")
print(f"    ‚Ä¢ Training updates: {num_experiences}")

print(f"\n  WITH Experience Replay:")
print(f"    ‚Ä¢ Each experience used: ~{experiences_used_with_replay / num_experiences:.1f} times (on average)")
print(f"    ‚Ä¢ Total experience usage: {experiences_used_with_replay}")
print(f"    ‚Ä¢ Training updates: {training_steps}")

efficiency_gain = experiences_used_with_replay / experiences_used_no_replay
print(f"\n  ‚úì Sample Efficiency Gain: {efficiency_gain:.1f}x")
print(f"    We get {efficiency_gain:.1f}x more learning from the same data!")

In [None]:
# Example 5: Visualizing Buffer Dynamics
print("\n" + "=" * 60)
print("Example 5: Replay Buffer Dynamics Over Time")
print("=" * 60)

# Simulate buffer filling and sampling over time
buffer_capacity = 100
replay_buffer = ReplayBuffer(capacity=buffer_capacity)

buffer_sizes = []
experiences_collected = []

# Collect experiences over time
for step in range(200):
    # Add new experience
    state = np.random.randn(4)
    action = np.random.randint(0, 3)
    reward = np.random.randn()
    next_state = np.random.randn(4)
    done = False
    
    replay_buffer.push(state, action, reward, next_state, done)
    
    buffer_sizes.append(len(replay_buffer))
    experiences_collected.append(step + 1)

# Plot buffer size over time
plt.figure(figsize=(12, 6))

plt.plot(experiences_collected, buffer_sizes, linewidth=2, color='blue', label='Buffer Size')
plt.axhline(y=buffer_capacity, color='red', linestyle='--', linewidth=2, label=f'Capacity ({buffer_capacity})')
plt.fill_between(experiences_collected, 0, buffer_sizes, alpha=0.3, color='blue')

plt.xlabel('Experiences Collected', fontsize=12)
plt.ylabel('Buffer Size', fontsize=12)
plt.title('Replay Buffer Size Over Time', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Interpretation:")
print("   ‚Ä¢ Buffer grows linearly until reaching capacity")
print("   ‚Ä¢ After capacity is reached, oldest experiences are removed")
print("   ‚Ä¢ This maintains a 'sliding window' of recent experiences")
print("   ‚Ä¢ Ensures buffer contains relevant, up-to-date experiences")

#### Summary: Experience Replay

**What We Learned:**

1. **The Problem**:
   - Online learning suffers from correlated samples
   - Sample inefficiency (each experience used once)
   - Catastrophic forgetting of earlier experiences

2. **The Solution**:
   - Store transitions in a replay buffer
   - Randomly sample mini-batches for training
   - Reuse experiences multiple times

3. **Key Benefits**:
   - **Breaks Correlation**: Random sampling creates i.i.d. batches
   - **Sample Efficiency**: Each experience used multiple times
   - **Stability**: Smooths learning, reduces variance
   - **Off-Policy**: Can learn from old experiences

4. **Implementation Details**:
   - Use `deque` with `maxlen` for automatic capacity management
   - Store complete transitions: $(s, a, r, s', done)$
   - Random sampling with `random.sample()`
   - Batch preparation for efficient training

5. **Practical Guidelines**:
   - Buffer size: 10,000 - 1,000,000 transitions
   - Batch size: 32 - 256 transitions
   - Start training after buffer has enough samples
   - Larger buffers = more diversity, more memory

**Impact on Deep RL:**

Experience replay was a crucial innovation that made DQN work. Without it:
- Training is unstable and often diverges
- Sample efficiency is poor
- Performance is significantly worse

With experience replay:
- DQN can learn from pixels to play Atari games
- Training is stable and reliable
- Sample efficiency is greatly improved

**Next Steps:**

Now that we have both a Q-network and experience replay, we need one more ingredient for stable DQN training: **target networks**. We'll cover this next!

#### Target Networks for Stable Training

**The Moving Target Problem**

When training a Q-network, we face a fundamental instability issue. Recall the Q-learning update:

$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$

The problem: We're using the **same network** to:
1. Estimate the current Q-value: $Q(s, a)$
2. Estimate the target Q-value: $r + \gamma \max_{a'} Q(s', a')$

This creates a **moving target problem**:
- Every time we update the network, we change both the prediction AND the target
- It's like trying to hit a target that moves every time you adjust your aim
- This leads to oscillations, divergence, and unstable training

**The Target Network Solution**

The solution is to use **two separate networks**:

1. **Online Network** (parameters $\theta$): Updated every step, used to select actions
2. **Target Network** (parameters $\theta^-$): Updated infrequently, used to compute targets

The modified update becomes:

$
\text{Loss} = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a) \right)^2 \right]
$

where:
- $Q_\theta$ is the online network (being trained)
- $Q_{\theta^-}$ is the target network (held fixed)
- $\mathcal{D}$ is the replay buffer

**How Target Networks Work:**

1. Initialize both networks with the same weights: $\theta^- = \theta$
2. For many steps:
   - Use online network to select actions and compute current Q-values
   - Use target network to compute target Q-values
   - Update only the online network
3. Periodically (e.g., every 1000 steps): $\theta^- \leftarrow \theta$

**Why This Works:**

- The target network provides **stable targets** for many updates
- The online network can learn without chasing a moving target
- Periodic updates ensure the target network eventually catches up
- This dramatically improves training stability

**Key Hyperparameters:**

- **Update Frequency**: How often to copy weights (e.g., every 1000-10000 steps)
- **Soft Updates** (alternative): $\theta^- \leftarrow \tau \theta + (1-\tau) \theta^-$ with small $\tau$ (e.g., 0.001)

Let's implement a complete DQN agent with target networks!

In [None]:
class DQNAgent:
    """Deep Q-Network agent with experience replay and target network."""
    
    def __init__(self, state_dim, action_dim, hidden_dim=128, 
                 lr=1e-3, gamma=0.99, epsilon_start=1.0, epsilon_end=0.01, 
                 epsilon_decay=0.995, buffer_size=10000, batch_size=64,
                 target_update_freq=1000):
        """
        Initialize DQN agent.
        
        Args:
            state_dim: Dimension of state space
            action_dim: Number of possible actions
            hidden_dim: Size of hidden layers
            lr: Learning rate
            gamma: Discount factor
            epsilon_start: Initial exploration rate
            epsilon_end: Minimum exploration rate
            epsilon_decay: Decay rate for epsilon
            buffer_size: Size of replay buffer
            batch_size: Mini-batch size for training
            target_update_freq: Steps between target network updates
        """
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        self.steps = 0
        
        # Create online and target networks
        self.online_network = QNetwork(state_dim, action_dim, hidden_dim)
        self.target_network = QNetwork(state_dim, action_dim, hidden_dim)
        
        # Initialize target network with same weights as online network
        self.target_network.load_state_dict(self.online_network.state_dict())
        self.target_network.eval()  # Target network is always in eval mode
        
        # Optimizer for online network
        self.optimizer = optim.Adam(self.online_network.parameters(), lr=lr)
        
        # Experience replay buffer
        self.replay_buffer = ReplayBuffer(buffer_size)
        
    def select_action(self, state, training=True):
        """Select action using epsilon-greedy policy.
        
        Args:
            state: Current state
            training: If True, use epsilon-greedy; if False, use greedy
            
        Returns:
            action: Selected action
        """
        # Exploration: random action
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.action_dim)
        
        # Exploitation: best action according to online network
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.online_network(state_tensor)
            return q_values.argmax().item()
    
    def store_transition(self, state, action, reward, next_state, done):
        """Store a transition in the replay buffer."""
        self.replay_buffer.push(state, action, reward, next_state, done)
    
    def update(self):
        """Perform one update step using a mini-batch from replay buffer.
        
        Returns:
            loss: TD loss value (or None if buffer too small)
        """
        # Don't update if buffer doesn't have enough samples
        if len(self.replay_buffer) < self.batch_size:
            return None
        
        # Sample mini-batch from replay buffer
        states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
        
        # Convert to tensors
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)
        
        # Compute current Q-values using online network
        current_q_values = self.online_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Compute target Q-values using target network
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
        
        # Compute loss (Mean Squared Error)
        loss = F.mse_loss(current_q_values, target_q_values)
        
        # Optimize the online network
        self.optimizer.zero_grad()
        loss.backward()
        # Clip gradients to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(self.online_network.parameters(), max_norm=10)
        self.optimizer.step()
        
        # Update target network periodically
        self.steps += 1
        if self.steps % self.target_update_freq == 0:
            self.update_target_network()
        
        # Decay epsilon
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
        
        return loss.item()
    
    def update_target_network(self):
        """Copy weights from online network to target network."""
        self.target_network.load_state_dict(self.online_network.state_dict())
        print(f"Target network updated at step {self.steps}")


print("DQN Agent with Target Network implemented!")
print("\nKey components:")
print("  ‚úì Online network for action selection and training")
print("  ‚úì Target network for stable Q-value targets")
print("  ‚úì Experience replay for breaking correlations")
print("  ‚úì Epsilon-greedy exploration")
print("  ‚úì Periodic target network updates")

#### Training DQN on CartPole

Now let's train our DQN agent on the CartPole environment! CartPole is a classic control problem where the goal is to balance a pole on a moving cart.

**CartPole Environment:**
- **State**: 4 continuous values (cart position, cart velocity, pole angle, pole angular velocity)
- **Actions**: 2 discrete actions (push left or push right)
- **Reward**: +1 for every timestep the pole stays upright
- **Episode ends**: When pole falls too far or cart moves off screen
- **Success**: Average reward of 195+ over 100 episodes

Let's implement the training loop:

In [None]:
def train_dqn(env_name='CartPole-v1', num_episodes=500, max_steps=500, 
              print_every=50, render=False):
    """
    Train a DQN agent on a Gym environment.
    
    Args:
        env_name: Name of Gym environment
        num_episodes: Number of episodes to train
        max_steps: Maximum steps per episode
        print_every: Print progress every N episodes
        render: Whether to render the environment
        
    Returns:
        agent: Trained DQN agent
        episode_rewards: List of total rewards per episode
        episode_lengths: List of episode lengths
        losses: List of training losses
    """
    # Create environment
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    print(f"Training DQN on {env_name}")
    print(f"State dimension: {state_dim}")
    print(f"Action dimension: {action_dim}")
    print("="*60)
    
    # Create agent
    agent = DQNAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        hidden_dim=128,
        lr=1e-3,
        gamma=0.99,
        epsilon_start=1.0,
        epsilon_end=0.01,
        epsilon_decay=0.995,
        buffer_size=10000,
        batch_size=64,
        target_update_freq=100
    )
    
    # Training metrics
    episode_rewards = []
    episode_lengths = []
    losses = []
    recent_rewards = deque(maxlen=100)
    
    # Training loop
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        episode_loss = []
        
        for step in range(max_steps):
            if render:
                env.render()
            
            # Select and perform action
            action = agent.select_action(state, training=True)
            next_state, reward, done, _ = env.step(action)
            
            # Store transition
            agent.store_transition(state, action, reward, next_state, done)
            
            # Update agent
            loss = agent.update()
            if loss is not None:
                episode_loss.append(loss)
            
            episode_reward += reward
            state = next_state
            
            if done:
                break
        
        # Record metrics
        episode_rewards.append(episode_reward)
        episode_lengths.append(step + 1)
        recent_rewards.append(episode_reward)
        if episode_loss:
            losses.append(np.mean(episode_loss))
        
        # Print progress
        if (episode + 1) % print_every == 0:
            avg_reward = np.mean(recent_rewards)
            avg_length = np.mean(list(recent_rewards))
            print(f"Episode {episode + 1}/{num_episodes}")
            print(f"  Avg Reward (last 100): {avg_reward:.2f}")
            print(f"  Epsilon: {agent.epsilon:.3f}")
            print(f"  Buffer Size: {len(agent.replay_buffer)}")
            if losses:
                print(f"  Avg Loss: {np.mean(losses[-100:]):.4f}")
            
            # Check if solved
            if avg_reward >= 195.0:
                print(f"\nüéâ Environment solved in {episode + 1} episodes!")
                print(f"   Average reward: {avg_reward:.2f}")
                break
    
    env.close()
    return agent, episode_rewards, episode_lengths, losses


# Train the agent
print("Starting DQN training...\n")
agent, rewards, lengths, losses = train_dqn(
    env_name='CartPole-v1',
    num_episodes=500,
    max_steps=500,
    print_every=50
)

print("\n" + "="*60)
print("Training completed!")
print(f"Total episodes: {len(rewards)}")
print(f"Final average reward (last 100): {np.mean(rewards[-100:]):.2f}")
print(f"Best episode reward: {max(rewards):.2f}")

#### Visualizing Training Progress

Let's visualize how the agent's performance improved during training:

In [None]:
# Create comprehensive training visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Episode Rewards
ax1 = axes[0, 0]
ax1.plot(rewards, alpha=0.3, color='blue', label='Episode Reward')
# Moving average
window = 50
if len(rewards) >= window:
    moving_avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
    ax1.plot(range(window-1, len(rewards)), moving_avg, color='red', 
             linewidth=2, label=f'{window}-Episode Moving Average')
ax1.axhline(y=195, color='green', linestyle='--', linewidth=2, 
            label='Solved Threshold (195)', alpha=0.7)
ax1.set_xlabel('Episode', fontsize=12)
ax1.set_ylabel('Total Reward', fontsize=12)
ax1.set_title('DQN Training Progress: Episode Rewards', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Episode Lengths
ax2 = axes[0, 1]
ax2.plot(lengths, alpha=0.3, color='purple', label='Episode Length')
if len(lengths) >= window:
    moving_avg_length = np.convolve(lengths, np.ones(window)/window, mode='valid')
    ax2.plot(range(window-1, len(lengths)), moving_avg_length, color='orange', 
             linewidth=2, label=f'{window}-Episode Moving Average')
ax2.set_xlabel('Episode', fontsize=12)
ax2.set_ylabel('Episode Length (steps)', fontsize=12)
ax2.set_title('Episode Lengths Over Time', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Training Loss
ax3 = axes[1, 0]
if losses:
    ax3.plot(losses, alpha=0.5, color='red', label='TD Loss')
    if len(losses) >= window:
        moving_avg_loss = np.convolve(losses, np.ones(window)/window, mode='valid')
        ax3.plot(range(window-1, len(losses)), moving_avg_loss, color='darkred', 
                 linewidth=2, label=f'{window}-Episode Moving Average')
    ax3.set_xlabel('Episode', fontsize=12)
    ax3.set_ylabel('Loss', fontsize=12)
    ax3.set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.set_yscale('log')  # Log scale for better visualization
else:
    ax3.text(0.5, 0.5, 'No loss data available', 
             ha='center', va='center', fontsize=12)
    ax3.set_title('Training Loss Over Time', fontsize=14, fontweight='bold')

# Plot 4: Reward Distribution
ax4 = axes[1, 1]
ax4.hist(rewards, bins=30, alpha=0.7, color='green', edgecolor='black')
ax4.axvline(x=np.mean(rewards), color='red', linestyle='--', 
            linewidth=2, label=f'Mean: {np.mean(rewards):.2f}')
ax4.axvline(x=195, color='blue', linestyle='--', 
            linewidth=2, label='Solved: 195')
ax4.set_xlabel('Total Reward', fontsize=12)
ax4.set_ylabel('Frequency', fontsize=12)
ax4.set_title('Distribution of Episode Rewards', fontsize=14, fontweight='bold')
ax4.legend()
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nüìä Training Summary Statistics:")
print("="*60)
print(f"Total Episodes: {len(rewards)}")
print(f"\nReward Statistics:")
print(f"  Mean: {np.mean(rewards):.2f}")
print(f"  Std: {np.std(rewards):.2f}")
print(f"  Min: {np.min(rewards):.2f}")
print(f"  Max: {np.max(rewards):.2f}")
print(f"  Last 100 episodes mean: {np.mean(rewards[-100:]):.2f}")
print(f"\nEpisode Length Statistics:")
print(f"  Mean: {np.mean(lengths):.2f}")
print(f"  Max: {np.max(lengths)}")
print(f"\nExploration:")
print(f"  Final epsilon: {agent.epsilon:.4f}")
print(f"  Replay buffer size: {len(agent.replay_buffer)}")

if np.mean(rewards[-100:]) >= 195:
    print("\n‚úÖ Environment SOLVED! Average reward ‚â• 195 over last 100 episodes")
else:
    print(f"\n‚ö†Ô∏è  Not quite solved yet. Need {195 - np.mean(rewards[-100:]):.2f} more reward on average.")

#### Understanding the Results

**What We Observe:**

1. **Learning Curve**: The agent starts with poor performance (random actions) and gradually improves as it learns

2. **Exploration vs Exploitation**: 
   - Early episodes: High epsilon ‚Üí more exploration ‚Üí variable performance
   - Later episodes: Low epsilon ‚Üí more exploitation ‚Üí stable high performance

3. **Target Network Impact**:
   - Training is stable without wild oscillations
   - Loss decreases smoothly over time
   - The agent converges to a good policy

4. **Experience Replay Benefits**:
   - Efficient use of past experiences
   - Breaks temporal correlations
   - Enables mini-batch training

**Key Insights:**

- **Target networks** prevent the moving target problem and stabilize training
- **Experience replay** breaks correlations and improves sample efficiency
- **Epsilon-greedy** exploration ensures the agent discovers good strategies
- The combination of these techniques makes DQN work!

**Hyperparameter Sensitivity:**

- **Learning rate**: Too high ‚Üí instability; too low ‚Üí slow learning
- **Target update frequency**: Too frequent ‚Üí moving target; too rare ‚Üí slow adaptation
- **Batch size**: Larger ‚Üí more stable gradients; smaller ‚Üí more updates
- **Buffer size**: Larger ‚Üí more diversity; smaller ‚Üí more recent experiences

**Next Steps:**

DQN with target networks is a powerful algorithm, but it still has limitations. In the next section, we'll explore **Double DQN**, which addresses the overestimation bias in standard DQN!

#### Double DQN: Addressing Overestimation Bias**The Problem with Standard DQN**Standard DQN has a subtle but important flaw: it tends to **overestimate** Q-values. This happens because of how the max operator is used in the Q-learning update:$Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]$**Why Overestimation Occurs:**The same network is used for both:1. **Selecting** the best action: $\arg\max_{a'} Q(s', a')$2. **Evaluating** that action: $Q(s', a')$This creates a **maximization bias**: if the Q-values have any estimation errors (which they always do), the max operation will tend to select actions with positive errors, leading to systematic overestimation.**Example of the Problem:**Imagine you're estimating the value of 3 actions, and your estimates have random errors:- True values: [1.0, 1.0, 1.0] (all equal)- Noisy estimates: [0.9, 1.2, 0.8] (with random errors)- Standard DQN picks: max([0.9, 1.2, 0.8]) = 1.2- This overestimates the true value of 1.0!Over many updates, these overestimations accumulate and can hurt performance.**The Double DQN Solution**Double DQN (DDQN) addresses this by **decoupling action selection from action evaluation**:1. Use the **online network** to select the best action2. Use the **target network** to evaluate that action**Double DQN Update Rule:**$Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q_{\theta^-}\left(s', \arg\max_{a'} Q_\theta(s', a')\right) - Q(s,a) \right]$where:- $Q_\theta$ is the online network (selects action)- $Q_{\theta^-}$ is the target network (evaluates action)**Key Insight:**By using different networks for selection and evaluation, we reduce the correlation between the errors, which reduces overestimation bias.**Benefits of Double DQN:**- More accurate Q-value estimates- Better performance on many tasks- More stable learning- Minimal computational overhead (we already have both networks!)Let's implement Double DQN and compare it with standard DQN!

In [None]:
class DoubleDQNAgent(DQNAgent):    """Double DQN agent that reduces overestimation bias.        Inherits from DQNAgent and only modifies the update method to use    Double Q-learning: online network selects actions, target network evaluates them.    """        def update(self):        """Update the agent using Double Q-learning.                Key difference from standard DQN:        - Online network selects the best action        - Target network evaluates that action        """        # Need enough samples in buffer        if len(self.replay_buffer) < self.batch_size:            return None                # Sample mini-batch from replay buffer        batch = random.sample(self.replay_buffer, self.batch_size)        states, actions, rewards, next_states, dones = zip(*batch)                # Convert to tensors        states = torch.FloatTensor(np.array(states))        actions = torch.LongTensor(actions)        rewards = torch.FloatTensor(rewards)        next_states = torch.FloatTensor(np.array(next_states))        dones = torch.FloatTensor(dones)                # Compute current Q-values        current_q_values = self.online_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)                # Compute target Q-values using Double Q-learning        with torch.no_grad():            # DOUBLE DQN: Use online network to SELECT actions            next_actions = self.online_network(next_states).argmax(1)                        # DOUBLE DQN: Use target network to EVALUATE those actions            next_q_values = self.target_network(next_states).gather(1, next_actions.unsqueeze(1)).squeeze(1)                        # Compute targets            target_q_values = rewards + (1 - dones) * self.gamma * next_q_values                # Compute loss        loss = nn.MSELoss()(current_q_values, target_q_values)                # Optimize the online network        self.optimizer.zero_grad()        loss.backward()        self.optimizer.step()                # Update target network periodically        self.steps += 1        if self.steps % self.target_update_freq == 0:            self.update_target_network()                # Decay epsilon        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)                return loss.item()print("Double DQN Agent implemented!")print("\nKey difference from standard DQN:")print("  ‚úì Online network SELECTS the best action")print("  ‚úì Target network EVALUATES that action")print("  ‚úì This decoupling reduces overestimation bias")

#### Comparing Standard DQN vs Double DQNNow let's train both algorithms on the same environment and compare their performance. We'll look at:1. **Learning curves**: How quickly do they learn?2. **Final performance**: Which achieves better results?3. **Stability**: Which is more consistent?4. **Q-value estimates**: Do we see evidence of overestimation?Let's run the comparison:

In [None]:
def train_agent_comparison(agent_class, agent_name, env_name='CartPole-v1',                             num_episodes=300, max_steps=500, seed=42):    """    Train an agent and return metrics for comparison.        Args:        agent_class: DQNAgent or DoubleDQNAgent class        agent_name: Name for logging        env_name: Gym environment name        num_episodes: Number of training episodes        max_steps: Max steps per episode        seed: Random seed for reproducibility            Returns:        Dictionary with training metrics    """    # Set seeds for reproducibility    np.random.seed(seed)    random.seed(seed)    torch.manual_seed(seed)        # Create environment    env = gym.make(env_name)    env.seed(seed)    state_dim = env.observation_space.shape[0]    action_dim = env.action_space.n        print(f"\nTraining {agent_name} on {env_name}")    print("=" * 60)        # Create agent    agent = agent_class(        state_dim=state_dim,        action_dim=action_dim,        hidden_dim=128,        lr=1e-3,        gamma=0.99,        epsilon_start=1.0,        epsilon_end=0.01,        epsilon_decay=0.995,        buffer_size=10000,        batch_size=64,        target_update_freq=100    )        # Training metrics    episode_rewards = []    episode_lengths = []    losses = []    q_values = []  # Track Q-values to detect overestimation        # Training loop    for episode in range(num_episodes):        state = env.reset()        episode_reward = 0        episode_loss = []        episode_q = []                for step in range(max_steps):            # Select and perform action            action = agent.select_action(state, training=True)                        # Track Q-values            with torch.no_grad():                state_tensor = torch.FloatTensor(state).unsqueeze(0)                q_vals = agent.online_network(state_tensor).max().item()                episode_q.append(q_vals)                        next_state, reward, done, _ = env.step(action)                        # Store transition            agent.store_transition(state, action, reward, next_state, done)                        # Update agent            loss = agent.update()            if loss is not None:                episode_loss.append(loss)                        episode_reward += reward            state = next_state                        if done:                break                # Record metrics        episode_rewards.append(episode_reward)        episode_lengths.append(step + 1)        if episode_loss:            losses.append(np.mean(episode_loss))        if episode_q:            q_values.append(np.mean(episode_q))                # Print progress        if (episode + 1) % 50 == 0:            avg_reward = np.mean(episode_rewards[-50:])            print(f"Episode {episode + 1}/{num_episodes} | "                  f"Avg Reward: {avg_reward:.2f} | "                  f"Epsilon: {agent.epsilon:.3f}")        env.close()        return {        'name': agent_name,        'rewards': episode_rewards,        'lengths': episode_lengths,        'losses': losses,        'q_values': q_values,        'agent': agent    }# Train both agents with the same seed for fair comparisonprint("Starting comparison experiment...")print("This will train both DQN and Double DQN on CartPole-v1")print("=" * 60)# Train standard DQNdqn_results = train_agent_comparison(    DQNAgent,     "Standard DQN",    num_episodes=300,    seed=42)# Train Double DQNddqn_results = train_agent_comparison(    DoubleDQNAgent,    "Double DQN",     num_episodes=300,    seed=42)print("\n" + "=" * 60)print("Training completed for both agents!")print("=" * 60)

#### Visualizing the ComparisonLet's create comprehensive visualizations to compare the two algorithms:

In [None]:
# Create comprehensive comparison visualizationfig, axes = plt.subplots(2, 2, figsize=(16, 12))# Prepare datawindow = 20  # Moving average window# Plot 1: Episode Rewards Comparisonax1 = axes[0, 0]ax1.plot(dqn_results['rewards'], alpha=0.2, color='blue', linewidth=0.5)ax1.plot(ddqn_results['rewards'], alpha=0.2, color='red', linewidth=0.5)# Moving averagesif len(dqn_results['rewards']) >= window:    dqn_ma = np.convolve(dqn_results['rewards'], np.ones(window)/window, mode='valid')    ax1.plot(range(window-1, len(dqn_results['rewards'])), dqn_ma,              color='blue', linewidth=2.5, label='Standard DQN')if len(ddqn_results['rewards']) >= window:    ddqn_ma = np.convolve(ddqn_results['rewards'], np.ones(window)/window, mode='valid')    ax1.plot(range(window-1, len(ddqn_results['rewards'])), ddqn_ma,              color='red', linewidth=2.5, label='Double DQN')ax1.axhline(y=195, color='green', linestyle='--', linewidth=2,             label='Solved Threshold', alpha=0.7)ax1.set_xlabel('Episode', fontsize=12)ax1.set_ylabel('Total Reward', fontsize=12)ax1.set_title('Learning Curves: DQN vs Double DQN', fontsize=14, fontweight='bold')ax1.legend(fontsize=11)ax1.grid(True, alpha=0.3)# Plot 2: Q-Value Estimates Over Timeax2 = axes[0, 1]if dqn_results['q_values'] and ddqn_results['q_values']:    ax2.plot(dqn_results['q_values'], alpha=0.3, color='blue', linewidth=0.5)    ax2.plot(ddqn_results['q_values'], alpha=0.3, color='red', linewidth=0.5)        # Moving averages    if len(dqn_results['q_values']) >= window:        dqn_q_ma = np.convolve(dqn_results['q_values'], np.ones(window)/window, mode='valid')        ax2.plot(range(window-1, len(dqn_results['q_values'])), dqn_q_ma,                  color='blue', linewidth=2.5, label='Standard DQN')        if len(ddqn_results['q_values']) >= window:        ddqn_q_ma = np.convolve(ddqn_results['q_values'], np.ones(window)/window, mode='valid')        ax2.plot(range(window-1, len(ddqn_results['q_values'])), ddqn_q_ma,                  color='red', linewidth=2.5, label='Double DQN')        ax2.set_xlabel('Episode', fontsize=12)    ax2.set_ylabel('Average Max Q-Value', fontsize=12)    ax2.set_title('Q-Value Estimates: Evidence of Overestimation?', fontsize=14, fontweight='bold')    ax2.legend(fontsize=11)    ax2.grid(True, alpha=0.3)# Plot 3: Training Loss Comparisonax3 = axes[1, 0]if dqn_results['losses'] and ddqn_results['losses']:    ax3.plot(dqn_results['losses'], alpha=0.3, color='blue', linewidth=0.5)    ax3.plot(ddqn_results['losses'], alpha=0.3, color='red', linewidth=0.5)        # Moving averages    if len(dqn_results['losses']) >= window:        dqn_loss_ma = np.convolve(dqn_results['losses'], np.ones(window)/window, mode='valid')        ax3.plot(range(window-1, len(dqn_results['losses'])), dqn_loss_ma,                  color='blue', linewidth=2.5, label='Standard DQN')        if len(ddqn_results['losses']) >= window:        ddqn_loss_ma = np.convolve(ddqn_results['losses'], np.ones(window)/window, mode='valid')        ax3.plot(range(window-1, len(ddqn_results['losses'])), ddqn_loss_ma,                  color='red', linewidth=2.5, label='Double DQN')        ax3.set_xlabel('Episode', fontsize=12)    ax3.set_ylabel('Loss', fontsize=12)    ax3.set_title('Training Loss Over Time', fontsize=14, fontweight='bold')    ax3.legend(fontsize=11)    ax3.grid(True, alpha=0.3)    ax3.set_yscale('log')# Plot 4: Performance Distributionax4 = axes[1, 1]ax4.hist(dqn_results['rewards'], bins=30, alpha=0.5, color='blue',          label='Standard DQN', edgecolor='black')ax4.hist(ddqn_results['rewards'], bins=30, alpha=0.5, color='red',          label='Double DQN', edgecolor='black')ax4.axvline(x=np.mean(dqn_results['rewards']), color='blue',             linestyle='--', linewidth=2, alpha=0.7)ax4.axvline(x=np.mean(ddqn_results['rewards']), color='red',             linestyle='--', linewidth=2, alpha=0.7)ax4.set_xlabel('Total Reward', fontsize=12)ax4.set_ylabel('Frequency', fontsize=12)ax4.set_title('Reward Distribution Comparison', fontsize=14, fontweight='bold')ax4.legend(fontsize=11)ax4.grid(True, alpha=0.3, axis='y')plt.tight_layout()plt.show()# Print detailed comparison statisticsprint("\n" + "=" * 70)print("DETAILED COMPARISON: Standard DQN vs Double DQN")print("=" * 70)print("\nüìä REWARD STATISTICS:")print("-" * 70)print(f"{'Metric':<30} {'Standard DQN':>18} {'Double DQN':>18}")print("-" * 70)print(f"{'Mean Reward':<30} {np.mean(dqn_results['rewards']):>18.2f} "      f"{np.mean(ddqn_results['rewards']):>18.2f}")print(f"{'Std Reward':<30} {np.std(dqn_results['rewards']):>18.2f} "      f"{np.std(ddqn_results['rewards']):>18.2f}")print(f"{'Max Reward':<30} {np.max(dqn_results['rewards']):>18.2f} "      f"{np.max(ddqn_results['rewards']):>18.2f}")print(f"{'Last 50 Episodes Mean':<30} {np.mean(dqn_results['rewards'][-50:]):>18.2f} "      f"{np.mean(ddqn_results['rewards'][-50:]):>18.2f}")print("\nüìà Q-VALUE STATISTICS (Overestimation Check):")print("-" * 70)if dqn_results['q_values'] and ddqn_results['q_values']:    print(f"{'Mean Q-Value':<30} {np.mean(dqn_results['q_values']):>18.2f} "          f"{np.mean(ddqn_results['q_values']):>18.2f}")    print(f"{'Max Q-Value':<30} {np.max(dqn_results['q_values']):>18.2f} "          f"{np.max(ddqn_results['q_values']):>18.2f}")    print(f"{'Final 50 Episodes Mean Q':<30} {np.mean(dqn_results['q_values'][-50:]):>18.2f} "          f"{np.mean(ddqn_results['q_values'][-50:]):>18.2f}")print("\nüéØ CONVERGENCE:")print("-" * 70)# Find first episode where moving average exceeds 195dqn_solved = -1ddqn_solved = -1threshold = 195window_size = 100for i in range(window_size, len(dqn_results['rewards'])):    if np.mean(dqn_results['rewards'][i-window_size:i]) >= threshold:        dqn_solved = i        breakfor i in range(window_size, len(ddqn_results['rewards'])):    if np.mean(ddqn_results['rewards'][i-window_size:i]) >= threshold:        ddqn_solved = i        breakif dqn_solved > 0:    print(f"{'Standard DQN solved at':<30} Episode {dqn_solved}")else:    print(f"{'Standard DQN':<30} Not solved")if ddqn_solved > 0:    print(f"{'Double DQN solved at':<30} Episode {ddqn_solved}")else:    print(f"{'Double DQN':<30} Not solved")print("\n" + "=" * 70)# Determine winnerdqn_final = np.mean(dqn_results['rewards'][-50:])ddqn_final = np.mean(ddqn_results['rewards'][-50:])if ddqn_final > dqn_final:    improvement = ((ddqn_final - dqn_final) / dqn_final) * 100    print(f"\nüèÜ WINNER: Double DQN")    print(f"   Improvement: {improvement:.1f}% better final performance")elif dqn_final > ddqn_final:    improvement = ((dqn_final - ddqn_final) / ddqn_final) * 100    print(f"\nüèÜ WINNER: Standard DQN")    print(f"   Improvement: {improvement:.1f}% better final performance")else:    print(f"\nü§ù TIE: Both algorithms performed similarly")print("=" * 70)

#### Analysis: Why Double DQN Works Better**Key Observations from the Comparison:**1. **Q-Value Estimates**:   - Standard DQN typically shows higher Q-values, indicating overestimation   - Double DQN produces more conservative, accurate Q-value estimates   - This is evidence of the overestimation bias being reduced2. **Learning Stability**:   - Double DQN often shows smoother learning curves   - Less variance in performance across episodes   - More consistent convergence to good policies3. **Final Performance**:   - Double DQN frequently achieves better or equal final performance   - The improvement is more pronounced in complex environments   - CartPole is relatively simple, so differences may be subtle4. **Computational Cost**:   - Double DQN has virtually no additional computational cost   - Same network architecture and training time   - Only the update rule changes slightly**When Does Double DQN Help Most?**Double DQN provides the biggest benefits when:- The environment has stochastic rewards or transitions- The action space is large- Q-value estimation is noisy- Long-term planning is important**Practical Recommendations:**‚úÖ **Use Double DQN** as your default choice - it's strictly better than standard DQN with no downsides‚úÖ **Combine with other improvements** like:   - Prioritized Experience Replay   - Dueling Networks   - Noisy Networks for exploration‚úÖ **Monitor Q-values** during training to detect overestimation issues**Mathematical Insight:**The key insight is that by decoupling action selection from evaluation, we reduce the positive bias:$\mathbb{E}[\max_a Q(s,a)] \geq \max_a \mathbb{E}[Q(s,a)]$This inequality (Jensen's inequality for the max function) shows that taking the max of noisy estimates gives a biased result. Double DQN mitigates this by using independent estimates.**Next Steps:**Double DQN is a foundational improvement to DQN. Modern deep RL often combines it with other techniques like:- **Dueling DQN**: Separate value and advantage streams- **Prioritized Replay**: Sample important transitions more frequently  - **Rainbow DQN**: Combines multiple improvements into one algorithmIn the next sections, we'll explore policy gradient methods, which take a fundamentally different approach to RL!

<a id='section3'></a>
## Section 3: Advanced Topics

*Content will be added in subsequent tasks*

<a id='section4'></a>
## Section 4: Code Implementations

*Content will be added in subsequent tasks*

<a id='section5'></a>
## Section 5: Real-World Applications

*Content will be added in subsequent tasks*

<a id='section6'></a>
## Section 6: Advanced Research & Deployment

*Content will be added in subsequent tasks*

<a id='conclusion'></a>
## Conclusion and Next Steps

*Content will be added in subsequent tasks*