<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 6b: SARSA vs Q-Learning in Windy Gridworld
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton & Barto Chapter 6 | Intermediate Level | 75 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        The Windy Gridworld problem, introduced in 
        <a href="http://incompleteideas.net/book/the-book-2nd.html" style="color: #17a2b8;">Sutton & Barto Example 6.5</a>,
        demonstrates the difference between on-policy (SARSA) and off-policy (Q-learning) TD control methods.
        In this environment, crosswind affects the agent's movement, creating a stochastic transition model.
        We'll implement both algorithms and compare their learning characteristics, exploring how
        the on-policy vs off-policy distinction affects convergence and performance.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Implement SARSA (on-policy TD control)</li>
        <li>Implement Q-learning (off-policy TD control)</li>
        <li>Understand the impact of wind on state transitions</li>
        <li>Compare on-policy vs off-policy learning</li>
        <li>Analyze convergence rates and final policies</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Key Concepts</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">SARSA</code> → On-policy TD control</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Q-learning</code> → Off-policy TD control</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">ε-greedy</code> → Action selection strategy</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Wind strength</code> → [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Grid size</code> → 7×10 gridworld</div>
    </div>
</td>
</tr>
</table>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; margin-top: 20px; border-left: 3px solid #17a2b8;">
    <h2 style="color: #17a2b8; font-size: 16px; margin: 0 0 8px 0; font-weight: 600;">Section 1: Environment Setup</h2>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        We begin by setting up the Windy Gridworld environment, a classic benchmark problem for comparing TD control algorithms.
    </p>
</div>

### The Windy Gridworld Problem

![Windy Gridworld](https://www.researchgate.net/profile/Markus-Dumke/publication/320890681/figure/fig1/AS:763210537922560@1558974980641/The-windy-gridworld-task-The-goal-is-to-move-from-the-start-state-S-to-the-goal-state-G.jpg)

**Environment characteristics:**
- **Grid**: 7 rows × 10 columns
- **Start**: Position (3, 0) marked as 'S'
- **Goal**: Position (3, 7) marked as 'G'
- **Wind**: Upward push in middle columns (strength shown at bottom)
- **Actions**: 4 standard moves (up, down, left, right)
- **Reward**: -1 per step until goal reached

In [None]:
"""
Cell 1: Install Dependencies and Import Libraries
Purpose: Set up the environment with required packages and load utilities
"""

# Install specific gym version for compatibility
!pip install gym==0.20 -q

import gym
import numpy as np
import sys
from gym.envs.toy_text import discrete
import matplotlib.pyplot as plt
import itertools
from collections import namedtuple, defaultdict
import pandas as pd
from IPython.display import display, HTML, clear_output
import warnings
warnings.filterwarnings('ignore')

# Load pretty print utility
try:
    import requests
    url = 'https://raw.githubusercontent.com/mdehghani86/RL_labs/master/utility/rl_utility.py'
    response = requests.get(url)
    exec(response.text)
    pretty_print("Environment Ready", 
                 "Successfully loaded dependencies and pretty_print utility<br>" +
                 "Gym version 0.20 installed for WindyGridworld compatibility", 
                 style='success')
except Exception as e:
    def pretty_print(title, content, style='info'):
        """Fallback pretty print function"""
        themes = {
            'info': {'primary': '#17a2b8', 'secondary': '#0e5a63', 'background': '#f8f9fa'},
            'success': {'primary': '#28a745', 'secondary': '#155724', 'background': '#f8fff9'},
            'warning': {'primary': '#ffc107', 'secondary': '#e0a800', 'background': '#fffdf5'},
            'result': {'primary': '#6f42c1', 'secondary': '#4e2c8e', 'background': '#faf5ff'},
            'note': {'primary': '#20c997', 'secondary': '#0d7a5f', 'background': '#f0fdf9'}
        }
        theme = themes.get(style, themes['info'])
        html = f'''
        <div style="border-radius: 5px; margin: 10px 0; width: 20cm; max-width: 20cm; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">
            <div style="background: linear-gradient(90deg, {theme['primary']} 0%, {theme['secondary']} 100%); padding: 10px 15px; border-radius: 5px 5px 0 0;">
                <strong style="color: white; font-size: 14px;">{title}</strong>
            </div>
            <div style="background: {theme['background']}; padding: 10px 15px; border-radius: 0 0 5px 5px; border-left: 3px solid {theme['primary']};">
                <div style="color: rgba(0,0,0,0.8); font-size: 12px; line-height: 1.5;">{content}</div>
            </div>
        </div>
        '''
        display(HTML(html))
    
    pretty_print("Fallback Mode", 
                 "Using local pretty_print definition<br>" +
                 "Dependencies loaded successfully", 
                 style='warning')

In [None]:
"""
Cell 2: Define Windy Gridworld Environment
Purpose: Create custom gym environment with wind dynamics
"""

# Define action constants for readability
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

class WindyGridworldEnv(discrete.DiscreteEnv):
    """
    Windy Gridworld Environment
    
    7x10 grid with wind pushing agent upward in certain columns
    Start: (3, 0), Goal: (3, 7)
    Wind strength by column: [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
    """
    
    metadata = {'render.modes': ['human', 'ansi']}
    
    def __init__(self):
        self.shape = (7, 10)  # Grid dimensions
        nS = np.prod(self.shape)  # Total number of states
        nA = 4  # Number of actions
        
        # Define wind strength for each column
        # Column:     0  1  2  3  4  5  6  7  8  9
        # Wind:       0  0  0  1  1  1  2  2  1  0
        winds = np.zeros(self.shape)
        winds[:, [3, 4, 5, 8]] = 1  # Wind strength 1
        winds[:, [6, 7]] = 2  # Wind strength 2
        
        # Calculate transition probabilities
        P = {}
        for s in range(nS):
            position = np.unravel_index(s, self.shape)
            P[s] = {a: [] for a in range(nA)}
            
            # Define transitions for each action
            P[s][UP] = self._calculate_transition_prob(position, [-1, 0], winds)
            P[s][RIGHT] = self._calculate_transition_prob(position, [0, 1], winds)
            P[s][DOWN] = self._calculate_transition_prob(position, [1, 0], winds)
            P[s][LEFT] = self._calculate_transition_prob(position, [0, -1], winds)
        
        # Initial state distribution (always start at (3, 0))
        isd = np.zeros(nS)
        isd[np.ravel_multi_index((3, 0), self.shape)] = 1.0
        
        super(WindyGridworldEnv, self).__init__(nS, nA, P, isd)
    
    def _calculate_transition_prob(self, current, delta, winds):
        """
        Calculate next state considering action and wind effect
        
        Wind pushes agent upward (negative row direction)
        """
        # Apply action and wind effect
        new_position = np.array(current) + np.array(delta) + np.array([-1, 0]) * winds[tuple(current)]
        new_position = self._limit_coordinates(new_position).astype(int)
        new_state = np.ravel_multi_index(tuple(new_position), self.shape)
        
        # Check if goal reached
        is_done = tuple(new_position) == (3, 7)
        
        # Return transition: (probability, next_state, reward, done)
        return [(1.0, new_state, -1.0, is_done)]
    
    def _limit_coordinates(self, coord):
        """
        Keep agent within grid boundaries
        """
        coord[0] = min(coord[0], self.shape[0] - 1)
        coord[0] = max(coord[0], 0)
        coord[1] = min(coord[1], self.shape[1] - 1)
        coord[1] = max(coord[1], 0)
        return coord
    
    def _render(self, mode='human', close=False):
        """
        Render the current state of the environment
        x = current position, T = goal, o = empty cell
        """
        if close:
            return
        
        outfile = sys.stdout
        
        for s in range(self.nS):
            position = np.unravel_index(s, self.shape)
            
            if self.s == s:
                output = " x "  # Current position
            elif position == (3, 7):
                output = " T "  # Goal
            else:
                output = " o "  # Empty cell
            
            if position[1] == 0:
                output = output.lstrip()
            if position[1] == self.shape[1] - 1:
                output = output.rstrip()
                output += "\n"
            
            outfile.write(output)
        outfile.write("\n")

# Create environment instance
env = WindyGridworldEnv()

pretty_print("Windy Gridworld Created",
             "Environment specifications:<br>" +
             "• Grid size: 7×10<br>" +
             "• Start position: (3, 0)<br>" +
             "• Goal position: (3, 7)<br>" +
             "• Wind columns: [3-5]=1, [6-7]=2, [8]=1<br>" +
             "• Reward: -1 per step",
             style='success')

## Section 2: Policy Implementation

### Epsilon-Greedy Policy

Both SARSA and Q-learning use ε-greedy policy for action selection:
- With probability ε: explore (random action)
- With probability 1-ε: exploit (best action based on Q-values)

In [None]:
"""
Cell 3: Define Epsilon-Greedy Policy
Purpose: Implement action selection strategy for exploration-exploitation balance
"""

def epsilon_greedy_policy(Q, state, nA, epsilon):
    """
    Create epsilon-greedy policy based on Q-values
    
    Args:
        Q: Action-value function dictionary
        state: Current state
        nA: Number of actions
        epsilon: Exploration probability
    
    Returns:
        probs: Action probabilities array
    """
    # Initialize with epsilon/nA probability for each action (exploration)
    probs = np.ones(nA) * epsilon / nA
    
    # Find best action based on Q-values
    best_action = np.argmax(Q[state])
    
    # Add remaining probability mass to best action (exploitation)
    probs[best_action] += 1.0 - epsilon
    
    return probs

pretty_print("Policy Function Ready",
             "Epsilon-greedy policy implemented<br>" +
             "Balances exploration and exploitation based on ε parameter",
             style='info')

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; margin-top: 20px; border-left: 3px solid #00acc1;">
    <h2 style="color: #00acc1; font-size: 16px; margin: 0 0 8px 0; font-weight: 600;">Section 3: SARSA Implementation</h2>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Implementing SARSA (State-Action-Reward-State-Action), an on-policy TD control algorithm that learns the value of the policy being followed.
    </p>
</div>

### SARSA: On-Policy TD Control

SARSA update rule:
$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$

Key characteristic: **On-policy** - learns about the policy being followed (ε-greedy)

In [None]:
"""
Cell 4: Implement SARSA Algorithm
Purpose: On-policy TD control for optimal epsilon-greedy policy
"""

# Define statistics tracking
EpisodeStats = namedtuple("Stats", ["episode_lengths", "episode_rewards"])

def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
    """
    SARSA algorithm: On-policy TD control
    
    Args:
        env: OpenAI gym environment
        num_episodes: Number of episodes to run
        discount_factor: Gamma discount factor
        alpha: TD learning rate
        epsilon: Exploration probability
    
    Returns:
        Q: Learned action-value function
        stats: Episode statistics
    """
    # Initialize Q-table with zeros
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    # Track statistics
    stats = EpisodeStats(
        episode_lengths=np.zeros(num_episodes),
        episode_rewards=np.zeros(num_episodes)
    )
    
    pretty_print("Starting SARSA Training",
                 f"Episodes: {num_episodes}<br>" +
                 f"α={alpha}, γ={discount_factor}, ε={epsilon}",
                 style='info')
    
    for i_episode in range(num_episodes):
        # Progress indicator
        if (i_episode + 1) % 100 == 0:
            print(f"\rEpisode {i_episode + 1}/{num_episodes}", end="")
            sys.stdout.flush()
        
        # Initialize S
        state = env.reset()
        
        # Choose A from S using policy derived from Q (ε-greedy)
        action_probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon)
        action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
        
        # Episode loop
        for t in itertools.count():
            # Take action A, observe R, S'
            next_state, reward, done, _ = env.step(action)
            
            # Choose A' from S' using policy derived from Q (ε-greedy)
            next_action_probs = epsilon_greedy_policy(Q, next_state, env.action_space.n, epsilon)
            next_action = np.random.choice(np.arange(len(next_action_probs)), p=next_action_probs)
            
            # Update statistics
            stats.episode_rewards[i_episode] += reward
            stats.episode_lengths[i_episode] = t
            
            # SARSA update: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') - Q(S,A)]
            td_target = reward + discount_factor * Q[next_state][next_action]
            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error
            
            if done:
                break
            
            # S ← S', A ← A'
            state = next_state
            action = next_action
    
    print("\n")
    return Q, stats

pretty_print("SARSA Implementation Complete",
             "On-policy TD control algorithm ready<br>" +
             "Updates use actual next action A' from ε-greedy policy",
             style='success')

## Section 4: Q-Learning Implementation

### Q-Learning: Off-Policy TD Control

Q-learning update rule:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

Key characteristic: **Off-policy** - learns optimal policy while following ε-greedy

In [None]:
"""
Cell 5: Implement Q-Learning Algorithm
Purpose: Off-policy TD control for learning optimal policy
"""

def q_learning(env, num_episodes, discount_factor=0.9, alpha=0.5, epsilon=0.05):
    """
    Q-Learning algorithm: Off-policy TD control
    
    Args:
        env: OpenAI gym environment
        num_episodes: Number of episodes to run
        discount_factor: Gamma discount factor
        alpha: Learning rate
        epsilon: Exploration probability
    
    Returns:
        Q: Learned action-value function
        stats: Episode statistics
    """
    # Initialize Q-table with zeros
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    
    # Track statistics
    stats = EpisodeStats(
        episode_lengths=np.zeros(num_episodes),
        episode_rewards=np.zeros(num_episodes)
    )
    
    pretty_print("Starting Q-Learning Training",
                 f"Episodes: {num_episodes}<br>" +
                 f"α={alpha}, γ={discount_factor}, ε={epsilon}",
                 style='info')
    
    for i_episode in range(num_episodes):
        # Progress indicator
        if (i_episode + 1) % 100 == 0:
            print(f"\rEpisode {i_episode + 1}/{num_episodes}", end="")
            sys.stdout.flush()
        
        # Initialize S
        state = env.reset()
        
        # Episode loop
        for t in range(10000):  # Max steps per episode
            # Choose A from S using policy derived from Q (ε-greedy)
            action_probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon)
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
            
            # Take action A, observe R, S'
            next_state, reward, done, _ = env.step(action)
            
            # Q-learning update: Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) - Q(S,A)]
            # Key difference from SARSA: uses max Q(S',a) instead of Q(S',A')
            td_target = reward + discount_factor * np.max(Q[next_state])
            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error
            
            # Update statistics
            stats.episode_rewards[i_episode] += reward
            stats.episode_lengths[i_episode] = t
            
            if done:
                break
            
            # S ← S'
            state = next_state
    
    print("\n")
    return Q, stats

pretty_print("Q-Learning Implementation Complete",
             "Off-policy TD control algorithm ready<br>" +
             "Updates use maximum Q-value for next state (optimal action)",
             style='success')

## Section 5: Running Experiments

Now we'll train both algorithms and compare their performance.

In [None]:
"""
Cell 6: Define Experiment Parameters
Purpose: Set hyperparameters for both algorithms
"""

# Experiment parameters
NUM_EPISODES = 300

# SARSA parameters
SARSA_PARAMS = {
    'discount_factor': 1.0,
    'alpha': 0.5,
    'epsilon': 0.1
}

# Q-Learning parameters
QLEARNING_PARAMS = {
    'discount_factor': 0.9,
    'alpha': 0.5,
    'epsilon': 0.05
}

pretty_print("Experiment Parameters Set",
             f"<strong>Common:</strong> {NUM_EPISODES} episodes<br><br>" +
             f"<strong>SARSA:</strong><br>" +
             f"• α={SARSA_PARAMS['alpha']}, γ={SARSA_PARAMS['discount_factor']}, ε={SARSA_PARAMS['epsilon']}<br><br>" +
             f"<strong>Q-Learning:</strong><br>" +
             f"• α={QLEARNING_PARAMS['alpha']}, γ={QLEARNING_PARAMS['discount_factor']}, ε={QLEARNING_PARAMS['epsilon']}",
             style='info')

In [None]:
"""
Cell 7: Train Both Algorithms
Purpose: Run SARSA and Q-learning on the Windy Gridworld
"""

# Train SARSA
pretty_print("Training SARSA", "Running on-policy TD control...", style='info')
Q_sarsa, stats_sarsa = sarsa(env, NUM_EPISODES, **SARSA_PARAMS)

# Train Q-Learning
pretty_print("Training Q-Learning", "Running off-policy TD control...", style='info')
Q_qlearning, stats_qlearning = q_learning(env, NUM_EPISODES, **QLEARNING_PARAMS)

pretty_print("Training Complete",
             f"<strong>SARSA Results:</strong><br>" +
             f"• Final episode reward: {stats_sarsa.episode_rewards[-1]:.0f}<br>" +
             f"• Final episode length: {stats_sarsa.episode_lengths[-1]:.0f}<br><br>" +
             f"<strong>Q-Learning Results:</strong><br>" +
             f"• Final episode reward: {stats_qlearning.episode_rewards[-1]:.0f}<br>" +
             f"• Final episode length: {stats_qlearning.episode_lengths[-1]:.0f}",
             style='result')

## Section 6: Visualization and Analysis

Let's visualize the learning curves and compare the performance of both algorithms.

In [None]:
"""
Cell 8: Create Comprehensive Comparison Plots
Purpose: Visualize and compare SARSA vs Q-learning performance
"""

def plot_algorithm_comparison(stats_sarsa, stats_qlearning, smoothing_window=10):
    """
    Create comparison plots for SARSA and Q-learning
    
    Args:
        stats_sarsa: SARSA episode statistics
        stats_qlearning: Q-learning episode statistics
        smoothing_window: Window size for moving average
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Episode Length Comparison
    axes[0, 0].plot(stats_sarsa.episode_lengths, alpha=0.3, color='blue', label='SARSA (raw)')
    axes[0, 0].plot(pd.Series(stats_sarsa.episode_lengths).rolling(smoothing_window, min_periods=1).mean(),
                   color='blue', linewidth=2, label='SARSA (smoothed)')
    axes[0, 0].plot(stats_qlearning.episode_lengths, alpha=0.3, color='red', label='Q-Learning (raw)')
    axes[0, 0].plot(pd.Series(stats_qlearning.episode_lengths).rolling(smoothing_window, min_periods=1).mean(),
                   color='red', linewidth=2, label='Q-Learning (smoothed)')
    axes[0, 0].set_xlabel('Episode')
    axes[0, 0].set_ylabel('Episode Length')
    axes[0, 0].set_title('Episode Length Over Time', fontweight='bold')
    axes[0, 0].legend(loc='upper right')
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Episode Reward Comparison
    rewards_smoothed_sarsa = pd.Series(stats_sarsa.episode_rewards).rolling(smoothing_window, min_periods=1).mean()
    rewards_smoothed_qlearning = pd.Series(stats_qlearning.episode_rewards).rolling(smoothing_window, min_periods=1).mean()
    
    axes[0, 1].plot(stats_sarsa.episode_rewards, alpha=0.3, color='blue', label='SARSA (raw)')
    axes[0, 1].plot(rewards_smoothed_sarsa, color='blue', linewidth=2, label='SARSA (smoothed)')
    axes[0, 1].plot(stats_qlearning.episode_rewards, alpha=0.3, color='red', label='Q-Learning (raw)')
    axes[0, 1].plot(rewards_smoothed_qlearning, color='red', linewidth=2, label='Q-Learning (smoothed)')
    axes[0, 1].set_xlabel('Episode')
    axes[0, 1].set_ylabel('Episode Reward')
    axes[0, 1].set_title(f'Episode Reward (Smoothed over {smoothing_window} episodes)', fontweight='bold')
    axes[0, 1].legend(loc='lower right')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Cumulative Steps Comparison
    axes[1, 0].plot(np.cumsum(stats_sarsa.episode_lengths), np.arange(len(stats_sarsa.episode_lengths)),
                   color='blue', linewidth=2, label='SARSA')
    axes[1, 0].plot(np.cumsum(stats_qlearning.episode_lengths), np.arange(len(stats_qlearning.episode_lengths)),
                   color='red', linewidth=2, label='Q-Learning')
    axes[1, 0].set_xlabel('Time Steps')
    axes[1, 0].set_ylabel('Episodes')
    axes[1, 0].set_title('Learning Speed Comparison', fontweight='bold')
    axes[1, 0].legend(loc='lower right')
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Performance Summary
    axes[1, 1].axis('off')
    
    # Calculate statistics
    sarsa_final_avg = np.mean(stats_sarsa.episode_lengths[-20:])
    qlearning_final_avg = np.mean(stats_qlearning.episode_lengths[-20:])
    sarsa_best = np.min(stats_sarsa.episode_lengths)
    qlearning_best = np.min(stats_qlearning.episode_lengths)
    
    summary_text = f"""
    Algorithm Comparison Summary:
    
    SARSA (On-Policy):
    • Average steps (last 20 episodes): {sarsa_final_avg:.1f}
    • Best episode: {sarsa_best:.0f} steps
    • Convergence: Smoother, more conservative
    • Behavior: Learns actual policy (ε-greedy)
    
    Q-Learning (Off-Policy):
    • Average steps (last 20 episodes): {qlearning_final_avg:.1f}
    • Best episode: {qlearning_best:.0f} steps
    • Convergence: Faster to optimal
    • Behavior: Learns optimal policy
    
    Key Insights:
    • Q-learning typically finds shorter paths
    • SARSA is safer near edges (accounts for exploration)
    • Both converge to good policies
    """
    
    axes[1, 1].text(0.1, 0.5, summary_text, fontsize=11, verticalalignment='center',
                   family='monospace', bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.5))
    
    plt.suptitle('SARSA vs Q-Learning on Windy Gridworld', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Generate comparison plots
plot_algorithm_comparison(stats_sarsa, stats_qlearning, smoothing_window=10)

pretty_print("Analysis Complete",
             "Visualizations show key differences:<br>" +
             "• Q-learning converges faster to optimal policy<br>" +
             "• SARSA shows more conservative, safer behavior<br>" +
             "• Both successfully solve the windy gridworld problem",
             style='result')

In [None]:
"""
Cell 9: Test with Different Learning Rates
Purpose: Compare algorithm sensitivity to learning rate
"""

# Test with very small alpha
ALPHA_SMALL = 0.01
EPISODES_EXTENDED = 1000

pretty_print("Extended Experiment",
             f"Testing with α={ALPHA_SMALL} for {EPISODES_EXTENDED} episodes<br>" +
             "This demonstrates the effect of learning rate on convergence",
             style='info')

# Train with small alpha
Q_sarsa_small, stats_sarsa_small = sarsa(
    env, EPISODES_EXTENDED, 
    discount_factor=SARSA_PARAMS['discount_factor'],
    alpha=ALPHA_SMALL,
    epsilon=SARSA_PARAMS['epsilon']
)

Q_qlearning_small, stats_qlearning_small = q_learning(
    env, EPISODES_EXTENDED,
    discount_factor=QLEARNING_PARAMS['discount_factor'],
    alpha=ALPHA_SMALL,
    epsilon=QLEARNING_PARAMS['epsilon']
)

# Plot results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(pd.Series(stats_sarsa.episode_lengths).rolling(20, min_periods=1).mean(),
         label=f'SARSA (α={SARSA_PARAMS["alpha"]})', color='blue', linewidth=2)
plt.plot(pd.Series(stats_sarsa_small.episode_lengths).rolling(20, min_periods=1).mean(),
         label=f'SARSA (α={ALPHA_SMALL})', color='lightblue', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Episode Length (smoothed)')
plt.title('SARSA: Effect of Learning Rate', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(pd.Series(stats_qlearning.episode_lengths).rolling(20, min_periods=1).mean(),
         label=f'Q-Learning (α={QLEARNING_PARAMS["alpha"]})', color='red', linewidth=2)
plt.plot(pd.Series(stats_qlearning_small.episode_lengths).rolling(20, min_periods=1).mean(),
         label=f'Q-Learning (α={ALPHA_SMALL})', color='salmon', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Episode Length (smoothed)')
plt.title('Q-Learning: Effect of Learning Rate', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.suptitle('Learning Rate Impact on Convergence', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

pretty_print("Learning Rate Analysis",
             "<strong>Observations:</strong><br>" +
             f"• Small α ({ALPHA_SMALL}) requires more episodes to converge<br>" +
             f"• Large α ({SARSA_PARAMS['alpha']}) converges faster but may be less stable<br>" +
             "• Q-learning is generally less sensitive to α than SARSA",
             style='result')

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Key Findings</h3>
    <div style="color: #555; line-height: 1.6; font-size: 13px;">
        <p><strong>1. On-Policy vs Off-Policy:</strong> SARSA learns the value of the policy it follows (ε-greedy), while Q-learning learns the optimal policy regardless of exploration.</p>
        <p><strong>2. Convergence Speed:</strong> Q-learning typically converges faster to the optimal policy, especially with lower epsilon values.</p>
        <p><strong>3. Safety Near Edges:</strong> SARSA tends to learn safer paths away from grid edges due to accounting for exploratory actions in its updates.</p>
        <p><strong>4. Wind Effect:</strong> Both algorithms successfully learn to compensate for wind, but Q-learning finds more direct paths.</p>
        <p><strong>5. Learning Rate Sensitivity:</strong> Both algorithms are sensitive to α, with smaller values providing more stable but slower convergence.</p>
    </div>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why does Q-learning tend to find shorter paths than SARSA in this environment?</li>
        <li>How would changing the wind pattern affect the relative performance of the two algorithms?</li>
        <li>What would happen if we used a decaying epsilon schedule instead of a fixed epsilon?</li>
        <li>In what scenarios would SARSA's conservative behavior be preferable to Q-learning's optimality?</li>
        <li>How would adding stochastic wind (varying strength) affect the comparison?</li>
        <li>Could we combine both algorithms to get benefits of both on-policy safety and off-policy optimality?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 6b: SARSA vs Q-Learning in Windy Gridworld</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 7 - Function Approximation</p>
</div>