<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 8-2: Dyna-Q with Prioritized Sweeping
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">¬© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton and Barto Chapter 8.4 | 60 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        This lab implements <strong>Prioritized Sweeping</strong>, an enhancement to Dyna-Q that focuses computational
        effort on state-action pairs where updates would be most beneficial. Instead of randomly selecting states
        for planning, prioritized sweeping maintains a <strong>priority queue</strong> ordered by the magnitude of
        potential value changes. This leads to more efficient learning, especially in large state spaces where
        random sampling would be wasteful.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Understand prioritized sweeping algorithm</li>
        <li>Implement priority queue for planning</li>
        <li>Track state predecessors efficiently</li>
        <li>Compare with standard Dyna-Q</li>
        <li>Analyze computational efficiency gains</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Key Components</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Priority Queue</code> ‚Üí Updates ordered by |ŒîQ|</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Predecessors</code> ‚Üí Track s‚Üís' relationships</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Threshold Œ∏</code> ‚Üí Minimum priority for updates</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Backward Focus</code> ‚Üí Propagate value changes</div>
    </div>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 1: Environment Setup</h2>
</div>

We'll use the same Shortcut Maze environment from Lab 8-1. The prioritized sweeping algorithm will demonstrate
improved efficiency in finding optimal paths compared to random planning.

In [None]:
"""
Cell 1: Import Libraries and Setup Environment

Purpose:
  - Clone maze environment repository
  - Import required libraries including PriorityQueue
  - Setup visualization parameters

Key Libraries:
  - queue.PriorityQueue: Maintains planning queue ordered by priority
  - numpy: Array operations and numerical computation
  - matplotlib: Performance visualization
  - rlglue: Agent-environment interaction framework

Environment:
  - 6x9 grid maze with walls
  - Deterministic transitions
  - +1 reward at goal, 0 elsewhere
"""

# Clone the repository with maze environment
!git clone https://github.com/mdehghani86/MazeExampleRep.git

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os, shutil
from tqdm import tqdm
from queue import PriorityQueue  # Critical for prioritized sweeping

# Install and import RL-Glue
!pip install jdc rlglue
import jdc

from MazeExampleRep.rl_glue import RLGlue
from MazeExampleRep.agent import BaseAgent
from MazeExampleRep.maze_env import ShortcutMazeEnvironment

# Create results directory
os.makedirs('results', exist_ok=True)

# Configure matplotlib
plt.rcParams.update({'font.size': 15})
plt.rcParams.update({'figure.figsize': [8, 5]})

print("‚úì Environment setup complete")
print("‚úì Priority queue library loaded")
print("‚úì Ready for prioritized sweeping implementation")

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 2: Prioritized Sweeping Algorithm</h2>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="MazeExampleRep/images/prioritized_sweeping.png" alt="Prioritized Sweeping Pseudocode" 
         style="width: 70%; max-width: 700px; border: 2px solid #17a2b8; border-radius: 8px;">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Prioritized Sweeping Algorithm from Sutton & Barto</p>
</div>

<div style="background: #e8f5e9; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #4caf50;">
    <h3 style="color: #2e7d32; font-size: 14px; margin: 0 0 8px 0;">Algorithm Enhancements</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        <strong>1. Priority Queue:</strong> States prioritized by |r + Œ≥¬∑max Q(s',¬∑) - Q(s,a)|<br>
        <strong>2. Predecessor Tracking:</strong> Maintain reverse model for backward propagation<br>
        <strong>3. Threshold Œ∏:</strong> Only queue updates with priority > Œ∏<br>
        <strong>4. Focused Updates:</strong> Process highest-priority states first<br>
        <strong>5. Cascade Effect:</strong> Updates propagate to predecessor states
    </p>
</div>

In [None]:
"""
Cell 2: PriorityAgent Class Initialization

Purpose:
  - Initialize agent with prioritized sweeping components
  - Setup priority queue and predecessor tracking
  - Configure threshold parameter Œ∏

Key Data Structures:
  - queue: PriorityQueue for planning order
  - predecessors: Dict mapping states to their predecessors
  - theta: Threshold for minimum priority
  - model: Standard Dyna-Q model {s: {a: (s', r)}}

CRITICAL: Priority queue uses negative priorities (min-heap)
"""

class PriorityAgent(BaseAgent):

    def agent_init(self, agent_info):
        """Initialize prioritized sweeping agent."""
        
        # Extract standard parameters
        try:
            self.num_states = agent_info["num_states"]
            self.num_actions = agent_info["num_actions"]
        except:
            print("ERROR: num_states and num_actions required")
            
        self.gamma = agent_info.get("discount", 0.95)
        self.step_size = agent_info.get("step_size", 0.1)
        self.epsilon = agent_info.get("epsilon", 0.1)
        self.planning_steps = agent_info.get("planning_steps", 10)

        # Random number generators
        self.rand_generator = np.random.RandomState(agent_info.get('random_seed', 50))
        self.planning_rand_generator = np.random.RandomState(
            agent_info.get('planning_random_seed', 50))

        # Standard Dyna-Q components
        self.q_values = np.zeros((self.num_states, self.num_actions))
        self.actions = list(range(self.num_actions))
        self.past_action = -1
        self.past_state = -1
        self.model = {}
        
        # ============================================================
        # PRIORITIZED SWEEPING COMPONENTS
        # ============================================================
        self.theta = agent_info.get("theta", 0.05)  # Priority threshold
        self.queue = PriorityQueue()                # Priority queue for planning
        self.predecessors = {}                      # s' -> [(s, a), ...]
        
        print(f"Initialized with Œ∏ = {self.theta}")

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">üîß Hands-On Exercise 1: Core Components</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Implement the three core components of prioritized sweeping:
        model update, predecessor tracking, and priority queue management.
        <br><br>
        <strong>Key insight:</strong> These components work together to focus planning on important updates.
    </p>
</div>

In [None]:
%%add_to PriorityAgent

# [HANDS-ON EXERCISE 1A: Model Update]

def update_model(self, past_state, past_action, state, reward):
    """
    Update the model (same as Dyna-Q).
    
    Args:
        past_state (int): Previous state s
        past_action (int): Action taken a
        state (int): Resulting state s'
        reward (float): Observed reward r
    """
    
    ### START YOUR CODE HERE ### (1-4 lines)
    # Hint: Create nested dict if state not in model
    # Store transition: model[s][a] = (s', r)
    
    
    
    ### END YOUR CODE HERE ###

In [None]:
%%add_to PriorityAgent

# [HANDS-ON EXERCISE 1B: Predecessor Tracking]

def update_predecessors(self, past_state, past_action, state):
    """
    Track which state-action pairs lead to each state.
    
    Args:
        past_state (int): Previous state s
        past_action (int): Action taken a  
        state (int): Resulting state s'
        
    Hint: Store (s,a) pairs that lead to s' for backward propagation
    """
    
    ### START YOUR CODE HERE ### (1-4 lines)
    # Create list for state if not exists
    # Append (past_state, past_action) if not already there
    
    
    
    ### END YOUR CODE HERE ###

In [None]:
%%add_to PriorityAgent

# [HANDS-ON EXERCISE 1C: Priority Queue Update]

def update_queue(self, past_state, past_action, state, reward):
    """
    Add state-action pair to queue with appropriate priority.
    
    Priority = |r + Œ≥¬∑max Q(s',¬∑) - Q(s,a)|
    Note: Use negative priority for min-heap behavior
    """
    
    ### START YOUR CODE HERE ### (2-4 lines)
    # Step 1: Calculate TD error magnitude
    # priority = |reward + gamma * max(Q[state]) - Q[past_state][past_action]|
    
    
    # Step 2: If priority > theta, add to queue with NEGATIVE priority
    # self.queue.put((-priority, (past_state, past_action)))
    
    
    ### END YOUR CODE HERE ###

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">üîß Hands-On Exercise 2: Planning Step</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Implement the prioritized planning step - the core innovation.
        Process highest-priority updates and propagate changes to predecessors.
        <br><br>
        <strong>Critical:</strong> Updates cascade backward through predecessor states.
    </p>
</div>

In [None]:
%%add_to PriorityAgent

# [HANDS-ON EXERCISE 2: Prioritized Planning]

def planning_step(self):
    """
    Perform prioritized sweeping planning steps.
    Process queue in priority order and propagate to predecessors.
    """
    
    ### START YOUR CODE HERE ###
    
    # Planning loop - continue for planning_steps or until queue empty
    for _ in range(self.planning_steps):
        if self.queue.empty():
            break
            
        # Step 1: Get highest priority state-action from queue
        # priority, (state, action) = self.queue.get()
        
        
        # Step 2: Get model prediction
        # next_state, reward = self.model[state][action]
        
        
        # Step 3: Q-learning update
        # Handle terminal (next_state == -1) and non-terminal cases
        
        
        
        
        # Step 4: Update all predecessors of current state
        # For each (s, a) that leads to current state:
        #   Calculate priority for predecessor
        #   Add to queue if priority > theta
        
        
        
        
        
    ### END YOUR CODE HERE ###

In [None]:
%%add_to PriorityAgent

# Helper functions for action selection

def argmax(self, q_values):
    """Argmax with random tie-breaking."""
    top = float("-inf")
    ties = []
    for i in range(len(q_values)):
        if q_values[i] > top:
            top = q_values[i]
            ties = []
        if q_values[i] == top:
            ties.append(i)
    return self.rand_generator.choice(ties)

def choose_action_egreedy(self, state):
    """Œµ-greedy action selection."""
    if self.rand_generator.rand() < self.epsilon:
        action = self.rand_generator.choice(self.actions)
    else:
        values = self.q_values[state]
        action = self.argmax(values)
    return action

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">üîß Hands-On Exercise 3: Agent Integration</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Complete the agent methods to integrate all components.
        Remember to update model, queue, and predecessors in the correct order.
    </p>
</div>

In [None]:
%%add_to PriorityAgent

# [HANDS-ON EXERCISE 3: Agent Methods]

def agent_start(self, state):
    """First action selection."""
    
    ### START YOUR CODE HERE ### (2 lines)
    
    
    ### END YOUR CODE HERE ###
    
    return self.past_action

def agent_step(self, reward, state):
    """Main learning step with prioritized sweeping."""
    
    ### START YOUR CODE HERE ###
    
    # Step 1: Direct RL update
    
    
    # Step 2: Update model
    
    
    # Step 3: Update priority queue
    
    
    # Step 4: Update predecessors
    
    
    # Step 5: Planning
    
    
    # Step 6: Select next action
    
    
    # Step 7: Store state and action
    
    
    ### END YOUR CODE HERE ###
    
    return self.past_action

def agent_end(self, reward):
    """Terminal state handling."""
    
    ### START YOUR CODE HERE ###
    
    # Similar to agent_step but with terminal state (-1)
    # Remember: No predecessors for terminal state
    
    
    
    
    
    ### END YOUR CODE HERE ###

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 3: Running Experiments</h2>
</div>

Test the prioritized sweeping agent and compare its performance with standard Dyna-Q.

In [None]:
"""
Cell 3: Experiment Runner

Purpose:
  - Run prioritized sweeping experiments
  - Measure steps per episode
  - Compare learning efficiency

Metrics:
  - Steps to goal (lower is better)
  - Learning speed
  - Planning efficiency
"""

def run_experiment(env, agent, env_parameters, agent_parameters, exp_parameters):
    """Run experiment with prioritized sweeping."""
    
    # Extract parameters
    num_runs = exp_parameters['num_runs']
    num_episodes = exp_parameters['num_episodes']
    planning_steps_all = agent_parameters['planning_steps']

    env_info = env_parameters                     
    agent_info = {
        "num_states": agent_parameters["num_states"],
        "num_actions": agent_parameters["num_actions"],
        "epsilon": agent_parameters["epsilon"],
        "theta": agent_parameters["theta"],
        "discount": env_parameters["discount"],
        "step_size": agent_parameters["step_size"]
    }

    all_averages = np.zeros((len(planning_steps_all), num_runs, num_episodes))
    log_data = {'planning_steps_all': planning_steps_all}

    for idx, planning_steps in enumerate(planning_steps_all):
        print(f'Planning steps: {planning_steps}')
        agent_info["planning_steps"] = planning_steps  

        for i in tqdm(range(num_runs)):
            agent_info['random_seed'] = i
            agent_info['planning_random_seed'] = i

            rl_glue = RLGlue(env, agent)
            rl_glue.rl_init(agent_info, env_info)

            for j in range(num_episodes):
                rl_glue.rl_start()
                is_terminal = False
                num_steps = 0
                
                while not is_terminal:
                    reward, _, action, is_terminal = rl_glue.rl_step()
                    num_steps += 1

                all_averages[idx][i][j] = num_steps

    log_data['all_averages'] = all_averages
    np.save("results/Priority-Sweeping_steps", log_data)
    

def plot_steps_per_episode(file_path):
    """Plot learning curves."""
    
    data = np.load(file_path, allow_pickle=True).item()
    all_averages = data['all_averages']
    planning_steps_all = data['planning_steps_all']

    for i, planning_steps in enumerate(planning_steps_all):
        plt.plot(np.mean(all_averages[i], axis=0), 
                label=f'Planning steps = {planning_steps}')

    plt.legend(loc='upper right')
    plt.xlabel('Episodes')
    plt.ylabel('Steps\nper\nepisode', rotation=0, labelpad=40)
    plt.axhline(y=16, linestyle='--', color='grey', alpha=0.4,
               label='Optimal')
    plt.title('Prioritized Sweeping Performance')
    plt.show()

In [None]:
"""
Cell 4: Run Prioritized Sweeping Experiment

Purpose:
  - Execute experiment with optimized parameters
  - Visualize learning performance
  - Save results for analysis

Parameters:
  - Œ∏ = 0.2: Priority threshold
  - Planning steps = 25
  - Œµ = 0.1: Exploration rate
"""

# ============================================================
# EXPERIMENT CONFIGURATION
# ============================================================

# Experiment parameters
experiment_parameters = {
    "num_runs": 20,        # Number of independent runs
    "num_episodes": 30,    # Episodes per run
}

# Environment parameters
environment_parameters = { 
    "discount": 0.95,
}

# Agent parameters
agent_parameters = {  
    "num_states": 54,      # 6x9 grid
    "num_actions": 4,      # 4 directions
    "epsilon": 0.1, 
    "step_size": 0.125,
    "theta": 0.2,          # Priority threshold
    "planning_steps": [25] # Focus computational effort
}

print("Running prioritized sweeping experiment...")
print(f"Configuration: Œ∏={agent_parameters['theta']}, "
      f"planning={agent_parameters['planning_steps'][0]}\n")

# Run experiment
current_env = ShortcutMazeEnvironment
current_agent = PriorityAgent

run_experiment(current_env, current_agent, environment_parameters, 
              agent_parameters, experiment_parameters)

# Plot results
plot_steps_per_episode('results/Priority-Sweeping_steps.npy')

# Save results
shutil.make_archive('results', 'zip', 'results')

print("\n‚úì Experiment complete!")

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 4: Analysis and Observations</h2>
</div>

<div style="background: #f0f4f8; padding: 20px; margin: 20px 0; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 15px 0;">üìù Record Your Observations</h3>
    <p style="color: #555; line-height: 1.6; margin: 0 0 15px 0; font-size: 13px;">
        Based on your experimental results, record your observations about prioritized sweeping performance.
        Consider the following aspects in your analysis:
    </p>
    
    <div style="background: white; padding: 15px; border-radius: 5px; margin-top: 10px;">
        <p style="color: #17a2b8; font-weight: bold; margin: 0 0 10px 0;">1. Learning Speed Comparison</p>
        <div style="background: #f8f9fa; padding: 10px; border-left: 2px solid #17a2b8; margin: 10px 0;">
            <em style="color: #666; font-size: 12px;">Compare with standard Dyna-Q from Lab 8-1. How much faster does prioritized sweeping converge?</em>
            <div style="margin-top: 10px; padding: 10px; border: 1px dashed #ccc; min-height: 60px;">
                <!-- Student observation here -->
            </div>
        </div>
        
        <p style="color: #17a2b8; font-weight: bold; margin: 20px 0 10px 0;">2. Computational Efficiency</p>
        <div style="background: #f8f9fa; padding: 10px; border-left: 2px solid #17a2b8; margin: 10px 0;">
            <em style="color: #666; font-size: 12px;">How does prioritized sweeping achieve better results with the same number of planning steps?</em>
            <div style="margin-top: 10px; padding: 10px; border: 1px dashed #ccc; min-height: 60px;">
                <!-- Student observation here -->
            </div>
        </div>
        
        <p style="color: #17a2b8; font-weight: bold; margin: 20px 0 10px 0;">3. Key Algorithm Insights</p>
        <div style="background: #f8f9fa; padding: 10px; border-left: 2px solid #17a2b8; margin: 10px 0;">
            <em style="color: #666; font-size: 12px;">What makes prioritized sweeping more effective? Consider the role of the priority queue and backward focusing.</em>
            <div style="margin-top: 10px; padding: 10px; border: 1px dashed #ccc; min-height: 60px;">
                <!-- Student observation here -->
            </div>
        </div>
    </div>
</div>

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Takeaways</h3>
    <p style="color: #555; line-height: 1.6; font-size: 13px;">
        <strong>1. Focused Computation:</strong> Priority queue ensures updates happen where most needed<br><br>
        <strong>2. Backward Propagation:</strong> Value changes cascade through predecessor states<br><br>
        <strong>3. Threshold Œ∏:</strong> Balances computation vs. accuracy<br><br>
        <strong>4. Efficiency Gain:</strong> Same planning budget achieves faster convergence<br><br>
        <strong>5. Scalability:</strong> More important in larger state spaces
    </p>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Further Exploration</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>How would different Œ∏ values affect performance?</li>
        <li>What happens in stochastic environments?</li>
        <li>How does performance scale with maze size?</li>
        <li>Could we combine with Dyna-Q+ exploration bonus?</li>
        <li>What are the memory requirements vs. standard Dyna-Q?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 8-2: Prioritized Sweeping</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Module 8 Complete - Planning and Learning with Tabular Methods</p>
</div>