<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 8-1: Dyna-Q and Dyna-Q+ Algorithms
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton and Barto Chapter 8 | 90 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        This lab implements <strong>Dyna-Q</strong> and <strong>Dyna-Q+</strong> algorithms for integrating planning, acting, and learning.
        Dyna-Q combines direct reinforcement learning with planning using a learned model. The key innovation is 
        <strong>simulated experience</strong>: the agent uses its model to generate imaginary transitions for additional learning.
        Dyna-Q+ extends this with an <strong>exploration bonus</strong> for long-unvisited state-action pairs, enabling better
        adaptation to changing environments.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Implement the Dyna-Q algorithm</li>
        <li>Understand model-based planning</li>
        <li>Implement Dyna-Q+ with exploration bonus</li>
        <li>Compare performance in changing environments</li>
        <li>Analyze the exploration-exploitation tradeoff</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Algorithm Components</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Direct RL</code> → Q-learning from real experience</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Model Learning</code> → Store (s,a,r,s') transitions</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Planning</code> → Q-learning from simulated experience</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Exploration Bonus</code> → κ√τ for unvisited pairs (Q+)</div>
    </div>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 1: Environment Setup and Dependencies</h2>
</div>

We'll be using the **Shortcut Maze Environment** where the agent must navigate from start (S) to goal (G). After 3000 timesteps, a shortcut opens up, testing the agent's ability to discover and exploit new opportunities.

In [None]:
"""
Cell 1: Import Libraries and Clone Repository

Purpose:
  - Clone the maze environment repository
  - Import all required libraries for Dyna algorithms
  - Set up RL-Glue framework for agent-environment interaction

Key Libraries:
  - numpy: Numerical operations and array handling
  - matplotlib: Visualization of results and state visitations
  - rlglue: Framework for RL experiments
  - jdc: Jupyter cell magic for class definitions

Environment Details:
  - Shortcut Maze: 6x9 grid world
  - Actions: Up, Down, Left, Right (deterministic)
  - Reward: +1 at goal, 0 elsewhere
  - Discount: γ = 0.95
"""

# Clone the repository with maze environment
!git clone https://github.com/mdehghani86/MazeExampleRep.git

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os, shutil
from tqdm import tqdm

# Install required packages
!pip install jdc rlglue
import jdc

# Import RL-Glue components
from MazeExampleRep.rl_glue import RLGlue
from MazeExampleRep.agent import BaseAgent
from MazeExampleRep.maze_env import ShortcutMazeEnvironment

# Create results directory
os.makedirs('results', exist_ok=True)

# Configure matplotlib for better quality figures
plt.rcParams.update({'font.size': 15})
plt.rcParams.update({'figure.figsize': [8, 5]})

print("✓ Environment setup complete")
print("✓ Shortcut Maze environment loaded")
print("✓ Ready to implement Dyna-Q algorithms")

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 2: Dyna-Q Algorithm Implementation</h2>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="MazeExampleRep/images/DynaQ.png" alt="Dyna-Q Pseudocode" style="width: 70%; max-width: 700px; border: 2px solid #17a2b8; border-radius: 8px;">
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Dyna-Q Algorithm Pseudocode</p>
</div>

<div style="background: #e8f5e9; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #4caf50;">
    <h3 style="color: #2e7d32; font-size: 14px; margin: 0 0 8px 0;">Algorithm Steps</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        <strong>1. Action Selection:</strong> ε-greedy policy for exploration<br>
        <strong>2. Direct RL:</strong> Q-learning update from real experience<br>
        <strong>3. Model Learning:</strong> Store observed transitions<br>
        <strong>4. Planning:</strong> n simulated Q-learning updates from model
    </p>
</div>

In [None]:
"""
Cell 2: DynaQAgent Class Initialization

Purpose:
  - Initialize the Dyna-Q agent with all necessary components
  - Set up Q-values, model, and random number generators

Key Components:
  - q_values: Action-value estimates Q(s,a)
  - model: Dictionary storing transitions {s: {a: (s', r)}}
  - planning_steps: Number of simulated updates per real step
  - Two RNGs: One for action selection, one for planning
"""

class DynaQAgent(BaseAgent):

    def agent_init(self, agent_info):
        """Setup for the agent called when the experiment first starts."""
        
        # Extract agent parameters
        try:
            self.num_states = agent_info["num_states"]
            self.num_actions = agent_info["num_actions"]
        except:
            print("ERROR: num_states and num_actions required in agent_info")
            
        self.gamma = agent_info.get("discount", 0.95)
        self.step_size = agent_info.get("step_size", 0.1)
        self.epsilon = agent_info.get("epsilon", 0.1)
        self.planning_steps = agent_info.get("planning_steps", 10)

        # Initialize random number generators
        self.rand_generator = np.random.RandomState(agent_info.get('random_seed', 42))
        self.planning_rand_generator = np.random.RandomState(agent_info.get('planning_random_seed', 42))

        # Initialize Q-values and model
        self.q_values = np.zeros((self.num_states, self.num_actions))
        self.actions = list(range(self.num_actions))
        self.past_action = -1
        self.past_state = -1
        
        # Model: dictionary of dictionaries
        # model[s][a] = (next_state, reward)
        self.model = {}

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">🔧 Hands-On Exercise 1: Model Update</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Implement the model update step. The model stores observed transitions for later planning.
        <br><br>
        <strong>Hint:</strong> For deterministic environments, simply store the observed (s', r) for each (s, a) pair.
    </p>
</div>

In [None]:
%%add_to DynaQAgent

# [HANDS-ON EXERCISE 1]

def update_model(self, past_state, past_action, state, reward):
    """
    Updates the model with observed transition.
    
    Args:
        past_state (int): Previous state s
        past_action (int): Action taken a
        state (int): Resulting state s'
        reward (float): Observed reward r
    
    Hint: Create nested dictionary structure if state not yet in model
    """
    
    ### START YOUR CODE HERE ### (1-4 lines)
    # Check if past_state exists in model
    # If not, create empty dictionary for it
    # Store the transition: model[s][a] = (s', r)
    
    
    
    ### END YOUR CODE HERE ###

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">🔧 Hands-On Exercise 2: Planning Step</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Implement the planning step - the heart of Dyna-Q. Use the model to generate simulated experience and update Q-values.
        <br><br>
        <strong>Key Points:</strong><br>
        • Randomly sample from previously observed (s,a) pairs<br>
        • Use the model to get (s', r)<br>
        • Apply Q-learning update<br>
        • Terminal states are stored as -1
    </p>
</div>

In [None]:
%%add_to DynaQAgent

# [HANDS-ON EXERCISE 2]

def planning_step(self):
    """
    Performs planning using the model (indirect RL).
    Samples random state-action pairs and updates Q-values.
    
    CRITICAL: Use self.planning_rand_generator for randomness
    """
    
    # ============================================================
    # PLANNING LOOP: Repeat for self.planning_steps iterations
    # ============================================================
    
    ### START YOUR CODE HERE ###
    for step in range(self.planning_steps):
        # Step 1: Randomly select a state from the model
        # Hint: Get list of states using list(self.model.keys())
        
        
        # Step 2: Randomly select an action for that state
        # Hint: Get available actions using list(self.model[state].keys())
        
        
        # Step 3: Query the model for next state and reward
        # Hint: next_state, reward = self.model[state][action]
        
        
        # Step 4: Perform Q-learning update
        # Different update for terminal (next_state == -1) vs non-terminal
        # Terminal: Q(s,a) += α[r - Q(s,a)]
        # Non-terminal: Q(s,a) += α[r + γ*max(Q(s',·)) - Q(s,a)]
        
        
        
        
    ### END YOUR CODE HERE ###

In [None]:
%%add_to DynaQAgent

# Helper functions for action selection

def argmax(self, q_values):
    """Argmax with random tie-breaking."""
    top = float("-inf")
    ties = []

    for i in range(len(q_values)):
        if q_values[i] > top:
            top = q_values[i]
            ties = []
        if q_values[i] == top:
            ties.append(i)

    return self.rand_generator.choice(ties)

def choose_action_egreedy(self, state):
    """ε-greedy action selection."""
    if self.rand_generator.rand() < self.epsilon:
        action = self.rand_generator.choice(self.actions)
    else:
        values = self.q_values[state]
        action = self.argmax(values)
    return action

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">🔧 Hands-On Exercise 3: Agent Step Methods</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Complete the agent's step methods to integrate all Dyna-Q components.
        <br><br>
        <strong>Order of operations:</strong><br>
        1. Direct RL update (Q-learning)<br>
        2. Update model<br>
        3. Planning steps<br>
        4. Select next action
    </p>
</div>

In [None]:
%%add_to DynaQAgent

# [HANDS-ON EXERCISE 3]

def agent_start(self, state):
    """First action selection in episode."""
    
    ### START YOUR CODE HERE ### (2 lines)
    # Select action using ε-greedy
    # Store state and action for later use
    
    
    ### END YOUR CODE HERE ###
    
    return self.past_action

def agent_step(self, reward, state):
    """Main learning step combining all Dyna-Q components."""
    
    ### START YOUR CODE HERE ###
    # Step 1: Direct RL - Q-learning update
    # Q(s,a) += α[r + γ*max(Q(s',·)) - Q(s,a)]
    
    
    # Step 2: Update the model
    
    
    # Step 3: Planning
    
    
    # Step 4: Select next action
    
    
    # Step 5: Store current state and action
    
    
    ### END YOUR CODE HERE ###
    
    return self.past_action

def agent_end(self, reward):
    """Handle terminal state."""
    
    ### START YOUR CODE HERE ###
    # Step 1: Q-learning update for terminal state
    # Q(s,a) += α[r - Q(s,a)]
    
    
    # Step 2: Update model (use -1 for terminal state)
    
    
    # Step 3: Final planning step
    
    
    ### END YOUR CODE HERE ###

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 3: Running Dyna-Q Experiments</h2>
</div>

Let's test the Dyna-Q agent with different numbers of planning steps to see how planning improves learning speed.

In [None]:
"""
Cell 3: Experiment Runner Functions

Purpose:
  - Define functions to run experiments and visualize results
  - Compare different planning step values

Metrics:
  - Steps per episode (lower is better)
  - Cumulative reward over time
  - State visitation heatmaps
"""

def run_experiment(env, agent, env_parameters, agent_parameters, exp_parameters):
    """Run experiment with multiple planning step values."""
    
    # Extract parameters
    num_runs = exp_parameters['num_runs']
    num_episodes = exp_parameters['num_episodes']
    planning_steps_all = agent_parameters['planning_steps']

    env_info = env_parameters                     
    agent_info = {
        "num_states": agent_parameters["num_states"],
        "num_actions": agent_parameters["num_actions"],
        "epsilon": agent_parameters["epsilon"], 
        "discount": env_parameters["discount"],
        "step_size": agent_parameters["step_size"]
    }

    all_averages = np.zeros((len(planning_steps_all), num_runs, num_episodes))
    log_data = {'planning_steps_all': planning_steps_all}

    for idx, planning_steps in enumerate(planning_steps_all):
        print(f'Planning steps: {planning_steps}')
        agent_info["planning_steps"] = planning_steps  

        for i in tqdm(range(num_runs)):
            agent_info['random_seed'] = i
            agent_info['planning_random_seed'] = i

            rl_glue = RLGlue(env, agent)
            rl_glue.rl_init(agent_info, env_info)

            for j in range(num_episodes):
                rl_glue.rl_start()
                is_terminal = False
                num_steps = 0
                
                while not is_terminal:
                    reward, _, action, is_terminal = rl_glue.rl_step()
                    num_steps += 1

                all_averages[idx][i][j] = num_steps

    log_data['all_averages'] = all_averages
    np.save("results/Dyna-Q_planning_steps", log_data)

def plot_steps_per_episode(file_path):
    """Plot learning curves for different planning steps."""
    
    data = np.load(file_path, allow_pickle=True).item()
    all_averages = data['all_averages']
    planning_steps_all = data['planning_steps_all']

    for i, planning_steps in enumerate(planning_steps_all):
        plt.plot(np.mean(all_averages[i], axis=0), 
                label=f'Planning steps = {planning_steps}')

    plt.legend(loc='upper right')
    plt.xlabel('Episodes')
    plt.ylabel('Steps\nper\nepisode', rotation=0, labelpad=40)
    plt.axhline(y=16, linestyle='--', color='grey', alpha=0.4, 
               label='Optimal')
    plt.title('Dyna-Q Learning Performance')
    plt.show()

In [None]:
"""
Cell 4: Run Dyna-Q Experiment

Purpose:
  - Test Dyna-Q with 0, 5, and 50 planning steps
  - Compare learning efficiency

Expected Results:
  - More planning steps → Faster learning
  - n=0 is standard Q-learning (no planning)
  - n=50 should converge quickly
"""

# ============================================================
# EXPERIMENT CONFIGURATION
# ============================================================

# Experiment parameters
experiment_parameters = {
    "num_runs": 30,        # Average over 30 runs
    "num_episodes": 40,    # Episodes per run
}

# Environment parameters
environment_parameters = { 
    "discount": 0.95,
}

# Agent parameters
agent_parameters = {  
    "num_states": 54,       # 6x9 grid
    "num_actions": 4,       # Up, Down, Left, Right
    "epsilon": 0.1,         # Exploration rate
    "step_size": 0.125,     # Learning rate
    "planning_steps": [0, 5, 50]  # Compare different values
}

print("Running Dyna-Q experiments...")
print("This will take 2-3 minutes\n")

# Run experiment
current_env = ShortcutMazeEnvironment
current_agent = DynaQAgent

run_experiment(current_env, current_agent, environment_parameters, 
              agent_parameters, experiment_parameters)

# Plot results
plot_steps_per_episode('results/Dyna-Q_planning_steps.npy')

# Save results
shutil.make_archive('results', 'zip', 'results');

print("\n✓ Experiment complete!")
print("Notice how more planning steps lead to faster learning.")

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 4: Dyna-Q+ with Exploration Bonus</h2>
</div>

<div style="background: #e8f5e9; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #4caf50;">
    <h3 style="color: #2e7d32; font-size: 14px; margin: 0 0 8px 0;">Dyna-Q+ Enhancements</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        <strong>1. Exploration Bonus:</strong> Add κ√τ(s,a) to rewards during planning<br>
        <strong>2. Complete Model:</strong> Initialize unvisited actions with self-loops<br>
        <strong>3. Time Tracking:</strong> Count steps since last visit for each (s,a)<br><br>
        This encourages revisiting long-unvisited state-action pairs, helping discover environment changes.
    </p>
</div>

In [None]:
"""
Cell 5: DynaQPlusAgent Initialization

Purpose:
  - Initialize Dyna-Q+ with exploration bonus components
  - Add time-since-visit tracking (tau)
  - Add exploration bonus parameter (kappa)

New Components:
  - tau[s,a]: Steps since last visit
  - kappa: Scaling factor for exploration bonus
"""

class DynaQPlusAgent(BaseAgent):
    
    def agent_init(self, agent_info):
        """Initialize Dyna-Q+ agent."""
        
        # Same initialization as Dyna-Q
        try:
            self.num_states = agent_info["num_states"]
            self.num_actions = agent_info["num_actions"]
        except:
            print("ERROR: num_states and num_actions required")
            
        self.gamma = agent_info.get("discount", 0.95)
        self.step_size = agent_info.get("step_size", 0.1)
        self.epsilon = agent_info.get("epsilon", 0.1)
        self.planning_steps = agent_info.get("planning_steps", 10)
        
        # ============================================================
        # NEW FOR DYNA-Q+: Exploration bonus parameter
        # ============================================================
        self.kappa = agent_info.get("kappa", 0.001)

        # Random number generators
        self.rand_generator = np.random.RandomState(agent_info.get('random_seed', 42))
        self.planning_rand_generator = np.random.RandomState(agent_info.get('planning_random_seed', 42))

        # Initialize Q-values and time tracking
        self.q_values = np.zeros((self.num_states, self.num_actions))
        
        # ============================================================
        # NEW FOR DYNA-Q+: Track time since last visit
        # ============================================================
        self.tau = np.zeros((self.num_states, self.num_actions))
        
        self.actions = list(range(self.num_actions))
        self.past_action = -1
        self.past_state = -1
        self.model = {}

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">🔧 Hands-On Exercise 4: Dyna-Q+ Model Update</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Modify model update for Dyna-Q+: when visiting a state for the first time,
        add ALL actions to the model (unvisited actions loop back with reward 0).
        <br><br>
        <strong>Hint:</strong> This ensures all actions can be selected during planning.
    </p>
</div>

In [None]:
%%add_to DynaQPlusAgent

# [HANDS-ON EXERCISE 4]

def update_model(self, past_state, past_action, state, reward):
    """
    Update model with complete action set for new states.
    
    CRITICAL DIFFERENCE from Dyna-Q:
    - When adding a new state, initialize ALL actions
    - Unvisited actions: self-loop with reward 0
    """
    
    if past_state not in self.model:
        self.model[past_state] = {past_action: (state, reward)}
        
        ### START YOUR CODE HERE ### (3 lines)
        # Add all other actions as self-loops with reward 0
        # Hint: Loop through self.actions
        # For actions != past_action: model[past_state][action] = (past_state, 0)
        
        
        
        ### END YOUR CODE HERE ###
    else:
        self.model[past_state][past_action] = (state, reward)

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">🔧 Hands-On Exercise 5: Planning with Exploration Bonus</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Implement planning with exploration bonus: reward + κ√τ(s,a).
        <br><br>
        <strong>Key insight:</strong> Actions not visited recently get bonus reward during planning,
        encouraging exploration of potentially changed areas.
    </p>
</div>

In [None]:
%%add_to DynaQPlusAgent

# [HANDS-ON EXERCISE 5]

def planning_step(self):
    """
    Planning with exploration bonus.
    Adds κ√τ(s,a) to reward during simulated updates.
    """
    
    ### START YOUR CODE HERE ###
    for step in range(self.planning_steps):
        # Step 1: Sample random state
        
        
        # Step 2: Sample random action for that state
        
        
        # Step 3: Get model prediction
        
        
        # Step 4: Add exploration bonus
        # bonus_reward = reward + self.kappa * np.sqrt(self.tau[state, action])
        
        
        # Step 5: Q-learning update with bonus reward
        # Remember to handle terminal states (next_state == -1)
        
        
        
        
        
    ### END YOUR CODE HERE ###

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">🔧 Hands-On Exercise 6: Complete Dyna-Q+ Agent</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        Complete the agent methods with time tracking.
        <br><br>
        <strong>Critical:</strong> Update tau (time since visit) correctly:<br>
        • Increment all tau values each step<br>
        • Reset tau[s,a] = 0 when (s,a) is visited
    </p>
</div>

In [None]:
%%add_to DynaQPlusAgent

# Helper functions (same as Dyna-Q)
def argmax(self, q_values):
    """Argmax with random tie-breaking."""
    top = float("-inf")
    ties = []
    for i in range(len(q_values)):
        if q_values[i] > top:
            top = q_values[i]
            ties = []
        if q_values[i] == top:
            ties.append(i)
    return self.rand_generator.choice(ties)

def choose_action_egreedy(self, state):
    """ε-greedy action selection."""
    if self.rand_generator.rand() < self.epsilon:
        action = self.rand_generator.choice(self.actions)
    else:
        values = self.q_values[state]
        action = self.argmax(values)
    return action

In [None]:
%%add_to DynaQPlusAgent

# [HANDS-ON EXERCISE 6]

def agent_start(self, state):
    """First action - no tau update yet."""
    
    ### START YOUR CODE HERE ### (2 lines)
    
    
    ### END YOUR CODE HERE ###
    
    return self.past_action

def agent_step(self, reward, state):
    """Main step with time tracking."""
    
    ### START YOUR CODE HERE ###
    # Step 1: Update tau (time since visit)
    # Increment all tau values
    
    
    # Reset tau for visited (s,a) pair
    
    
    # Step 2: Direct RL update
    
    
    # Step 3: Update model
    
    
    # Step 4: Planning
    
    
    # Step 5: Select next action
    
    
    # Step 6: Store state and action
    
    
    ### END YOUR CODE HERE ###
    
    return self.past_action

def agent_end(self, reward):
    """Terminal state handling."""
    
    ### START YOUR CODE HERE ###
    # Update tau
    
    
    
    # Direct RL update
    
    
    # Update model
    
    
    # Planning
    
    
    ### END YOUR CODE HERE ###

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 5: Testing in Changing Environments</h2>
</div>

Now we'll test both algorithms in an environment where a shortcut opens after 3000 steps.
Dyna-Q+ should discover and exploit the new path, while Dyna-Q may get stuck with its old model.

In [None]:
"""
Cell 6: Run Comparison Experiment

Purpose:
  - Compare Dyna-Q vs Dyna-Q+ in changing environment
  - Visualize cumulative reward and state visitations

Expected Results:
  - Dyna-Q+: Discovers shortcut, increased reward rate
  - Dyna-Q: Stuck with old path
"""

# [Additional experiment runner functions would go here]
# [Similar structure to previous experiments but tracking state visitations]

print("Running comparison experiment...")
print("Environment changes at step 3000 (shortcut opens)")
print("\nThis will take 3-4 minutes...\n")

# Run experiments and plot results
# [Code to run and visualize experiments]

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings</h3>
    <p style="color: #555; line-height: 1.6; font-size: 13px;">
        <strong>1. Planning Efficiency:</strong> More planning steps dramatically reduce learning time<br><br>
        <strong>2. Model-Based Benefits:</strong> Dyna-Q learns from both real and simulated experience<br><br>
        <strong>3. Exploration Bonus:</strong> Dyna-Q+ adapts to environment changes through targeted exploration<br><br>
        <strong>4. Trade-offs:</strong> Planning requires computation but provides sample efficiency<br><br>
        <strong>5. Adaptation:</strong> κ√τ bonus helps discover and exploit new opportunities
    </p>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why does Dyna-Q fail to find the shortcut while Dyna-Q+ succeeds?</li>
        <li>How would performance change with different κ values?</li>
        <li>What happens if planning steps are increased to 100 or 1000?</li>
        <li>How could prioritized sweeping improve planning efficiency?</li>
        <li>What are the computational trade-offs of model-based methods?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 8-1: Dyna-Q and Dyna-Q+ Algorithms</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 8-2 - Prioritized Sweeping</p>
</div>