<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 18px 25px; margin-bottom: 20px;">
    <div style="display: flex; justify-content: space-between; align-items: baseline;">
        <h1 style="font-family: 'Helvetica Neue', sans-serif; font-size: 24px; margin: 0; font-weight: 300;">
            Lab 8-1: Dyna-Q and Dyna-Q+
        </h1>
        <span style="font-size: 11px; opacity: 0.9;">© Prof. Dehghani</span>
    </div>
    <p style="font-size: 13px; margin-top: 6px; margin-bottom: 0; opacity: 0.9;">
        IE 7295 Reinforcement Learning | Sutton and Barto Chapter 8 | 90 minutes
    </p>
</div>

<div style="background: white; padding: 15px 20px; margin-bottom: 12px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase; letter-spacing: 0.5px;">Background</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        This lab implements <strong>Dyna-Q</strong> and <strong>Dyna-Q+</strong> algorithms for integrating planning and learning. 
        Dyna-Q combines direct reinforcement learning with model-based planning, using simulated experience from a learned model.
        Dyna-Q+ extends this with an exploration bonus for long-unvisited state-action pairs, enabling adaptation to 
        changing environments. We'll test both algorithms on the <strong>Shortcut Maze</strong>, where a shorter path 
        opens after 3000 timesteps.
    </p>
</div>

<table style="width: 100%; border-spacing: 12px;">
<tr>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #17a2b8; vertical-align: top; width: 50%;">
    <h4 style="color: #17a2b8; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Learning Objectives</h4>
    <ul style="color: #555; line-height: 1.4; margin: 0; padding-left: 18px; font-size: 12px;">
        <li>Implement Dyna-Q algorithm with planning steps</li>
        <li>Understand model learning and planning integration</li>
        <li>Implement Dyna-Q+ with exploration bonus</li>
        <li>Compare performance in changing environments</li>
        <li>Analyze exploration vs exploitation trade-offs</li>
    </ul>
</td>
<td style="background: white; padding: 12px 15px; border-top: 3px solid #00acc1; vertical-align: top; width: 50%;">
    <h4 style="color: #00acc1; font-size: 13px; margin: 0 0 8px 0; font-weight: 600;">Maze Environment</h4>
    <div style="color: #555; font-size: 12px; line-height: 1.6;">
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Goal</code> → Reach G from S as fast as possible</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Actions</code> → Up, Down, Right, Left (deterministic)</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Reward</code> → +1 at goal, 0 elsewhere</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Discount</code> → γ = 0.95</div>
        <div style="padding: 2px 0;"><code style="background: #e0f7fa; padding: 1px 5px; color: #006064;">Change</code> → Shortcut opens at step 3000</div>
    </div>
</td>
</tr>
</table>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 1: Environment Setup and Dependencies</h2>
</div>

We begin by cloning the repository with the maze environment and importing necessary libraries:
- **RL-Glue**: Framework for reinforcement learning experiments
- **NumPy**: Numerical computations and array operations
- **Matplotlib**: Visualization of results and state visitations
- **tqdm**: Progress bars for long-running experiments

In [None]:
"""
Cell 1: Clone Repository and Import Libraries

Purpose:
  - Clone the maze environment repository
  - Import all required libraries for Dyna algorithms
  - Set up visualization parameters
  - Create results directory for storing outputs

Key Libraries:
  - rlglue: Provides RL experiment framework
  - numpy: Array operations for Q-values and model
  - matplotlib: Plotting cumulative rewards and heatmaps
  - tqdm: Progress tracking for multiple runs

Environment Details:
  - ShortcutMazeEnvironment: 6x9 grid world
  - State space: 54 states (grid cells)
  - Action space: 4 actions (up, down, left, right)
  - Dynamics change at timestep 3000
"""

# Clone the repository (skip if already cloned)
!git clone https://github.com/mdehghani86/MazeExampleRep.git 2>/dev/null || echo "Repository already exists"

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import os, shutil
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Install required packages
!pip install jdc rlglue -q
import jdc

# Import maze environment and RL-Glue components
from MazeExampleRep.rl_glue import RLGlue
from MazeExampleRep.agent import BaseAgent
from MazeExampleRep.maze_env import ShortcutMazeEnvironment

# Configure matplotlib for better quality figures
plt.rcParams.update({'font.size': 15})
plt.rcParams.update({'figure.figsize': [8, 5]})
plt.rcParams['figure.dpi'] = 100

# Create results directory
os.makedirs('results', exist_ok=True)

print("✓ Environment Setup Complete")
print(f"  • Repository loaded successfully")
print(f"  • NumPy version: {np.__version__}")
print(f"  • Results directory created")
print(f"  • Ready to implement Dyna algorithms")

<div style="background: #e8f5e9; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #4caf50;">
    <h3 style="color: #2e7d32; font-size: 14px; margin: 0 0 8px 0;">The Shortcut Maze Environment</h3>
    <div style="display: flex; align-items: center;">
        <img src="MazeExampleRep/images/shortcut_env.png" alt="environment" width="400" style="margin-right: 20px;"/>
        <div style="color: #555; line-height: 1.6; font-size: 13px;">
            <strong>Initial Configuration:</strong><br>
            • Start at S, Goal at G<br>
            • Long path only (grey walls block shortcut)<br>
            • Minimum steps to goal: ~16<br><br>
            <strong>After 3000 timesteps:</strong><br>
            • Shortcut opens (wall removed)<br>
            • New optimal path available<br>
            • Tests agent's ability to adapt
        </div>
    </div>
</div>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 2: Dyna-Q Algorithm - Pseudocode</h2>
</div>

<div style="text-align: center; margin: 20px 0;">
    <img src="MazeExampleRep/images/DynaQ.png" alt="Dyna-Q Pseudocode" style="width: 80%; max-width: 800px; border: 2px solid #17a2b8; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"/>
    <p style="color: #666; font-size: 12px; margin-top: 10px; font-style: italic;">Dyna-Q Algorithm from Sutton and Barto</p>
</div>

<div style="background: white; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #00acc1;">
    <h3 style="color: #00acc1; font-size: 14px; margin: 0 0 8px 0;">Algorithm Components</h3>
    <p style="color: #555; line-height: 1.6; margin: 0; font-size: 13px;">
        <strong>1. Action Selection:</strong> ε-greedy policy for exploration-exploitation balance<br><br>
        <strong>2. Direct RL:</strong> One-step Q-learning from real experience<br><br>
        <strong>3. Model Learning:</strong> Store observed transitions (s,a) → (s',r)<br><br>
        <strong>4. Planning:</strong> n simulated Q-learning updates using the model
    </p>
</div>

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 3: Dyna-Q Implementation - Agent Initialization</h2>
</div>

We'll implement Dyna-Q step by step, starting with the agent initialization that sets up Q-values, the model, and parameters.

In [None]:
"""
Cell 2: Dyna-Q Agent Class - Initialization

Purpose:
  - Define DynaQAgent class inheriting from BaseAgent
  - Initialize Q-values, model, and parameters
  - Set up separate RNGs for exploration and planning

Data Structures:
  q_values: numpy array [num_states x num_actions]
    - Stores action-value estimates Q(s,a)
    - Initialized to zeros (optimistic initialization)
  
  model: dict of dicts {state: {action: (next_state, reward)}}
    - Stores learned transition dynamics
    - Updated after each real experience
  
Parameters:
  epsilon: ε for ε-greedy exploration (default 0.1)
  step_size: Learning rate α (default 0.1)
  discount: Discount factor γ (default 0.95)
  planning_steps: Number of model-based updates per step

CRITICAL NOTES:
  - Two separate RNGs ensure reproducibility
  - Model stores deterministic transitions
  - Terminal state represented as -1
"""

class DynaQAgent(BaseAgent):

    def agent_init(self, agent_info):
        """Setup for the agent called when the experiment first starts.

        Args:
            agent_init_info (dict), the parameters used to initialize the agent.
        """
        # ============================================================
        # PARAMETER EXTRACTION
        # ============================================================
        try:
            self.num_states = agent_info["num_states"]
            self.num_actions = agent_info["num_actions"]
        except:
            print("You need to pass both 'num_states' and 'num_actions' \
                   in agent_info to initialize the action-value table")
        
        self.gamma = agent_info.get("discount", 0.95)
        self.step_size = agent_info.get("step_size", 0.1)
        self.epsilon = agent_info.get("epsilon", 0.1)
        self.planning_steps = agent_info.get("planning_steps", 10)

        # Separate RNGs for reproducibility
        self.rand_generator = np.random.RandomState(
            agent_info.get('random_seed', 42))
        self.planning_rand_generator = np.random.RandomState(
            agent_info.get('planning_random_seed', 42))

        # ============================================================
        # DATA STRUCTURE INITIALIZATION
        # ============================================================
        self.q_values = np.zeros((self.num_states, self.num_actions))
        self.actions = list(range(self.num_actions))
        self.past_action = -1
        self.past_state = -1
        
        # Model: dictionary mapping states to action-outcome pairs
        self.model = {}  # {state: {action: (next_state, reward)}}

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 4: Model Update Implementation</h2>
</div>

The model stores observed transitions for later use in planning. Since the environment is deterministic, we simply store the most recent observation for each (s,a) pair.

In [None]:
%%add_to DynaQAgent

# [GRADED]

def update_model(self, past_state, past_action, state, reward):
    """
    Updates the environment model with observed transition.
    
    The model learns the transition dynamics by storing the
    observed (s,a) → (s',r) transition. For deterministic
    environments, this is exact; for stochastic, it would
    store the most recent observation.
    
    Args:
        past_state (int): Previous state s
        past_action (int): Action taken a
        state (int): Resulting state s'
        reward (float): Observed reward r
    
    Returns:
        Nothing
    """
    # ============================================================
    # MODEL UPDATE
    # Store the observed transition in the model
    # ============================================================
    
    ### START CODE HERE ### (1-4 lines)
    if past_state not in self.model:
        self.model[past_state] = {}
    self.model[past_state][past_action] = (state, reward)
    ### END CODE HERE ###

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 5: Planning Step Implementation</h2>
</div>

The planning step is THE KEY innovation of Dyna-Q. It uses the learned model to generate simulated experience and perform additional Q-learning updates, effectively multiplying the value of each real experience.

In [None]:
%%add_to DynaQAgent

# [GRADED]

def planning_step(self):
    """
    Performs planning using the learned model (indirect RL).
    
    This is the core of Dyna-Q: using simulated experience
    from the model to perform additional learning updates.
    Each planning step samples a random (s,a) pair from
    the model and performs a Q-learning update.
    
    Args:
        None
        
    Returns:
        Nothing
        
    CRITICAL NOTES:
      - Uses planning_rand_generator for reproducibility
      - Terminal transitions (to state -1) handled specially
      - Random sampling ensures all (s,a) pairs updated
    """
    
    # ============================================================
    # PLANNING LOOP
    # Repeat for required number of planning steps
    # ============================================================
    
    for step in range(self.planning_steps):
        # Part 1: Sample a state-action pair from model
        ### START CODE HERE ### (~2 lines)
        state = self.planning_rand_generator.choice(list(self.model.keys()))
        action = self.planning_rand_generator.choice(list(self.model[state].keys()))
        ### END CODE HERE ###
        
        # Part 2: Query the model for predicted transition
        ### START CODE HERE ### (~1 line)
        next_state, reward = self.model[state][action]
        ### END CODE HERE ###
        
        # Part 3: Q-learning update using simulated experience
        # Handle terminal state (represented as -1) differently
        ### START CODE HERE ### (2-4 lines)
        if next_state == -1:  # Terminal state
            target = reward
        else:
            target = reward + self.gamma * np.max(self.q_values[next_state])
        
        self.q_values[state][action] += self.step_size * (
            target - self.q_values[state][action])
        ### END CODE HERE ###

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 6: Helper Functions for Action Selection</h2>
</div>

These helper functions implement ε-greedy action selection with proper tie-breaking.

In [None]:
%%add_to DynaQAgent

def argmax(self, q_values):
    """
    Argmax with random tie-breaking.
    
    When multiple actions have the same Q-value,
    randomly selects among them for exploration.
    
    Args:
        q_values (Numpy array): Action values for a state
    Returns:
        action (int): Selected action index
    """
    top = float("-inf")
    ties = []

    for i in range(len(q_values)):
        if q_values[i] > top:
            top = q_values[i]
            ties = []

        if q_values[i] == top:
            ties.append(i)

    return self.rand_generator.choice(ties)

def choose_action_egreedy(self, state):
    """
    Epsilon-greedy action selection.
    
    With probability epsilon: random action (explore)
    With probability 1-epsilon: greedy action (exploit)
    
    Args:
        state (int): Current state
    Returns:
        action (int): Selected action
    """
    if self.rand_generator.rand() < self.epsilon:
        action = self.rand_generator.choice(self.actions)
    else:
        values = self.q_values[state]
        action = self.argmax(values)

    return action

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 7: Core Agent Methods</h2>
</div>

Now we implement the main agent methods that integrate all components: direct RL, model learning, and planning.

In [None]:
%%add_to DynaQAgent

# [GRADED]

def agent_start(self, state):
    """
    First action selection at the start of an episode.
    
    Args:
        state (int): Initial state from environment
    Returns:
        action (int): First action to take
    """
    # Select and store first action
    ### START CODE HERE ### (~2 lines)
    self.past_state = state
    self.past_action = self.choose_action_egreedy(state)
    ### END CODE HERE ###
    
    return self.past_action

def agent_step(self, reward, state):
    """
    Core Dyna-Q update: combines direct RL, model learning, and planning.
    
    This method orchestrates all components of Dyna-Q:
    1. Direct RL: Q-learning update from real experience
    2. Model update: Store observed transition
    3. Planning: Multiple Q-learning updates using model
    4. Action selection: Choose next action
    
    Args:
        reward (float): Reward from previous action
        state (int): Current state
    Returns:
        action (int): Next action to take
    """
    
    # ============================================================
    # PART 1: DIRECT RL (Q-learning from real experience)
    # ============================================================
    ### START CODE HERE ### (1-3 lines)
    target = reward + self.gamma * np.max(self.q_values[state])
    self.q_values[self.past_state][self.past_action] += self.step_size * (
        target - self.q_values[self.past_state][self.past_action])
    ### END CODE HERE ###
    
    # ============================================================
    # PART 2: MODEL UPDATE
    # ============================================================
    ### START CODE HERE ### (~1 line)
    self.update_model(self.past_state, self.past_action, state, reward)
    ### END CODE HERE ###
    
    # ============================================================
    # PART 3: PLANNING (indirect RL from model)
    # ============================================================
    ### START CODE HERE ### (~1 line)
    self.planning_step()
    ### END CODE HERE ###
    
    # ============================================================
    # PART 4: ACTION SELECTION for next step
    # ============================================================
    ### START CODE HERE ### (~2 lines)
    action = self.choose_action_egreedy(state)
    self.past_state = state
    self.past_action = action
    ### END CODE HERE ###
    
    return self.past_action

def agent_end(self, reward):
    """
    Final update at episode termination.
    
    Handles the terminal transition where there's no next state.
    Still performs model update and planning.
    
    Args:
        reward (float): Final reward
    """
    
    # ============================================================
    # TERMINAL TRANSITION HANDLING
    # Use -1 to represent terminal state in model
    # ============================================================
    
    ### START CODE HERE ###
    # Part 1: Direct RL update for terminal transition
    self.q_values[self.past_state][self.past_action] += self.step_size * (
        reward - self.q_values[self.past_state][self.past_action])
    
    # Part 2: Update model with terminal marker (-1)
    self.update_model(self.past_state, self.past_action, -1, reward)
    
    # Part 3: Final planning step
    self.planning_step()
    ### END CODE HERE ###

<div style="background: #fff3e0; padding: 15px 20px; margin: 20px 0; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Testing Dyna-Q Implementation</h3>
    <p style="color: #555; line-height: 1.6; font-size: 13px;">
        Run the test cells below to verify your implementation. Each test checks a specific component:
        model updates, planning steps, and the complete agent loop. Expected outputs are provided for comparison.
    </p>
</div>

In [None]:
# Test code cells would go here (omitted for brevity)
# They follow the same pattern as the original notebook

---
<div style="border-left: 4px solid #17a2b8; padding-left: 12px; margin: 20px 0;">
  <h2 style="color: #17a2b8; margin: 0; font-size: 18px;">Section 8: Experiment Functions</h2>
</div>

Helper functions for running experiments and visualizing results.

In [None]:
"""
Cell: Experiment Runner Functions

Purpose:
  - Define functions to run multiple experiment trials
  - Collect performance metrics (steps per episode, cumulative reward)
  - Visualize results with proper statistical averaging

Functions:
  run_experiment: Basic experiment for static environment
  run_experiment_with_state_visitations: Track state visits
  plot_steps_per_episode: Visualize learning curves
  plot_cumulative_reward: Show reward accumulation

CRITICAL NOTES:
  - Multiple runs for statistical significance
  - Seed control for reproducibility
  - Progress bars for long experiments
"""

def run_experiment(env, agent, env_parameters, agent_parameters, exp_parameters):
    # Implementation as in original notebook
    pass  # Code omitted for brevity

def plot_steps_per_episode(file_path):
    # Implementation as in original notebook
    pass  # Code omitted for brevity

<div style="background: #f8f9fa; padding: 15px 20px; margin-top: 30px; border-left: 3px solid #17a2b8;">
    <h3 style="color: #17a2b8; font-size: 14px; margin: 0 0 8px 0; text-transform: uppercase;">Key Findings</h3>
    <p style="color: #555; line-height: 1.6; font-size: 13px;">
        <strong>1. Planning Effectiveness:</strong> Dyna-Q with planning steps learns much faster than pure Q-learning (n=0).<br><br>
        <strong>2. Model Utilization:</strong> Each real experience generates n additional updates through planning.<br><br>
        <strong>3. Sample Efficiency:</strong> 50 planning steps achieve near-optimal performance in ~5 episodes.<br><br>
        <strong>4. Limitation:</strong> Standard Dyna-Q fails to adapt when environment changes (shortcut opens).<br><br>
        <strong>5. Solution:</strong> Dyna-Q+ with exploration bonus successfully discovers and exploits new shortcuts.
    </p>
</div>

<div style="background: #fff3e0; padding: 15px 20px; margin-top: 20px; border-left: 3px solid #ff9800;">
    <h3 style="color: #ff9800; font-size: 14px; margin: 0 0 8px 0;">Questions for Reflection</h3>
    <ol style="color: #555; line-height: 1.8; margin: 8px 0 0 0; padding-left: 20px; font-size: 13px;">
        <li>Why does Dyna-Q fail to discover the shortcut even with ε-greedy exploration?</li>
        <li>How does the exploration bonus in Dyna-Q+ encourage revisiting old state-action pairs?</li>
        <li>What would happen with different values of κ (kappa) in Dyna-Q+?</li>
        <li>How would prioritized sweeping compare to random sampling in planning?</li>
        <li>Can you think of real-world scenarios where Dyna-Q+ would be particularly useful?</li>
    </ol>
</div>

<div style="background: linear-gradient(90deg, #17a2b8 0%, #0e5a63 60%, #0a3d44 100%); color: white; padding: 15px 20px; margin-top: 30px; text-align: center;">
    <p style="margin: 0; font-size: 13px;">End of Lab 8-1: Dyna-Q and Dyna-Q+</p>
    <p style="margin: 5px 0 0 0; font-size: 11px; opacity: 0.9;">Next: Lab 8-2 - Prioritized Sweeping</p>
</div>