
2D Hide-and-Seek Multi-Agent Reinforcement Learning ST455 Project
------------------------------------

_Authors: 37140, 49537, 49212_

# 1. Introduction

In this project, we explore a 2D adaptation of the popular 'hide and seek' game with inverted rules; only one hider to start with, and multiple seekers. The end goal would be to be the single hider remaining at the end of the episode. We modelled the development of this game on a grid world using multi-agent reinforcement learning (MARL), and we used techniques learnt in this course, as well as other resources, to 'train' hiders and seekers to improve as compared to their completely random initial states.

We design a 2D $10×10$ grid world with one **hider** and four **seekers**, initialised with randomly placed obstacles at a certain 'frequency' percentage of randomness. We set this to be $10%$ and agents only observe their local neighbourhood depending on their role (Manhattan distances of either $2$ or $3$). Seekers can share information about new cells in their vision range or hider sightings, and they receive an intrinsic “curiosity” bonus for exploring unvisited locations. The hider, in turn, is rewarded for each time step it remains uncaught.

### Rules of the Game:
  1. Agents & Roles
     - One “hider” and four “seekers” occupy free cells on the grid.
     - The capturing seeker becomes the new hider (with short invisibility), and the former hider becomes a seeker upon capture.  
    <div style="margin-bottom:1em;"></div>
  2. Grid & Obstacles
     - Square grid of size $N \times N$, with $N = 10$ with randomly placed walls.
     - Agents may only occupy and move through non-wall cells.
    <div style="margin-bottom:1em;"></div>
  3. Movement
     - At each time step, every agent chooses one of five actions: move up, down, left, right, or stay.
     - Moves that would leave the grid, hit a wall, or collide with another agent are invalid, and the agent remains in place.
    <div style="margin-bottom:1em;"></div>
  4. Vision & Observation
     - Seekers have a Manhattan distance vision range of $R = 2$: they observe all cells within $R$, including obstacles and any visible hider.
     - If any seeker sees the hider who is not in its invisibility period, capture is immediate.
    <div style="margin-bottom:1em;"></div>
  5. Shared Belief & Communication
     - Seekers share a central belief map over all grid cells.
     - Belief is reset to full confidence when a sighting (but not a capture) occurs. It is broadcast to all seekers.
     - Each sighting and each new cell exploration incurs a small communication cost per agent, deducted from the reward.
    <div style="margin-bottom:1em;"></div>
  6. Rewards
     - Seekers:
         1. +CaptureReward for catching the hider.
         2. +ExploreReward for each newly discovered cell.
         3. –TimePenalty each time step.
         4. –CommCost per communication event.
     - Hider:
         1. +SurviveReward for each time step it remains uncaught.
        <div style="margin-bottom:1em;"></div>
  7. Intrinsic Motivation
     - Seekers track how often they visit each cell.
     - Each visit grants a curiosity bonus: $\beta / \sqrt{\text{visits} + 1}$, encouraging exploration of new areas.
    <div style="margin-bottom:1em;"></div>
  8. Episode Termination
     - The episode ends immediately when a maximum number of time steps (initialised as $100$) is reached.

To evaluate their performance, we track key metrics over 2,000 training episodes. We later provide a comparison with a different seed of experiments, which exhibit very useful information for analysis, from comparison to the main learning, as well as for evaluation of the quality of our implementation as a group. 

A quick summary of the initial learning we implemented during the project, as well as heatmap visualisations representing movement after 1000 episodes of learning are shown below, as well as the intuitive reasoning behind them. A detailed discussion and a critical analysis of the results are included in sections 5 and 6.

<img src="training_outputs/plot_7.png" width="900"/>  
We notice a lot of correlations between certain aspects of the game. For example, average role switches per episode follow episode reward variance very closely, which was seen with the more complex learning algorithm too. As well as this, we can see a decrease in action selection variance, which hinted towards some convergence with policy, which did appear somewhat weakly with our agents trained using Double Q Learning. 

<div style="margin-bottom:1em;"></div>

We found in general that due to the nature of MARL games, convergence and proof of learning are not immediately explicit, and though certain aspects of learning were present, the use of higher-powered GPUs or longer running times would potentially have made outcomes more prominent. We observe that after initial noisy fluctuations, the long-term behaviours converge and remain stable by around 2000 training episodes.


<div style="margin-bottom:1em;"></div>

**Visit heatmaps after training**  
<div style="display: flex; gap: 10px;">
  <img src="training_outputs/training_output_1.png" alt="Agent 0 Visits" width="280"/>
  <img src="training_outputs/training_output_2.png" alt="Agent 1 Visits" width="280"/>
  <img src="training_outputs/training_output_3.png" alt="Agent 2 Visits" width="280"/>
</div>
<div style="display: flex; gap: 10px; margin-top: 10px;">
  <img src="training_outputs/training_output_4.png" alt="Agent 3 Visits" width="280"/>
  <img src="training_outputs/training_output_5.png" alt="Agent 4 Visits" width="280"/>
</div>
   After 1000 episodes, seekers exhibit distinct exploration patterns. For example, some agents focus on corridors near spawn points, while others explore more towards the edges of the maps.

**Research questions:**  

- *How does seekers' 'intrinsic curiosity' affect its coverage? How efficient is this with communication costs?*  
- *Does Double Q or Actor–Critic yield stronger multi-agent coordination in this environment?*  

We now describe the environment and data and implement our algorithms. We then present a rigorous numerical evaluation and analysis.

# 2. Soundness of Solution Concepts 

We explore two key reinforcement learning methods: Double Q-learning and Actor-Critic. Alongside this, we extend our understanding of the solution concepts by implementing an enhanced reward structure, prioritised learning experience, target network updates for stable learning, shared belief maps, and intrinsic curiosity. We justify some of these choices below, with a more detailed implementation description provided in section 3:

**2.1 Double Q-Learning**  
"Double Q-learning decouples action selection from value estimation by maintaining two independent Q-networks, thereby mitigating the overestimation bias inherent in classical Q-learning" (*van Hasselt et al., 2016*).  
- *Rationale:* In partially observable hide-and-seek, noisy reward signals (possibly stemming from sporadic captures) can lead vanilla Q-learning to overvalue untested actions. Double Q-learning’s bias correction promotes more stable convergence (if convergence occurs).
- *Analysis:* While bias reduction is beneficial, Double Q-learning can under-estimate values, potentially slowing down the exploitation of genuinely good policies. In practice, we observed slightly slower early convergence than vanilla Q, suggesting a trade-off between stability and sample efficiency.

**2.2 Actor-Critic**  
"Actor-critic methods combine a parameterised policy (the Actor) with a learned value-function baseline (the Critic), yielding low-variance gradient estimates and natural handling of continuous or large action spaces" (*Sutton et al., 2000*).  
- *Rationale:* The hide-and-seek environment can involve continuous communication actions (e.g. broadcasting belief states) or movement choices. Policy gradients allow smooth adjustments of stochastic policies, improving exploration in high-dimensional action settings.  
- *Critical Analysis:* Actor-critic methods suffer from high variance in gradient estimates and can be sensitive to hyperparameters (e.g. learning rates). In our experiments, tuning was more delicate than for Q-based agents, and some runs showed policy collapse.

**2.3 Shared Belief Maps**  
We implement a centralised “belief map” that aggregates partial observations from multiple agents into a common spatial estimate of opponent positions.  
- *Rationale:* Partial observability hinders each agent’s situational awareness. By sharing beliefs, seekers can coordinate better avoiding duplicate searches and corner traps without requiring fully centralised control.  
- *Critical Analysis:* While belief sharing improves team performance, it introduces communication overhead and a single point of failure: corrupted beliefs (due to sensor noise) can mislead all agents.

**2.4 Intrinsic Curiosity Modules (ICM)**  
"ICM augments the external reward with an internal “novelty bonus” based on the prediction error of a learned forward model, encouraging agents to explore unfamiliar states" (*Pathak et al., 2017*).  
- *Rationale:* In tasks like hide-and-seek—where rewards are somewhat low and captures are infrequent, extrinsic learning can stagnate, and we noticed some phases of inaction where seekers too far from the hider would just bounce between $2$ adjacent squares. Curiosity rewards drive agents to cover the environment, discovering better hiding or seeking spots.  
- *Critical Analysis:* Curiosity can lead to over-exploring irrelevant regions, especially in settings where novelty does not necessarily correlate with winning. We found that excessive intrinsic reward weight delayed convergence, so we had to focus on balancing of curiosity vs. task reward.

# 3. Implementation

In [500]:
import numpy as np
import pandas as pd
import random
from dataclasses import dataclass
from typing import Dict, List, Set, Tuple, Optional
import matplotlib.pyplot as plt

try:
    from IPython import get_ipython
    get_ipython().run_line_magic('matplotlib', 'inline')
except Exception:
    pass

import matplotlib.animation as animation
import unittest


@dataclass
class Config:
    """
    Holds all adjustable parameters for the hide-and-seek simulation.

    Attributes:
      seed: random seed for reproducibility
      grid_size: size of the square grid world
      num_agents: total agents (1 hider + seekers)
      t_max: max time steps per episode
      num_episodes: number of episodes
      wall_prob: probability each cell is a wall
      alpha, gamma: learning rate, discount factor
      epsilon_*: parameters for epsilon-greedy
      alpha_actor, alpha_critic: actor-critic learning rates
      beta_intrinsic: weight for intrinsic (novelty) rewards
      use_intrinsic: toggle for intrinsic rewards
      capture_reward: reward when a seeker captures hider
      survive_reward: reward per timestep for hider survival
      explore_reward: reward seekers get for exploring new cells
      timestep_penalty: small step penalty to encourage efficiency
      comm_cost: cost per communication event
      memory_span: not used directly but placeholder
      belief_decay: decay rate for belief confidence
      seeker_vision: vision radius for seekers
      hider_vision: vision radius for hider (larger than seeker)
      grace_steps: invisible steps after role switch
      distance_reward: reward for maintaining optimal distance
      role_switch_reward: reward for successful role switch
      coordination_reward: reward for coordinated actions
      alpha_decay: learning rate decay
      alpha_min: minimum learning rate
      replay_buffer_size: size of experience replay buffer
      batch_size: batch size for experience replay
      target_update_freq: frequency of target network updates
    """
    seed: int = 42
    grid_size: int = 10
    num_agents: int = 5
    t_max: int = 100
    num_episodes: int = 2000
    wall_prob: float = 0.1
    alpha: float = 0.3
    gamma: float = 0.95
    epsilon_start: float = 1.0
    epsilon_min: float = 0.15
    epsilon_decay: float = 0.9995
    alpha_actor: float = 0.3
    alpha_critic: float = 0.3
    alpha_actor_decay: float = 0.9995
    alpha_critic_decay: float = 0.9995
    beta_intrinsic: float = 2.0
    use_intrinsic: bool = True
    capture_reward: float = 150.0
    survive_reward: float = 3.0
    explore_reward: float = 15.0
    timestep_penalty: float = -0.2
    comm_cost: float = 0.00
    memory_span: int = 20
    belief_decay: float = 0.95
    seeker_vision: int = 1
    hider_vision: int = 3
    grace_steps: int = 3
    distance_reward: float = 1.0
    role_switch_reward: float = 30.0
    coordination_reward: float = 8.0
    alpha_decay: float = 0.9995
    alpha_min: float = 0.05
    replay_buffer_size: int = 20000
    batch_size: int = 64
    target_update_freq: int = 200


# Initialise global configuration and random seeds
tools_cfg = Config()
random.seed(tools_cfg.seed)
np.random.seed(tools_cfg.seed)

# Define possible agent actions: (dx, dy)
actions: List[Tuple[int, int]] = [
    (0, 1),   # move right
    (1, 0),   # move down
    (0, -1),  # move left
    (-1, 0),  # move up
    (0, 0),   # stay in place
]
num_actions = len(actions)


### SharedMemory Class

The `SharedMemory` class provides centralised storage for seeker coordination, tracking explored cells, hider sightings, belief distributions, and communication metrics. It maintains the following state variables:

1. **Grid Size**  
   Integer `grid_size` defining the dimensions of the square grid world.

2. **Explored Cells**  
   A `set` of coordinates `explored` tracking cells visited by seekers.

3. **Seeker Positions**  
   A `set` of agent IDs `seeker_positions` tracking active seekers.

4. **Last Sighting**  
   Tuple `last_seen_pos` and integer `last_seen_time` recording the most recent hider observation.

5. **Confidence**  
   Float `confidence` measuring belief certainty, decaying over time.

6. **Belief Distribution**  
   Array `belief` of shape $(N,N)$ representing probability distribution over hider locations.

7. **Communication Counters**  
   Dictionaries `comm_count` and `total_comm_count` tracking message frequency.

8. **Visit History**  
   Dictionary `visit_count` mapping $(aid,pos)$ pairs to visit frequencies.

---

#### Key Methods

1. *_\_init__(self, grid_size: int)*  
   - Initializes `grid_size` and calls `clear_episode()` and `clear_step()`.  
    <div style="margin-bottom:1em;"></div>

2. *clear_episode(self)*  
   - Resets all episode-level state:
     - `explored = set()`
     - `seeker_positions = set()`
     - `last_seen_pos = None`, `last_seen_time = -1`
     - `confidence = 0.0`
     - `belief = \frac{1}{N^2}`
     - `total_comm_count = 0`
     - `visit_count = {}`  
    <div style="margin-bottom:1em;"></div>

3. *clear_step(self)*  
   - Resets `comm_count = {}` for the new time step.  
    <div style="margin-bottom:1em;"></div>

4. *record_seeker(self, aid: int)*  
   - Adds `aid` to `seeker_positions`.  
    <div style="margin-bottom:1em;"></div>

5. *record_sighting(self, aid: int, pos: Tuple[int,int], t: int, conf: float = 1.0)*  
   - Ensures `aid` is in `seeker_positions`
   - Increments `comm_count[aid]` and `total_comm_count`, then does likewise for all other seekers
   - Sets `last_seen_pos = pos`, `last_seen_time = t`, `confidence = conf`
   - Collapses `belief` to zero everywhere except `belief[pos]=1`  
    <div style="margin-bottom:1em;"></div>

6. *record_explored(self, aid: int, cells: Set[Tuple[int,int]]) → Set[Tuple[int,int]]*  
   - Computes `new_cells = cells – explored`
   - If nonempty, increments `comm_count[aid]`, `total_comm_count`, and updates `explored`
   - Returns `new_cells`  
    <div style="margin-bottom:1em;"></div>

7. *record_visit(self, aid: int, pos: Tuple[int,int]) → int*  
   - Increments and returns `visit_count[(aid,pos)]`
   - Supports curiosity bonus of $\frac{\beta}{\sqrt{\text{visits}+1}}$  
    <div style="margin-bottom:1em;"></div>

8. *cost_and_reset(self, aid: int) → float*  
   - Computes `cost = comm_count[aid] * \text{CommCost}`
   - Resets `comm_count[aid]=0`
   - Returns `cost`  
    <div style="margin-bottom:1em;"></div>

9. *decay_confidence(self, t: int)*  
   - If `last_seen_time<0` does nothing
   - Else computes $\Delta t = t - \text{last\_seen\_time}$ and updates  
     $$
     \gamma \leftarrow \gamma \times (\text{decay})^{\Delta t}
     $$  
    <div style="margin-bottom:1em;"></div>

10. *diffuse_belief(self)*  
    - Rolls `belief` up/down/left/right, zeros out wrapped edges, averages all five arrays, then renormalises so $\sum_{ij}B_{ij}=1$  
    <div style="margin-bottom:1em;"></div>

11. *get_belief(self, t: int, top_n: Optional[int] = None) → List[Tuple[Tuple[int,int],float]]*  
    - Computes `weighted = (belief * confidence).flatten()`
    - If `top_n` is `None` or $\geq$ length, sorts all indices by descending weight
    - Else uses partial sort on the top `top_n`
    - Returns a list of $((i,j),\,\text{weight})$ pairs in descending order

In [503]:
class SharedMemory:
    """
    Centralised store for seeker coordination.
    Tracks:
      - Explored grid cells
      - Last hider sightings (position + timestamp + confidence)
      - Belief distribution over hider locations
      - Communication counters
      - Visit counts (for intrinsic rewards)
    """

    def __init__(self, grid_size: int):
        self.grid_size = grid_size
        self.clear_episode()
        self.clear_step()

    def clear_episode(self):
        self.explored: Set[Tuple[int, int]] = set()
        self.seeker_positions: Set[int] = set()
        self.last_seen_pos: Optional[Tuple[int, int]] = None
        self.last_seen_time: int = -1
        self.confidence: float = 0.0
        self.belief: np.ndarray = np.ones((self.grid_size, self.grid_size)) / (self.grid_size ** 2)
        self.total_comm_count: int = 0
        self.visit_count: Dict[Tuple[int, int], int] = {}

    def clear_step(self):
        self.comm_count: Dict[int, int] = {}

    def record_seeker(self, aid: int):
        self.seeker_positions.add(aid)

    def record_sighting(self, aid: int, pos: Tuple[int, int], t: int, conf: float = 1.0):
        if aid not in self.seeker_positions:
            self.record_seeker(aid)
        self.comm_count[aid] = self.comm_count.get(aid, 0) + 1
        self.total_comm_count += 1
        for sid in self.seeker_positions:
            if sid != aid:
                self.comm_count[sid] = self.comm_count.get(sid, 0) + 1
                self.total_comm_count += 1
        self.last_seen_pos = pos
        self.last_seen_time = t
        self.confidence = conf
        self.belief.fill(0.0)
        self.belief[pos] = 1.0

    def record_explored(self, aid: int, cells: Set[Tuple[int, int]]) -> Set[Tuple[int, int]]:
        new_cells = cells - self.explored
        if new_cells:
            self.comm_count[aid] = self.comm_count.get(aid, 0) + 1
            self.total_comm_count += 1
            self.explored |= new_cells
        return new_cells

    def record_visit(self, aid: int, pos: Tuple[int, int]) -> int:
        key = (aid, pos)
        self.visit_count[key] = self.visit_count.get(key, 0) + 1
        return self.visit_count[key]

    def cost_and_reset(self, aid: int) -> float:
        cost = self.comm_count.get(aid, 0) * tools_cfg.comm_cost
        self.comm_count[aid] = 0
        return cost

    def decay_confidence(self, t: int):
        if self.last_seen_time < 0:
            return
        dt = t - self.last_seen_time
        self.confidence *= (tools_cfg.belief_decay ** dt)

    def diffuse_belief(self):
        B = self.belief
        B_up = np.roll(B, 1, axis=0)
        B_down = np.roll(B, -1, axis=0)
        B_left = np.roll(B, 1, axis=1)
        B_right = np.roll(B, -1, axis=1)
        B_up[0, :] = 0
        B_down[-1, :] = 0
        B_left[:, 0] = 0
        B_right[:, -1] = 0
        new_b = (B + B_up + B_down + B_left + B_right) / 5.0
        total = new_b.sum()
        if total > 0:
            self.belief = new_b / total

    def get_belief(self, t: int, top_n: Optional[int] = None) -> List[Tuple[Tuple[int, int], float]]:
        weighted = (self.belief * self.confidence).flatten()
        length = weighted.size
        if top_n is None or top_n >= length:
            indices = np.argsort(-weighted)
        else:
            partial = np.argpartition(-weighted, top_n)[:top_n]
            indices = partial[np.argsort(-weighted[partial])]
        result = []
        for idx in indices:
            i, j = divmod(idx, self.grid_size)
            result.append(((i, j), weighted[idx]))
        return result

### HideSeekEnv Class

The `HideSeekEnv` class drives a hide-and-seek simulation on a grid, coordinating agent placement, movement, observations, captures and richly structured rewards. Internally it maintains:

1. **Grid**  
   An $N\times N$ array $G\in\{0,1\}^{N\times N}$, sampled once on init via  
   $$  
   G_{ij}\sim\mathrm{Bernoulli}(1-\text{wall\_prob})  
   $$

2. **Shared Memory**  
   A `SharedMemory` instance that tracks seekers' explored cells, last-seen hider location & confidence, belief diffusion, and communication counts.

3. **Agents**  
   A dict mapping `agent.id` → `AgentBase` (or subclass) objects, each with  
   - position $(x,y)$  
   - `role` ("seeker"/"hider")  
   - invisibility counter  
   - learning tables (Q-values or actor/critic weights)

4. **Global Time**  
   A counter $t$ of elapsed timesteps.

5. **Switch Count**  
   A counter $\sigma$ of how many times a capture triggered a role swap.

---

#### Key Methods

1. *__init__(self)*  
   - Builds `self.grid` and instantiates `self.shared`.  
   - Sets  
     $$  
     \begin{aligned}
     \text{self.time}&=0,\quad \\
     \text{self.switch\_count}&=0,\quad \\
     \text{self.agents}&=\{\}.  
     \end{aligned}
     $$  
    <div style="margin-bottom:1em;"></div>

2. *add_agent(self, agent, pos)*  
   - Calls  
     $$  
     \begin{aligned}
     agent.set\_initial\_state&(x,y,role),\quad \\
     agent.env &= self
     \end{aligned}
     $$  
   - Inserts into `self.agents`.  
   - If `agent.role=='seeker'`, calls `self.shared.record_seeker(agent.id)`.  
    <div style="margin-bottom:1em;"></div>

3. *reset(self)*  
   - Resets  
     $$  
     t\leftarrow0,\quad \sigma\leftarrow0  
     $$  
     and clears shared memory via `clear_episode()` and `clear_step()`.  
   - Samples $|\text{agents}|$ free cells $\{(i,j)\mid G_{ij}=0\}$.  
   - Randomly picks one agent as "hider," the rest become "seekers."  
   - Resets each agent's invisibility and re-records seekers in shared memory.  
    <div style="margin-bottom:1em;"></div>

4. *step(self, actions_dict) $\to$ $\{\text{agent.id}\mapsto r_i\}$*  
   Advances one timestep in **four phases**:

   1. **Pre-move Capture**  
      If a seeker is ℓ₁-adjacent to the hider, that seeker immediately captures:  
      - All incur timestep penalty $-p_t$.  
      - Capturing seeker gets  
        $$  
        R_{\rm cap}+R_{\rm role\_switch}.  
        $$  
      - Roles swap, invisibility granted, $\sigma\!+\!=1$, $t\!+\!=1$.  
      - Return rewards.

   2. **Belief & Sightings**  
      - `shared.clear_step()`  
      - `shared.decay_confidence(t)`  
      - `shared.diffuse_belief()`  
      - Each seeker observes; on sighting, calls  
        `shared.record_sighting(id,(h_x,h_y),t)`.

   3. **Movement Decision & Commit**  
      - Propose each move $(x,y)\to(x',y')$, invalid→stay.  
      - Commit all moves, mark `moved_last_action`.  
      - Seekers call  
        $$  
        \text{shared.record\_explored}(id,\{\text{new\_pos}\})  
        $$  
        to log newly explored cells.

   4. **Reward Computation**  
      For each agent $i$, start with  
      $$  
      r_i=-p_t.  
      $$

      - **Seeker**:  
        1. Exploration: $+\;R_{\rm explore}\times|\Delta\text{explored}|$. If no new cells, $-0.5$.  
        2. Coordination: if `shared.last_seen_pos` exists with confidence > 0.5 and this move reduces the Manhattan distance to it, add $+\;R_{\rm coordination}$.

      - **Hider**:  
        1. Survival: $+\;R_{\rm survive}$.  
        2. Distance-based: let  
           $$  
           d=\min_{s\in\text{seekers}}\|\text{pos}_i-\text{pos}_s\|_1.  
           $$  
           If $2\le d\le4$, add $+\;R_{\rm distance}$; if $d<2$, subtract $R_{\rm distance}$.

      - **Stay penalty**: any agent whose `last_action==4` (stay) takes an extra $-1.0$.

      Finally,  
      $$  
      t\leftarrow t+1  
      $$  
      and return $\{\,i\mapsto r_i\}$.

In [506]:
class HideSeekEnv:
    """
    Core simulator:
      - Builds grid, places agents
      - Manages moves, sightings, and captures
      - Tracks rewards and role switches
    """

    def __init__(self):
        self.grid = (np.random.rand(tools_cfg.grid_size, tools_cfg.grid_size) < tools_cfg.wall_prob).astype(int)
        self.shared = SharedMemory(tools_cfg.grid_size)
        self.agents: Dict[int, 'AgentBase'] = {}
        self.time = 0
        self.switch_count = 0

    def add_agent(self, agent: 'AgentBase', pos: Tuple[int, int]):
        agent.set_initial_state(pos[0], pos[1], agent.role)
        agent.env = self
        self.agents[agent.id] = agent
        if agent.role == 'seeker':
            self.shared.record_seeker(agent.id)

    def reset(self):
        self.time = 0
        self.switch_count = 0
        self.shared.clear_episode()
        self.shared.clear_step()
        free_cells = [
            (i, j)
            for i in range(tools_cfg.grid_size)
            for j in range(tools_cfg.grid_size)
            if self.grid[i, j] == 0
        ]
        starts = random.sample(free_cells, tools_cfg.num_agents)
        hider_id = random.choice(list(self.agents.keys()))
        for agent in self.agents.values():
            role = 'hider' if agent.id == hider_id else 'seeker'
            x, y = starts.pop()
            agent.set_initial_state(x, y, role)
            agent.invisible = 0
            if role == 'seeker':
                self.shared.record_seeker(agent.id)

    def step(self, actions_dict: Dict[int, int]) -> Dict[int, float]:
        # Phase 0: preemptive capture detection (before any belief updates)
        hider = next(a for a in self.agents.values() if a.role == 'hider')
        finder = next(
            (
                s.id
                for s in self.agents.values()
                if s.role == 'seeker'
                and abs(s.x - hider.x) + abs(s.y - hider.y) <= 1
            ),
            None,
        )
        if finder is not None:
            # Immediate capture: no belief decay, diffusion, or sightings
            seeker = self.agents[finder]
            rewards = {aid: tools_cfg.timestep_penalty for aid in self.agents}
            rewards[finder] = tools_cfg.capture_reward + tools_cfg.role_switch_reward  # Add role switch reward
            
            # Role swap (no broadcast)
            seeker.role, hider.role = 'hider', 'seeker'
            if finder in self.shared.seeker_positions:
                self.shared.seeker_positions.remove(finder)
            self.shared.seeker_positions.add(hider.id)
            seeker.invisible = tools_cfg.grace_steps
            hider.invisible = 0
            self.switch_count += 1
            self.time += 1
            return rewards

        # Begin standard timestep updates
        self.shared.clear_step()
        self.shared.decay_confidence(self.time)
        self.shared.diffuse_belief()

        # Phase 1: observation (seekers see but don't catch)
        for aid, agent in self.agents.items():
            obs, seen = agent.observe(list(self.agents.values()), self)
            if agent.role == 'seeker' and seen:
                self.shared.record_sighting(aid, (hider.x, hider.y), self.time)

        # Phase 2: check for capture (pre-move)
        finder = None
        for aid, agent in self.agents.items():
            if agent.role == 'seeker' and abs(agent.x - hider.x) + abs(agent.y - hider.y) <= 1:
                finder = aid
                break
        if finder is not None:
            seeker = self.agents[finder]
            rewards = {aid: tools_cfg.timestep_penalty for aid in self.agents}
            rewards[finder] = tools_cfg.capture_reward + tools_cfg.role_switch_reward  # Add role switch reward
            seeker.role, hider.role = 'hider', 'seeker'
            if finder in self.shared.seeker_positions:
                self.shared.seeker_positions.remove(finder)
            self.shared.seeker_positions.add(hider.id)
            seeker.invisible = tools_cfg.grace_steps
            hider.invisible = 0
            self.switch_count += 1
            self.time += 1
            return rewards

        # Phase 3: decide all moves first
        new_positions = {}
        occupied = {agent.state() for agent in self.agents.values()}
        for aid, agent in self.agents.items():
            x0, y0 = agent.state()
            action_index = actions_dict.get(aid, num_actions - 1)
            agent.last_action = action_index
            dx, dy = actions[action_index]
            x1, y1 = x0 + dx, y0 + dy
            if (
                0 <= x1 < tools_cfg.grid_size
                and 0 <= y1 < tools_cfg.grid_size
                and self.grid[x1, y1] == 0
                and (x1, y1) not in occupied
            ):
                new_positions[aid] = (x1, y1)
            else:
                new_positions[aid] = (x0, y0)

        # Now update all positions and track exploration
        new_explored = {aid: 0 for aid in self.agents}
        final_occupied = set()
        for aid, agent in self.agents.items():
            new_pos = new_positions[aid]
            agent.moved_last_action = (new_pos != agent.state())
            agent.x, agent.y = new_pos
            final_occupied.add(new_pos)
            if agent.role == 'seeker':
                newly = self.shared.record_explored(aid, {new_pos})
                new_explored[aid] = len(newly)

        # Phase 4: compute step rewards with enhanced structure
        rewards: Dict[int, float] = {}
        for aid, agent in self.agents.items():
            cost = self.shared.cost_and_reset(aid)
            r = tools_cfg.timestep_penalty - cost
            
            if agent.role == 'seeker':
                # Exploration reward
                r += tools_cfg.explore_reward * new_explored[aid]
                if new_explored[aid] == 0:
                    r -= 0.5  # Penalty for not finding new cells
                
                # Coordination reward
                if self.shared.last_seen_pos and self.shared.confidence > 0.5:
                    # Reward for moving towards the hider's last known position
                    dx = self.shared.last_seen_pos[0] - agent.x
                    dy = self.shared.last_seen_pos[1] - agent.y
                    if abs(dx) + abs(dy) < abs(self.shared.last_seen_pos[0] - agent.x) + abs(self.shared.last_seen_pos[1] - agent.y):
                        r += tools_cfg.coordination_reward
            else:  # hider
                r += tools_cfg.survive_reward
                # Distance-based reward
                seekers = [ag for ag in self.agents.values() if ag.role == 'seeker']
                if seekers:
                    min_dist = min(abs(agent.x - s.x) + abs(agent.y - s.y) for s in seekers)
                    # Reward for maintaining optimal distance (not too close, not too far)
                    if 2 <= min_dist <= 4:
                        r += tools_cfg.distance_reward
                    elif min_dist < 2:
                        r -= tools_cfg.distance_reward  # Penalty for being too close
            
            # Penalty for staying still
            if agent.last_action == 4:  # 'stay' action
                r -= 1.0  # Strong penalty for staying
                
            rewards[aid] = r

        self.time += 1
        return rewards

In [508]:
# ---------------------- Utility Function ----------------------
def compute_bdir(dx: int, dy: int) -> int:
    """
    Convert a 2D offset (dx, dy) into a discrete action index:
      0: right, 1: down, 2: left, 3: up, 4: stay
    Returns the most representative direction of the offset.
    """
    if dx == 0 and dy == 0:
        return num_actions - 1  # stay
    if abs(dx) > abs(dy):
        return 0 if dx > 0 else 2
    return 1 if dy > 0 else 3

### AgentBase Class

The `AgentBase` class provides the structure shared by all agents which manages identity, position, role and bookkeeping. It maintains the following state variables:

1. **Identifier**  
   A unique integer `id`.

2. **Coordinates**  
   Current grid position $(x,y)$ stored in `x` and `y`.

3. **Role**  
   A string `role` equal to `"seeker"` or `"hider"`.

4. **Invisibility counter**  
   Integer `invisible` tracking how many steps the agent remains hidden after capture.

5. **Cumulative reward**  
   Float `reward_sum` summing all rewards earned in the current episode.

6. **Environment reference**  
   `env` pointing to the `HideSeekEnv` instance for observation and movement.

7. **Visited set**  
   A `set` of coordinates for debugging and analysis of exploratory behaviour.

8. **Vision range**  
   Integer `vision_range`, set to $2$ for seekers and $3$ for hiders, defining the radius of local observation.

9. **Last action**  
   Integer `last_action` tracking the most recent action taken.

10. **Movement flag**  
    Boolean `moved_last_action` indicating if the last action resulted in movement.

---

#### Key Methods

1. *__init__(self, aid: int)*  
   - Initializes agent with ID `aid`
   - Sets default role based on ID (seeker if aid=0, hider otherwise)
   - Initializes empty visited set and zero reward sum
   - Sets vision range based on role (2 for seeker, 3 for hider)
   - Initializes position to (0,0)
   - Sets invisibility counter to 0  
    <div style="margin-bottom:1em;"></div>

2. *set_initial_state(self, x: int, y: int, role: str)*  
   - Resets position to $(x,y)$, sets `role`, zeroes `invisible` and `reward_sum`, and clears `visited`.  
   - Updates `vision_range` to $2$ if `role=="seeker"` else $3$.  
   - Resets `last_action` to None and `moved_last_action` to False  
    <div style="margin-bottom:1em;"></div>

3. *state(self) → Tuple[int, int]*  
   - Returns the agent's current coordinates as a tuple $(x,y)$.  
    <div style="margin-bottom:1em;"></div>

4. *observe(self, agents: List['AgentBase'], env: HideSeekEnv) → (Dict[Tuple[int,int],Dict], bool)*  
   - Scans all cells within `vision_range`, building  
     ```python
     obs: Dict[(i,j)→{
       'obstacle': bool,
       'contains_hider': bool
     }]
     ```  
   - Sets `seen=True` if any observed cell contains a visible hider.  
   - Returns `(obs, seen)`, supporting capture detection and learning updates without using policy logic.  
    <div style="margin-bottom:1em;"></div>

5. *update(self, *args, **kwargs)*  
   - Abstract method to be overridden by subclasses
   - Handles learning updates based on experience
   - Takes variable arguments to support different learning algorithms  
    <div style="margin-bottom:1em;"></div>

6. *select_action(self, obs, shared: SharedMemory, t: int, agents: List[AgentBase]) → int*  
   - Abstract method to be overridden by subclasses
   - Determines the next action based on current state and observations
   - Returns an integer representing the chosen action
   - Updates `last_action` with the chosen action
   - May update `moved_last_action` based on whether the action results in movement  
    <div style="margin-bottom:1em;"></div>

In [511]:
class AgentBase:
    """
    Base agent with position, role, and bookkeeping.
    """

    def __init__(self, aid: int):
        self.id = aid
        self.role = 'seeker' if aid == 0 else 'hider'
        self.visited = set()
        self.reward_sum = 0
        self.last_action = None  # Initialize last_action
        self.moved_last_action = False  # Initialize moved_last_action
        self.env = None
        self.vision_range = 2 if self.role == 'seeker' else 3
        self.x = 0
        self.y = 0
        self.invisible = 0

    def set_initial_state(self, x: int, y: int, role: str):
        self.x, self.y, self.role = x, y, role
        self.invisible = 0
        self.reward_sum = 0.0
        self.visited.clear()
        self.vision_range = 2 if role == 'seeker' else 3
        self.last_action = None
        self.moved_last_action = False

    def state(self) -> Tuple[int, int]:
        return (self.x, self.y)

    def observe(self, agents: List['AgentBase'], env: HideSeekEnv):
        obs = {}
        seen = False
        r = self.vision_range
        for dx in range(-r, r + 1):
            for dy in range(-r, r + 1):
                i, j = self.x + dx, self.y + dy
                if 0 <= i < tools_cfg.grid_size and 0 <= j < tools_cfg.grid_size:
                    contains = any(
                        ag.id != self.id
                        and ag.role == 'hider'
                        and ag.invisible == 0
                        and (ag.x, ag.y) == (i, j)
                        for ag in agents
                    )
                    obs[(i, j)] = {
                        'obstacle': env.grid[i, j] == 1,
                        'contains_hider': contains,
                    }
                    seen = seen or contains
        return obs, seen

    def update(self, *args, **kwargs):
        """Overridden by subclasses."""
        pass


### Experience and PrioritizedReplayBuffer Classes

The `Experience` and `PrioritizedReplayBuffer` classes work together to implement prioritised experience replay, a technique that improves learning efficiency by sampling more important experiences more frequently.

The `Experience` class represents a single transition in the environment, storing:

1. **State**  
   Tuple `state` containing the agent's position $(x,y)$.

2. **Action**  
   Integer `action` representing the action taken.

3. **Reward**  
   Float `reward` received after taking the action.

4. **Next State**  
   Tuple `next_state` containing the resulting position $(x',y')$.

5. **Done Flag**  
   Boolean `done` indicating if the episode terminated.

6. **Priority**  
   Float `priority` used for importance sampling, defaulting to 1.0.

---

#### PrioritizedReplayBuffer Class

The `PrioritizedReplayBuffer` class manages a collection of experiences with priority-based sampling. It maintains:

1. **Capacity**  
   Integer `capacity` defining the maximum number of experiences stored.

2. **Priority Exponent**  
   Float `alpha` controlling how much prioritization is used.

3. **Importance Sampling Exponent**  
   Float `beta` used to correct the bias introduced by prioritization.

4. **Buffer**  
   List `buffer` storing the experiences.

5. **Priorities**  
   Array `priorities` storing the priority values for each experience.

6. **Position**  
   Integer `position` tracking the current insertion point.

7. **Size**  
   Integer `size` tracking the current number of stored experiences.

---

#### Key Methods

1. *push(self, experience: Experience)*  
   - Adds a new experience to the buffer
   - If buffer is full, overwrites oldest experience
   - Sets priority to maximum existing priority
   - Updates position and size counters  
    <div style="margin-bottom:1em;"></div>

2. *sample(self, batch_size: int) → Tuple[List[Experience], np.ndarray, np.ndarray]*  
   - Returns three elements:
     1. List of sampled experiences
     2. Array of sampled indices
     3. Array of importance sampling weights
   - If buffer size < batch_size, returns all experiences
   - Otherwise, samples based on priorities:
     $$  
     P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}
     $$
   - Computes importance sampling weights:
     $$  
     w_i = \left(\frac{N \cdot P(i)}{\beta}\right)^{-1}
     $$  
    <div style="margin-bottom:1em;"></div>

3. *update_priorities(self, indices: np.ndarray, priorities: np.ndarray)*  
   - Updates priorities for experiences at given indices
   - Used after computing TD errors to adjust sampling probabilities  
    <div style="margin-bottom:1em;"></div>

4. *_\_len__(self) → int*  
   - Returns current number of stored experiences  
    <div style="margin-bottom:1em;"></div>

In [514]:
@dataclass
class Experience:
    """Represents a single experience in the replay buffer."""
    state: Tuple[int, int]
    action: int
    reward: float
    next_state: Tuple[int, int]
    done: bool
    priority: float = 1.0  # Added priority field

class PrioritizedReplayBuffer:
    """Stores and samples experiences with priority-based sampling."""
    def __init__(self, capacity: int = 20000, alpha: float = 0.6, beta: float = 0.4):
        self.capacity = capacity
        self.alpha = alpha  # Priority exponent
        self.beta = beta    # Importance sampling exponent
        self.buffer = []
        self.priorities = np.zeros(capacity)
        self.position = 0
        self.size = 0

    def push(self, experience: Experience):
        max_priority = self.priorities.max() if self.buffer else 1.0
        
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.position] = experience
        self.priorities[self.position] = max_priority
        self.position = (self.position + 1) % self.capacity
        self.size = min(self.size + 1, self.capacity)

    def sample(self, batch_size: int) -> Tuple[List[Experience], np.ndarray, np.ndarray]:
        if self.size < batch_size:
            return self.buffer, np.ones(len(self.buffer)), np.ones(len(self.buffer))
            
        # Calculate sampling probabilities
        priorities = self.priorities[:self.size]
        probs = priorities ** self.alpha
        probs /= probs.sum()
        
        # Sample indices
        indices = np.random.choice(self.size, batch_size, p=probs)
        
        # Calculate importance sampling weights
        weights = (self.size * probs[indices]) ** (-self.beta)
        weights /= weights.max()
        
        return [self.buffer[idx] for idx in indices], indices, weights

    def update_priorities(self, indices: np.ndarray, priorities: np.ndarray):
        self.priorities[indices] = priorities

    def __len__(self) -> int:
        return self.size

### DoubleQLearningMixin Class

The `DoubleQLearningMixin` gives an agent value-based learning off-policy (to mitigate overestimation bias). It maintains the following state variables:

1. **First Q-table**  
   A NumPy array $Q_{1}\in\mathbb{R}^{N\times N\times A}$ initialised to zeros.

2. **Second Q-table**  
   A NumPy array $Q_{2}$ of identical shape and initialisation.

3. **Target Q-tables**  
   NumPy arrays $Q_{1}^{\text{target}}$ and $Q_{2}^{\text{target}}$ for stable learning.

4. **Exploration rate**  
   Float $\epsilon$, starting at `tools_cfg.epsilon_start` and decaying toward `tools_cfg.epsilon_min`.

5. **Learning rate**  
   Float $\alpha$, starting at `tools_cfg.alpha` and decaying toward `tools_cfg.alpha_min`.

6. **Replay Buffer**  
   `PrioritizedReplayBuffer` instance for storing and sampling experiences.

7. **Update Counters**  
   Integer `steps_since_target_update` tracking when to update target networks.

8. **Performance Tracking**  
   List `episode_rewards` storing recent episode rewards for adaptive learning.

9. **TD Error Tracking**  
   List `td_errors` storing temporal difference errors for priority updates.

---

#### Key Methods

1. *__init__(self)*  
   - Initializes $Q_{1}$, $Q_{2}$, $Q_{1}^{\text{target}}$, and $Q_{2}^{\text{target}}$ to zero arrays
   - Sets $\epsilon \gets \mathit{epsilon\_start}$ and $\alpha \gets \mathit{alpha}$
   - Creates `PrioritizedReplayBuffer` with capacity `tools_cfg.replay_buffer_size`
   - Initializes empty lists for `episode_rewards` and `td_errors`
   - Sets `steps_since_target_update = 0`  
    <div style="margin-bottom:1em;"></div>

2. *decay_epsilon(self)*  
   - Updates  
     $$
       \epsilon \;\gets\; \max\bigl(\mathit{epsilon\_min},\,\epsilon \times \mathit{epsilon\_decay}\bigr).
     $$  
    <div style="margin-bottom:1em;"></div>

3. *update_learning_rate(self, episode_reward: float)*  
   - Appends `episode_reward` to `episode_rewards`
   - If average reward over last 10 episodes is positive:
     $$
       \alpha \;\gets\; \max\bigl(\mathit{alpha\_min},\,\alpha \times \mathit{alpha\_decay}\bigr)
     $$
   - Otherwise:
     $$
       \alpha \;\gets\; \min\bigl(\mathit{alpha},\,\alpha / \mathit{alpha\_decay}\bigr)
     $$  
    <div style="margin-bottom:1em;"></div>

4. *select_action_double(self, x: int, y: int, bdir: int) → int*  
   - Computes the element‐wise sum  
     $$
       Q_{\text{sum}} = Q_{1}^{\text{target}}[x,y] + Q_{2}^{\text{target}}[x,y].
     $$  
   - Filters to actions that keep the agent within the grid.  
   - With probability $\epsilon$, returns a random valid action; otherwise returns  
     $$
       \arg\max_{a} \;Q_{\text{sum}}[a].
     $$  
    <div style="margin-bottom:1em;"></div>

5. *double_q_update(self, ps: Tuple[int,int], a: int, ns: Tuple[int,int], r: float, bps: int, bns: int)*  
   - Adds intrinsic reward for exploration:
     $$
       r \;\gets\; r + \frac{\beta}{\sqrt{\text{visits}+1}}
     $$
   - Stores experience in replay buffer
   - Updates target networks periodically:
     $$
       Q_{i}^{\text{target}} \;\gets\; Q_{i} \quad \text{if} \quad \text{steps} \geq \text{update\_freq}
     $$
   - Samples batch from replay buffer
   - With probability $0.5$, chooses one of two updates:  
     - Update $Q_1$:  
       $$
       \begin{aligned}
         \mathit{best} &= \arg\max_{a'} Q_{1}[ns,a'],\quad \\
         \mathit{target} &= r + \gamma\,Q_{2}^{\text{target}}[ns,\mathit{best}],\quad \\
         \mathit{td\_error} &= |\mathit{target} - Q_{1}[ps,a]|,\quad \\
         Q_{1}[ps,a] \;&\gets\; Q_{1}[ps,a] + \alpha\,w\,\bigl(\mathit{target} - Q_{1}[ps,a]\bigr).
       \end{aligned}
       $$  
     - Or update $Q_2$ symmetrically using $Q_1$ for the target estimate
   - Updates priorities in replay buffer based on TD errors
   - Alternating which table provides evaluation versus target sharply reduces optimistic bias in single-table Q-learning  

In [517]:
class DoubleQLearningMixin:
    """
    Adds Double Q-learning with prioritised experience replay, target networks, and adaptive learning rates.
    """

    def __init__(self):
        self.Q1 = np.zeros((tools_cfg.grid_size, tools_cfg.grid_size, num_actions))
        self.Q2 = np.zeros_like(self.Q1)
        self.target_Q1 = np.zeros_like(self.Q1)
        self.target_Q2 = np.zeros_like(self.Q2)
        self.epsilon = tools_cfg.epsilon_start
        self.alpha = tools_cfg.alpha
        self.last_action = None
        self.replay_buffer = PrioritizedReplayBuffer(
            capacity=tools_cfg.replay_buffer_size,
            alpha=0.6,  # Priority exponent
            beta=0.4    # Importance sampling exponent
        )
        self.update_target_every = tools_cfg.target_update_freq
        self.steps_since_target_update = 0
        self.batch_size = tools_cfg.batch_size
        self.episode_rewards = []
        self.td_errors = []  # Track TD errors for priority updates

    def decay_epsilon(self):
        """Decay the exploration rate."""
        self.epsilon = max(tools_cfg.epsilon_min, self.epsilon * tools_cfg.epsilon_decay)

    def update_learning_rate(self, episode_reward: float):
        """Adapt learning rate based on performance."""
        self.episode_rewards.append(episode_reward)
        if len(self.episode_rewards) > 10:
            avg_reward = np.mean(self.episode_rewards[-10:])
            if avg_reward > 0:
                self.alpha = max(tools_cfg.alpha_min, self.alpha * tools_cfg.alpha_decay)
            else:
                self.alpha = min(tools_cfg.alpha, self.alpha / tools_cfg.alpha_decay)

    def select_action_double(self, x: int, y: int, bdir: int) -> int:
        valid = []
        for i, (dx, dy) in enumerate(actions):
            nx, ny = x + dx, y + dy
            if (0 <= nx < tools_cfg.grid_size and 
                0 <= ny < tools_cfg.grid_size and 
                self.env.grid[nx, ny] == 0):
                valid.append(i)
        
        if not valid:
            self.last_action = num_actions - 1
            return self.last_action
            
        if random.random() < self.epsilon:
            self.last_action = random.choice(valid)
            self.decay_epsilon()
            return self.last_action
            
        # Use target networks for action selection
        qsum = self.target_Q1[x, y] + self.target_Q2[x, y]
        masked_q = qsum.copy()
        masked_q[~np.isin(np.arange(num_actions), valid)] = -float('inf')
        self.last_action = int(np.argmax(masked_q))
        return self.last_action

    def double_q_update(self, ps: Tuple[int, int], a: int, ns: Tuple[int, int], r: float, bps: int, bns: int):
        # Add intrinsic reward for exploration
        if self.env and self.env.shared:
            visit_count = self.env.shared.record_visit(self.id, ns)
            r += tools_cfg.beta_intrinsic / np.sqrt(visit_count + 1)
        
        # Store experience in replay buffer
        done = False
        experience = Experience(ps, a, r, ns, done)
        self.replay_buffer.push(experience)
        
        # Update target networks periodically
        self.steps_since_target_update += 1
        if self.steps_since_target_update >= self.update_target_every:
            self.target_Q1 = self.Q1.copy()
            self.target_Q2 = self.Q2.copy()
            self.steps_since_target_update = 0
        
        # Sample from replay buffer and update Q-values
        if len(self.replay_buffer) >= self.batch_size:
            batch, indices, weights = self.replay_buffer.sample(self.batch_size)
            td_errors = []
            
            for i, exp in enumerate(batch):
                if random.random() < 0.5:
                    best_next = np.argmax(self.Q1[exp.next_state[0], exp.next_state[1], :])
                    target = exp.reward + tools_cfg.gamma * self.target_Q2[exp.next_state[0], exp.next_state[1], best_next]
                    td_error = target - self.Q1[exp.state[0], exp.state[1], exp.action]
                    self.Q1[exp.state[0], exp.state[1], exp.action] += self.alpha * weights[i] * td_error
                else:
                    best_next = np.argmax(self.Q2[exp.next_state[0], exp.next_state[1], :])
                    target = exp.reward + tools_cfg.gamma * self.target_Q1[exp.next_state[0], exp.next_state[1], best_next]
                    td_error = target - self.Q2[exp.state[0], exp.state[1], exp.action]
                    self.Q2[exp.state[0], exp.state[1], exp.action] += self.alpha * weights[i] * td_error
                
                td_errors.append(abs(td_error))
            
            # Update priorities based on TD errors
            self.replay_buffer.update_priorities(indices, np.array(td_errors) + 1e-6)  # Add small constant to avoid zero priorities



### RLSeekerAgent Class

The `RLSeekerAgent` combines `AgentBase` and `DoubleQLearningMixin` to implement a seeker guided by shared belief. It adds to the base agent:

1. **Q-tables**  
   Two arrays $Q_{1},Q_{2}\in\mathbb{R}^{N\times N\times A}$, initialised to zero.

2. **Target Q-tables**  
   Two arrays $Q_{1}^{\text{target}},Q_{2}^{\text{target}}$ for stable learning.

3. **Exploration rate**  
   Scalar $\epsilon$, decayed after each selection.

4. **Learning rate**  
   Scalar $\alpha$, adapted based on performance.

5. **Vision range**  
   Fixed to 2 for seekers.

6. **Replay Buffer**  
   `PrioritizedReplayBuffer` for storing and sampling experiences.

7. **Performance Tracking**  
   List `episode_rewards` for adaptive learning rate.

8. **TD Error Tracking**  
   List `td_errors` for priority updates.

---

#### Key Methods

1. *__init__(self, aid: int)*  
   - Calls both parent initialisers
   - Sets `vision_range=2`
   - Initializes Q-tables and target networks
   - Creates prioritized replay buffer
   - Sets up performance tracking lists  
    <div style="margin-bottom:1em;"></div>

2. *select_action(self, obs, shared: SharedMemory, t: int, agents)* → *int*  
   - Fetches current state $(x,y)$
   - Queries `shared.get_belief(t, top_n=1)` for the most likely hider cell $(i,j)$
   - If belief exists and confidence > 0.2:
     - Computes direction index $b$ toward $(i,j)$
   - Otherwise:
     - Finds nearest unexplored cell
     - Computes direction index $b$ toward that cell
   - Gets valid actions that keep agent within grid
   - With probability $\epsilon$:
     - If valid actions exist toward preferred direction, chooses one
     - Otherwise, chooses random valid action
   - Otherwise:
     - Computes $Q_{\text{sum}} = Q_{1}^{\text{target}}[x,y] + Q_{2}^{\text{target}}[x,y]$
     - Boosts Q-value for preferred direction by `tools_cfg.explore_reward`
     - Masks invalid actions with $-\infty$
     - Returns $\arg\max_a Q_{\text{sum}}[a]$
   - Calls `decay_epsilon()`
   - Returns the chosen action  
    <div style="margin-bottom:1em;"></div>

3. *update(self, ps, a, ns, r, bps, bns)*  
   - Delegates to `double_q_update(ps,a,ns,r,bps,bns)`
   - This method:
     - Adds intrinsic reward for exploration
     - Stores experience in replay buffer
     - Updates target networks periodically
     - Samples batch from replay buffer
     - Updates Q-values using double Q-learning
     - Updates priorities based on TD errors  

In [520]:
class RLSeekerAgent(AgentBase, DoubleQLearningMixin):
    """
    Seeker uses Double Q-learning guided by shared belief.
    """

    def __init__(self, aid: int):
        AgentBase.__init__(self, aid)
        DoubleQLearningMixin.__init__(self)

    def select_action(self, obs, shared: SharedMemory, t: int, agents: List[AgentBase]) -> int:
        ps = self.state()
        belief = shared.get_belief(t, top_n=1)
        # If we have a confident belief, chase the hider
        if belief and belief[0][1] > 0.2:
            ref = belief[0][0]
            dx, dy = ref[0] - ps[0], ref[1] - ps[1]
            bps = compute_bdir(dx, dy)
        else:
            # Fallback: move toward nearest unexplored cell
            unexplored = [(i, j) for i in range(tools_cfg.grid_size)
                          for j in range(tools_cfg.grid_size)
                          if (i, j) not in self.env.shared.explored and self.env.grid[i, j] == 0]
            if unexplored:
                nearest = min(unexplored, key=lambda pos: abs(pos[0] - ps[0]) + abs(pos[1] - ps[1]))
                dx, dy = nearest[0] - ps[0], nearest[1] - ps[1]
                bps = compute_bdir(dx, dy)
            else:
                bps = num_actions - 1  # stay if nowhere to go

        # Use the existing double Q-learning action selection, but bias toward bps
        valid = []
        for i, (adx, ady) in enumerate(actions):
            nx, ny = ps[0] + adx, ps[1] + ady
            if (0 <= nx < tools_cfg.grid_size and 
                0 <= ny < tools_cfg.grid_size and 
                self.env.grid[nx, ny] == 0):
                valid.append(i)
        if not valid:
            self.last_action = num_actions - 1
            return num_actions - 1

        if random.random() < self.epsilon:
            valid_towards = [i for i in valid if i == bps]
            if valid_towards:
                self.last_action = valid_towards[0]
                self.decay_epsilon()
                return self.last_action
            self.last_action = random.choice(valid)
            self.decay_epsilon()
            return self.last_action

        qsum = self.Q1[ps[0], ps[1]] + self.Q2[ps[0], ps[1]]
        # Boost Q-value for the preferred direction
        qsum[bps] += tools_cfg.explore_reward
        masked_q = qsum.copy()
        masked_q[~np.isin(np.arange(num_actions), valid)] = -float('inf')
        self.last_action = int(np.argmax(masked_q))
        return self.last_action

    def update(
        self,
        ps: Tuple[int, int],
        a: int,
        ns: Tuple[int, int],
        r: float,
        bps: int,
        bns: int,
    ):
        self.double_q_update(ps, a, ns, r, bps, bns)


### RLHiderAgent Class

The `RLHiderAgent` also uses double Q-learning but chooses actions to evade. It differs only in vision and target selection:

1. **Q-tables**  
   Two arrays $Q_{1},Q_{2}\in\mathbb{R}^{N\times N\times A}$, initialised to zero.

2. **Target Q-tables**  
   Two arrays $Q_{1}^{\text{target}},Q_{2}^{\text{target}}$ for stable learning.

3. **Exploration rate**  
   Scalar $\epsilon$, decayed after each selection.

4. **Learning rate**  
   Scalar $\alpha$, adapted based on performance.

5. **Vision range**  
   Fixed to 3 for hiders.

6. **Replay Buffer**  
   `PrioritizedReplayBuffer` for storing and sampling experiences.

7. **Performance Tracking**  
   List `episode_rewards` for adaptive learning rate.

8. **TD Error Tracking**  
   List `td_errors` for priority updates.

---

#### Key Methods

1. *__init__(self, aid: int)*  
   - Initialises parents and sets `vision_range=3`
   - Initialises Q-tables and target networks
   - Creates prioritised replay buffer
   - Sets up performance tracking lists  
    <div style="margin-bottom:1em;"></div>

2. *select_action(self, obs, shared, t, agents)* → *int*  
   - Fetches current state $(x,y)$
   - Gets valid actions that keep agent within grid
   - If no valid actions exist, returns stay action
   - Finds nearest seeker by Manhattan distance
   - If seekers exist and nearest is within distance 3:
     - Computes direction $b$ away from nearest seeker
     - With probability $\epsilon$:
       - If valid actions exist away from seeker, chooses one
       - Otherwise, chooses random valid action
     - Otherwise:
       - Computes $Q_{\text{sum}} = Q_{1}^{\text{target}}[x,y] + Q_{2}^{\text{target}}[x,y]$
       - Boosts Q-value for away direction by `tools_cfg.survive_reward`
       - Masks invalid actions with $-\infty$
       - Returns $\arg\max_a Q_{\text{sum}}[a]$
   - Otherwise:
     - Uses default `select_action_double` for normal exploration
   - Calls `decay_epsilon()`
   - Returns the chosen action  
    <div style="margin-bottom:1em;"></div>

3. *update(self, ps, a, ns, r, bps, bns)*  
   - Delegates to `double_q_update(ps,a,ns,r,bps,bns)`
   - This method:
     - Adds intrinsic reward for exploration
     - Stores experience in replay buffer
     - Updates target networks periodically
     - Samples batch from replay buffer
     - Updates Q-values using double Q-learning
     - Updates priorities based on TD errors  

In [523]:
class RLHiderAgent(AgentBase, DoubleQLearningMixin):
    """
    Hider uses Double Q-learning to avoid the nearest seeker.
    """

    def __init__(self, aid: int):
        AgentBase.__init__(self, aid)
        DoubleQLearningMixin.__init__(self)

    def select_action(self, obs, shared: SharedMemory, t: int, agents: List[AgentBase]) -> int:
        ps = self.state()
        seekers = [ag for ag in agents if ag.role == 'seeker']
        
        # Get valid actions
        valid = []
        for i, (dx, dy) in enumerate(actions):
            nx, ny = ps[0] + dx, ps[1] + dy
            if (0 <= nx < tools_cfg.grid_size and 
                0 <= ny < tools_cfg.grid_size and 
                self.env.grid[nx, ny] == 0):
                valid.append(i)
        
        if not valid:
            self.last_action = num_actions - 1
            return num_actions - 1
            
        # Find nearest seeker
        if seekers:
            nearest = min(seekers, key=lambda s: abs(s.x - ps[0]) + abs(s.y - ps[1]))
            dx, dy = ps[0] - nearest.x, ps[1] - nearest.y
            away_dir = compute_bdir(dx, dy)
            
            # If seeker is close, prioritize moving away
            if abs(nearest.x - ps[0]) + abs(nearest.y - ps[1]) <= 3:
                if random.random() < self.epsilon:
                    valid_away = [i for i in valid if i == away_dir]
                    if valid_away:
                        self.last_action = valid_away[0]
                        self.decay_epsilon()
                        return self.last_action
                    self.last_action = random.choice(valid)
                    self.decay_epsilon()
                    return self.last_action
                
                qsum = self.Q1[ps[0], ps[1]] + self.Q2[ps[0], ps[1]]
                # Boost Q-values for actions moving away from nearest seeker
                qsum[away_dir] += tools_cfg.survive_reward
                
                masked_q = qsum.copy()
                masked_q[~np.isin(np.arange(num_actions), valid)] = -float('inf')
                self.last_action = int(np.argmax(masked_q))
                return self.last_action
        
        # Default to normal action selection if no immediate threat
        act = self.select_action_double(ps[0], ps[1], 0)
        self.decay_epsilon()
        return act

    def update(
        self,
        ps: Tuple[int, int],
        a: int,
        ns: Tuple[int, int],
        r: float,
        bps: int,
        bns: int,
    ):
        self.double_q_update(ps, a, ns, r, bps, bns)

### ActorCriticAgent Class

The `ActorCriticAgent` implements an actor-critic policy without Q-tables:

1. **Policy preferences**  
   Array $\mathbf{H}\in\mathbb{R}^{N\times N\times A}$.

2. **Value function**  
   Array $\mathbf{V}\in\mathbb{R}^{N\times N}$.

3. **Last action probabilities**  
   Vector $\pi\in\mathbb{R}^{A}$.

4. **Learning rates**  
   Scalars $\alpha_{\text{actor}},\alpha_{\text{critic}}$.

5. **Vision range**  
   Manhattan distance of $2$ for seekers and $3$ for hiders.

6. **Exploration rate**  
   Scalar $\epsilon$ for epsilon-greedy action selection.

7. **Last action**  
   Integer tracking the most recent action taken.

---

#### Key Methods

1. *__init__(self, aid: int)*  
   - Initializes base agent and sets up actor-critic components
   - Sets vision range based on role (2 for seeker, 3 for hider)
   - Initialises exploration rate to `tools_cfg.epsilon_start`  
    <div style="margin-bottom:1em;"></div>

2. *decay_epsilon(self)*  
   - Updates exploration rate: $\epsilon \gets \max(\epsilon_{\min}, \epsilon \times \epsilon_{\text{decay}})$  
    <div style="margin-bottom:1em;"></div>

3. *select_action(self, obs, shared, t, agents)* → *int*  
   - Gets valid actions within grid bounds
   - If no valid actions, returns stay action
   - If seeker and confident belief exists:
     - Boosts preferences toward hider location
   - If hider and nearby seekers:
     - Boosts preferences away from nearest seeker
   - With probability $\epsilon$:
     - Returns random valid action
   - Otherwise:
     - Computes softmax over preferences
     - Samples action from distribution
   - Stores probabilities and returns action  
    <div style="margin-bottom:1em;"></div>

4. *update(self, ps, a, ns, r)*  
   - Adds intrinsic reward for exploration
   - Computes TD error: $\delta = r + \gamma V[n_x,n_y] - V[x,y]$
   - Updates value function: $V[x,y] \mathrel{+}= \alpha_{\text{critic}}\delta$
   - Updates policy preferences:
     $$
     \begin{aligned}
       H[x,y,a] &\mathrel{+}= \alpha_{\text{actor}}\delta(1 - \pi[a]) \\
       H[x,y,i] &\mathrel{-}= \alpha_{\text{actor}}\delta\,\pi[i] \quad \text{for} \quad i\neq a
     \end{aligned}
     $$  

In [526]:
class ActorCriticAgent(AgentBase):
    """
    An agent implementing an Actor-Critic algorithm.
    """

    def __init__(self, aid: int):
        AgentBase.__init__(self, aid)
        self.H = np.zeros((tools_cfg.grid_size, tools_cfg.grid_size, num_actions))
        self.V = np.zeros((tools_cfg.grid_size, tools_cfg.grid_size))
        self.last_probs = np.ones(num_actions) / num_actions
        self.epsilon = tools_cfg.epsilon_start
        self.last_action = None

    def decay_epsilon(self):
        self.epsilon = max(tools_cfg.epsilon_min, self.epsilon * tools_cfg.epsilon_decay)

    def select_action(self, obs, shared: SharedMemory, t: int, agents: List[AgentBase]) -> int:
        x, y = self.state()
        valid = []
        for i, (dx, dy) in enumerate(actions):
            nx, ny = x + dx, y + dy
            if (0 <= nx < tools_cfg.grid_size and 
                0 <= ny < tools_cfg.grid_size and 
                self.env.grid[nx, ny] == 0):
                valid.append(i)
                
        if not valid:
            self.last_action = num_actions - 1
            return num_actions - 1
            
        # Get observation-based preferences
        if self.role == 'seeker':
            belief = shared.get_belief(t, top_n=1)
            if belief and belief[0][1] > 0.5:
                ref = belief[0][0]
                dx, dy = ref[0] - x, ref[1] - y
                pref_dir = compute_bdir(dx, dy)
                self.H[x, y, pref_dir] += tools_cfg.explore_reward
        else:  # hider
            seekers = [ag for ag in agents if ag.role == 'seeker']
            if seekers:
                nearest = min(seekers, key=lambda s: abs(s.x - x) + abs(s.y - y))
                if abs(nearest.x - x) + abs(nearest.y - y) <= 3:
                    dx, dy = x - nearest.x, y - nearest.y
                    away_dir = compute_bdir(dx, dy)
                    self.H[x, y, away_dir] += tools_cfg.survive_reward
            
        if random.random() < self.epsilon:
            self.decay_epsilon()
            self.last_action = random.choice(valid)
            return self.last_action
            
        prefs = self.H[x, y].copy()
        prefs[~np.isin(np.arange(num_actions), valid)] = -float('inf')
        exp_prefs = np.exp(prefs - np.max(prefs))
        probs = exp_prefs / exp_prefs.sum()
        
        self.last_action = np.random.choice(range(num_actions), p=probs)
        self.last_probs = probs
        return self.last_action

    def update(
        self, ps: Tuple[int, int], a: int, ns: Tuple[int, int], r: float
    ):
        # Add intrinsic reward for exploration
        if self.env and self.env.shared:
            visit_count = self.env.shared.record_visit(self.id, ns)
            r += tools_cfg.beta_intrinsic / np.sqrt(visit_count + 1)
            
        x, y = ps
        nx, ny = ns
        td_err = r + tools_cfg.gamma * self.V[nx, ny] - self.V[x, y]
        self.V[x, y] += tools_cfg.alpha_critic * td_err
        for ai in range(num_actions):
            if ai == a:
                self.H[x, y, ai] += tools_cfg.alpha_actor * td_err * (1 - self.last_probs[ai])
            else:
                self.H[x, y, ai] -= tools_cfg.alpha_actor * td_err * self.last_probs[ai]



### Logger Class

The `Logger` class records and visualises learning metrics across episodes. It maintains the following state variables:

1. **Agent types**  
   A Python `dict` mapping each agent's `id` to `"ActorCritic"` or `"DoubleQ"`.

2. **Episode rewards**  
   A Python `dict` of lists: `episode_rewards[aid]` accumulates each agent's total reward per episode.

3. **Rewards by type**  
   Two lists `rewards_by_type["ActorCritic"]` and `rewards_by_type["DoubleQ"]` holding average reward per episode.

4. **Capture times**  
   A list `capture_times` of integers recording how many steps each episode took to catch the hider.

5. **Coverage rates**  
   A list `coverage_rates` of floats measuring fraction of free cells explored:  
   $$
     \text{coverage}_e \;=\; \frac{|\text{env.shared.explored}|}{N^2 \;-\;\sum_{i,j}G_{ij}}
   $$

6. **Communication counts**  
   A list `comm_counts` of total communication events (messages sent + received) per episode.

7. **Visit counts**  
   A NumPy array `visit_counts` of shape $(A,N,N)$ tallying how often each agent visits each grid cell.

8. **Grace periods**  
   An optional list `grace_periods` tracking the environment's `grace_counter` per episode.

9. **Figures**  
   A list `figures` storing matplotlib `Figure` objects produced by `plot`.

10. **Episode actions**  
    A list of lists tracking all actions taken in each episode.

11. **Episode captures**  
    A list tracking when captures occur in each episode.

12. **Episode switch counts**  
    A list tracking role switches per episode.

13. **Reward variance**  
    A list tracking variance in rewards across agents per episode.

14. **Action variance**  
    A list tracking variance in action selection per episode.

15. **Learning progress**  
    A list tracking Q-value convergence over time.

---

#### Key Methods

1. *__init__(self, grid_size: int, agents: List[AgentBase])*  
   - Initialises all tracking containers
   - Sets up agent type mapping
   - Creates visit count arrays
   - Initialises episode counters  
    <div style="margin-bottom:1em;"></div>

2. *log_step(self, env, agents: List[AgentBase])*  
   - Records actions for current step
   - Updates explored cells tracking
   - Checks for and records captures
   - Updates visit counts for all agents  
    <div style="margin-bottom:1em;"></div>

3. *log_episode(self, ep: int, agents: List[AgentBase], env: HideSeekEnv)*  
   - Records episode-level metrics:
     - Rewards by agent and type
     - Capture times and role switches
     - Coverage and communication rates
     - Action and reward variances
     - Learning progress indicators
   - Updates visit counts and exploration tracking
   - Clears step-level buffers for next episode  
    <div style="margin-bottom:1em;"></div>

4. *plot(self)*  
   - Generates multiple visualization figures:
     1. **Reward Analysis**
        - Smoothed average rewards by agent type
        - Reward variance over time
        - Role-specific performance metrics
     2. **Learning Progress**
        - Q-value convergence
        - Action selection patterns
        - Exploration efficiency
     3. **Game Dynamics**
        - Capture times and success rates
        - Coverage rates and exploration patterns
        - Communication frequency
     4. **Agent Behavior**
        - Visit heatmaps for each agent
        - Action distribution analysis
        - Role switch patterns
   - Stores all figures in `self.figures`  

In [529]:
class Logger:
    """
    Records and plots training metrics.
    """

    def __init__(self, grid_size: int, agents: List[AgentBase]):
        self.agent_types = {ag.id: ('ActorCritic' if isinstance(ag, ActorCriticAgent) else 'DoubleQ') for ag in agents}
        self.episode_rewards: Dict[int, List[float]] = {aid: [] for aid in self.agent_types}
        self.rewards_by_type: Dict[str, List[float]] = {'ActorCritic': [], 'DoubleQ': []}
        self.capture_times: List[int] = []  # one entry per episode
        self.coverage_rates: List[float] = []
        self.comm_counts: List[int] = []
        self.visit_counts = np.zeros((len(agents), grid_size, grid_size), dtype=int)
        self.last_episode_visits = np.zeros((len(agents), grid_size, grid_size), dtype=int)
        self.grace_periods: List[int] = []
        self.figures = []
        self.episode_count = 0
        self.moves_per_episode: List[int] = []
        self.stay_actions_per_episode: List[int] = []
        self.blocked_moves_per_episode: List[int] = []
        self.total_actions_per_episode: List[int] = []
        self.current_episode_moves = 0
        self.avg_q_values: List[float] = []
        self.avg_values: List[float] = []
        self.episode_actions = []
        self.episode_captures = []  # Track captures per episode
        self.episode_switch_counts = []  # Track role switches per episode
        self.episode_reward_variance = []  # Track reward variance per episode
        self.episode_action_variance = []  # Track action variance per episode
        self.successful_captures = []  # Track successful captures
        self.avg_capture_time = []  # Track average time to capture
        self.exploration_efficiency = []  # Track how quickly agents explore
        self.episode_lengths = []  # Track how long episodes last
        self.learning_progress = []  # Track Q-value changes
        self.episode_explored_cells = []  # Track explored cells per episode

    def log_step(self, env, agents: List[AgentBase]):
        # Record all agents' actions for this step
        step_actions = []
        for agent in agents:
            step_actions.append((agent.last_action, agent.moved_last_action))
        if len(self.episode_actions) == 0:
            self.episode_actions.append([])
        self.episode_actions[-1].append(step_actions)
        
        # Track explored cells for this episode
        if len(self.episode_explored_cells) == 0:
            self.episode_explored_cells.append(set())
        self.episode_explored_cells[-1].update(env.shared.explored)
        
        # Check for captures in this step
        hider = next((a for a in agents if a.role == 'hider'), None)
        if hider:
            for agent in agents:
                if agent.role == 'seeker' and abs(agent.x - hider.x) + abs(agent.y - hider.y) <= 1:
                    if not self.episode_captures or self.episode_captures[-1] != env.time:
                        self.episode_captures.append(env.time)
                        break  

    def log_episode(self, ep: int, agents: List[AgentBase], env: HideSeekEnv):
        if self.episode_captures:
            first_cap = self.episode_captures[0]
        else:
            first_cap = tools_cfg.t_max
        self.capture_times.append(first_cap)
        self.episode_count = ep
        self.last_episode_visits.fill(0)

        self.episode_switch_counts.append(env.switch_count)
        
        episode_rewards = [ag.reward_sum for ag in agents]
        self.episode_reward_variance.append(np.var(episode_rewards))
        
        action_counts = np.zeros(num_actions)
        for step_actions in self.episode_actions[-1]:
            for action, _ in step_actions:
                action_counts[action] += 1
        self.episode_action_variance.append(np.var(action_counts))
        
        total_free = tools_cfg.grid_size ** 2 - env.grid.sum()
        # Use the episode-specific explored cells
        episode_coverage = len(self.episode_explored_cells[-1]) / total_free
        self.coverage_rates.append(episode_coverage)
        
        # Start a new set for the next episode
        self.episode_explored_cells.append(set())
        
        total_moves = 0
        total_stays = 0
        total_blocked = 0
        total_actions = 0
        total_q = 0
        total_v = 0
        q_count = 0
        v_count = 0
        
        for ag in agents:
            self.episode_rewards[ag.id].append(ag.reward_sum)
            x, y = ag.state()
            self.visit_counts[ag.id, x, y] += 1
            self.last_episode_visits[ag.id, x, y] += 1
            
            if isinstance(ag, DoubleQLearningMixin):
                total_q += np.mean(ag.Q1 + ag.Q2)
                q_count += 1
            elif isinstance(ag, ActorCriticAgent):
                total_v += np.mean(ag.V)
                v_count += 1

        if q_count > 0:
            self.avg_q_values.append(total_q / q_count)
        if v_count > 0:
            self.avg_values.append(total_v / v_count)

        for typ in self.rewards_by_type:
            ids = [aid for aid, t in self.agent_types.items() if t == typ]
            vals = [self.episode_rewards[aid][ep] for aid in ids]
            avg = float(np.mean(vals)) if vals else 0.0
            self.rewards_by_type[typ].append(avg)

        total_free = tools_cfg.grid_size ** 2 - env.grid.sum()
        coverage = len(env.shared.explored) / total_free
        self.coverage_rates.append(coverage)
        self.comm_counts.append(env.shared.total_comm_count)
        if hasattr(env, 'grace_counter'):
            self.grace_periods.append(env.grace_counter)

        # Count all actions from the episode_actions list
        for step_actions in self.episode_actions[-1]:  
            for action, moved in step_actions:
                total_actions += 1
                if action == num_actions - 1:  # Stay action
                    total_stays += 1
                elif not moved:  
                    total_blocked += 1
                else:  # Successful move
                    total_moves += 1
        
        self.moves_per_episode.append(total_moves)
        self.stay_actions_per_episode.append(total_stays)
        self.blocked_moves_per_episode.append(total_blocked)
        self.total_actions_per_episode.append(total_actions)
        
        self.episode_actions.append([])

        self.episode_captures.clear()

        # Track successful captures
        if env.switch_count > 0:
            self.successful_captures.append(1)
        else:
            self.successful_captures.append(0)
            
        # Track average capture time
        if self.episode_captures:
            self.avg_capture_time.append(np.mean(self.episode_captures))
        else:
            self.avg_capture_time.append(tools_cfg.t_max)
            
        # Track exploration efficiency
        self.exploration_efficiency.append(coverage)
        
        # Track episode length
        self.episode_lengths.append(env.time)
        
        # Track learning progress (Q-value changes)
        q_changes = []
        for ag in agents:
            if isinstance(ag, DoubleQLearningMixin):
                q_changes.append(np.mean(np.abs(ag.Q1 - ag.Q2)))
        if q_changes:
            self.learning_progress.append(np.mean(q_changes))

    def plot(self):
        # Create a figure for rewards and moves
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Plot 1: Smoothed reward
        df = pd.DataFrame(self.rewards_by_type)
        df.rolling(window=50).mean().plot(ax=axes[0, 0])
        axes[0, 0].set_title('Smoothed Avg Reward per Agent Type')
        axes[0, 0].set_xlabel('Episode')
        axes[0, 0].set_ylabel('Reward')
        axes[0, 0].grid(True)
        
        # Plot 2: Movement statistics
        moves_df = pd.DataFrame({
            'Successful Moves': self.moves_per_episode,
            'Stay Actions': self.stay_actions_per_episode,
            'Blocked Moves': self.blocked_moves_per_episode
        })
        moves_df.rolling(window=50).mean().plot(ax=axes[0, 1])
        axes[0, 1].set_title('Movement Statistics per Episode')
        axes[0, 1].set_xlabel('Episode')
        axes[0, 1].set_ylabel('Count')
        axes[0, 1].grid(True)
        
        # Plot 3: Q-values
        if self.avg_q_values:
            q_series = pd.Series(self.avg_q_values)
            q_series.rolling(window=50).mean().plot(ax=axes[1, 0], label='Q-values')
            axes[1, 0].set_title('Smoothed Average Q-values')
            axes[1, 0].set_xlabel('Episode')
            axes[1, 0].set_ylabel('Average Q-value')
            axes[1, 0].grid(True)
        
        # Plot 4: Reward Variance
        reward_variance = pd.DataFrame(self.rewards_by_type).rolling(window=50).var()
        reward_variance.plot(ax=axes[1, 1])
        axes[1, 1].set_title('Rolling Variance of Rewards (50-episode window)')
        axes[1, 1].set_xlabel('Episode')
        axes[1, 1].set_ylabel('Variance')
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        self.figures.append(fig)

        # Create a figure for capture times and learning metrics
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Plot capture times (actual captures)
        capture_series = pd.Series(self.capture_times)
        capture_series.rolling(window=50).mean().plot(ax=axes[0, 0])
        axes[0, 0].set_title('Average Time to Capture')
        axes[0, 0].set_xlabel('Episode')
        axes[0, 0].set_ylabel('Time Steps')
        axes[0, 0].grid(True)
        
        # Plot role switches over time
        switch_series = pd.Series(self.episode_switch_counts)
        switch_series.rolling(window=50).mean().plot(ax=axes[0, 1])
        axes[0, 1].set_title('Average Role Switches per Episode')
        axes[0, 1].set_xlabel('Episode')
        axes[0, 1].set_ylabel('Number of Switches')
        axes[0, 1].grid(True)
        
        # Plot action variance over time
        action_var_series = pd.Series(self.episode_action_variance)
        action_var_series.rolling(window=50).mean().plot(ax=axes[1, 0])
        axes[1, 0].set_title('Action Selection Variance')
        axes[1, 0].set_xlabel('Episode')
        axes[1, 0].set_ylabel('Variance')
        axes[1, 0].grid(True)
        
        # Plot reward variance over time
        reward_var_series = pd.Series(self.episode_reward_variance)
        reward_var_series.rolling(window=50).mean().plot(ax=axes[1, 1])
        axes[1, 1].set_title('Episode Reward Variance')
        axes[1, 1].set_xlabel('Episode')
        axes[1, 1].set_ylabel('Variance')
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        self.figures.append(fig)

        # Create a figure for capture times and coverage
        fig, axes = plt.subplots(2, 1, figsize=(15, 10))
        
        # Plot capture times
        capture_series = pd.Series(self.capture_times)
        capture_series.rolling(window=50).mean().plot(ax=axes[0])
        axes[0].set_title('Smoothed Capture Time per Episode')
        axes[0].set_xlabel('Episode')
        axes[0].set_ylabel('Time Steps')
        axes[0].grid(True)
        
        # Plot coverage rate
        coverage_series = pd.Series(self.coverage_rates)
        coverage_series.rolling(window=50).mean().plot(ax=axes[1])
        axes[1].set_title('Smoothed Coverage Rate per Episode')
        axes[1].set_xlabel('Episode')
        axes[1].set_ylabel('Coverage Rate')
        axes[1].grid(True)
        
        plt.tight_layout()
        self.figures.append(fig)

        # Create a figure for learning metrics
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Plot 1: Success Rate
        success_rate = pd.Series(self.successful_captures).rolling(window=50).mean()
        success_rate.plot(ax=axes[0, 0])
        axes[0, 0].set_title('Capture Success Rate (50-episode window)')
        axes[0, 0].set_xlabel('Episode')
        axes[0, 0].set_ylabel('Success Rate')
        axes[0, 0].grid(True)
        
        # Plot 2: Average Capture Time
        # Only consider episodes where a capture occurred
        capture_times = [t for t in self.capture_times if t < tools_cfg.t_max]
        if capture_times:
            capture_time = pd.Series(capture_times).rolling(window=50).mean()
            capture_time.plot(ax=axes[0, 1])
            axes[0, 1].set_title('Average Time to Capture (only successful captures)')
            axes[0, 1].set_xlabel('Episode')
            axes[0, 1].set_ylabel('Time Steps')
            axes[0, 1].grid(True)
        
        # Plot 3: Exploration Efficiency
        exploration = pd.Series(self.exploration_efficiency).rolling(window=50).mean()
        exploration.plot(ax=axes[1, 0])
        axes[1, 0].set_title('Exploration Efficiency')
        axes[1, 0].set_xlabel('Episode')
        axes[1, 0].set_ylabel('Coverage Rate')
        axes[1, 0].grid(True)
        
        # Plot 4: Q-value Convergence
        if self.learning_progress:
            learning = pd.Series(self.learning_progress).rolling(window=50).mean()
            learning.plot(ax=axes[1, 1])
            axes[1, 1].set_title('Q-value Convergence')
            axes[1, 1].set_xlabel('Episode')
            axes[1, 1].set_ylabel('Q-value Difference')
            axes[1, 1].grid(True)
        
        plt.tight_layout()
        self.figures.append(fig)

        # Create a figure for reward and action statistics
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Plot 1: Reward Variance
        reward_variance = pd.DataFrame(self.rewards_by_type).rolling(window=50).var()
        reward_variance.plot(ax=axes[0, 0])
        axes[0, 0].set_title('Rolling Variance of Rewards (50-episode window)')
        axes[0, 0].set_xlabel('Episode')
        axes[0, 0].set_ylabel('Variance')
        axes[0, 0].grid(True)
        
        # Plot 2: Role Switches
        switch_series = pd.Series(self.episode_switch_counts)
        switch_series.rolling(window=50).mean().plot(ax=axes[0, 1])
        axes[0, 1].set_title('Average Role Switches per Episode')
        axes[0, 1].set_xlabel('Episode')
        axes[0, 1].set_ylabel('Number of Switches')
        axes[0, 1].grid(True)
        
        # Plot 3: Episode Lengths
        length_series = pd.Series(self.episode_lengths)
        length_series.rolling(window=50).mean().plot(ax=axes[1, 0])
        axes[1, 0].set_title('Average Episode Length')
        axes[1, 0].set_xlabel('Episode')
        axes[1, 0].set_ylabel('Time Steps')
        axes[1, 0].grid(True)
        
        # Plot 4: Action Variance
        action_var_series = pd.Series(self.episode_action_variance)
        action_var_series.rolling(window=50).mean().plot(ax=axes[1, 1])
        axes[1, 1].set_title('Action Selection Variance')
        axes[1, 1].set_xlabel('Episode')
        axes[1, 1].set_ylabel('Variance')
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        self.figures.append(fig)

# 4. Choice & Description of Data  

In this work, we employ a fully synthetic, grid-based environment to isolate and evaluate multi-agent hide-and-seek strategies under controlled complexity. All parameters are configurable, allowing systematic study of agent behaviour as environment properties vary.


### Training and Demo Function

The `run_training_and_demo()` function orchestrates the complete training process for the hide-and-seek environment. It maintains the following key components:

1. **Environment Setup**  
   - Creates `HideSeekEnv` instance
   - Initialises agents with specific roles:
     - Agent 0: `RLSeekerAgent`
     - Agent 1: `ActorCriticAgent`
     - Remaining agents: `RLHiderAgent`

2. **Performance Tracking**  
   - Dictionary `episode_rewards` tracking rewards per agent
   - Dictionary `best_rewards` storing best performance
   - Dictionary `no_improvement_count` tracking learning plateaus

3. **Logger**  
   - `Logger` instance for metric collection and visualization

---

#### Key Features

1. **Multi-Agent Training**  
   - Coordinates multiple agents with different learning algorithms
   - Handles role-specific updates and rewards

2. **Adaptive Learning**  
   - Adjusts learning rates based on performance
   - Tracks learning plateaus
   - Implements intrinsic rewards for exploration

3. **Comprehensive Logging**  
   - Records detailed metrics at step and episode levels
   - Tracks performance across different agent types
   - Generates visualisation plots

4. **Role-Specific Processing**  
   - Different update logic for seeker vs hider agents
   - Belief-based updates for seekers
   - Evasion-based updates for hiders

5. **Performance Optimisation**  
   - Tracks best performance per agent
   - Monitors learning progress
   - Adapts learning parameters

In [531]:
# ---------------------- Training & Demo ----------------------
def run_training_and_demo():
    """
    Initialise environment, create agents, run training for num_episodes,
    log metrics and plot results.
    """
    print("\nInitializing environment and agents...")
    env = HideSeekEnv()
    agents: List[AgentBase] = []
    for aid in range(tools_cfg.num_agents):
        if aid == 0:
            ag = RLSeekerAgent(aid)
        elif aid == 1:
            ag = ActorCriticAgent(aid)
        else:
            ag = RLHiderAgent(aid)
        env.add_agent(ag, (0, 0))
        agents.append(ag)
    logger = Logger(tools_cfg.grid_size, agents)

    # Performance tracking
    episode_rewards = {aid: [] for aid in range(tools_cfg.num_agents)}
    best_rewards = {aid: float('-inf') for aid in range(tools_cfg.num_agents)}
    no_improvement_count = {aid: 0 for aid in range(tools_cfg.num_agents)}
    
    print(f"\nStarting training for {tools_cfg.num_episodes} episodes...")
    print("=" * 80)

    for ep in range(tools_cfg.num_episodes):
        env.reset()
        for ag in agents:
            ag.visited.clear()
        logger.episode_actions.append([])
        
        episode_reward = {aid: 0.0 for aid in range(tools_cfg.num_agents)}
        
        for t in range(tools_cfg.t_max):
            prev_states = {ag.id: ag.state() for ag in agents}
            actions_dict = {
                ag.id: ag.select_action(None, env.shared, env.time, agents)
                for ag in agents
            }
            rewards = env.step(actions_dict)
            logger.log_step(env, agents)

            for ag in agents:
                ps, ns = prev_states[ag.id], ag.state()
                r = rewards[ag.id]
                episode_reward[ag.id] += r
                
                if tools_cfg.use_intrinsic and ag.role == 'seeker':
                    cnt = env.shared.record_visit(ag.id, ns)
                    r += tools_cfg.beta_intrinsic / np.sqrt(cnt + 1)

                if isinstance(ag, ActorCriticAgent):
                    ag.update(ps, actions_dict[ag.id], ns, r)
                else:
                    if isinstance(ag, RLSeekerAgent):
                        belief_pre = env.shared.get_belief(env.time, top_n=1)
                        ref_pre = belief_pre[0][0] if belief_pre else ps
                        bps = compute_bdir(ref_pre[0] - ps[0], ref_pre[1] - ps[1])
                        belief_post = env.shared.get_belief(env.time, top_n=1)
                        ref_post = belief_post[0][0] if belief_post else ns
                        bns = compute_bdir(ref_post[0] - ns[0], ref_post[1] - ns[1])
                    else:
                        bps, bns = 0, 0
                    ag.update(ps, actions_dict[ag.id], ns, r, bps, bns)
                ag.reward_sum += r

        # Update learning rates based on performance
        for ag in agents:
            if isinstance(ag, DoubleQLearningMixin):
                ag.update_learning_rate(episode_reward[ag.id])
            
            # Track best performance
            if episode_reward[ag.id] > best_rewards[ag.id]:
                best_rewards[ag.id] = episode_reward[ag.id]
                no_improvement_count[ag.id] = 0
            else:
                no_improvement_count[ag.id] += 1
                
            episode_rewards[ag.id].append(episode_reward[ag.id])
                
        logger.log_episode(ep, agents, env)
        
        if (ep + 1) % 100 == 0:
            print(f"Episode {ep + 1}/{tools_cfg.num_episodes} completed")

    print("\nTraining completed. Generating plots...")
    logger.plot()

### TestCoreExtended Class

The `TestCoreExtended` class verifies the environment is created properly and handles edge cases well. This would give us confident in our complex learning algorithms implemented.

_setUp_
 - Initialises a fresh environment with a seeker and hider in known positions for consistent testing conditions.

_test_seeker_sees_hider_updates_belief_
 - Verifies that seeker sightings correctly update the shared belief system with proper weight distribution.

_test_capture_does_not_update_belief_
 - Ensures belief system maintains appropriate uncertainty levels after hider captures.

_test_invisible_hider_not_seen_
 - Confirms that seekers cannot detect hiders during their grace period after role switches.

_test_role_switch_logic_
 - Validates correct role swapping and grace period application after successful captures.

_test_intrinsic_reward_added_for_seekers_
 - Tests the exploration bonus system for seekers, ensuring proper reward calculation and application.

_test_full_episode_runs_t_max_steps_
 - Guarantees episodes run for exactly the specified maximum number of steps without premature termination.

_test_training_loop_never_ends_early_
 - Ensures training episodes maintain consistent length and proper termination conditions.

In [533]:
class TestCoreExtended(unittest.TestCase):
    def setUp(self):
        """Shared setup for each test."""
        self.env = HideSeekEnv()
        self.seeker = RLSeekerAgent(0)
        self.hider = RLHiderAgent(1)
        self.env.grid = np.zeros_like(self.env.grid)
        self.env.add_agent(self.seeker, (5, 5))
        self.env.add_agent(self.hider, (5, 6))
        self.seeker.role = 'seeker'
        self.hider.role = 'hider'

    def test_seeker_sees_hider_updates_belief(self):
        self.seeker.invisible = self.hider.invisible = 0
        self.env.shared.clear_step()
        self.env.shared.decay_confidence(self.env.time)
        self.env.shared.diffuse_belief()
        obs, seen = self.seeker.observe([self.seeker, self.hider], self.env)
        self.assertTrue(seen)
        self.env.shared.record_sighting(self.seeker.id, self.hider.state(), self.env.time)
        top_loc, weight = self.env.shared.get_belief(self.env.time, top_n=1)[0]
        self.assertEqual(top_loc, self.hider.state())
        self.assertAlmostEqual(weight, 1.0, places=2)

    def test_capture_does_not_update_belief(self):
        self.env.shared.clear_episode()
        self.env.shared.confidence = 0.5
        self.env.shared.belief.fill(1 / (tools_cfg.grid_size ** 2))
        rewards = self.env.step({self.seeker.id: 0})
        top_loc, weight = self.env.shared.get_belief(self.env.time, top_n=1)[0]
        self.assertLess(weight, 1.0)

    def test_invisible_hider_not_seen(self):
        self.hider.invisible = 999
        obs, seen = self.seeker.observe([self.seeker, self.hider], self.env)
        self.assertFalse(seen)

    def test_role_switch_logic(self):
        rewards = self.env.step({self.seeker.id: 0})
        self.assertEqual(self.seeker.role, 'hider')
        self.assertEqual(self.hider.role, 'seeker')
        self.assertEqual(self.seeker.invisible, tools_cfg.grace_steps)

    def test_intrinsic_reward_added_for_seekers(self):
        self.env.reset()
        self.seeker.x, self.seeker.y = 5, 5
        self.hider.x, self.hider.y = 0, 0
        prev = self.seeker.reward_sum
        prev_pos = self.seeker.state()
        act = self.seeker.select_action(None, self.env.shared, self.env.time, [self.seeker, self.hider])
        rewards = self.env.step({self.seeker.id: act})
        post_pos = self.seeker.state()
        cnt = self.env.shared.record_visit(self.seeker.id, post_pos)
        bonus = tools_cfg.beta_intrinsic / np.sqrt(cnt + 1)
        self.assertGreater(rewards[self.seeker.id] + bonus, rewards[self.seeker.id])

    def test_full_episode_runs_t_max_steps(self):
        """Ensure each episode always advances exactly t_max steps."""
        for _ in range(3):
            env = HideSeekEnv()
            agents: List[AgentBase] = []
            for aid in range(tools_cfg.num_agents):
                ag = [RLSeekerAgent, ActorCriticAgent, RLHiderAgent][min(aid, 2)](aid)
                env.add_agent(ag, (0, 0))
                agents.append(ag)

            env.reset()
            self.assertEqual(env.time, 0)
            for _ in range(tools_cfg.t_max):
                actions_dict = {
                    ag.id: ag.select_action(None, env.shared, env.time, agents)
                    for ag in agents
                }
                env.step(actions_dict)

            self.assertEqual(
                env.time,
                tools_cfg.t_max,
                f"Expected env.time == {tools_cfg.t_max}, got {env.time}"
            )

    def test_training_loop_never_ends_early(self):
        """Ensure run_training_and_demo always runs each episode for t_max steps."""
        # Create a small test environment
        env = HideSeekEnv()
        agents = []
        for aid in range(tools_cfg.num_agents):
            ag = [RLSeekerAgent, ActorCriticAgent, RLHiderAgent][min(aid, 2)](aid)
            env.add_agent(ag, (0, 0))
            agents.append(ag)
            
        # Run just 10 episodes to test
        for ep in range(10):
            env.reset()
            for t in range(tools_cfg.t_max):
                actions_dict = {
                    ag.id: ag.select_action(None, env.shared, env.time, agents)
                    for ag in agents
                }
                env.step(actions_dict)
            self.assertEqual(env.time, tools_cfg.t_max, 
                           f"Episode {ep} ended at step {env.time}, expected {tools_cfg.t_max}")


In [535]:
# Main execution
if __name__ == "__main__":
    print("Running main...")
    suite = unittest.TestLoader().loadTestsFromTestCase(TestCoreExtended)
    print("Running tests...")
    unittest.TextTestRunner(verbosity=2).run(suite)
    print("Tests completed, running training and demo...")

run_training_and_demo()
print("Program completed.")


test_capture_does_not_update_belief (__main__.TestCoreExtended.test_capture_does_not_update_belief) ... ok
test_full_episode_runs_t_max_steps (__main__.TestCoreExtended.test_full_episode_runs_t_max_steps)
Ensure each episode always advances exactly t_max steps. ... ok
test_intrinsic_reward_added_for_seekers (__main__.TestCoreExtended.test_intrinsic_reward_added_for_seekers) ... ok
test_invisible_hider_not_seen (__main__.TestCoreExtended.test_invisible_hider_not_seen) ... ok
test_role_switch_logic (__main__.TestCoreExtended.test_role_switch_logic) ... ok
test_seeker_sees_hider_updates_belief (__main__.TestCoreExtended.test_seeker_sees_hider_updates_belief) ... ok
test_training_loop_never_ends_early (__main__.TestCoreExtended.test_training_loop_never_ends_early)
Ensure run_training_and_demo always runs each episode for t_max steps. ... 

Running main...
Running tests...


ok

----------------------------------------------------------------------
Ran 7 tests in 0.464s

OK


Tests completed, running training and demo...

Initializing environment and agents...

Starting training for 2000 episodes...
Episode 100/2000 completed
Episode 200/2000 completed
Episode 300/2000 completed
Episode 400/2000 completed
Episode 500/2000 completed
Episode 600/2000 completed
Episode 700/2000 completed
Episode 800/2000 completed
Episode 900/2000 completed
Episode 1000/2000 completed
Episode 1100/2000 completed
Episode 1200/2000 completed
Episode 1300/2000 completed
Episode 1400/2000 completed
Episode 1500/2000 completed
Episode 1600/2000 completed
Episode 1700/2000 completed
Episode 1800/2000 completed
Episode 1900/2000 completed
Episode 2000/2000 completed

Training completed. Generating plots...
Program completed.


<img src="training_outputs/plot_1.png" width="900"/>  
<img src="training_outputs/plot_2.png" width="900"/>  
<img src="training_outputs/plot_3.png" width="900"/>  
<img src="training_outputs/plot_4.png" width="900"/>  
<img src="training_outputs/plot_5.png" width="900"/>  

## 5. Numerical Evaluation

Multi-agent environments can be categorised as cooperative, competitive, or mixed. In mixed settings, such as ours, analysing dynamics without explicit equilibrium analysis is inherently incomplete. Nevertheless, our findings provide reasons to believe that agents in our scenario are indeed moving towards equilibrium.

Firstly, Kuba et al. (2022) highlight that policy-gradient methods become less effective in multi-agent settings if opponents simultaneously update their policies, treating others merely as part of the environment. Centralised training with decentralised execution frameworks, such as Multi-Agent Proximal Policy Optimisation, typically mitigates this issue by evaluating policies across larger action samples. Although we utilised a standard policy-gradient method (actor-critic) without centralised training, convergence was still observed.

This convergence is demonstrated by an increase in the average number of role switches per episode, coupled with a synchronous and gradual decline in the average capture time. Within our environment, this trend suggests that seeker agents improved coordination through information diffusion mechanisms, effectively converging towards the hider’s last known position. As episodes increased and role switching became rewarding, seekers learned more efficient movements towards the hider, thus increasing switching frequency.

Additionally, the observed rise in episode reward variance underscores the reward mechanism—agents gain substantial rewards when successfully locating the hider; therefore, greater switching frequency naturally increased reward variability.

For agents employing the Double Q-Learning algorithm, we also observed expected convergence of Q-values. Initially, there was a pronounced increase in Q-value differences due to the exploration of previously unseen state-action pairs. However, after approximately 600 episodes, these differences steadily diminished, indicating that sufficient states had been explored, allowing Q-values to stabilise.

The per-agent plots comparing smoothed rewards indicated minimal differences between Double Q-learning and actor-critic methods, implying both algorithms adapted effectively and learned at comparable rates.

Other performance metrics considered included the average number of time steps required for an agent to be captured and the generalisation capabilities of agents in new environments. However, imbalanced agent numbers across learning paradigms caused significantly higher payoff variance for actor-critic agents, rendering comparisons between paradigms unreliable, as illustrated by the Rolling Variance of Rewards plot. Additionally, the coverage of the 10x10 grid ranged only from 14% to 20%, suggesting inadequate exploration even on fixed maps with known obstacles, thereby limiting agents’ abilities to generalise effectively to new environments.

# 6. Conclusion

Throughout this project, we trained multiple agents within a grid world environment using two reinforcement learning methods—Actor-Critic and Double Q learning. We have analysed how these decentralised training and decentralised execution perform in a novel mixed (cooperative and competitive) multi-agent setting of interaction. Our results find that both Actor-Critic and Double Q-learning agents demonstrated effective convergence behaviours. This was seen particularly for the Double-Q-Learning agents through the constant decline in the difference between Q-values. 

What could be considered is dynamics that could arise in the environment beyond 2000 episodes. In particular, we would like to see if training a set map for long enough would (i) increase the agents' coverage of the map and (ii) allow agents to generalise their learning to new maps (with different sets of obstacles). Moreover, decentralised training and execution frameworks are known to converge slower to equilibria as agents sequentially respond to the previous policies of each other. An interesting question to consider is how much quicker centralised training would perform in this novel multi-agent environment. 

In sum, this project demonstrates that multiple agents can learn to coordinate and compete in a simple grid world. The findings highlight the importance of stability and reward design in multi-agent systems and point the way towards more robust and scalable learning strategies.


## 8. References

- Kuba, Jakub Grudzien, et al. “Trust region policy optimisation in multi-agent reinforcement learning.” *arXiv preprint arXiv:2109.11251* (2021).

- van Hasselt, Hado; Guez, Arthur; Silver, David. “Deep Reinforcement Learning with Double Q-learning.” *Proceedings of the AAAI Conference on Artificial Intelligence* (2016).

- Sutton, Richard S.; McAllester, David A.; Singh, Satinder P.; Mansour, Yishay. “Policy gradient methods for reinforcement learning with function approximation.” *Advances in Neural Information Processing Systems* 12 (2000).

- Pathak, Deepak; Agrawal, Pulkit; Efros, Alexei A.; Darrell, Trevor. “Curiosity-driven Exploration by Self-supervised Prediction.” *Proceedings of the 34th International Conference on Machine Learning* (2017).
