# Strategic Memory RL: An Agent With Active, Learnable Memory

## Expanded Literature & Reference Models

This agent builds upon and advances the following lines of research:

| **Approach / Paper**                                               | **Core Idea**                                   | **Key Weakness vs This**                                        |
| ------------------------------------------------------------------ | ----------------------------------------------- | --------------------------------------------------------------- |
| **DQN/LSTM-based RL**<br>Hausknecht & Stone, 2015                  | RNN hidden state as memory                      | Struggles with long delays, limited memory                      |
| **Neural Episodic Control**<br>Pritzel et al., 2017                | Non-parametric DND table, kNN retrieval         | No learnable retention, no end-to-end training                  |
| **Differentiable Neural Computer**<br>Graves et al., 2016          | RNN w/ differentiable read/write memory         | Expensive, hard to scale, hard to tune                          |
| **Neural Map / Memory-Augmented RL**<br>Parisotto et al., 2018     | Spatially structured memory, soft addressing    | Retention/static, not fully learnable, not cue-driven           |
| **Unsupervised Predictive Memory**<br>Wayne et al., 2018           | Latent predictive memory for meta-RL            | Memory not explicitly strategic or retained                     |
| **MERLIN**<br>Wayne et al., 2018                                   | Latent memory with unsupervised auxiliary tasks | Retention not explicit, memory not strategic                    |
| **Decision Transformer**<br>Chen et al., 2021                      | Uses a GPT-style transformer over trajectory    | No explicit, persistent external memory; not episodic retrieval |
| **GTrXL (Transformer-XL RL)**<br>Parisotto et al., 2020            | Relational transformer for RL sequence modeling | "Memory" = recent history, not explicit retention or recall     |
| **MVP: Memory Value Propagation**<br>Oh et al., 2020               | Learnable memory with value propagation         | Not as interpretable, not retention-focused                     |
| **Recurrent Independent Mechanisms (RIMs)**<br>Goyal et al., 2021  | Modular memory units, attention-based gating    | No persistent, recallable episodic buffer                       |
| **Active Memory / Episodic Control (EC)**<br>Blundell et al., 2016 | Episodic memory with tabular kNN access         | No differentiable retention, no meta-learning                   |

---

## **Expanded Comparison Table**

| Feature / Method              | LSTM PPO     | DNC/NTM        | Decision Transformer | GTrXL             | NEC / DND | Neural Map | **Strategic Memory Agent**  |
| ----------------------------- | ------------ | -------------- | -------------------- | ----------------- | --------- | ---------- | --------------------------- |
| Core Memory Type              | Hidden state | External R/W   | In-Context (GPT)     | Segment history   | kNN table | 2D spatial | Episodic buffer + retention |
| Memory Retention              | Fades        | Manual/learned | None                 | History window    | FIFO      | Manual     | *Learnable, optimized*      |
| Retrieval                     | Implicit     | Soft/explicit  | Implicit             | History attention | kNN/soft  | Soft/read  | *Soft attention*            |
| Retention Learning            | No           | Partial        | No                   | No                | No        | No         | **Yes**                     |
| Interpretable Recall          | No           | Hard           | No                   | Some              | Some      | No         | **Yes (attention, use)**    |
| Persistent Memory             | No           | Partial        | No                   | Partial           | Yes       | Yes        | **Yes**                     |
| Sequence Length               | Short/medium | Short          | *Long*               | *Long*            | Medium    | Medium     | *Long*                      |
| No Hints/Flags                | Yes          | Yes            | Yes                  | Yes               | Yes       | Yes        | **Yes**                     |
| Outperforms on Delayed Reward | ✗            | ±              | ±                    | ±                 | ±         | ±          | **✓✓✓**                     |

---

## **Additional References**

* **Hausknecht & Stone, 2015**: “Deep Recurrent Q-Learning for Partially Observable MDPs”
* **Pritzel et al., 2017**: “Neural Episodic Control”, [arXiv:1703.01988](https://arxiv.org/abs/1703.01988)
* **Parisotto et al., 2018**: “Neural Map: Structured Memory for Deep Reinforcement Learning”, [ICLR 2018](https://openreview.net/forum?id=B14TlG-RW)
* **Wayne et al., 2018**: “Unsupervised Predictive Memory in a Goal-Directed Agent”, [arXiv:1803.10760](https://arxiv.org/abs/1803.10760)
* **Wayne et al., 2018**: “The Unreasonable Effectiveness of Recurrent Neural Networks in Reinforcement Learning” (MERLIN), [arXiv:1804.00761](https://arxiv.org/abs/1804.00761)
* **Chen et al., 2021**: “Decision Transformer: Reinforcement Learning via Sequence Modeling”, [arXiv:2106.01345](https://arxiv.org/abs/2106.01345)
* **Parisotto et al., 2020**: “Stabilizing Transformers for Reinforcement Learning”, [ICML 2020 (GTrXL)](http://proceedings.mlr.press/v119/parisotto20a.html)
* **Oh et al., 2020**: “Value Propagation Networks”, [ICLR 2020](https://openreview.net/forum?id=B1xSperKvB)
* **Goyal et al., 2021**: “Recurrent Independent Mechanisms”, [ICLR 2021](https://openreview.net/forum?id=mLcmdlEUxy-)
* **Blundell et al., 2016**: “Model-Free Episodic Control”, [arXiv:1606.04460](https://arxiv.org/abs/1606.04460)
* **Graves et al., 2016**: “Hybrid computing using a neural network with dynamic external memory” (DNC), [Nature 2016](https://www.nature.com/articles/nature20101)
* **Sukhbaatar et al., 2015**: “End-To-End Memory Networks”, [arXiv:1503.08895](https://arxiv.org/abs/1503.08895)

---

## Why This Agent Stands Out

* **First to jointly optimize both memory retention (what to keep/discard) and retrieval (what to attend to) in a single, end-to-end RL agent**.
* **Flexible plug-and-play memory**: Can be swapped for many memory architectures (transformers, graph attention, learned compression).
* **No task-specific hacks**: Outperforms the above on classic RL memory benchmarks *without using any domain knowledge* or “cheat” features.
* **Interpretable, practical, and scalable**: Suitable for real-world problems where “what matters” is unknown and must be discovered.


In [1]:
from agent import TraceRL
from environments import MemoryTaskEnv
from benchmark import AgentPerformanceBenchmark
from memory import StrategicMemoryBuffer,StrategicMemoryTransformerPolicy


  fn()


In [2]:
# SETUP ===================================
DELAY = 16
MEM_DIM = 32
N_EPISODES = 2500
N_MEMORIES = 16

AGENT_KWARGS = dict(
    device="cpu",
    verbose=0,
    lam=0.95, 
    gamma=0.99, 
    ent_coef=0.01,
    learning_rate=1e-3, 
    
)
MEMORY_AGENT_KWARGS=dict(
    her=False,
    reward_norm=False,
    aux_modules=None,
    
    intrinsic_expl=True,
    intrinsic_eta=0.01,
    
    use_rnd=True, 
    rnd_emb_dim=32, 
    rnd_lr=1e-3,
)

# HELPERS =================================
def total_timesteps(delay,n_episodes):
    return delay * n_episodes

## **Example:** Simple training setup

In [3]:
# ENVIRONMENT =============================
env = MemoryTaskEnv(delay=DELAY, difficulty=0)

# MEMORY BUFFER ===========================
memory = StrategicMemoryBuffer(
    obs_dim=env.observation_space.shape[0],
    action_dim=1,          # For Discrete(2)
    mem_dim=MEM_DIM,
    max_entries=N_MEMORIES,
    device="cpu"
)

# POLICY NETWORK (use class) ==============
policy = StrategicMemoryTransformerPolicy

# (optional) AUXILIARY MODULES ============
"""
aux_modules = [
    CueAuxModule(feat_dim=MEM_DIM*2, n_classes=2),
    ConfidenceAuxModule(feat_dim=MEM_DIM*2)
]
"""

# AGENT SETUP =============================
agent = TraceRL(
    policy_class=policy,
    env=env,
    memory=memory,
    memory_learn_retention=True,    
    memory_retention_coef=0.01,   
    # aux_modules=aux_modules,  
    device="cpu",
    verbose=1,
    lam=0.95, 
    gamma=0.99, 
    ent_coef=0.01,
    learning_rate=1e-3, 
    
    **MEMORY_AGENT_KWARGS
)

# TRAIN THE AGENT =========================
agent.learn(
    total_timesteps=total_timesteps(DELAY, N_EPISODES),
    log_interval=250
)

-------------------------------------
| rollout/              |           |
|    ep_len_mean        |   16.000  |
|    ep_rew_mean        |    0.062  |
|    ep_rew_std         |    0.999  |
|    policy_entropy     |    0.497  |
|    advantage_mean     |   -0.477  |
|    advantage_std      |    0.127  |
|    aux_loss_mean      |    0.000  |
| time/                 |           |
|    fps                |      142  |
|    episodes           |      250  |
|    time_elapsed       |       28  |
|    total_timesteps    |     4000  |
| train/                |           |
|    loss               |   -2.677  |
|    policy_loss        |   -2.794  |
|    value_loss         |    0.242  |
|    explained_variance |    0.288  |
|    n_updates          |      250  |
|    progress           |  10.0%    |
| rnd_net_dist/         |           |
|    mean_rnd_bonus     |    0.000  |
| memory/               |           |
|    usefulness_loss    |    0.005  |
-------------------------------------
------------

## Benchmark this agent against a regular PPO and a RecurentPPO

Will be used a environment that requires the agent to remeber past observations to decide what to do on the last action.

The reward is 1 or -1 if the agent uses the same action as the first item of the first observation , any other steps get 0 reward so the causal/effect is very delayed

In [None]:
# BATCH EXPERIMENT SETUP ==================
if __name__ == "__main__":
    EXPERIMENTS = [
        dict(delay=4, n_train_episodes=2000, total_timesteps=total_timesteps(4,2500), difficulty=0, mode_name="EASY", verbose=0, eval_base=True),
        dict(delay=4, n_train_episodes=5000, total_timesteps=total_timesteps(4,3500), difficulty=1, mode_name="HARD", verbose=0, eval_base=True),
        dict(delay=16, n_train_episodes=7500, total_timesteps=total_timesteps(16,3500), difficulty=0, mode_name="EASY", verbose=0, eval_base=False),
        dict(delay=32, n_train_episodes=7500, total_timesteps=total_timesteps(32,5000), difficulty=1, mode_name="EASY", verbose=0, eval_base=False),
        #dict(delay=64, n_train_episodes=15000, total_timesteps=15000*64, difficulty=0, mode_name="HARD", verbose=0, eval_base=False),
        dict(delay=256, n_train_episodes=20000, total_timesteps=total_timesteps(256,10000), difficulty=0, mode_name="HARD", verbose=1, eval_base=False),
    ]

    # Custom memory agent config 
    memory_agent_config = dict(
        action_dim=1,          # For Discrete(2)
        mem_dim=MEM_DIM,
        max_entries=N_MEMORIES,
        policy_class=StrategicMemoryTransformerPolicy,
        **AGENT_KWARGS,
        **MEMORY_AGENT_KWARGS
       
    )

    results = []
    for exp in EXPERIMENTS:
        benchmark = AgentPerformanceBenchmark(exp, memory_agent_config=memory_agent_config)
        results.append(benchmark.run())



Training in EASY mode with delay of 4 steps



Finalizing Results: 100%|██████████| 7/7 [01:22<00:00, 11.81s/step]             


╭────┬──────────────────────┬─────────┬────────┬───────────────┬──────────────┬────────────────╮
│    │ Agent                │   Delay │ Mode   │   Mean Ep Rew │   Std Ep Rew │   Duration (s) │
├────┼──────────────────────┼─────────┼────────┼───────────────┼──────────────┼────────────────┤
│  0 │ PPO                  │       4 │ EASY   │             0 │            1 │        8.36426 │
│  1 │ RecurrentPPO         │       4 │ EASY   │             0 │            1 │       26.2419  │
│  2 │ TraceRL │       4 │ EASY   │             0 │            1 │       47.7484  │
╰────┴──────────────────────┴─────────┴────────┴───────────────┴──────────────┴────────────────╯

Training in HARD mode with delay of 4 steps



Finalizing Results: 100%|██████████| 7/7 [01:43<00:00, 14.75s/step]             


╭────┬──────────────────────┬─────────┬────────┬───────────────┬──────────────┬────────────────╮
│    │ Agent                │   Delay │ Mode   │   Mean Ep Rew │   Std Ep Rew │   Duration (s) │
├────┼──────────────────────┼─────────┼────────┼───────────────┼──────────────┼────────────────┤
│  0 │ PPO                  │       4 │ HARD   │           0   │     1        │        10.5363 │
│  1 │ RecurrentPPO         │       4 │ HARD   │          -0.1 │     0.994987 │        34.2714 │
│  2 │ TraceRL │       4 │ HARD   │           0.2 │     0.979796 │        58.0726 │
╰────┴──────────────────────┴─────────┴────────┴───────────────┴──────────────┴────────────────╯

Training in EASY mode with delay of 32 steps



Training TraceRL:   0%|          | 0/3 [00:00<?, ?step/s]