Recursive Language Models extended with a typed, gateway-traced library of LangGraph subgraphs that compounds skill across queries.
Status: Phase 1 complete. Library hypothesis validated on two benchmarks. See EXPERIMENTS.md for the running record and PROPOSAL.md for the system + execution strategy.
80%+ on three benchmarks with a single-day-built library, vs the vanilla RLM scaffold's 20-40%. One hand-built node moves LoCoDiff from 40% → 85%.
| Benchmark | Vanilla RLM (M0) | + library |
|---|---|---|
| LoCoDiff Q3-locked | 0/5 (0%) | 4/5 (80%) |
| LoCoDiff stratified-20 | 8/20 (40%) | 17/20 (85%) (node-only baseline) |
| Lab-repo audit (10 seed) | — | 8/10 (80%) |
Library ablation on Q3-locked: removing the library drops accuracy from
80% → 0%. Adding it brings the planner from "rewrites a diff parser
across 14-22 turns and times out" to "INVOKEs LIB.git_diff_applier in
turn 2 and FINAL_VARs in turn 3" — about 65× faster end-to-end.
Recursive Language Models (Zhang/Kraska/Khattab, MIT CSAIL 2026) lets an LM drive a Python REPL to decompose long context without ever loading it into the model's window. compound-rlm extends that paradigm with library learning: every successful skill becomes a typed, retrievable LangGraph subgraph in a versioned catalog. New queries first try retrieval; the planner only synthesizes (or solves directly) when no library node fits.
ROOT PLANNER (DSPy)
query ──► inputs: query, ctx, top-K library candidates
output: INVOKE(node_id, args) | DIRECT_SOLVE | (later: SYNTHESIZE)
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
LIBRARY SYNTHESIZE DIRECT_SOLVE
RETRIEVE (Phase 2) (vanilla RLM in-context)
(FAISS over
intent embs)
│
▼
EXECUTE in
sandbox (local exec; Docker per Phase 2)
│
▼
TRAJECTORY + Langfuse trace + catalog success update
The novelty isn't the library learning (Voyager, DreamCoder, LILO got there first). It's that every node is a typed LangGraph subgraph, gateway-traced for production observability, and published as a portable HuggingFace dataset per domain. That's what's missing from the existing research stack.
# Clone + install
git clone git@github.com:protoLabsAI/compound-rlm.git
cd compound-rlm
uv sync
# Configure: gateway URL + key (or point at any OpenAI-compat backend)
export GATEWAY_URL=http://your-gateway/v1
export GATEWAY_API_KEY=sk-...
# Bootstrap the seed library (registers git_diff_applier + codebase_grep)
uv run python scripts/bootstrap_library.py
# Run on the locked Q3-5 LoCoDiff benchmark
git clone https://github.com/AbanteAI/LoCoDiff-bench /tmp/LoCoDiff-bench
uv run python benchmarks/locodiff/run_with_library.py \
--catalog libraries/local/coding/catalog.jsonl --locked-q3
# Expect: ~4/5 PASS, ~9s wall per pass
# Run on the lab-repo audit (point --repo at any git repo)
uv run python benchmarks/lab_repo_audit/run_audit.py \
--catalog libraries/local/coding/catalog.jsonl \
--repo /path/to/your/repocompound_rlm/
├── core/ # M0 RLM scaffold: graph, sandbox, parser, llm, prompts, trajectory
├── library/ # Catalog, Node protocol, IntentIndex (FAISS), LibraryRegistry
├── nodes/ # Hand-built library nodes (currently: git_diff_applier, codebase_grep)
├── primitives/ # (Phase 2) Synthesizer's compositional vocabulary
├── dspy_modules/ # (Phase 1.5) DSPy wrapping for GEPA optimization
├── trajectory/ # SFT-corpus persistence
└── eval/
benchmarks/
├── locodiff/ # AbanteAI LoCoDiff-bench loader, scorer, runner
├── lab_repo_audit/ # 10 seed questions over a local repo (designed for library transfer)
└── longbench_v2/ # (Phase 1.5) THUDM LongBench v2 integration
scripts/ # bootstrap_library, launch_mit_planner, teardown_mit_planner
libraries/ # Versioned catalog snapshots (LFS), local/ is gitignored
trajectories/ # Weekly archives (LFS) for reproduction
This stack is a synthesis, not an invention:
| Ingredient | Source |
|---|---|
| Recursion over decomposed context | Zhang et al. — RLM |
| Skill-library lifelong learning | Wang et al. — Voyager |
| Library compression / refactoring | Bowers — Stitch (Phase 2) |
| LM-guided library learning | Grand et al. — LILO |
| Programming-not-prompting | DSPy (Phase 1.5) |
| Multi-role decomposition | ROMA (inspiration) |
| Memory typing | Letta (inspiration) |
What we add: production observability (Langfuse), type-checked LangGraph subgraph nodes, multi-domain library publishing as HF datasets, cold-start with hand-curated nodes.
Apache-2.0. Libraries published under the same.
See CITATION.cff — citable as software while we draft the workshop paper.