compound-rlm

Recursive Language Models extended with a typed, gateway-traced library of LangGraph subgraphs that compounds skill across queries.

Status: Phase 1 complete. Library hypothesis validated on two benchmarks. See EXPERIMENTS.md for the running record and PROPOSAL.md for the system + execution strategy.

Headline result

80%+ on three benchmarks with a single-day-built library, vs the vanilla RLM scaffold's 20-40%. One hand-built node moves LoCoDiff from 40% → 85%.

Benchmark	Vanilla RLM (M0)	+ library
LoCoDiff Q3-locked	0/5 (0%)	4/5 (80%)
LoCoDiff stratified-20	8/20 (40%)	17/20 (85%) (node-only baseline)
Lab-repo audit (10 seed)	—	8/10 (80%)

Library ablation on Q3-locked: removing the library drops accuracy from 80% → 0%. Adding it brings the planner from "rewrites a diff parser across 14-22 turns and times out" to "INVOKEs LIB.git_diff_applier in turn 2 and FINAL_VARs in turn 3" — about 65× faster end-to-end.

What it is

Recursive Language Models (Zhang/Kraska/Khattab, MIT CSAIL 2026) lets an LM drive a Python REPL to decompose long context without ever loading it into the model's window. compound-rlm extends that paradigm with library learning: every successful skill becomes a typed, retrievable LangGraph subgraph in a versioned catalog. New queries first try retrieval; the planner only synthesizes (or solves directly) when no library node fits.

                ROOT PLANNER (DSPy)
   query  ──►   inputs: query, ctx, top-K library candidates
                output: INVOKE(node_id, args) | DIRECT_SOLVE | (later: SYNTHESIZE)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
   LIBRARY           SYNTHESIZE       DIRECT_SOLVE
   RETRIEVE          (Phase 2)        (vanilla RLM in-context)
   (FAISS over
    intent embs)
       │
       ▼
   EXECUTE in
   sandbox (local exec; Docker per Phase 2)
       │
       ▼
   TRAJECTORY + Langfuse trace + catalog success update

The novelty isn't the library learning (Voyager, DreamCoder, LILO got there first). It's that every node is a typed LangGraph subgraph, gateway-traced for production observability, and published as a portable HuggingFace dataset per domain. That's what's missing from the existing research stack.

Quickstart

# Clone + install
git clone git@github.com:protoLabsAI/compound-rlm.git
cd compound-rlm
uv sync

# Configure: gateway URL + key (or point at any OpenAI-compat backend)
export GATEWAY_URL=http://your-gateway/v1
export GATEWAY_API_KEY=sk-...

# Bootstrap the seed library (registers git_diff_applier + codebase_grep)
uv run python scripts/bootstrap_library.py

# Run on the locked Q3-5 LoCoDiff benchmark
git clone https://github.com/AbanteAI/LoCoDiff-bench /tmp/LoCoDiff-bench
uv run python benchmarks/locodiff/run_with_library.py \
  --catalog libraries/local/coding/catalog.jsonl --locked-q3
# Expect: ~4/5 PASS, ~9s wall per pass

# Run on the lab-repo audit (point --repo at any git repo)
uv run python benchmarks/lab_repo_audit/run_audit.py \
  --catalog libraries/local/coding/catalog.jsonl \
  --repo /path/to/your/repo

Layout

compound_rlm/
├── core/         # M0 RLM scaffold: graph, sandbox, parser, llm, prompts, trajectory
├── library/      # Catalog, Node protocol, IntentIndex (FAISS), LibraryRegistry
├── nodes/        # Hand-built library nodes (currently: git_diff_applier, codebase_grep)
├── primitives/   # (Phase 2) Synthesizer's compositional vocabulary
├── dspy_modules/ # (Phase 1.5) DSPy wrapping for GEPA optimization
├── trajectory/   # SFT-corpus persistence
└── eval/

benchmarks/
├── locodiff/         # AbanteAI LoCoDiff-bench loader, scorer, runner
├── lab_repo_audit/   # 10 seed questions over a local repo (designed for library transfer)
└── longbench_v2/     # (Phase 1.5) THUDM LongBench v2 integration

scripts/             # bootstrap_library, launch_mit_planner, teardown_mit_planner
libraries/           # Versioned catalog snapshots (LFS), local/ is gitignored
trajectories/        # Weekly archives (LFS) for reproduction

Building on prior art

This stack is a synthesis, not an invention:

Ingredient	Source
Recursion over decomposed context	Zhang et al. — RLM
Skill-library lifelong learning	Wang et al. — Voyager
Library compression / refactoring	Bowers — Stitch (Phase 2)
LM-guided library learning	Grand et al. — LILO
Programming-not-prompting	DSPy (Phase 1.5)
Multi-role decomposition	ROMA (inspiration)
Memory typing	Letta (inspiration)

What we add: production observability (Langfuse), type-checked LangGraph subgraph nodes, multi-domain library publishing as HF datasets, cold-start with hand-curated nodes.

License

Apache-2.0. Libraries published under the same.

Citation

See CITATION.cff — citable as software while we draft the workshop paper.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmarks		benchmarks
compound_rlm		compound_rlm
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
EXPERIMENTS.md		EXPERIMENTS.md
PLAN.md		PLAN.md
PROPOSAL.md		PROPOSAL.md
README.md		README.md
RESULTS.md		RESULTS.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

compound-rlm

Headline result

What it is

Quickstart

Layout

Building on prior art

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

compound-rlm

Headline result

What it is

Quickstart

Layout

Building on prior art

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages