Skip to content

protoLabsAI/compound-rlm

Repository files navigation

compound-rlm

Recursive Language Models extended with a typed, gateway-traced library of LangGraph subgraphs that compounds skill across queries.

Status: Phase 1 complete. Library hypothesis validated on two benchmarks. See EXPERIMENTS.md for the running record and PROPOSAL.md for the system + execution strategy.

Headline result

80%+ on three benchmarks with a single-day-built library, vs the vanilla RLM scaffold's 20-40%. One hand-built node moves LoCoDiff from 40% → 85%.

Benchmark Vanilla RLM (M0) + library
LoCoDiff Q3-locked 0/5 (0%) 4/5 (80%)
LoCoDiff stratified-20 8/20 (40%) 17/20 (85%) (node-only baseline)
Lab-repo audit (10 seed) 8/10 (80%)

Library ablation on Q3-locked: removing the library drops accuracy from 80% → 0%. Adding it brings the planner from "rewrites a diff parser across 14-22 turns and times out" to "INVOKEs LIB.git_diff_applier in turn 2 and FINAL_VARs in turn 3" — about 65× faster end-to-end.

What it is

Recursive Language Models (Zhang/Kraska/Khattab, MIT CSAIL 2026) lets an LM drive a Python REPL to decompose long context without ever loading it into the model's window. compound-rlm extends that paradigm with library learning: every successful skill becomes a typed, retrievable LangGraph subgraph in a versioned catalog. New queries first try retrieval; the planner only synthesizes (or solves directly) when no library node fits.

                ROOT PLANNER (DSPy)
   query  ──►   inputs: query, ctx, top-K library candidates
                output: INVOKE(node_id, args) | DIRECT_SOLVE | (later: SYNTHESIZE)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
   LIBRARY           SYNTHESIZE       DIRECT_SOLVE
   RETRIEVE          (Phase 2)        (vanilla RLM in-context)
   (FAISS over
    intent embs)
       │
       ▼
   EXECUTE in
   sandbox (local exec; Docker per Phase 2)
       │
       ▼
   TRAJECTORY + Langfuse trace + catalog success update

The novelty isn't the library learning (Voyager, DreamCoder, LILO got there first). It's that every node is a typed LangGraph subgraph, gateway-traced for production observability, and published as a portable HuggingFace dataset per domain. That's what's missing from the existing research stack.

Quickstart

# Clone + install
git clone git@github.com:protoLabsAI/compound-rlm.git
cd compound-rlm
uv sync

# Configure: gateway URL + key (or point at any OpenAI-compat backend)
export GATEWAY_URL=http://your-gateway/v1
export GATEWAY_API_KEY=sk-...

# Bootstrap the seed library (registers git_diff_applier + codebase_grep)
uv run python scripts/bootstrap_library.py

# Run on the locked Q3-5 LoCoDiff benchmark
git clone https://github.com/AbanteAI/LoCoDiff-bench /tmp/LoCoDiff-bench
uv run python benchmarks/locodiff/run_with_library.py \
  --catalog libraries/local/coding/catalog.jsonl --locked-q3
# Expect: ~4/5 PASS, ~9s wall per pass

# Run on the lab-repo audit (point --repo at any git repo)
uv run python benchmarks/lab_repo_audit/run_audit.py \
  --catalog libraries/local/coding/catalog.jsonl \
  --repo /path/to/your/repo

Layout

compound_rlm/
├── core/         # M0 RLM scaffold: graph, sandbox, parser, llm, prompts, trajectory
├── library/      # Catalog, Node protocol, IntentIndex (FAISS), LibraryRegistry
├── nodes/        # Hand-built library nodes (currently: git_diff_applier, codebase_grep)
├── primitives/   # (Phase 2) Synthesizer's compositional vocabulary
├── dspy_modules/ # (Phase 1.5) DSPy wrapping for GEPA optimization
├── trajectory/   # SFT-corpus persistence
└── eval/

benchmarks/
├── locodiff/         # AbanteAI LoCoDiff-bench loader, scorer, runner
├── lab_repo_audit/   # 10 seed questions over a local repo (designed for library transfer)
└── longbench_v2/     # (Phase 1.5) THUDM LongBench v2 integration

scripts/             # bootstrap_library, launch_mit_planner, teardown_mit_planner
libraries/           # Versioned catalog snapshots (LFS), local/ is gitignored
trajectories/        # Weekly archives (LFS) for reproduction

Building on prior art

This stack is a synthesis, not an invention:

Ingredient Source
Recursion over decomposed context Zhang et al. — RLM
Skill-library lifelong learning Wang et al. — Voyager
Library compression / refactoring Bowers — Stitch (Phase 2)
LM-guided library learning Grand et al. — LILO
Programming-not-prompting DSPy (Phase 1.5)
Multi-role decomposition ROMA (inspiration)
Memory typing Letta (inspiration)

What we add: production observability (Langfuse), type-checked LangGraph subgraph nodes, multi-domain library publishing as HF datasets, cold-start with hand-curated nodes.

License

Apache-2.0. Libraries published under the same.

Citation

See CITATION.cff — citable as software while we draft the workshop paper.

About

Recursive Language Models with a compounding library of LangGraph subgraphs (Phase 0: M0 baseline migrated from protoLabs lab)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors