Harness-managed virtual memory for stateful tool-using LLM agents.
ClawVM manages agent state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. An observable fault model and offline replay oracle make memory-management decisions deterministic and auditable.
This repository contains the evaluation artifact: a deterministic replay engine, synthetic and real-trace workloads, and analysis tools. All experiments run in pure Python 3 with zero external dependencies.
- Typed pages with minimum-fidelity invariants: state degrades gracefully (full, compressed, structured, pointer) instead of being silently dropped.
- Validated writeback: three-phase staged/validated/committed protocol prevents destructive persistence at lifecycle boundaries.
- Observable fault model: six policy-controllable fault types (refetch, duplicate-tool, post-compaction bootstrap, flush-miss, silent recall, pinned invariant) are machine-countable.
- Replay oracle: offline analysis with bounded-lookahead oracle measures the gap between online policy and optimal, separating policy quality from budget insufficiency.
- Deterministic replay: identical inputs produce byte-identical traces and metrics.
- Tier-1 lifecycle gates: six must-pass contract tests for memory-management invariants.
- Tier-2 synthetic policy sweeps: compare six policies (Retrieval, Retrieval+Cache, Compaction-Hybrid, LRU, ClawVM, Oracle) across four workload families and configurable token budgets.
Run the full preliminary evaluation suite:
python3 replay_py/run_preliminary_suite.py --out-root /tmp/clawvm-prelimThis runs Tier-1 lifecycle gates and Tier-2 policy sweeps, producing:
preliminary.report.md— human-readable summarysuite.manifest.json— machine-readable manifesttier1/tier1.report.json— lifecycle gate results (6 scenarios, pass/fail)tier2/results/tier2.summary.{json,csv,md}— policy comparison tables (faults, thrash index, oracle gap)
clawvm_runtime/ Runtime: page table, representation selector, writeback journal,
fault observer, decision trace logger, engine orchestrator
replay_py/ Replay engine, trace normalization, and CLI tools
openclaw_hooks/ Agent harness hook integration for live experiments
workloads/ Experiment drivers: sweeps, ablations, adversarial tests,
live experiment harness, trace conversion
schemas/ JSON schemas for DecisionTrace events and tool simulator
traces/ Trace file layout and naming conventions
docs/ Full documentation suite
EXPERIMENTS.md Primary execution guide
- Experiments guide — complete walkthrough for running evaluations
- Documentation index — architecture, data model, policies, metrics, glossary
- CLI reference — all CLI flags and usage patterns
- Student runbook — operational quick handoff