A decontaminated, audit-validated benchmark for evaluating long-horizon coherence in 8B-class agent models.
Long-form writeup: LHC v0.2: A Benchmark for Long-Horizon Agent Coherence (and the Methodology That Got It Honest) (also in Portuguese).
This repo contains:
- LHC v0.2 — 24 hand-curated tasks, 4 gap-mode ablation, full audit trail, four rounds of external review. Also published as the
hugonogueira/lhc-v0.2HuggingFace dataset. - Deterministic baseline — ~80-line Python parser that scores 2/2 deterministically on two structured-state resumption tasks where every fine-tuned 8B model we tested averaged below 2.0 across 12 cells. Standalone repo: hugomn/lhc-resume-state-parser.
- The honest training record — every dead end, including the fine-tuned model that didn't beat its base.
If you are training a model to maintain coherence across long-running conversations or autonomous agent loops — and you want to know whether your fine-tune is actually better than its base, or just better at the test set — LHC is built for you. It is a small, manually-curated benchmark with three properties most LHC-style benchmarks lack:
- Decontaminated. No
based_onoverlap with the synthetic data Ember was trained on. Contamination check tooling is in the repo. - Gap-mode ablated. Each task is run under 4 different "long-horizon gap" conditions (none / placeholder / neutral / current). The gap mode is the lever that exposes coherence failures vs. surface memorization.
- Pre-registered decision gates.
evals/v0.2/DECISION.mdwas locked before any model was scored. The reasoning for shipping or not-shipping is the same reasoning we'd apply to anyone else's results.
The Ember v0.1.5 fine-tune did not ship. Under matched local MLX inference, Ember v0.1.5 is statistically indistinguishable from base Qwen3-8B on LHC v0.2; it does not meet our bar for release ("significant improvement on at least one functionality compared to a same-class model").
Getting to that conclusion took four rounds of external review in 72 hours:
- Round 1 (2026-05-08) — caught LHC v0.1 contamination, hash-seed bug, train/eval gap mismatch. Triggered the v0.2 rebuild.
- Round 2 (2026-05-09) — caught that the v0.2 sweep compared Ember (local MLX, with
/no_think) against the OpenRouter models (no/no_think). Inference-config asymmetry confounded the original "Ember regresses by 0.25" verdict. - Round 3 (2026-05-10) — blessed the matched-inference diagnostic with a wording change and asked us to close one symmetry nuisance.
- Round 4 (2026-05-10) — closing that nuisance surfaced an MLX replication issue (within-session vs across-session). Final n=3 fresh-server-restart comparison is a statistical tie; both CIs cross zero.
Both reviewer rounds and the resulting verdicts are recorded in the journal. Nothing has been edited after the fact.
The 4-model sweep (3 trials × 24 tasks × 4 gap modes per cell, judged by Claude Opus 4.7) produced this leaderboard. Ember and Qwen3-8B base ran via different inference paths — see the matched-inference comparison below for the apples-to-apples Ember-vs-base verdict.
| Rank | Model | Mean (max 2.00) | Inference path |
|---|---|---|---|
| #1 | Qwen3-8B base | 1.413 | OpenRouter, no /no_think |
| #2 | Ministral-8B-2512 | 1.351 | OpenRouter |
| #3 (tie) | Ember v0.1.5 iter-900 | 1.285 | local MLX, with /no_think |
| #3 (tie) | Llama-3.1-8B | 1.285 | OpenRouter |
All four benchmark-validity gates passed (judge stability, rank inversion, CI discrimination, variance attribution).
Ember vs base, matched local-MLX inference, n=3 fresh-server trials per cell:
| Gap | Ember mean | Qwen-local mean | Δ E−Q | 95% CI (task-bootstrap)¹ |
|---|---|---|---|---|
current |
1.222 | 1.181 | +0.042 | [−0.139, +0.222] |
neutral |
1.319 | 1.333 | −0.014 | [−0.139, +0.111] |
Both CIs cross zero: statistical tie on both gap modes. Ember is not measurably better than base under matched inference; it is also not measurably worse.
¹ Task-bootstrap CI conditional on the observed 3 restart trials. A hierarchical bootstrap over (task, trial) would be wider, not narrower. Sufficient to reject shipping; not for fine ranking.
A ~80-line deterministic parser (no LLM) scored 0.75 / 2.00 on resumption (8 tasks). On the two structured-state tasks (resumption_v2_004 warehouse tracker and resumption_v2_008 inconsistency detection), the parser scores 2/2 deterministically; across the original sweep, every fine-tuned 8B model we tested (Ember, Qwen3-8B base, Ministral, Llama-3.1-8B) had a mean below 2.0 on at least one of those tasks. Some LLM cells match the parser's max score of 2/2; none of the LLMs match the parser's deterministic mean. Implication: for structured-state resumption, fine-tuning may be the wrong layer.
See docs/results.md for per-category and per-gap-mode tables, and docs/findings.md F-05 for the full record.
| Path | What |
|---|---|
evals/v0.2/ |
LHC v0.2 — tasks, decision gates, analyzer, audit scripts, deterministic baseline, matched-inference diagnostics. The current canonical benchmark. |
evals/runners/ |
The harness — multi-provider runner with stable seeding, full-message scorecards, fail-fast lifecycle, gap-mode ablation. |
evals/results/published/lhc-v0.2/ |
All sweep + diagnostic scorecards, audit results, verdict. Reproducible. |
evals/tasks/ |
LHC v0.1 tasks (historical — superseded by v0.2 due to contamination). |
docs/journal/ |
Append-only chronological log of work sessions. The methodology arc. |
docs/findings.md |
Living "what's true now" doc, newest-first. |
docs/results.md |
Current LHC v0.2 cross-model leaderboard. |
data/ |
Synthetic dataset construction pipeline (used for v0.1.5; pending redesign for v0.2). |
training/ |
LoRA / DoRA recipes via MLX-LM, sweep drivers, eval drivers. |
- LHC v0.2 results — current leaderboard
- Findings — living what-we-know doc
- Journal — 2026-05-08 — round 1: contamination, hash-seed bug, train/eval gap mismatch
- Journal — 2026-05-09 — the sweep verdict (later rebuilt under matched inference)
- Journal — 2026-05-10 — rounds 2–4: matched-inference diagnostic + MLX replication observation + final null-result framing
- LHC v0.2 decision gates — pre-registered methodology
- LHC v0.2 authoring procedure — how the decontaminated tasks were built
- Manifesto — why long-horizon coherence matters
- Architecture — how the harness assembles a run
- Changelog — versioned release notes
git clone https://github.com/hugomn/lhc.git
cd lhc
uv venv && source .venv/bin/activate
uv pip install -e .
cp .env.example .env # add provider keys (OPENROUTER_API_KEY + ANTHROPIC_API_KEY work for the published sweep models)
# One model × one gap mode × one trial
python -m evals.runners.lhc \
--provider openrouter \
--model qwen/qwen3-8b \
--judge-provider anthropic \
--gap-mode current \
--lhc-version 0.2 \
--tasks-dir evals/v0.2/tasksTo evaluate a model you serve yourself (vLLM, MLX-LM server, etc.), set SLOWLIT_BASE_URL to your endpoint and use --provider slowlit. Important: the slowlit provider injects /no_think as a system-prompt prefix by default (Qwen3-family thinking-mode opt-out). If you're benchmarking a model that doesn't support /no_think mode, override the prefix in the slowlit provider config or use --provider openrouter with an OpenRouter-mirrored deployment instead — see evals/runners/lhc.py for the prefix-injection logic. Whichever inference path you pick, document it alongside your scorecard — the original 2026-05-09 sweep was confounded by exactly this asymmetry, see findings.md F-05.
In our setup on Apple Silicon, MLX-LM server outputs were stable within a single server session but varied across fresh server starts (~0.07–0.13 stdev on overall mean across 3 restart trials, with 6/24 tasks score-flipping per side). If your fine-tune is served via MLX-LM and you want to compare it against another model's score:
- Restart the MLX server between trials (or accept that your "n=3" is one sample with within-session caching).
- The diagnostic scripts in
evals/v0.2/diagnostic_*_replicate.pyshow the protocol: launch a freshmlx_lm.serverprocess per trial. - Aim for ≥3 fresh-server-restart trials on both sides of any comparison; report mean + stdev + a paired bootstrap CI (see
evals/v0.2/diagnostic_compare.py).
We have not isolated the source of the cross-restart variance (could be MLX-LM server, Metal kernels, cache state, sampling defaults, warmup, or process init). If your stack is different, your variance characteristics may be different — measure first.
# Original 4-model sweep — 4 models × 4 gap modes × 3 trials = 48 scorecards, ~3 hours, ~$5
# Note: this is OUR sweep config (model list and output paths are hardcoded); not a researcher-facing matrix runner
python -m training.sweep_v02 --skip-existing
# Audit + original verdict (with the now-retracted G4 CI; see findings.md F-05)
python evals/v0.2/audit_g9_judge_stability.py --per-model 30
python evals/v0.2/audit_g12_manual.py # interactive, 12 samples
python evals/v0.2/analyze.py \
--scorecards-dir evals/results/published/lhc-v0.2/sweep \
--audit-results evals/results/published/lhc-v0.2/audit-combined.json
# Matched-inference diagnostic (Ember and base Qwen3-8B both via local MLX,
# n=3 fresh-server-restart trials per cell)
python evals/v0.2/diagnostic_local_qwen.py # Qwen-local trial 1
python evals/v0.2/diagnostic_qwen_replicate.py # Qwen-local trials 2 + 3
python evals/v0.2/diagnostic_ember_rerun.py # Ember trial 1
python evals/v0.2/diagnostic_ember_replicate.py # Ember trials 2 + 3
python evals/v0.2/diagnostic_compare.py # paired bootstrap CI; defaults to published scorecards-
Benchmark contamination is easy to introduce and hard to detect. LHC v0.1's derivative-seed contamination masked non-improvement — without the v0.2 rebuild, we would have shipped a model whose apparent improvement on the contaminated benchmark was an artifact and which, on a clean matched-inference comparison, does not measurably beat its base. The contamination check tooling at
evals/v0.2/build_banned_overlap.pyis reusable for any benchmark-vs-fine-tune comparison. -
Pre-registering decision gates and refusing to move them is the methodology that produced an honest result.
evals/v0.2/DECISION.mdwas locked before any model run. When the data showed Ember worse than base, we had no path to argue around it. -
Pre-registered thresholds need a power check. Our G9 threshold of 0.10 was below the noise floor at the originally-specified sample size — caught and corrected by increasing N. Documented in the 2026-05-09 journal entry.
-
Deterministic parsers are a real baseline for structured-state tasks. A 100-line script beat all four 8B-class LLMs on two specific resumption tasks. Anyone training agent models should test this before assuming fine-tuning is the right layer.
-
Storing full prompt + full gap + full response in scorecards costs nothing and enables independent re-judging by anyone, ever. Excerpt-only storage made the v0.1 scorecards effectively un-auditable. Lesson learned the hard way.
-
Inference-config asymmetry can dominate a benchmark verdict. The original v0.2 sweep ranked Ember vs OpenRouter-served Qwen3-8B and reported "Ember regresses by 0.25, CI [−0.46, −0.06]." Re-running both via the same local MLX server with the same
/no_thinkprefix shrank that delta to +0.04 (a tie). About 5/6 of the apparent regression was inference confound. Anyone benchmarking a fine-tune against its base should run them on the same inference path. -
MLX-LM server replication: restart between trials. In our setup on Apple Silicon, MLX outputs were stable within a single server session but varied across fresh server starts. Per-cell stdev across 3 restart trials was 0.02–0.13 on overall mean, with 6 of 24 tasks score-flipping per side. The original sweep's "byte-identical n=3 trials" was a within-session repeat, not three independent samples. We have not isolated the source. Benchmark replications of MLX-served models should restart the server between trials or explicitly state they are within-session.
If you run LHC against a model and want to share scorecards:
- Self-identify your scorecards. Add
model_slug,base_model,adapter(ornull),inference_path,inference_provider,system_prompt_prefix, andserver_configfields at the top of each scorecard JSON. The published diagnostic scorecards underevals/results/published/lhc-v0.2/diagnostic-*show the shape. Filenames disambiguate; JSON content also should. - Document the inference path in any comparison. "Model X scored Y on LHC v0.2
current" is missing the inference path. "Model X (vLLM, no system prefix, n=3 fresh-process trials) scored Y±σ on LHC v0.2current" is comparable to ours. - If you publish artifacts to HuggingFace from a working directory, do not upload ignored files. This repo's
.gitignorecovers.env,models/,checkpoints/,evals/results/v0.2/(working scorecards) and other local-only state. Use a cleangit ls-files+huggingface-cli uploadrather thanhuggingface-cli upload ..
Apache 2.0 for code. See LICENSE.
LHC v0.2 tasks and methodology are also Apache 2.0 — fork the benchmark, run your own model on it, send a PR with your scorecard.
If you use LHC or build on this methodology, please cite via CITATION.cff.
Solo research effort by Hugo Nogueira. The methodology is the contribution; the model that didn't ship is the load-bearing example.
This is the seed for an eventual research lab focused on long-horizon agent reliability — but for now, it lives here under my personal name.