Skip to content

feat(bench): --adapter + --use-raw-turns flags + skill adapter schema fix#154

Merged
KailasMahavarkar merged 2 commits intomainfrom
feat/locomo-ab-run
Apr 20, 2026
Merged

feat(bench): --adapter + --use-raw-turns flags + skill adapter schema fix#154
KailasMahavarkar merged 2 commits intomainfrom
feat/locomo-ab-run

Conversation

@KailasMahavarkar
Copy link
Copy Markdown
Contributor

Three bench-side changes that should ride v0.4.

1. Skill adapter schema fix (bug)

`GraphStoreSkillAdapter.reset()` previously unregistered `message` kind and re-registered it without `content` REQUIRED + without `EMBED content`. Told the LLM to emit `DOCUMENT "text"` clause.

Problem: parent's query strategies read `n.get("content")`, which was empty because DOCUMENT populates blob/FTS5/vector but not the column. Retrieval dropped those rows. `retrieved_memories` bled entity `name` fields into context. Conv-26 F1 crashed to 0.023.

Fix: keep the baseline schema (`content:string` REQUIRED + `EMBED content`). Update the skill-adapter prompt to instruct the LLM to emit `content = "..."` as a typed field. A/B parity restored.

2. `--adapter {graphstore,skill}` flag on run_locomo

Lets users pick baseline vs skill adapter from CLI without editing Python. Default `graphstore` preserves prior behaviour.

Extras, only meaningful with `--adapter skill`:

  • `--skill-dump-dir PATH` - dump raw LLM output per session for audit
  • `--no-carry-facts` - disable cross-session fact memory

3. `--use-raw-turns` flag on run_locomo + `use_raw_turns` param on `load_locomo`

Forces the raw dialogue turns branch instead of the default author-distilled observations. Needed for a fair A/B of any LLM-ingest adapter: feeding pre-distilled facts to a distiller is a distillation-of-distillation test.

Default `False` preserves prior behaviour.

Verified

  • 1766 tests pass on main after these changes; no regressions
  • Parser-roundtrip verified for new skill-adapter prompt
  • Dataset loader switch verified on conv-26: observations path = 7 msgs per session; raw path = 18 msgs per session

🤖 Generated with Claude Code

KailasMahavarkar and others added 2 commits April 20, 2026 02:37
Adds --adapter {graphstore,skill} to run_locomo CLI so both the
deterministic baseline and the LLM-driven skill adapter can be
exercised from one entry point.

Extras (only relevant to --adapter skill):
  --skill-dump-dir PATH   save raw LLM output per session
  --no-carry-facts        disable cross-session fact memory

Default adapter unchanged (graphstore). Tests and prior benchmark
invocations keep working as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes needed for a fair LLM-ingest A/B test on LoCoMo.

1. datasets.py: add `use_raw_turns` parameter to load_locomo (default False
   matches prior behaviour - feeds author-distilled observations). When
   True, forces the raw-turns branch: ~20 dialogue turns per session with
   actual speaker/text instead of the 9 pre-extracted facts.
   run_locomo.py: expose as --use-raw-turns CLI flag.

   Why this matters: LoCoMo's observations are hand-distilled by the
   dataset authors. Feeding observations to a Mem0-style "LLM distills
   conversations into facts" adapter is distillation-of-distillation;
   you cannot tell if the LLM is adding value. The fair test is: both
   adapters ingest raw turns. Baseline stores turns verbatim; skill
   adapter runs LLM to produce its own facts. Then their F1 delta
   measures what LLM-at-ingest is actually worth.

2. graphstore_skill.py: drop the schema override that re-registered
   `message` without the `content` field. Parent schema (`content`
   REQUIRED + EMBED content) is load-bearing because parent's query
   strategies call `n.get("content")`. Earlier version switched to
   `DOCUMENT "..."` clause which populates the blob but leaves the
   column empty; retrieval dropped those rows, retrieved_memories
   bled entity `name` fields ("Melanie" as answer text), F1 crashed
   to 0.023 on conv-26.

   Prompt updated accordingly: tell the LLM to emit `content = "text"`
   field directly, no DOCUMENT clause.

No API break. Both observations and raw-turns paths tested on conv-26:
observations  = 7 msgs/s1 (distilled facts)
raw turns     = 18 msgs/s1 (real dialogue), 419 total for the 19 sessions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@KailasMahavarkar KailasMahavarkar merged commit 625551c into main Apr 20, 2026
4 checks passed
@KailasMahavarkar KailasMahavarkar deleted the feat/locomo-ab-run branch April 20, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant