feat(bench): --adapter + --use-raw-turns flags + skill adapter schema fix#154
Merged
KailasMahavarkar merged 2 commits intomainfrom Apr 20, 2026
Merged
feat(bench): --adapter + --use-raw-turns flags + skill adapter schema fix#154KailasMahavarkar merged 2 commits intomainfrom
KailasMahavarkar merged 2 commits intomainfrom
Conversation
Adds --adapter {graphstore,skill} to run_locomo CLI so both the
deterministic baseline and the LLM-driven skill adapter can be
exercised from one entry point.
Extras (only relevant to --adapter skill):
--skill-dump-dir PATH save raw LLM output per session
--no-carry-facts disable cross-session fact memory
Default adapter unchanged (graphstore). Tests and prior benchmark
invocations keep working as-is.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes needed for a fair LLM-ingest A/B test on LoCoMo.
1. datasets.py: add `use_raw_turns` parameter to load_locomo (default False
matches prior behaviour - feeds author-distilled observations). When
True, forces the raw-turns branch: ~20 dialogue turns per session with
actual speaker/text instead of the 9 pre-extracted facts.
run_locomo.py: expose as --use-raw-turns CLI flag.
Why this matters: LoCoMo's observations are hand-distilled by the
dataset authors. Feeding observations to a Mem0-style "LLM distills
conversations into facts" adapter is distillation-of-distillation;
you cannot tell if the LLM is adding value. The fair test is: both
adapters ingest raw turns. Baseline stores turns verbatim; skill
adapter runs LLM to produce its own facts. Then their F1 delta
measures what LLM-at-ingest is actually worth.
2. graphstore_skill.py: drop the schema override that re-registered
`message` without the `content` field. Parent schema (`content`
REQUIRED + EMBED content) is load-bearing because parent's query
strategies call `n.get("content")`. Earlier version switched to
`DOCUMENT "..."` clause which populates the blob but leaves the
column empty; retrieval dropped those rows, retrieved_memories
bled entity `name` fields ("Melanie" as answer text), F1 crashed
to 0.023 on conv-26.
Prompt updated accordingly: tell the LLM to emit `content = "text"`
field directly, no DOCUMENT clause.
No API break. Both observations and raw-turns paths tested on conv-26:
observations = 7 msgs/s1 (distilled facts)
raw turns = 18 msgs/s1 (real dialogue), 419 total for the 19 sessions
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three bench-side changes that should ride v0.4.
1. Skill adapter schema fix (bug)
`GraphStoreSkillAdapter.reset()` previously unregistered `message` kind and re-registered it without `content` REQUIRED + without `EMBED content`. Told the LLM to emit `DOCUMENT "text"` clause.
Problem: parent's query strategies read `n.get("content")`, which was empty because DOCUMENT populates blob/FTS5/vector but not the column. Retrieval dropped those rows. `retrieved_memories` bled entity `name` fields into context. Conv-26 F1 crashed to 0.023.
Fix: keep the baseline schema (`content:string` REQUIRED + `EMBED content`). Update the skill-adapter prompt to instruct the LLM to emit `content = "..."` as a typed field. A/B parity restored.
2. `--adapter {graphstore,skill}` flag on run_locomo
Lets users pick baseline vs skill adapter from CLI without editing Python. Default `graphstore` preserves prior behaviour.
Extras, only meaningful with `--adapter skill`:
3. `--use-raw-turns` flag on run_locomo + `use_raw_turns` param on `load_locomo`
Forces the raw dialogue turns branch instead of the default author-distilled observations. Needed for a fair A/B of any LLM-ingest adapter: feeding pre-distilled facts to a distiller is a distillation-of-distillation test.
Default `False` preserves prior behaviour.
Verified
🤖 Generated with Claude Code