feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix#149
Merged
KailasMahavarkar merged 3 commits intomainfrom Apr 19, 2026
Merged
feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix#149KailasMahavarkar merged 3 commits intomainfrom
KailasMahavarkar merged 3 commits intomainfrom
Conversation
New bench adapter `GraphStoreSkillAdapter` in benchmarks/framework/adapters/graphstore_skill.py. Takes the graphstore-dsl skill (tools/skills/graphstore-dsl/SKILL.md) as the system prompt, hands a conversation session to the LLM via litellm, and parses the emitted DSL line-by-line through Lark. Invalid lines get counted and dropped; valid ones execute. Shape: session -> render_prompt(skill + turns) -> llm_call -> [line for line in raw.splitlines() if dsl_parse(line) ok then gs.execute(line)] Subclasses GraphStoreAdapter so the query path is identical to the baseline skill-compliant adapter (REMEMBER + RECALL + recency). Only the ingest() method differs. This makes A/B comparison clean: same retrieval, different ingestion. Overrides reset() to re-register `message` kind without REQUIRED content field + no EMBED directive. Parent adapter was built around the `content=` field path; the skill teaches `DOCUMENT "..."` clause (post-PR #102), which populates blob + FTS5 + vector in one shot. Registering both is a G7 violation; pick one. Tracks per-run stats: emitted / executed / parse_failed / exec_failed / llm_empty / sessions + sample of rejected statements. Runner can pick these up via `ingest_done(record_metadata=)` for reporting. Also bundled (required for this adapter's LLM calls): benchmarks/framework/llm_client.py - _CONFIG_PATH fixed: was repo_root/autoresearch/config.json, now repo_root/tools/autoresearch/config.json (where the file actually lives) - QA_MODEL / QA_MODEL_OR split: Ollama's "minimax-m2.7:cloud" wins when local_ollama is reachable, OpenRouter's "minimax/minimax-m2.7:nitro" remains as paid fallback Smoke (1 session, 7 messages, LoCoMo conv-26/s1): ingest: 14.3 s (LLM roundtrip via Ollama minimax-m2.7:cloud) DSL accept rate: 31/31 = 100% nodes: 16 (session + messages + entities) edges: 15 (has_message + next + mentions) REMEMBER "LGBTQ support group" top-3 hits the answer-bearing msg Known gaps for follow-up: - LLM under-emitted messages (3/7) in smoke - may need per-msg budget split or lower max_tokens cap per call - Compare F1 against baseline adapter (same retrieval, different ingest) on full LoCoMo - separate runner flag / follow-up PR - No CLI flag yet on run_locomo.py to pick the adapter; instantiate directly from Python until then Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new surfaces for inspecting / auditing / replaying what the LLM
actually emitted during skill-guided ingest:
1. `adapter.last_raw_output` property
In-memory string set on every ingest() call. None before first
call; always updated afterwards even if the output was empty.
2. `config["skill_dump_raw_dir"] = "<path>"`
Persistent per-session .dsl file at <path>/<session_id>.dsl. Dir
created on adapter __init__. Session id is filesystem-sanitised
before use. Written regardless of empty / parse-failed status so
nothing is lost.
Used for:
- Smoke test inspection ("what did the LLM actually emit?")
- Post-hoc analysis when F1 drops on a specific record
- Replay: point a rerun at the .dsl files, skip the LLM, deterministic
- Diffing two LLM prompts on the same session
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a running belief-state that carries across session ingestions
within one conversation. LLM sees prior ASSERTed facts before
processing each new session. Two effects:
1. Deduplication. LLM no longer re-asserts facts it already wrote.
Observed on LoCoMo conv-26: s2 emitted 4 new ASSERTs instead of
rediscovering s1's 11 facts.
2. Cross-session RETRACT. When a new message contradicts a prior
belief, LLM can emit `RETRACT "fact:id" REASON "..."` before the
superseding ASSERT. (Not observed on conv-26 yet - accumulative
narrative, no real contradictions. Mechanism in place for
datasets that do contradict.)
Implementation:
- `_scrape_belief_updates(lines, facts)` walks successfully-executed
statements, matches `ASSERT "id" ...` / `RETRACT "id" ...` regex,
updates in-memory dict keyed by fact_id. Regex is deliberately
forgiving; any odd statement just misses the scrape, harmless.
- `_render_known_facts_block(facts)` formats live (non-retracted) facts
as a compact text block injected between skill and session instructions.
- `ingest()` wires these: after executing a session's statements, scrape
belief updates; next session's prompt includes the current belief
state.
- Config knobs:
skill_carry_facts True (default) / False to disable
skill_max_known_facts 120 (soft cap; evicts retracted first, then
oldest live)
- Stats unchanged; memory sits alongside on `self._known_facts`.
Reset clears the facts dict so each new conversation starts fresh.
4-session smoke verified:
s1: 11 asserts -> 11 alive
s2: 4 asserts (saw s1's facts) -> 15 alive
s3: 10 asserts -> 25 alive
s4: 8 asserts -> 32 alive
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KailasMahavarkar
added a commit
that referenced
this pull request
Apr 20, 2026
v0.4 ships retrieval observability triangle: - REMEMBER signal telemetry + rich meta["signals"] (#150) - SYS EXPLAIN REMEMBER dry-run (#151) - ANSWER verb with pluggable reader LLM (#152) Plus: - Skills split: graphstore-dsl (runtime) + graphstore-builder (Python) (#148) - Skill-guided LLM ingest adapter + LoCoMo wiring fix (#149) - Docusaurus docs site @ graphstore-docs.orkait.com (#142-147) Breaking changes: none. All additions are additive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
KailasMahavarkar
added a commit
that referenced
this pull request
Apr 20, 2026
* chore(release): bump v0.3.0 -> v0.4.0 v0.4 ships retrieval observability triangle: - REMEMBER signal telemetry + rich meta["signals"] (#150) - SYS EXPLAIN REMEMBER dry-run (#151) - ANSWER verb with pluggable reader LLM (#152) Plus: - Skills split: graphstore-dsl (runtime) + graphstore-builder (Python) (#148) - Skill-guided LLM ingest adapter + LoCoMo wiring fix (#149) - Docusaurus docs site @ graphstore-docs.orkait.com (#142-147) Breaking changes: none. All additions are additive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump pyproject.toml version to 0.4.0 Missed in 07c9986. Pairs with src/graphstore/__init__.py bump. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New bench adapter `GraphStoreSkillAdapter`. Uses the `graphstore-dsl` skill as a system prompt, hands each session to an LLM via litellm, parses the emitted DSL line-by-line through Lark, executes valid lines, drops invalid ones.
Goal: test whether LLM-driven ingestion beats deterministic NER+CREATE NODE on LoCoMo. Same query path as the baseline skill-compliant adapter (inherits `GraphStoreAdapter`) - only `ingest()` differs, so A/B comparison is clean.
Shape
```
session -> render_prompt(skill + turns) -> llm_call
-> [ dsl_parse(line) + gs.execute(line) for line in output ]
```
Schema difference
Parent adapter registers `message` with `content:string` REQUIRED + `EMBED content`. Skill teaches the `DOCUMENT "..."` clause path (post-PR #102) which populates blob + FTS5 + vector in one call. Registering both is a G7 violation. This adapter's `reset()` re-registers `message` without `content` REQUIRED and without EMBED.
Smoke (1 session, 7 msgs, LoCoMo conv-26/s1)
```
ingest: 14.3 s (LLM roundtrip via Ollama minimax-m2.7:cloud)
DSL accept rate: 31/31 = 100%
nodes: 16 (session + messages + entities)
edges: 15 (has_message + next + mentions)
REMEMBER "LGBTQ support group" top-3 hits the answer-bearing msg
```
LoCoMo wiring fix (bundled, required for the LLM call)
`benchmarks/framework/llm_client.py`:
These two fixes let the bench run entirely on local Ollama for free.
Known gaps (follow-up)
Test plan
🤖 Generated with Claude Code