feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix by KailasMahavarkar · Pull Request #149 · orkait/graphstore

KailasMahavarkar · 2026-04-19T20:40:57Z

Summary

New bench adapter `GraphStoreSkillAdapter`. Uses the `graphstore-dsl` skill as a system prompt, hands each session to an LLM via litellm, parses the emitted DSL line-by-line through Lark, executes valid lines, drops invalid ones.

Goal: test whether LLM-driven ingestion beats deterministic NER+CREATE NODE on LoCoMo. Same query path as the baseline skill-compliant adapter (inherits `GraphStoreAdapter`) - only `ingest()` differs, so A/B comparison is clean.

Shape

```
session -> render_prompt(skill + turns) -> llm_call
-> [ dsl_parse(line) + gs.execute(line) for line in output ]
```

Schema difference

Parent adapter registers `message` with `content:string` REQUIRED + `EMBED content`. Skill teaches the `DOCUMENT "..."` clause path (post-PR #102) which populates blob + FTS5 + vector in one call. Registering both is a G7 violation. This adapter's `reset()` re-registers `message` without `content` REQUIRED and without EMBED.

Smoke (1 session, 7 msgs, LoCoMo conv-26/s1)

```
ingest: 14.3 s (LLM roundtrip via Ollama minimax-m2.7:cloud)
DSL accept rate: 31/31 = 100%
nodes: 16 (session + messages + entities)
edges: 15 (has_message + next + mentions)
REMEMBER "LGBTQ support group" top-3 hits the answer-bearing msg
```

LoCoMo wiring fix (bundled, required for the LLM call)

`benchmarks/framework/llm_client.py`:

`_CONFIG_PATH` was `repo_root/autoresearch/config.json`; corrected to `repo_root/tools/autoresearch/config.json` (where the file actually lives). Without this, no providers resolve and `llm_call` returns empty.
`QA_MODEL` / `QA_MODEL_OR` were both `"minimax/minimax-m2.7:nitro"`. Split: `QA_MODEL = "minimax-m2.7:cloud"` (Ollama) wins when local_ollama reachable, `QA_MODEL_OR` remains the OpenRouter paid fallback.

These two fixes let the bench run entirely on local Ollama for free.

Known gaps (follow-up)

LLM under-emitted messages in smoke (3 / 7). May need per-msg budget split or smaller batches.
No CLI flag yet on `run_locomo.py` to pick adapter. Instantiate from Python until then.
Full LoCoMo A/B run (baseline vs skill-adapter on all 1986 Qs) is a separate PR.

Test plan

Helpers unit-test (`_render_session_prompt`, `_iter_dsl_lines`) pass offline
Smoke ingest on 1 LoCoMo session succeeds with 100% DSL accept rate
REMEMBER returns correct top hit on a factual query against the ingested session
Full LoCoMo A/B (deferred to follow-up)

🤖 Generated with Claude Code

New bench adapter `GraphStoreSkillAdapter` in benchmarks/framework/adapters/graphstore_skill.py. Takes the graphstore-dsl skill (tools/skills/graphstore-dsl/SKILL.md) as the system prompt, hands a conversation session to the LLM via litellm, and parses the emitted DSL line-by-line through Lark. Invalid lines get counted and dropped; valid ones execute. Shape: session -> render_prompt(skill + turns) -> llm_call -> [line for line in raw.splitlines() if dsl_parse(line) ok then gs.execute(line)] Subclasses GraphStoreAdapter so the query path is identical to the baseline skill-compliant adapter (REMEMBER + RECALL + recency). Only the ingest() method differs. This makes A/B comparison clean: same retrieval, different ingestion. Overrides reset() to re-register `message` kind without REQUIRED content field + no EMBED directive. Parent adapter was built around the `content=` field path; the skill teaches `DOCUMENT "..."` clause (post-PR #102), which populates blob + FTS5 + vector in one shot. Registering both is a G7 violation; pick one. Tracks per-run stats: emitted / executed / parse_failed / exec_failed / llm_empty / sessions + sample of rejected statements. Runner can pick these up via `ingest_done(record_metadata=)` for reporting. Also bundled (required for this adapter's LLM calls): benchmarks/framework/llm_client.py - _CONFIG_PATH fixed: was repo_root/autoresearch/config.json, now repo_root/tools/autoresearch/config.json (where the file actually lives) - QA_MODEL / QA_MODEL_OR split: Ollama's "minimax-m2.7:cloud" wins when local_ollama is reachable, OpenRouter's "minimax/minimax-m2.7:nitro" remains as paid fallback Smoke (1 session, 7 messages, LoCoMo conv-26/s1): ingest: 14.3 s (LLM roundtrip via Ollama minimax-m2.7:cloud) DSL accept rate: 31/31 = 100% nodes: 16 (session + messages + entities) edges: 15 (has_message + next + mentions) REMEMBER "LGBTQ support group" top-3 hits the answer-bearing msg Known gaps for follow-up: - LLM under-emitted messages (3/7) in smoke - may need per-msg budget split or lower max_tokens cap per call - Compare F1 against baseline adapter (same retrieval, different ingest) on full LoCoMo - separate runner flag / follow-up PR - No CLI flag yet on run_locomo.py to pick the adapter; instantiate directly from Python until then Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new surfaces for inspecting / auditing / replaying what the LLM actually emitted during skill-guided ingest: 1. `adapter.last_raw_output` property In-memory string set on every ingest() call. None before first call; always updated afterwards even if the output was empty. 2. `config["skill_dump_raw_dir"] = "<path>"` Persistent per-session .dsl file at <path>/<session_id>.dsl. Dir created on adapter __init__. Session id is filesystem-sanitised before use. Written regardless of empty / parse-failed status so nothing is lost. Used for: - Smoke test inspection ("what did the LLM actually emit?") - Post-hoc analysis when F1 drops on a specific record - Replay: point a rerun at the .dsl files, skip the LLM, deterministic - Diffing two LLM prompts on the same session Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a running belief-state that carries across session ingestions within one conversation. LLM sees prior ASSERTed facts before processing each new session. Two effects: 1. Deduplication. LLM no longer re-asserts facts it already wrote. Observed on LoCoMo conv-26: s2 emitted 4 new ASSERTs instead of rediscovering s1's 11 facts. 2. Cross-session RETRACT. When a new message contradicts a prior belief, LLM can emit `RETRACT "fact:id" REASON "..."` before the superseding ASSERT. (Not observed on conv-26 yet - accumulative narrative, no real contradictions. Mechanism in place for datasets that do contradict.) Implementation: - `_scrape_belief_updates(lines, facts)` walks successfully-executed statements, matches `ASSERT "id" ...` / `RETRACT "id" ...` regex, updates in-memory dict keyed by fact_id. Regex is deliberately forgiving; any odd statement just misses the scrape, harmless. - `_render_known_facts_block(facts)` formats live (non-retracted) facts as a compact text block injected between skill and session instructions. - `ingest()` wires these: after executing a session's statements, scrape belief updates; next session's prompt includes the current belief state. - Config knobs: skill_carry_facts True (default) / False to disable skill_max_known_facts 120 (soft cap; evicts retracted first, then oldest live) - Stats unchanged; memory sits alongside on `self._known_facts`. Reset clears the facts dict so each new conversation starts fresh. 4-session smoke verified: s1: 11 asserts -> 11 alive s2: 4 asserts (saw s1's facts) -> 15 alive s3: 10 asserts -> 25 alive s4: 8 asserts -> 32 alive Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.4 ships retrieval observability triangle: - REMEMBER signal telemetry + rich meta["signals"] (#150) - SYS EXPLAIN REMEMBER dry-run (#151) - ANSWER verb with pluggable reader LLM (#152) Plus: - Skills split: graphstore-dsl (runtime) + graphstore-builder (Python) (#148) - Skill-guided LLM ingest adapter + LoCoMo wiring fix (#149) - Docusaurus docs site @ graphstore-docs.orkait.com (#142-147) Breaking changes: none. All additions are additive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump v0.3.0 -> v0.4.0 v0.4 ships retrieval observability triangle: - REMEMBER signal telemetry + rich meta["signals"] (#150) - SYS EXPLAIN REMEMBER dry-run (#151) - ANSWER verb with pluggable reader LLM (#152) Plus: - Skills split: graphstore-dsl (runtime) + graphstore-builder (Python) (#148) - Skill-guided LLM ingest adapter + LoCoMo wiring fix (#149) - Docusaurus docs site @ graphstore-docs.orkait.com (#142-147) Breaking changes: none. All additions are additive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump pyproject.toml version to 0.4.0 Missed in 07c9986. Pairs with src/graphstore/__init__.py bump. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KailasMahavarkar and others added 3 commits April 20, 2026 02:10

KailasMahavarkar merged commit 406cab3 into main Apr 19, 2026
4 checks passed

KailasMahavarkar deleted the feat/locomo-skill-adapter branch April 19, 2026 21:06

KailasMahavarkar mentioned this pull request Apr 20, 2026

chore(release): v0.4.0 #153

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix#149

feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix#149
KailasMahavarkar merged 3 commits intomainfrom
feat/locomo-skill-adapter

KailasMahavarkar commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KailasMahavarkar commented Apr 19, 2026

Summary

Shape

Schema difference

Smoke (1 session, 7 msgs, LoCoMo conv-26/s1)

LoCoMo wiring fix (bundled, required for the LLM call)

Known gaps (follow-up)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant