Skip to content

feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix#149

Merged
KailasMahavarkar merged 3 commits intomainfrom
feat/locomo-skill-adapter
Apr 19, 2026
Merged

feat(bench): skill-guided LLM ingest adapter + LoCoMo llm_client wiring fix#149
KailasMahavarkar merged 3 commits intomainfrom
feat/locomo-skill-adapter

Conversation

@KailasMahavarkar
Copy link
Copy Markdown
Contributor

Summary

New bench adapter `GraphStoreSkillAdapter`. Uses the `graphstore-dsl` skill as a system prompt, hands each session to an LLM via litellm, parses the emitted DSL line-by-line through Lark, executes valid lines, drops invalid ones.

Goal: test whether LLM-driven ingestion beats deterministic NER+CREATE NODE on LoCoMo. Same query path as the baseline skill-compliant adapter (inherits `GraphStoreAdapter`) - only `ingest()` differs, so A/B comparison is clean.

Shape

```
session -> render_prompt(skill + turns) -> llm_call
-> [ dsl_parse(line) + gs.execute(line) for line in output ]
```

Schema difference

Parent adapter registers `message` with `content:string` REQUIRED + `EMBED content`. Skill teaches the `DOCUMENT "..."` clause path (post-PR #102) which populates blob + FTS5 + vector in one call. Registering both is a G7 violation. This adapter's `reset()` re-registers `message` without `content` REQUIRED and without EMBED.

Smoke (1 session, 7 msgs, LoCoMo conv-26/s1)

```
ingest: 14.3 s (LLM roundtrip via Ollama minimax-m2.7:cloud)
DSL accept rate: 31/31 = 100%
nodes: 16 (session + messages + entities)
edges: 15 (has_message + next + mentions)
REMEMBER "LGBTQ support group" top-3 hits the answer-bearing msg
```

LoCoMo wiring fix (bundled, required for the LLM call)

`benchmarks/framework/llm_client.py`:

  1. `_CONFIG_PATH` was `repo_root/autoresearch/config.json`; corrected to `repo_root/tools/autoresearch/config.json` (where the file actually lives). Without this, no providers resolve and `llm_call` returns empty.
  2. `QA_MODEL` / `QA_MODEL_OR` were both `"minimax/minimax-m2.7:nitro"`. Split: `QA_MODEL = "minimax-m2.7:cloud"` (Ollama) wins when local_ollama reachable, `QA_MODEL_OR` remains the OpenRouter paid fallback.

These two fixes let the bench run entirely on local Ollama for free.

Known gaps (follow-up)

  • LLM under-emitted messages in smoke (3 / 7). May need per-msg budget split or smaller batches.
  • No CLI flag yet on `run_locomo.py` to pick adapter. Instantiate from Python until then.
  • Full LoCoMo A/B run (baseline vs skill-adapter on all 1986 Qs) is a separate PR.

Test plan

  • Helpers unit-test (`_render_session_prompt`, `_iter_dsl_lines`) pass offline
  • Smoke ingest on 1 LoCoMo session succeeds with 100% DSL accept rate
  • REMEMBER returns correct top hit on a factual query against the ingested session
  • Full LoCoMo A/B (deferred to follow-up)

🤖 Generated with Claude Code

KailasMahavarkar and others added 3 commits April 20, 2026 02:10
New bench adapter `GraphStoreSkillAdapter` in
benchmarks/framework/adapters/graphstore_skill.py. Takes the
graphstore-dsl skill (tools/skills/graphstore-dsl/SKILL.md) as the
system prompt, hands a conversation session to the LLM via litellm,
and parses the emitted DSL line-by-line through Lark. Invalid lines
get counted and dropped; valid ones execute.

Shape:

  session -> render_prompt(skill + turns) -> llm_call -> [line for line
  in raw.splitlines() if dsl_parse(line) ok then gs.execute(line)]

Subclasses GraphStoreAdapter so the query path is identical to the
baseline skill-compliant adapter (REMEMBER + RECALL + recency). Only
the ingest() method differs. This makes A/B comparison clean: same
retrieval, different ingestion.

Overrides reset() to re-register `message` kind without REQUIRED
content field + no EMBED directive. Parent adapter was built around
the `content=` field path; the skill teaches `DOCUMENT "..."` clause
(post-PR #102), which populates blob + FTS5 + vector in one shot.
Registering both is a G7 violation; pick one.

Tracks per-run stats: emitted / executed / parse_failed / exec_failed /
llm_empty / sessions + sample of rejected statements. Runner can pick
these up via `ingest_done(record_metadata=)` for reporting.

Also bundled (required for this adapter's LLM calls):

  benchmarks/framework/llm_client.py
    - _CONFIG_PATH fixed: was repo_root/autoresearch/config.json,
      now repo_root/tools/autoresearch/config.json (where the file
      actually lives)
    - QA_MODEL / QA_MODEL_OR split: Ollama's "minimax-m2.7:cloud" wins
      when local_ollama is reachable, OpenRouter's
      "minimax/minimax-m2.7:nitro" remains as paid fallback

Smoke (1 session, 7 messages, LoCoMo conv-26/s1):

  ingest: 14.3 s (LLM roundtrip via Ollama minimax-m2.7:cloud)
  DSL accept rate: 31/31 = 100%
  nodes: 16 (session + messages + entities)
  edges: 15 (has_message + next + mentions)
  REMEMBER "LGBTQ support group" top-3 hits the answer-bearing msg

Known gaps for follow-up:
- LLM under-emitted messages (3/7) in smoke - may need per-msg budget
  split or lower max_tokens cap per call
- Compare F1 against baseline adapter (same retrieval, different
  ingest) on full LoCoMo - separate runner flag / follow-up PR
- No CLI flag yet on run_locomo.py to pick the adapter; instantiate
  directly from Python until then

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new surfaces for inspecting / auditing / replaying what the LLM
actually emitted during skill-guided ingest:

1. `adapter.last_raw_output` property
   In-memory string set on every ingest() call. None before first
   call; always updated afterwards even if the output was empty.

2. `config["skill_dump_raw_dir"] = "<path>"`
   Persistent per-session .dsl file at <path>/<session_id>.dsl. Dir
   created on adapter __init__. Session id is filesystem-sanitised
   before use. Written regardless of empty / parse-failed status so
   nothing is lost.

Used for:
- Smoke test inspection ("what did the LLM actually emit?")
- Post-hoc analysis when F1 drops on a specific record
- Replay: point a rerun at the .dsl files, skip the LLM, deterministic
- Diffing two LLM prompts on the same session

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a running belief-state that carries across session ingestions
within one conversation. LLM sees prior ASSERTed facts before
processing each new session. Two effects:

1. Deduplication. LLM no longer re-asserts facts it already wrote.
   Observed on LoCoMo conv-26: s2 emitted 4 new ASSERTs instead of
   rediscovering s1's 11 facts.
2. Cross-session RETRACT. When a new message contradicts a prior
   belief, LLM can emit `RETRACT "fact:id" REASON "..."` before the
   superseding ASSERT. (Not observed on conv-26 yet - accumulative
   narrative, no real contradictions. Mechanism in place for
   datasets that do contradict.)

Implementation:
- `_scrape_belief_updates(lines, facts)` walks successfully-executed
  statements, matches `ASSERT "id" ...` / `RETRACT "id" ...` regex,
  updates in-memory dict keyed by fact_id. Regex is deliberately
  forgiving; any odd statement just misses the scrape, harmless.
- `_render_known_facts_block(facts)` formats live (non-retracted) facts
  as a compact text block injected between skill and session instructions.
- `ingest()` wires these: after executing a session's statements, scrape
  belief updates; next session's prompt includes the current belief
  state.
- Config knobs:
    skill_carry_facts      True (default) / False to disable
    skill_max_known_facts  120 (soft cap; evicts retracted first, then
                           oldest live)
- Stats unchanged; memory sits alongside on `self._known_facts`.

Reset clears the facts dict so each new conversation starts fresh.

4-session smoke verified:
    s1: 11 asserts -> 11 alive
    s2:  4 asserts (saw s1's facts) -> 15 alive
    s3: 10 asserts -> 25 alive
    s4:  8 asserts -> 32 alive

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@KailasMahavarkar KailasMahavarkar merged commit 406cab3 into main Apr 19, 2026
4 checks passed
@KailasMahavarkar KailasMahavarkar deleted the feat/locomo-skill-adapter branch April 19, 2026 21:06
KailasMahavarkar added a commit that referenced this pull request Apr 20, 2026
v0.4 ships retrieval observability triangle:
- REMEMBER signal telemetry + rich meta["signals"] (#150)
- SYS EXPLAIN REMEMBER dry-run (#151)
- ANSWER verb with pluggable reader LLM (#152)

Plus:
- Skills split: graphstore-dsl (runtime) + graphstore-builder (Python) (#148)
- Skill-guided LLM ingest adapter + LoCoMo wiring fix (#149)
- Docusaurus docs site @ graphstore-docs.orkait.com (#142-147)

Breaking changes: none. All additions are additive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KailasMahavarkar added a commit that referenced this pull request Apr 20, 2026
* chore(release): bump v0.3.0 -> v0.4.0

v0.4 ships retrieval observability triangle:
- REMEMBER signal telemetry + rich meta["signals"] (#150)
- SYS EXPLAIN REMEMBER dry-run (#151)
- ANSWER verb with pluggable reader LLM (#152)

Plus:
- Skills split: graphstore-dsl (runtime) + graphstore-builder (Python) (#148)
- Skill-guided LLM ingest adapter + LoCoMo wiring fix (#149)
- Docusaurus docs site @ graphstore-docs.orkait.com (#142-147)

Breaking changes: none. All additions are additive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump pyproject.toml version to 0.4.0

Missed in 07c9986. Pairs with src/graphstore/__init__.py bump.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant