feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline by KailasMahavarkar · Pull Request #168 · orkait/graphstore

KailasMahavarkar · 2026-04-20T12:07:30Z

Why

Behaviour audit of the CPU Bonsai pipeline surfaced 2 CRITICAL + 5 HIGH correctness/observability issues plus 2 data-shape bugs caught during the deeper end-to-end re-audit. Fixed all at the root. TQ1_0 model swap ships alongside.

Audit findings → fixes

sev	finding	fix
C1	concurrent calls crash llama.cpp	`threading.Lock` around every `create_chat_completion`
C2	long sessions silently evict skill prefix	pre-call budget check; auto-reset; `IngestOverflow` when impossible
H1	skill edits silently reuse stale KV	fingerprint skill, pin `# skill-sha256=<fp>` into system prompt
H2	empty / `<think>`-only output silent no-op	raise `IngestEmpty` with raw output
H3	duplicate UPSERT crashes BatchRollback	dedupe pre-parse, record in `IngestResult.rejected`
H5	zero observability	structured log line with input/raw/stmt/exec/rejected/entity/belief counts + duration + skill fp
H7	no preview	`dry_run=True` returns DSL without touching the store
H6a	belief fact_id drift across messages	port `_FactState` tracking; scrape ASSERT/RETRACT; inject `### KNOWN FACTS` block in next user msg
H6b	flaky edge emission on small models	explicit numbered rules recap in skill: "always emit UPSERT + matching CREATE EDGE together"

What ships

src/graphstore/bonsai_ingestor.py — BonsaiIngestor class + FactState + IngestResult + IngestEmpty + IngestOverflow, plus public .facts / .reset_facts() API
tests/test_bonsai_ingestor.py — 25 unit tests
tools/skills/graphstore-bonsai-dsl/SKILL.md — 500-token skill with fact_id reuse rule + rules recap + KNOWN FACTS example
tools/skills/graphstore-bonsai-dsl/grammar.gbnf — parked for future GBNF use
benchmarks/kaggle/pack_ternary_bonsai/ — Kaggle kernel producing TQ1_0 GGUF from FP16 source

Model

models/ is gitignored. Download once:

mkdir -p models/Ternary-Bonsai-4B-TQ1_0
curl -L -o models/Ternary-Bonsai-4B-TQ1_0/Ternary-Bonsai-4B-TQ1_0.gguf \
  https://huggingface.co/superkaiii/Ternary-Bonsai-4B-TQ1_0-GGUF/resolve/main/Ternary-Bonsai-4B-TQ1_0.gguf

Measurements (4B TQ1_0, CPU-only)

test	result
T1 entity ingest	5 stmts including 2 CREATE EDGE; graph: 2 ents, 2 edges
T2 belief claim	`ASSERT fact:favorite_drink=coffee`
T3 belief correction	`RETRACT + ASSERT` reusing same `fact:favorite_drink` (was the main audit bug)
final graph	2 ents, 3 msgs, 1 belief, 2 edges

Warm latency: ~2.5-4s/msg on 8-core AMD 9700X, memory-bandwidth bound.

Test plan

25 unit tests pass (post-processing + fingerprint + fact state)
1827 total tests pass, 101 skipped (uv run pytest tests/ -q)
Live T1-T3 smoke on TQ1_0: all correct including fact_id reuse

🤖 Generated with Claude Code

Behaviour audit of the CPU Bonsai pipeline surfaced seven correctness + observability issues that made the raw llama-cpp-python path unsafe to ship even at the 4s/msg warm latency we had already reached. Fix them at the root, then swap to the smaller + higher-quality TQ1_0 model. Fixes (each mapped to an audit finding): C1 concurrent calls crash llama.cpp - threading.Lock around every `create_chat_completion` invocation; single instance per ingestor. C2 long sessions overflow n_ctx and silently evict the skill prefix - pre-call budget check against (n_ctx - headroom); auto-reset before the evict would happen, or raise IngestOverflow when even a reset cannot fit the request. H1 skill edits silently reuse stale KV - fingerprint the skill bytes and pin `# skill-sha256=<fp>` into the system prompt. On edit the prefix changes, so llama.cpp's prefix-match cache naturally invalidates with no explicit reset call. H2 empty / <think>-only outputs silently succeed - now raises IngestEmpty with the raw output attached for diagnosis. H3 duplicate UPSERT of same entity id crashes BatchRollback - dedupe UPSERTs pre-parse, record drops in IngestResult.rejected. H5 zero observability - one structured log line per ingest with input/raw/statement/exec/rejected/entity/belief counts + duration + skill fingerprint + dry_run flag. H7 no preview - `dry_run=True` returns the DSL without touching the GraphStore. Also ships: - tools/skills/graphstore-bonsai-dsl/SKILL.md 500-token ingest-only skill designed for small local LLMs. Replaces the 5.5K-token graphstore-dsl skill when the target is Bonsai-sized. - tools/skills/graphstore-bonsai-dsl/grammar.gbnf GBNF constraint for llama.cpp grammar-constrained decoding. Not used by BonsaiIngestor today (added latency without quality gain in our tests); kept alongside the skill so it tracks any schema changes. - benchmarks/kaggle/pack_ternary_bonsai/ Kaggle kernel that converts prism-ml/Ternary-Bonsai-4B-unpacked (FP16 safetensors) to TQ1_0 GGUF and uploads the result to superkaiii/Ternary-Bonsai-4B-TQ1_0-GGUF. Re-runnable for future model / quant updates. Tests: 15 unit tests cover strip_think, line-split, UPSERT dedupe, fingerprint stability + edit detection, frontmatter stripping, empty input rejection, missing-file errors, and IngestResult defaults. Live model path (needs the 1.09 GB TQ1_0 GGUF on disk) is not a unit test; smoke-verified end-to-end separately: cold dry-run (entity ingest): 10.7s 5 stmts, 2 entities caught warm real (belief claim): 2.5s 2 stmts, ASSERT emitted skill fingerprint stable: 32a4fa68e5ab across calls TQ1_0 vs earlier Q2_K: ~40% faster warm (2.5s vs 3.5-4.6s) + handles belief claims correctly (Q2_K 1.7B failed the claim test; TQ1_0 4B passes cleanly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Correctness follow-up to the initial BonsaiIngestor commit. The deeper end-to-end audit caught two data-shape bugs the guards did not cover: 1. Belief fact_id drift across messages. T2 emitted ASSERT "fact:favorite_drink". T3 ("Actually I prefer tea now.") emitted ASSERT "fact:preference" - a new fact_id for the same underlying concept. Graph ended with two live beliefs for one preference. Fix: port the _FactState tracking pattern from the existing graphstore_skill.py adapter into BonsaiIngestor. After every successful ingest, `_scrape_belief_updates` walks executed ASSERT / RETRACT lines and updates a running fact_id -> FactState dict. Next ingest renders the live (non-retracted) entries into a `### KNOWN FACTS (reuse these fact_ids...)` block prepended to the user message. Model sees prior ids and reuses them. The block goes in the USER message, not the system prompt, so the skill-prefix KV cache stays byte-identical across calls. 2. Edge emission flaky. With only per-example guidance in the skill, the 4B Q2_K model occasionally skipped CREATE EDGE statements for ingested entities. Entities created, edges missing, graph half-built. Fix: added an explicit numbered rules block to the skill telling the model to always emit UPSERT + matching CREATE EDGE together. Post-fix end-to-end (4B TQ1_0, CPU): T1 Priya joined OpenAI -> 5 stmts, 2 edges emitted T2 favorite drink coffee -> ASSERT fact:favorite_drink T3 actually prefer tea -> RETRACT + ASSERT REUSING fact:favorite_drink final graph: 2 ents, 3 msgs, 1 belief, 2 edges API additions on BonsaiIngestor: - .facts read-only snapshot of the running fact state - .reset_facts() clear state when starting a new user / conversation Unit tests +10 (25 total now): - scrape ASSERT creates FactState with kind/value/confidence/source - scrape RETRACT marks retracted + records reason - scrape ASSERT -> RETRACT -> ASSERT round trip un-retracts - scrape ignores non-belief lines - render hides retracted facts - render formats all fields - render trims to max_facts (most recent kept) - facts property returns a copy (mutation does not leak) - reset_facts clears state Full suite: 1827 passed, 101 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KailasMahavarkar and others added 2 commits April 20, 2026 17:37

KailasMahavarkar merged commit 665c477 into main Apr 20, 2026
4 checks passed

KailasMahavarkar deleted the feat/bonsai-nl-ingestor branch April 20, 2026 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline#168

feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline#168
KailasMahavarkar merged 2 commits intomainfrom
feat/bonsai-nl-ingestor

KailasMahavarkar commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KailasMahavarkar commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Audit findings → fixes

What ships

Model

Measurements (4B TQ1_0, CPU-only)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KailasMahavarkar commented Apr 20, 2026 •

edited

Loading