feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline#168
Merged
KailasMahavarkar merged 2 commits intomainfrom Apr 20, 2026
Merged
feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline#168KailasMahavarkar merged 2 commits intomainfrom
KailasMahavarkar merged 2 commits intomainfrom
Conversation
Behaviour audit of the CPU Bonsai pipeline surfaced seven correctness +
observability issues that made the raw llama-cpp-python path unsafe to
ship even at the 4s/msg warm latency we had already reached. Fix them
at the root, then swap to the smaller + higher-quality TQ1_0 model.
Fixes (each mapped to an audit finding):
C1 concurrent calls crash llama.cpp - threading.Lock around every
`create_chat_completion` invocation; single instance per ingestor.
C2 long sessions overflow n_ctx and silently evict the skill prefix -
pre-call budget check against (n_ctx - headroom); auto-reset before
the evict would happen, or raise IngestOverflow when even a reset
cannot fit the request.
H1 skill edits silently reuse stale KV - fingerprint the skill bytes
and pin `# skill-sha256=<fp>` into the system prompt. On edit the
prefix changes, so llama.cpp's prefix-match cache naturally
invalidates with no explicit reset call.
H2 empty / <think>-only outputs silently succeed - now raises
IngestEmpty with the raw output attached for diagnosis.
H3 duplicate UPSERT of same entity id crashes BatchRollback - dedupe
UPSERTs pre-parse, record drops in IngestResult.rejected.
H5 zero observability - one structured log line per ingest with
input/raw/statement/exec/rejected/entity/belief counts + duration
+ skill fingerprint + dry_run flag.
H7 no preview - `dry_run=True` returns the DSL without touching the
GraphStore.
Also ships:
- tools/skills/graphstore-bonsai-dsl/SKILL.md
500-token ingest-only skill designed for small local LLMs. Replaces
the 5.5K-token graphstore-dsl skill when the target is Bonsai-sized.
- tools/skills/graphstore-bonsai-dsl/grammar.gbnf
GBNF constraint for llama.cpp grammar-constrained decoding. Not used
by BonsaiIngestor today (added latency without quality gain in our
tests); kept alongside the skill so it tracks any schema changes.
- benchmarks/kaggle/pack_ternary_bonsai/
Kaggle kernel that converts prism-ml/Ternary-Bonsai-4B-unpacked
(FP16 safetensors) to TQ1_0 GGUF and uploads the result to
superkaiii/Ternary-Bonsai-4B-TQ1_0-GGUF. Re-runnable for future
model / quant updates.
Tests:
15 unit tests cover strip_think, line-split, UPSERT dedupe, fingerprint
stability + edit detection, frontmatter stripping, empty input
rejection, missing-file errors, and IngestResult defaults. Live model
path (needs the 1.09 GB TQ1_0 GGUF on disk) is not a unit test;
smoke-verified end-to-end separately:
cold dry-run (entity ingest): 10.7s 5 stmts, 2 entities caught
warm real (belief claim): 2.5s 2 stmts, ASSERT emitted
skill fingerprint stable: 32a4fa68e5ab across calls
TQ1_0 vs earlier Q2_K: ~40% faster warm (2.5s vs 3.5-4.6s) + handles
belief claims correctly (Q2_K 1.7B failed the claim test; TQ1_0 4B
passes cleanly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Correctness follow-up to the initial BonsaiIngestor commit. The deeper
end-to-end audit caught two data-shape bugs the guards did not cover:
1. Belief fact_id drift across messages.
T2 emitted ASSERT "fact:favorite_drink". T3 ("Actually I prefer tea
now.") emitted ASSERT "fact:preference" - a new fact_id for the same
underlying concept. Graph ended with two live beliefs for one preference.
Fix: port the _FactState tracking pattern from the existing
graphstore_skill.py adapter into BonsaiIngestor. After every successful
ingest, `_scrape_belief_updates` walks executed ASSERT / RETRACT lines
and updates a running fact_id -> FactState dict. Next ingest renders
the live (non-retracted) entries into a `### KNOWN FACTS (reuse these
fact_ids...)` block prepended to the user message. Model sees prior
ids and reuses them.
The block goes in the USER message, not the system prompt, so the
skill-prefix KV cache stays byte-identical across calls.
2. Edge emission flaky.
With only per-example guidance in the skill, the 4B Q2_K model
occasionally skipped CREATE EDGE statements for ingested entities.
Entities created, edges missing, graph half-built.
Fix: added an explicit numbered rules block to the skill telling the
model to always emit UPSERT + matching CREATE EDGE together.
Post-fix end-to-end (4B TQ1_0, CPU):
T1 Priya joined OpenAI -> 5 stmts, 2 edges emitted
T2 favorite drink coffee -> ASSERT fact:favorite_drink
T3 actually prefer tea -> RETRACT + ASSERT REUSING fact:favorite_drink
final graph: 2 ents, 3 msgs, 1 belief, 2 edges
API additions on BonsaiIngestor:
- .facts read-only snapshot of the running fact state
- .reset_facts() clear state when starting a new user / conversation
Unit tests +10 (25 total now):
- scrape ASSERT creates FactState with kind/value/confidence/source
- scrape RETRACT marks retracted + records reason
- scrape ASSERT -> RETRACT -> ASSERT round trip un-retracts
- scrape ignores non-belief lines
- render hides retracted facts
- render formats all fields
- render trims to max_facts (most recent kept)
- facts property returns a copy (mutation does not leak)
- reset_facts clears state
Full suite: 1827 passed, 101 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Behaviour audit of the CPU Bonsai pipeline surfaced 2 CRITICAL + 5 HIGH correctness/observability issues plus 2 data-shape bugs caught during the deeper end-to-end re-audit. Fixed all at the root. TQ1_0 model swap ships alongside.
Audit findings → fixes
threading.Lockaround everycreate_chat_completionIngestOverflowwhen impossible# skill-sha256=<fp>into system prompt<think>-only output silent no-opIngestEmptywith raw outputIngestResult.rejecteddry_run=Truereturns DSL without touching the store_FactStatetracking; scrape ASSERT/RETRACT; inject### KNOWN FACTSblock in next user msgWhat ships
src/graphstore/bonsai_ingestor.py—BonsaiIngestorclass +FactState+IngestResult+IngestEmpty+IngestOverflow, plus public.facts/.reset_facts()APItests/test_bonsai_ingestor.py— 25 unit teststools/skills/graphstore-bonsai-dsl/SKILL.md— 500-token skill with fact_id reuse rule + rules recap + KNOWN FACTS exampletools/skills/graphstore-bonsai-dsl/grammar.gbnf— parked for future GBNF usebenchmarks/kaggle/pack_ternary_bonsai/— Kaggle kernel producing TQ1_0 GGUF from FP16 sourceModel
models/is gitignored. Download once:Measurements (4B TQ1_0, CPU-only)
ASSERT fact:favorite_drink=coffeeRETRACT + ASSERTreusing samefact:favorite_drink(was the main audit bug)Warm latency: ~2.5-4s/msg on 8-core AMD 9700X, memory-bandwidth bound.
Test plan
uv run pytest tests/ -q)🤖 Generated with Claude Code