Skip to content

feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline#168

Merged
KailasMahavarkar merged 2 commits intomainfrom
feat/bonsai-nl-ingestor
Apr 20, 2026
Merged

feat(bonsai): NL->DSL ingestor with correctness guards + TQ1_0 model pipeline#168
KailasMahavarkar merged 2 commits intomainfrom
feat/bonsai-nl-ingestor

Conversation

@KailasMahavarkar
Copy link
Copy Markdown
Contributor

@KailasMahavarkar KailasMahavarkar commented Apr 20, 2026

Why

Behaviour audit of the CPU Bonsai pipeline surfaced 2 CRITICAL + 5 HIGH correctness/observability issues plus 2 data-shape bugs caught during the deeper end-to-end re-audit. Fixed all at the root. TQ1_0 model swap ships alongside.

Audit findings → fixes

sev finding fix
C1 concurrent calls crash llama.cpp threading.Lock around every create_chat_completion
C2 long sessions silently evict skill prefix pre-call budget check; auto-reset; IngestOverflow when impossible
H1 skill edits silently reuse stale KV fingerprint skill, pin # skill-sha256=<fp> into system prompt
H2 empty / <think>-only output silent no-op raise IngestEmpty with raw output
H3 duplicate UPSERT crashes BatchRollback dedupe pre-parse, record in IngestResult.rejected
H5 zero observability structured log line with input/raw/stmt/exec/rejected/entity/belief counts + duration + skill fp
H7 no preview dry_run=True returns DSL without touching the store
H6a belief fact_id drift across messages port _FactState tracking; scrape ASSERT/RETRACT; inject ### KNOWN FACTS block in next user msg
H6b flaky edge emission on small models explicit numbered rules recap in skill: "always emit UPSERT + matching CREATE EDGE together"

What ships

  • src/graphstore/bonsai_ingestor.pyBonsaiIngestor class + FactState + IngestResult + IngestEmpty + IngestOverflow, plus public .facts / .reset_facts() API
  • tests/test_bonsai_ingestor.py — 25 unit tests
  • tools/skills/graphstore-bonsai-dsl/SKILL.md — 500-token skill with fact_id reuse rule + rules recap + KNOWN FACTS example
  • tools/skills/graphstore-bonsai-dsl/grammar.gbnf — parked for future GBNF use
  • benchmarks/kaggle/pack_ternary_bonsai/ — Kaggle kernel producing TQ1_0 GGUF from FP16 source

Model

models/ is gitignored. Download once:

mkdir -p models/Ternary-Bonsai-4B-TQ1_0
curl -L -o models/Ternary-Bonsai-4B-TQ1_0/Ternary-Bonsai-4B-TQ1_0.gguf \
  https://huggingface.co/superkaiii/Ternary-Bonsai-4B-TQ1_0-GGUF/resolve/main/Ternary-Bonsai-4B-TQ1_0.gguf

Measurements (4B TQ1_0, CPU-only)

test result
T1 entity ingest 5 stmts including 2 CREATE EDGE; graph: 2 ents, 2 edges
T2 belief claim ASSERT fact:favorite_drink=coffee
T3 belief correction RETRACT + ASSERT reusing same fact:favorite_drink (was the main audit bug)
final graph 2 ents, 3 msgs, 1 belief, 2 edges

Warm latency: ~2.5-4s/msg on 8-core AMD 9700X, memory-bandwidth bound.

Test plan

  • 25 unit tests pass (post-processing + fingerprint + fact state)
  • 1827 total tests pass, 101 skipped (uv run pytest tests/ -q)
  • Live T1-T3 smoke on TQ1_0: all correct including fact_id reuse

🤖 Generated with Claude Code

KailasMahavarkar and others added 2 commits April 20, 2026 17:37
Behaviour audit of the CPU Bonsai pipeline surfaced seven correctness +
observability issues that made the raw llama-cpp-python path unsafe to
ship even at the 4s/msg warm latency we had already reached. Fix them
at the root, then swap to the smaller + higher-quality TQ1_0 model.

Fixes (each mapped to an audit finding):

  C1 concurrent calls crash llama.cpp - threading.Lock around every
     `create_chat_completion` invocation; single instance per ingestor.
  C2 long sessions overflow n_ctx and silently evict the skill prefix -
     pre-call budget check against (n_ctx - headroom); auto-reset before
     the evict would happen, or raise IngestOverflow when even a reset
     cannot fit the request.
  H1 skill edits silently reuse stale KV - fingerprint the skill bytes
     and pin `# skill-sha256=<fp>` into the system prompt. On edit the
     prefix changes, so llama.cpp's prefix-match cache naturally
     invalidates with no explicit reset call.
  H2 empty / <think>-only outputs silently succeed - now raises
     IngestEmpty with the raw output attached for diagnosis.
  H3 duplicate UPSERT of same entity id crashes BatchRollback - dedupe
     UPSERTs pre-parse, record drops in IngestResult.rejected.
  H5 zero observability - one structured log line per ingest with
     input/raw/statement/exec/rejected/entity/belief counts + duration
     + skill fingerprint + dry_run flag.
  H7 no preview - `dry_run=True` returns the DSL without touching the
     GraphStore.

Also ships:
  - tools/skills/graphstore-bonsai-dsl/SKILL.md
    500-token ingest-only skill designed for small local LLMs. Replaces
    the 5.5K-token graphstore-dsl skill when the target is Bonsai-sized.
  - tools/skills/graphstore-bonsai-dsl/grammar.gbnf
    GBNF constraint for llama.cpp grammar-constrained decoding. Not used
    by BonsaiIngestor today (added latency without quality gain in our
    tests); kept alongside the skill so it tracks any schema changes.
  - benchmarks/kaggle/pack_ternary_bonsai/
    Kaggle kernel that converts prism-ml/Ternary-Bonsai-4B-unpacked
    (FP16 safetensors) to TQ1_0 GGUF and uploads the result to
    superkaiii/Ternary-Bonsai-4B-TQ1_0-GGUF. Re-runnable for future
    model / quant updates.

Tests:
  15 unit tests cover strip_think, line-split, UPSERT dedupe, fingerprint
  stability + edit detection, frontmatter stripping, empty input
  rejection, missing-file errors, and IngestResult defaults. Live model
  path (needs the 1.09 GB TQ1_0 GGUF on disk) is not a unit test;
  smoke-verified end-to-end separately:

    cold dry-run (entity ingest):   10.7s  5 stmts,  2 entities caught
    warm real   (belief claim):      2.5s  2 stmts,  ASSERT emitted
    skill fingerprint stable:        32a4fa68e5ab across calls

TQ1_0 vs earlier Q2_K: ~40% faster warm (2.5s vs 3.5-4.6s) + handles
belief claims correctly (Q2_K 1.7B failed the claim test; TQ1_0 4B
passes cleanly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Correctness follow-up to the initial BonsaiIngestor commit. The deeper
end-to-end audit caught two data-shape bugs the guards did not cover:

1. Belief fact_id drift across messages.
   T2 emitted ASSERT "fact:favorite_drink". T3 ("Actually I prefer tea
   now.") emitted ASSERT "fact:preference" - a new fact_id for the same
   underlying concept. Graph ended with two live beliefs for one preference.

   Fix: port the _FactState tracking pattern from the existing
   graphstore_skill.py adapter into BonsaiIngestor. After every successful
   ingest, `_scrape_belief_updates` walks executed ASSERT / RETRACT lines
   and updates a running fact_id -> FactState dict. Next ingest renders
   the live (non-retracted) entries into a `### KNOWN FACTS (reuse these
   fact_ids...)` block prepended to the user message. Model sees prior
   ids and reuses them.

   The block goes in the USER message, not the system prompt, so the
   skill-prefix KV cache stays byte-identical across calls.

2. Edge emission flaky.
   With only per-example guidance in the skill, the 4B Q2_K model
   occasionally skipped CREATE EDGE statements for ingested entities.
   Entities created, edges missing, graph half-built.

   Fix: added an explicit numbered rules block to the skill telling the
   model to always emit UPSERT + matching CREATE EDGE together.

Post-fix end-to-end (4B TQ1_0, CPU):
  T1 Priya joined OpenAI     -> 5 stmts, 2 edges emitted
  T2 favorite drink coffee   -> ASSERT fact:favorite_drink
  T3 actually prefer tea     -> RETRACT + ASSERT REUSING fact:favorite_drink
  final graph: 2 ents, 3 msgs, 1 belief, 2 edges

API additions on BonsaiIngestor:
  - .facts         read-only snapshot of the running fact state
  - .reset_facts() clear state when starting a new user / conversation

Unit tests +10 (25 total now):
  - scrape ASSERT creates FactState with kind/value/confidence/source
  - scrape RETRACT marks retracted + records reason
  - scrape ASSERT -> RETRACT -> ASSERT round trip un-retracts
  - scrape ignores non-belief lines
  - render hides retracted facts
  - render formats all fields
  - render trims to max_facts (most recent kept)
  - facts property returns a copy (mutation does not leak)
  - reset_facts clears state

Full suite: 1827 passed, 101 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@KailasMahavarkar KailasMahavarkar merged commit 665c477 into main Apr 20, 2026
4 checks passed
@KailasMahavarkar KailasMahavarkar deleted the feat/bonsai-nl-ingestor branch April 20, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant