feat(data): FTS-powered context enrichment for LLM chat#933
Closed
cpcloud wants to merge 7 commits intomicasa-dev:mainfrom
Closed
feat(data): FTS-powered context enrichment for LLM chat#933cpcloud wants to merge 7 commits intomicasa-dev:mainfrom
cpcloud wants to merge 7 commits intomicasa-dev:mainfrom
Conversation
212d280 to
cdcc204
Compare
Adds plans/707-fts-context-enrichment.md (spec) and plans/707-fts-context-enrichment-plan.md (implementation plan) for the first FTS feature: injecting entity-search context into the chat pipeline's SQL-generation, summary, and fallback prompt builders. Refs micasa-dev#707.
Wires FTS entity search into all three chat prompt builders
(BuildSQLPrompt, BuildSummaryPrompt, BuildSystemPrompt fallback). Each
chat turn runs SearchEntities against an entities_fts virtual table
indexing every text-bearing entity type, fetches a live one-line
summary for each hit via EntitySummary (which revalidates against the
source table), fences the results with a clearly-labeled BEGIN/END
block, and escapes the fence delimiter + code-fence tokens in
user-controlled entity text to prevent prompt-injection breakout.
Stage 1 gets the context for better SQL (entity IDs for disambiguation,
better WHERE filters, fewer fragile LIKEs). Stage 2 gets it as
disambiguation-only background ("use solely for disambiguation, not as
a source of additional facts") so summaries are still grounded in the
SQL results. Fallback gets it to improve the no-SQL answer path.
Error handling: any failure in SearchEntities or EntitySummary
short-circuits to empty context, matching pre-FTS behavior exactly --
enrichment degrades gracefully.
Regression test covers the query-error wrapping path (corrupted FTS
schema is forwarded as a wrapped "search entities:" error).
Refs micasa-dev#707.
Adds plans/707-fts-eval-and-hardening.md: follow-up plan for the FTS context-enrichment work. Three pieces: A. A `micasa eval fts` subcommand that runs a benchmark question set through the live chat pipeline against a fixture DB (or the user's own DB), grades each answer with a deterministic rubric plus an optional LLM judge, and reports FTS-on vs FTS-off deltas. B. SearchEntities becomes a single window-function query with per-entity-type quotas and a BM25 rank threshold. C. setupEntitiesFTS installs AFTER INSERT/UPDATE/DELETE triggers on every source table. UPDATE triggers on parents whose text is embedded in a child's FTS row cascade a refresh to those children. Covers acceptance criteria, privacy warnings for non-fixture runs, partial-failure taxonomy, and JudgeScore sentinel semantics (-1 = "not run"; 0 = genuine all-criteria-failed grade). Refs micasa-dev#707.
setupEntitiesFTS now installs AFTER INSERT / UPDATE / DELETE triggers on every source table that contributes rows to entities_fts (projects, vendors, appliances, maintenance_items, incidents, service_log_entries, quotes). Parent tables whose text is embedded in a child's entity_name (project.title and vendor.name in quote, maintenance_item.name in SLE) get companion _au_cascade triggers that rebuild the child's FTS row when the parent is updated. Cascade JOINs filter on parent.deleted_at IS NULL so a parent soft-delete degrades the child's entity_name (project title disappears from the quote; vendor name disappears; SLE name blanks out) instead of leaving stale text in the index. The populate path carries the same filter so initial rebuilds on app open match the trigger invariant. Trigger installation is idempotent (DROP IF EXISTS + CREATE), so schema drift across app versions heals on the next Store.Open. FK constraints (RESTRICT on quote parents, CASCADE on SLE parents) continue to govern hard-delete feasibility; parent _ad triggers are plain single-table cleanups, no cascade blocks needed. Tests cover: insert, rename, soft-delete, parent-rename cascade for all three relationships, parent-soft-delete cascade via raw DML (the app gates soft-delete with live children, so the cascade path is exercised by sync in production; raw DML matches that scenario in tests), FK cascade on maintenance_item hard-delete, and initial rebuild preserving the soft-delete filter for both SLE and quote joins. Refs micasa-dev#707.
cdcc204 to
9680f38
Compare
Replaces the flat LIMIT 20 in SearchEntities with a three-tier
window-function query and adds natural-language query tolerance.
Ranking:
- Tier 1 takes exactly one row per matching entity type (guarantees
cross-type representation).
- Tier 2 raises each type up to ftsEntityKPerType rows so single noisy
types can't dominate.
- Tier 3 fills any remaining room up to ftsEntityTotalCap from whatever's
left, globally ranked. Single-type searches use the full cap this way.
Package-level tuning constants (not user-configurable -- the eval harness
is the tuning channel):
ftsEntityKPerType = 5
ftsEntityRankCeiling = 0.0 // permissive; eval will tighten
ftsEntityTotalCap = 20
entity_id tiebreaks rank in every ORDER BY so results are stable when
BM25 produces identical ranks on similarly-shaped rows.
Query tolerance:
- prepareFTSEntityQuery lowercases, strips non-alphanum, drops short and
stopword tokens, and OR-joins the survivors as quoted prefix phrases.
- Returns early when no content words survive so a pure-stopword question
like "what is it?" doesn't hammer FTS with an empty MATCH.
Tests cover per-type quota preservation under a flood of first-class
matches, single-type searches using the full cap, every matching type
surfacing when 5+ types share a token, total cap enforcement, rank
threshold plumbing, stable ordering across runs, the query builder
directly, and the end-to-end regression that "what's the status of the
kitchen project?" now surfaces the Kitchen Remodel project.
Refs micasa-dev#707.
Wires the chat-quality eval described in the plan: - internal/ftseval/ package with typed Config, Question, ArmResult, RunResult; Run() drives each question through both FTS arms against a pre-built store, grades with a deterministic regex rubric plus an optional LLM judge, and returns the per-question results. - Fixture seed (SeedFixture) populating projects, vendors, appliances, maintenance items, incidents, one service log, and one quote that ties kitchen to Pacific Plumbing (with the "permit delays" vendor note the long-tail-note question relies on). - Default question set covering disambiguation, cross-entity joins, service-log lookup, aggregate (FTS-neutral), basement incidents, nonexistent entity, long-tail note, and brand filter. - Judge-score sentinel -1 when the judge didn't run (--skip-judge, no summary, parse failure, or judge error); 0-5 when it did. Judge parser tolerates real-world model output: markdown-decorated rubric lines, `:` vs `=` separators, mixed case, leading <think>/<thinking>/ <reasoning> blocks, and "Rationale" as an alias for "Reason". The judge_reason surfaces in Notes when the score is the sentinel. - Table report (default on TTYs, via lipgloss), markdown (default when piping or writing to a file), and JSON. JSON writes a redactedConfig that excludes APIKey so the key never leaks to stdout, --output, or CI artifacts. Judge-score aggregates exclude sentinel rows. - --strict exits 1 on per-question FTS-on rubric regression over questions completed on both arms (sql_error still counts as completed per production behavior; stage-1/stage-2 provider errors do not). - Empty ExpectedEntityIDs are skipped in entity-hit scoring so --db runs (which have a zero-valued SeededFixture) don't false-positive. CLI: `micasa eval fts` with --db, --provider, --model, --judge-model, --questions, --skip-judge, --no-ab, --format, --output, --strict. Default fixture is built in a tempdir that cleans up on exit; --db points at an existing store. Privacy warning printed on stderr when running against a non-fixture DB on a non-local provider. Nix: `nix run '.#fts-eval'` wraps the subcommand. Refactor: moves buildFTSContext and buildTableInfoFrom out of internal/app/chat.go into internal/llm as exported BuildFTSContextFromStore and BuildTableInfo so the eval harness reproduces exactly the prompt-building logic chat uses. Refs micasa-dev#707.
9680f38 to
daa025b
Compare
Collaborator
Author
|
Closed in favor of five separately reviewable PRs:
The chat pipeline wiring from this PR is held back pending a stronger eval signal. #963 ships the eval infra so that evaluation can happen on main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
entities_ftstable indexing 7 entity types (projects, vendors, appliances, incidents, quotes, maintenance items, service logs), wired into the chat pipeline so the LLM gets entity context in both SQL and summary prompts.entity_idtiebreaker.entities_ftsstays current via SQLite triggers: own-row INSERT/UPDATE/DELETE per entity plus cascading refresh of children — quote rows refresh when their parent project or vendor changes; service_log rows refresh when their parent maintenance_item changes.BuildFTSContextescapes delimiters to protect against prompt injection.EntitySummaryreturns tri-state (found/stale/missing) so the caller can revalidate before using cached results.micasa eval ftssubcommand for chat-quality evaluation: seeded fixture, 8 default questions, FTS-on/off A/B arms, deterministic regex rubric plus optional LLM judge (tolerant parser for real-world model output — markdown decoration,:/=separators,<think>blocks), table/markdown/JSON reports (table default on TTYs),--strictexit code on per-question rubric regression, Nix app wrapper.AGENTS.mdrules against reinventing stdlib helpers and passing conceptual-zero values (nil/empty/0) to third-party functions without checking godoc.Closes #707