multivon-eval 0.11.0 — runtime prompt recorder
·
26 commits
to main
since this release
The answer to the 20.9% ceiling. The determinacy gate (0.10.1) measured scanner v3 against five real repos: 20.9% of call sites are statically resolvable — the rest build prompts dynamically and are statically unbridgeable by construction. The runtime prompt recorder (designed in #9, promoted from the 0.10.0 deferred list by the gate result) is the honest path past it: during an eval run, an opt-in interceptor records the rendered prompt text per call site, fingerprinted with the same fingerprint_text the static scanner uses. A **kwargs unpack the scanner can only report as UNKNOWN is, at call time, real kwargs with real text.
The honesty discipline survives the new power — three labeled trust tiers, never collapsed:
- static — the scan proves the prompt text;
- runtime — recordings prove only the renderings observed, not all renderings (variable renderings per site are a fingerprint SET, and every verdict speaks in "current recordings matched k of N previously observed renderings" — a site is never called fresh because one rendering matched);
- templates / external prompts — deferred, unverifiable.
Added
multivon_eval.recorder— opt-in runtime prompt recorder.record_prompts()context manager (non-pytest) orpytest --record-prompts(plus--record-prompts-out,--record-text). Method-level wrapping of exactly the three SDK surfaces the static scanner knows — anthropicMessages.create, openaichat.completions.create,litellm.completion/acompletion— save original, wrap, restore byte-identical on exit (inherited attributes restored bydelattr, so__dict__s end exactly as found). Zero overhead when off: importing multivon_eval performs NO patching, pinned by a fresh-interpreter subprocess test. Recordings stay local inprompt_recordings.jsonl; fingerprints only by default, rendered text only behind explicit--record-text. Append-safe storage: duplicate (anchor, role, fingerprint) keys merge counts/case_uids on write, atomic rewrite.- Case binding by observation — a contextvar carries the active
case_uid;EvalSuitebinds it per case from_provenance.case_uid(one None-check when recording is off) and the pytest plugin binds the test nodeid per test. Recordings carry the case_uids observed per (anchor, role, fingerprint) — the run knows which sites fired for which case. multivon-eval staleness baseline --merge-recordings [FILE]— merges recordings intoprompt_baseline.jsonassource:"runtime"records with fingerprint SETS, stored under a separateruntime_recordskey. Merge-only: never rescans, NEVER touches static records; a static rescan never discards the runtime tier; re-merging the same recordings file is idempotent.- OBSERVED report tier — runtime-sourced sites render distinctly in text/json/markdown: compared recordings-vs-recordings (runtime-only sites cannot be compared against a static scan, and the report says so), always in the k/N language. The determinacy headline gains a third clause: "K sites observed at runtime." The standing footer now states all three trust tiers verbatim, next to the blind-spots list.
multivon-eval staleness stamp --from-recordings [FILE]— prints observed case→site bindings as proposals (case_uid → anchor + fingerprint with observation counts); writes only with explicit--apply --cases F.jsonl(targets land assource:"runtime",bound:"observed"). Observation removes the fabrication objection that blocked auto-binding in the 0.10.0 adversarial review — the human confirmation stays. Runtime-bound targets are verified against recordings, never against the static scan (reportedunverifiable [runtime]there, by rule), and never enter the static coverage denominator.
Fixed
add_checkQAG question generation no longer invents stricter sub-requirements. "Response should mention the return policy" generated questions about return procedures and eligibility the criterion never asked for, scoring a plainly-correct answer 0.33 FAIL (reproduced 3/3 trials). The generation prompt now requires every question be answerable "Yes" by any response satisfying the criterion as stated; the same answer now scores 1.0 (and the evasive control still fails). Found by a fresh-user E2E audit on the quickstart's own example.- Keyless demo picks an Ollama model you actually have —
python -m multivon_evalused hardcodedllama3and reported "detected but unreachable" when the server was fine and the model wasn't pulled. It now asks/api/tagsfor an installed model (text models preferred), honorsDEMO_MODEL, and the failure message distinguishes "model not available" from "server unreachable". - Bootstrap creates the output dir before any paid LLM call — a typo'd or read-only
--outputpreviously failed after ~$0.12 of judge spend. Progress lines now print to stderr as eac