Release multivon-eval 0.11.0 — runtime prompt recorder · multivon-ai/multivon-eval

The answer to the 20.9% ceiling. The determinacy gate (0.10.1) measured scanner v3 against five real repos: 20.9% of call sites are statically resolvable — the rest build prompts dynamically and are statically unbridgeable by construction. The runtime prompt recorder (designed in #9, promoted from the 0.10.0 deferred list by the gate result) is the honest path past it: during an eval run, an opt-in interceptor records the rendered prompt text per call site, fingerprinted with the same fingerprint_text the static scanner uses. A **kwargs unpack the scanner can only report as UNKNOWN is, at call time, real kwargs with real text.

The honesty discipline survives the new power — three labeled trust tiers, never collapsed:

static — the scan proves the prompt text;
runtime — recordings prove only the renderings observed, not all renderings (variable renderings per site are a fingerprint SET, and every verdict speaks in "current recordings matched k of N previously observed renderings" — a site is never called fresh because one rendering matched);
templates / external prompts — deferred, unverifiable.

Added

multivon_eval.recorder — opt-in runtime prompt recorder. record_prompts() context manager (non-pytest) or pytest --record-prompts (plus --record-prompts-out, --record-text). Method-level wrapping of exactly the three SDK surfaces the static scanner knows — anthropic Messages.create, openai chat.completions.create, litellm.completion/acompletion — save original, wrap, restore byte-identical on exit (inherited attributes restored by delattr, so __dict__s end exactly as found). Zero overhead when off: importing multivon_eval performs NO patching, pinned by a fresh-interpreter subprocess test. Recordings stay local in prompt_recordings.jsonl; fingerprints only by default, rendered text only behind explicit --record-text. Append-safe storage: duplicate (anchor, role, fingerprint) keys merge counts/case_uids on write, atomic rewrite.
Case binding by observation — a contextvar carries the active case_uid; EvalSuite binds it per case from _provenance.case_uid (one None-check when recording is off) and the pytest plugin binds the test nodeid per test. Recordings carry the case_uids observed per (anchor, role, fingerprint) — the run knows which sites fired for which case.
multivon-eval staleness baseline --merge-recordings [FILE] — merges recordings into prompt_baseline.json as source:"runtime" records with fingerprint SETS, stored under a separate runtime_records key. Merge-only: never rescans, NEVER touches static records; a static rescan never discards the runtime tier; re-merging the same recordings file is idempotent.
OBSERVED report tier — runtime-sourced sites render distinctly in text/json/markdown: compared recordings-vs-recordings (runtime-only sites cannot be compared against a static scan, and the report says so), always in the k/N language. The determinacy headline gains a third clause: "K sites observed at runtime." The standing footer now states all three trust tiers verbatim, next to the blind-spots list.
multivon-eval staleness stamp --from-recordings [FILE] — prints observed case→site bindings as proposals (case_uid → anchor + fingerprint with observation counts); writes only with explicit --apply --cases F.jsonl (targets land as source:"runtime", bound:"observed"). Observation removes the fabrication objection that blocked auto-binding in the 0.10.0 adversarial review — the human confirmation stays. Runtime-bound targets are verified against recordings, never against the static scan (reported unverifiable [runtime] there, by rule), and never enter the static coverage denominator.

Fixed

add_check QAG question generation no longer invents stricter sub-requirements. "Response should mention the return policy" generated questions about return procedures and eligibility the criterion never asked for, scoring a plainly-correct answer 0.33 FAIL (reproduced 3/3 trials). The generation prompt now requires every question be answerable "Yes" by any response satisfying the criterion as stated; the same answer now scores 1.0 (and the evasive control still fails). Found by a fresh-user E2E audit on the quickstart's own example.
Keyless demo picks an Ollama model you actually have — python -m multivon_eval used hardcoded llama3 and reported "detected but unreachable" when the server was fine and the model wasn't pulled. It now asks /api/tags for an installed model (text models preferred), honors DEMO_MODEL, and the failure message distinguishes "model not available" from "server unreachable".
Bootstrap creates the output dir before any paid LLM call — a typo'd or read-only --output previously failed after ~$0.12 of judge spend. Progress lines now print to stderr as eac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multivon-eval 0.11.0 — runtime prompt recorder

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Added

Fixed

Uh oh!