multivon-eval 0.10.1 — scanner v3 + the published 20.9% gate failure
Scanner v3 — the determinacy gate (spec test-plan #14) run against five real repos (aider, gpt-researcher, open-interpreter, letta, pr-agent) found that 4 of 5 reported zero call sites: not because they have no prompts, but because the scanner was silently blind to how real code calls LLMs. v3 fixes detection and honestly reports what it still cannot read.
Fixed
- Aliased litellm imports detected —
from litellm import acompletionthen bareacompletion(...)(pr-agent's shape) now matches. Star imports and function-local imports stay out of scope. **kwargs-unpacked calls surface as UNKNOWN —litellm.completion(**kwargs)(aider's shape) now emits an honest<dynamic:KwargsUnpack>record instead of vanishing.messages=<variable>surfaces as UNKNOWN — the most common real-world shape (messages list built elsewhere) now emits one dynamic record per call site instead of nothing. A literal emptymessages=[]correctly emits nothing (statically known empty).SCANNER_VERSIONbumped to 3; baselines written by v2 print a "rescan recommended" warning instead of fake drift.
Measured (the determinacy gate, public on the epic)
Honest detection changed the denominator: 73 → 278 sites across the five repos, and static resolvability is 20.9% — below the 50% gate. Conclusion recorded publicly: real-world prompt traffic is mostly dynamic construction; static analysis tracks call-site add/remove for all of it but can verify text drift only for prompts-as-constants codebases (letta-style: 58 static sites). The runtime recorder (epic) is now the priority path for the rest. The staleness report's determinacy headline makes this exact ratio visible per-repo — by design.