Skip to content

multivon-eval 0.10.1 — scanner v3 + the published 20.9% gate failure

Choose a tag to compare

@siddharthsrivastava siddharthsrivastava released this 12 Jun 22:09
· 28 commits to main since this release

Scanner v3 — the determinacy gate (spec test-plan #14) run against five real repos (aider, gpt-researcher, open-interpreter, letta, pr-agent) found that 4 of 5 reported zero call sites: not because they have no prompts, but because the scanner was silently blind to how real code calls LLMs. v3 fixes detection and honestly reports what it still cannot read.

Fixed

  • Aliased litellm imports detectedfrom litellm import acompletion then bare acompletion(...) (pr-agent's shape) now matches. Star imports and function-local imports stay out of scope.
  • **kwargs-unpacked calls surface as UNKNOWNlitellm.completion(**kwargs) (aider's shape) now emits an honest <dynamic:KwargsUnpack> record instead of vanishing.
  • messages=<variable> surfaces as UNKNOWN — the most common real-world shape (messages list built elsewhere) now emits one dynamic record per call site instead of nothing. A literal empty messages=[] correctly emits nothing (statically known empty).
  • SCANNER_VERSION bumped to 3; baselines written by v2 print a "rescan recommended" warning instead of fake drift.

Measured (the determinacy gate, public on the epic)

Honest detection changed the denominator: 73 → 278 sites across the five repos, and static resolvability is 20.9% — below the 50% gate. Conclusion recorded publicly: real-world prompt traffic is mostly dynamic construction; static analysis tracks call-site add/remove for all of it but can verify text drift only for prompts-as-constants codebases (letta-style: 58 static sites). The runtime recorder (epic) is now the priority path for the rest. The staleness report's determinacy headline makes this exact ratio visible per-repo — by design.