multivon-eval 0.9.1 — Anthropic reasoning-tier fix + vision module + ollama provider
Patch release driven by the pdfhell mini-v4 eval-pipeline post-mortem (CORRECTION_NOTICE.md). The same Anthropic API change that silently broke every Opus 4-7 call in the pdfhell leaderboard would have broken any multivon-eval consumer using Opus 4-7 as a judge — fixing the underlying issue upstream closes the gap for everyone using these adapters.
Fixed
AnthropicAdapteromitstemperaturefor reasoning-tier models. Anthropic'sclaude-opus-4-7and theclaude-opus-5+family reject the parameter with a 400. The adapter detects them by name prefix and drops the field; older models still receivetemperatureunchanged. Same fix applied tomultivon_eval.discover._call_judge. New helperAnthropicAdapter._supports_temperature()exposes the decision. 19 new unit tests cover the matrix.
Added
-
multivon_eval.visionmodule. Singlecall_vision(prompt, sources, judge, max_tokens)function. Providers:anthropic,openai,google,ollama. Per-provider content-block conversion (Anthropicdocument/image, OpenAIfile/image_url, GooglePart.from_bytes, Ollamaimagesfield). PDFs rasterise viapypdfium2for ollama. Previously lived inpdfhell.vision; promoted so any multivon-eval consumer can grade images/PDFs without re-implementing per-provider plumbing. -
ollama:is now a first-class JudgeConfig provider.JudgeConfig(provider="ollama", model="llama3.2")resolves tolitellm'sollama/<model>driver internally. Matches the colon convention used by the rest of the SDK (anthropic:,openai:,google:). Both sync and async judge paths.OLLAMA_HOSTenv var sets the base URL (defaulthttp://127.0.0.1:11434).
Compatibility
- No breaking changes.
AnthropicAdapterconstructor signature is unchanged —temperatureis still accepted, just silently dropped for the reasoning tier. Existing pinned dependents (incl.pdfhell>=0.5.0) work without modification. - Tests: 19 new + 864 pre-existing passing. The two failing
test_beginner_friendly.pytests are pre-existing model-name issues independent of this release.
Usage
from multivon_eval import JudgeConfig, call_vision
# Vision grading via any supported provider
judge = JudgeConfig(provider="anthropic", model="claude-opus-4-7") # temperature omitted automatically
answer = call_vision("What is the total?", ["invoice.pdf"], judge)
# Local model judges via ollama
judge = JudgeConfig(provider="ollama", model="qwen2.5:7b") # first-class, no provider="litellm" neededFull changelog: CHANGELOG.md