Skip to content

multivon-eval 0.9.1 — Anthropic reasoning-tier fix + vision module + ollama provider

Choose a tag to compare

@siddharthsrivastava siddharthsrivastava released this 24 May 05:05
· 48 commits to main since this release

Patch release driven by the pdfhell mini-v4 eval-pipeline post-mortem (CORRECTION_NOTICE.md). The same Anthropic API change that silently broke every Opus 4-7 call in the pdfhell leaderboard would have broken any multivon-eval consumer using Opus 4-7 as a judge — fixing the underlying issue upstream closes the gap for everyone using these adapters.

Fixed

  • AnthropicAdapter omits temperature for reasoning-tier models. Anthropic's claude-opus-4-7 and the claude-opus-5+ family reject the parameter with a 400. The adapter detects them by name prefix and drops the field; older models still receive temperature unchanged. Same fix applied to multivon_eval.discover._call_judge. New helper AnthropicAdapter._supports_temperature() exposes the decision. 19 new unit tests cover the matrix.

Added

  • multivon_eval.vision module. Single call_vision(prompt, sources, judge, max_tokens) function. Providers: anthropic, openai, google, ollama. Per-provider content-block conversion (Anthropic document/image, OpenAI file/image_url, Google Part.from_bytes, Ollama images field). PDFs rasterise via pypdfium2 for ollama. Previously lived in pdfhell.vision; promoted so any multivon-eval consumer can grade images/PDFs without re-implementing per-provider plumbing.

  • ollama: is now a first-class JudgeConfig provider. JudgeConfig(provider="ollama", model="llama3.2") resolves to litellm's ollama/<model> driver internally. Matches the colon convention used by the rest of the SDK (anthropic:, openai:, google:). Both sync and async judge paths. OLLAMA_HOST env var sets the base URL (default http://127.0.0.1:11434).

Compatibility

  • No breaking changes. AnthropicAdapter constructor signature is unchanged — temperature is still accepted, just silently dropped for the reasoning tier. Existing pinned dependents (incl. pdfhell>=0.5.0) work without modification.
  • Tests: 19 new + 864 pre-existing passing. The two failing test_beginner_friendly.py tests are pre-existing model-name issues independent of this release.

Usage

from multivon_eval import JudgeConfig, call_vision

# Vision grading via any supported provider
judge = JudgeConfig(provider="anthropic", model="claude-opus-4-7")  # temperature omitted automatically
answer = call_vision("What is the total?", ["invoice.pdf"], judge)

# Local model judges via ollama
judge = JudgeConfig(provider="ollama", model="qwen2.5:7b")  # first-class, no provider="litellm" needed

Full changelog: CHANGELOG.md