Skip to content

v0.5.0

Choose a tag to compare

@github-actions github-actions released this 10 Oct 18:18
035b238

0.5.0 (2025-10-10)

⚠ BREAKING CHANGES

  • added more groupings under benchmarks catalog (#244)

Features

  • add clockbench evaluation framwork and script for synthesizing public dataset. (#159) (3ba9836)
  • add IFEval (#182) (8d1b939)
  • add local openbench implementation of groq provider in inspect (#131) (52aea35)
  • add mmmlu eval (#193) (a42c2d5)
  • add mmstar benchmark (#174) (5d085ab)
  • add new openbench documentation (#169) (f3e6a37)
  • add overarching bbh command to run all 18 BBH tasks (463a25f)
  • add preset eval group infrastructure (#215) (d9ea03a)
  • added more groupings under benchmarks catalog (#244) (d932cb0)
  • ArabicMMLU: add remaining 32 Arabic exam subsets, total 41 subsets (#219) (006e248)
  • benchmark: add support for arc-agi (#158) (3f32253)
  • benchmark: add support for detailbench (#154) (23fbca5)
  • benchmark: add support for TUMLU (#160) (#161) (885be75)
  • benchmark: multichallenge implementation (#170) (cf2ab4f)
  • change default model to groq/openai/gpt-oss-20b (#138) (8f7f42f)
  • components: export the run_eval entrypoint method (#157) (acbe7f4)
  • configure release-please for pre-v1.0 version bumping (#133) (c432934)
  • cybench: ported over code for cybench (#207) (7949425)
  • cybersecurity, changelog, more docs (39f123c)
  • display results patch to include task duration stats (#167) (e4e480c)
  • docs: add changelog page (#225) (7db9135)
  • docs: add release notes section and update index with new features for v0.5 (#245) (09ab78e)
  • docs: added feature card and docs page for exercism (#243) (2b38147)
  • docs: Added feature eval docs pages and cache command docs (#191) (50501f1)
  • eval: add support for json output (#14) (f335418)
  • exercism: added support for exercism tasks w/ agent support for aider, roo, claude, opencode (#151) (d86f0da)
  • graphwalks token filter (#115) (e38658c)
  • groq reasoning effort + bugfix to override inspect's "groq" (#142) (b919cc7)
  • lighteval: Add 7 core commonsense reasoning benchmarks from LightEval (#197) (7792c45)
  • lighteval: add BigBench eval (122 MCQ tasks) (9f35b1d)
  • lighteval: Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) (917667a)
  • lighteval: add Global-MMLU eval (42 languages) (3542213)
  • lighteval: register BigBench benchmarks in config and registry (77018e9)
  • lighteval: register Global-MMLU benchmarks in config and registry (156f509)
  • link to subscription form on main page (#240) (988d08c)
  • livemcpbench: Adding support for liveMCPBench (#127) (222f678)
  • make evals dash/undescore insensitive (#185) (5ec5177)
  • mbpp (#117) (93ad88b)
  • mcq_eval: enable abstraction of MCQ eval (#181) (2f53db2)
  • mmmu-pro: added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision (#134) (a875378)
  • openrouter: add OpenRouter provider support (#145) (47b579e)
  • openrouter: add provider routing args support (#180) (12e1d81)
  • otis-mock-aime: added support for otis mock aime 2024-2025 (#218) (1b9fd5c)
  • plugins: add entry point system for external benchmarks (#216) (71e7257)
  • return eval logs from run_eval function (#173) (ee459d9)
  • rootly_terraform: add initial implementation of Rootly Terraform evals (#195) (cd3acae)

Bug Fixes

  • allow for more python versions (#164) (e6682fe)
  • close headqa metadata entry (947522d)
  • cybench: moved cybench dependency into dependency group (#237) (8d30715)
  • handle missing SciCode dependency lazily in solver (#186) (fed4e88)
  • improve BBH target extraction to handle multi-char answers (147e3e0)
  • load Neue Regrade font in Mintlify docs (#177) (550c7f5)
  • make core package actually install (#235) (edeb4b8)
  • normalize benchmark keys during entry point merge (#217) (d285664)
  • register headqa_en and headqa_es variants (6a19aa1)
  • render inspect error correctly (#241) (97ccd10)
  • resolve registry import conflict (98a1c79)
  • scicode: add support for test split, fix test_data.h5 import error (#149) (23fa8cb)
  • update bbh function for programmatic access only (811ce9e)
  • use generic type ignore for bbh task decorator (fd67171)

Documentation

  • readme: clarify benchmark case-sensitivity and grader requirements (#135) (c34a5a3)

Chores

  • add @nmayorga7 to CODEOWNERS (abad7bf)
  • alphabetize available benchmarks error (#214) (68f46e9)
  • benchmark: removed combined cti-bench eval (#183) (a77852c)
  • bugbot fixes for MCQ (#190) (6ecaefc)
  • docs: add docs for openrouter and MCQEval (#188) (7f8cd83)
  • docs: alphabetize benchmarks metadata (#187) (ce77812)
  • docs: benchmarks each on new line (#184) (b3c40f8)
  • docs: minor cleanup (#179) (80c9e09)
  • fixed fonts in openbench docs (#178) (6e3c2a5)
  • GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (d0018d1)
  • mcq-eval: accept more dataset types (#194) (cb5e038)
  • move all metrics to discrete files in /metrics (#168) (7caa042)
  • release-please pre-1.0: treat BREAKING as minor (3c88a42)
  • remove pre-commit benchmark checks for easier CI (#213) (67b07a7)
  • rename OpenBench to openbench (#196) (0621b46)
  • rename task to sample in time metrics (#172) (96e1817)
  • sync packaging pyproject (#234) (940a879)
  • update Claude workflows to enhance permissions and streamline triggers (#136) (effb7da)
  • update readme and contributing (#176) (ea606ba)
  • update release-please configuration and add lockfile update workflow (#146) (2d6ad9b)
  • user agent (#163) (e20f3c1)

CI

  • add benchmarks validation pre commit hook (#171) (3725638)
  • remove PR trigger from release-please (#166) (6440d44)