Releases
v0.5.0
Compare
Sorry, something went wrong.
No results found
0.5.0 (2025-10-10)
⚠ BREAKING CHANGES
added more groupings under benchmarks catalog (#244 )
Features
add clockbench evaluation framwork and script for synthesizing public dataset. (#159 ) (3ba9836 )
add IFEval (#182 ) (8d1b939 )
add local openbench implementation of groq provider in inspect (#131 ) (52aea35 )
add mmmlu eval (#193 ) (a42c2d5 )
add mmstar benchmark (#174 ) (5d085ab )
add new openbench documentation (#169 ) (f3e6a37 )
add overarching bbh command to run all 18 BBH tasks (463a25f )
add preset eval group infrastructure (#215 ) (d9ea03a )
added more groupings under benchmarks catalog (#244 ) (d932cb0 )
ArabicMMLU: add remaining 32 Arabic exam subsets, total 41 subsets (#219 ) (006e248 )
benchmark: add support for arc-agi (#158 ) (3f32253 )
benchmark: add support for detailbench (#154 ) (23fbca5 )
benchmark: add support for TUMLU (#160 ) (#161 ) (885be75 )
benchmark: multichallenge implementation (#170 ) (cf2ab4f )
change default model to groq/openai/gpt-oss-20b (#138 ) (8f7f42f )
components: export the run_eval entrypoint method (#157 ) (acbe7f4 )
configure release-please for pre-v1.0 version bumping (#133 ) (c432934 )
cybench: ported over code for cybench (#207 ) (7949425 )
cybersecurity, changelog, more docs (39f123c )
display results patch to include task duration stats (#167 ) (e4e480c )
docs: add changelog page (#225 ) (7db9135 )
docs: add release notes section and update index with new features for v0.5 (#245 ) (09ab78e )
docs: added feature card and docs page for exercism (#243 ) (2b38147 )
docs: Added feature eval docs pages and cache command docs (#191 ) (50501f1 )
eval: add support for json output (#14 ) (f335418 )
exercism: added support for exercism tasks w/ agent support for aider, roo, claude, opencode (#151 ) (d86f0da )
graphwalks token filter (#115 ) (e38658c )
groq reasoning effort + bugfix to override inspect's "groq" (#142 ) (b919cc7 )
lighteval: Add 7 core commonsense reasoning benchmarks from LightEval (#197 ) (7792c45 )
lighteval: add BigBench eval (122 MCQ tasks) (9f35b1d )
lighteval: Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) (917667a )
lighteval: add Global-MMLU eval (42 languages) (3542213 )
lighteval: register BigBench benchmarks in config and registry (77018e9 )
lighteval: register Global-MMLU benchmarks in config and registry (156f509 )
link to subscription form on main page (#240 ) (988d08c )
livemcpbench: Adding support for liveMCPBench (#127 ) (222f678 )
make evals dash/undescore insensitive (#185 ) (5ec5177 )
mbpp (#117 ) (93ad88b )
mcq_eval: enable abstraction of MCQ eval (#181 ) (2f53db2 )
mmmu-pro: added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision (#134 ) (a875378 )
openrouter: add OpenRouter provider support (#145 ) (47b579e )
openrouter: add provider routing args support (#180 ) (12e1d81 )
otis-mock-aime: added support for otis mock aime 2024-2025 (#218 ) (1b9fd5c )
plugins: add entry point system for external benchmarks (#216 ) (71e7257 )
return eval logs from run_eval function (#173 ) (ee459d9 )
rootly_terraform: add initial implementation of Rootly Terraform evals (#195 ) (cd3acae )
Bug Fixes
allow for more python versions (#164 ) (e6682fe )
close headqa metadata entry (947522d )
cybench: moved cybench dependency into dependency group (#237 ) (8d30715 )
handle missing SciCode dependency lazily in solver (#186 ) (fed4e88 )
improve BBH target extraction to handle multi-char answers (147e3e0 )
load Neue Regrade font in Mintlify docs (#177 ) (550c7f5 )
make core package actually install (#235 ) (edeb4b8 )
normalize benchmark keys during entry point merge (#217 ) (d285664 )
register headqa_en and headqa_es variants (6a19aa1 )
render inspect error correctly (#241 ) (97ccd10 )
resolve registry import conflict (98a1c79 )
scicode: add support for test split, fix test_data.h5 import error (#149 ) (23fa8cb )
update bbh function for programmatic access only (811ce9e )
use generic type ignore for bbh task decorator (fd67171 )
Documentation
readme: clarify benchmark case-sensitivity and grader requirements (#135 ) (c34a5a3 )
Chores
add @nmayorga7 to CODEOWNERS (abad7bf )
alphabetize available benchmarks error (#214 ) (68f46e9 )
benchmark: removed combined cti-bench eval (#183 ) (a77852c )
bugbot fixes for MCQ (#190 ) (6ecaefc )
docs: add docs for openrouter and MCQEval (#188 ) (7f8cd83 )
docs: alphabetize benchmarks metadata (#187 ) (ce77812 )
docs: benchmarks each on new line (#184 ) (b3c40f8 )
docs: minor cleanup (#179 ) (80c9e09 )
fixed fonts in openbench docs (#178 ) (6e3c2a5 )
GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (d0018d1 )
mcq-eval: accept more dataset types (#194 ) (cb5e038 )
move all metrics to discrete files in /metrics (#168 ) (7caa042 )
release-please pre-1.0: treat BREAKING as minor (3c88a42 )
remove pre-commit benchmark checks for easier CI (#213 ) (67b07a7 )
rename OpenBench to openbench (#196 ) (0621b46 )
rename task to sample in time metrics (#172 ) (96e1817 )
sync packaging pyproject (#234 ) (940a879 )
update Claude workflows to enhance permissions and streamline triggers (#136 ) (effb7da )
update readme and contributing (#176 ) (ea606ba )
update release-please configuration and add lockfile update workflow (#146 ) (2d6ad9b )
user agent (#163 ) (e20f3c1 )
CI
add benchmarks validation pre commit hook (#171 ) (3725638 )
remove PR trigger from release-please (#166 ) (6440d44 )
You can’t perform that action at this time.