Release v0.5.0 · groq/openbench

0.5.0 (2025-10-10)

⚠ BREAKING CHANGES

added more groupings under benchmarks catalog (#244)

Features

add clockbench evaluation framwork and script for synthesizing public dataset. (#159) (3ba9836)
add IFEval (#182) (8d1b939)
add local openbench implementation of groq provider in inspect (#131) (52aea35)
add mmmlu eval (#193) (a42c2d5)
add mmstar benchmark (#174) (5d085ab)
add new openbench documentation (#169) (f3e6a37)
add overarching bbh command to run all 18 BBH tasks (463a25f)
add preset eval group infrastructure (#215) (d9ea03a)
added more groupings under benchmarks catalog (#244) (d932cb0)
ArabicMMLU: add remaining 32 Arabic exam subsets, total 41 subsets (#219) (006e248)
benchmark: add support for arc-agi (#158) (3f32253)
benchmark: add support for detailbench (#154) (23fbca5)
benchmark: add support for TUMLU (#160) (#161) (885be75)
benchmark: multichallenge implementation (#170) (cf2ab4f)
change default model to groq/openai/gpt-oss-20b (#138) (8f7f42f)
components: export the run_eval entrypoint method (#157) (acbe7f4)
configure release-please for pre-v1.0 version bumping (#133) (c432934)
cybench: ported over code for cybench (#207) (7949425)
cybersecurity, changelog, more docs (39f123c)
display results patch to include task duration stats (#167) (e4e480c)
docs: add changelog page (#225) (7db9135)
docs: add release notes section and update index with new features for v0.5 (#245) (09ab78e)
docs: added feature card and docs page for exercism (#243) (2b38147)
docs: Added feature eval docs pages and cache command docs (#191) (50501f1)
eval: add support for json output (#14) (f335418)
exercism: added support for exercism tasks w/ agent support for aider, roo, claude, opencode (#151) (d86f0da)
graphwalks token filter (#115) (e38658c)
groq reasoning effort + bugfix to override inspect's "groq" (#142) (b919cc7)
lighteval: Add 7 core commonsense reasoning benchmarks from LightEval (#197) (7792c45)
lighteval: add BigBench eval (122 MCQ tasks) (9f35b1d)
lighteval: Add cross-lingual understanding benchmarks (XCOPA, XStoryCloze, XWinograd) (917667a)
lighteval: add Global-MMLU eval (42 languages) (3542213)
lighteval: register BigBench benchmarks in config and registry (77018e9)
lighteval: register Global-MMLU benchmarks in config and registry (156f509)
link to subscription form on main page (#240) (988d08c)
livemcpbench: Adding support for liveMCPBench (#127) (222f678)
make evals dash/undescore insensitive (#185) (5ec5177)
mbpp (#117) (93ad88b)
mcq_eval: enable abstraction of MCQ eval (#181) (2f53db2)
mmmu-pro: added support for mmmu_mcq, mmmu_open, mmmu_pro, mmmu_pro_vision (#134) (a875378)
openrouter: add OpenRouter provider support (#145) (47b579e)
openrouter: add provider routing args support (#180) (12e1d81)
otis-mock-aime: added support for otis mock aime 2024-2025 (#218) (1b9fd5c)
plugins: add entry point system for external benchmarks (#216) (71e7257)
return eval logs from run_eval function (#173) (ee459d9)
rootly_terraform: add initial implementation of Rootly Terraform evals (#195) (cd3acae)

Bug Fixes

allow for more python versions (#164) (e6682fe)
close headqa metadata entry (947522d)
cybench: moved cybench dependency into dependency group (#237) (8d30715)
handle missing SciCode dependency lazily in solver (#186) (fed4e88)
improve BBH target extraction to handle multi-char answers (147e3e0)
load Neue Regrade font in Mintlify docs (#177) (550c7f5)
make core package actually install (#235) (edeb4b8)
normalize benchmark keys during entry point merge (#217) (d285664)
register headqa_en and headqa_es variants (6a19aa1)
render inspect error correctly (#241) (97ccd10)
resolve registry import conflict (98a1c79)
scicode: add support for test split, fix test_data.h5 import error (#149) (23fa8cb)
update bbh function for programmatic access only (811ce9e)
use generic type ignore for bbh task decorator (fd67171)

Documentation

readme: clarify benchmark case-sensitivity and grader requirements (#135) (c34a5a3)

Chores

add @nmayorga7 to CODEOWNERS (abad7bf)
alphabetize available benchmarks error (#214) (68f46e9)
benchmark: removed combined cti-bench eval (#183) (a77852c)
bugbot fixes for MCQ (#190) (6ecaefc)
docs: add docs for openrouter and MCQEval (#188) (7f8cd83)
docs: alphabetize benchmarks metadata (#187) (ce77812)
docs: benchmarks each on new line (#184) (b3c40f8)
docs: minor cleanup (#179) (80c9e09)
fixed fonts in openbench docs (#178) (6e3c2a5)
GitHub Terraform: Create/Update .github/workflows/stale.yaml [skip ci] (d0018d1)
mcq-eval: accept more dataset types (#194) (cb5e038)
move all metrics to discrete files in /metrics (#168) (7caa042)
release-please pre-1.0: treat BREAKING as minor (3c88a42)
remove pre-commit benchmark checks for easier CI (#213) (67b07a7)
rename OpenBench to openbench (#196) (0621b46)
rename task to sample in time metrics (#172) (96e1817)
sync packaging pyproject (#234) (940a879)
update Claude workflows to enhance permissions and streamline triggers (#136) (effb7da)
update readme and contributing (#176) (ea606ba)
update release-please configuration and add lockfile update workflow (#146) (2d6ad9b)
user agent (#163) (e20f3c1)

CI

add benchmarks validation pre commit hook (#171) (3725638)
remove PR trigger from release-please (#166) (6440d44)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

0.5.0 (2025-10-10)

⚠ BREAKING CHANGES

Features

Bug Fixes

Documentation

Chores

CI

Uh oh!