OpenCompass v0.5.3 Release Notes

🌟 Highlights

✨ 🧪 Extensive New Benchmarks and Dataset Support: Added support for a wide range of new benchmarks and datasets, including AIME2026, HMMT Feb 2026, SimpleQA-Verified, AdvancedIF, HLE-Verified, MRCR-V2, Molecular-IQ, SciReasoner 1.5, MP20, PerspectiveGap, ZebraLogic, ArxivRollBench, CL-Bench, and more.

✨ 🧩 RawPromptTemplate Support: Introduced RawPromptTemplate and expanded its support across dataset configs, OpenAISDKStreaming, ChatML datasets, retrievers, and multiple benchmark configurations.

✨ 🤖 New Model & API Support: Added support for the OpenAI Responses API, Gemini SDK API and Claude SDK thinking content.

✨ 🛠️ Infrastructure & Evaluation Enhancements: Improved concurrent inference, repeat analysis, summarizer configuration, judge post-processing, CI coverage, and evaluation robustness.

🚀 New Features

🔧 Added support for new datasets and benchmarks, including AIME2026 and HMMT Feb 2026 (#2404), Molecular-IQ (#2431), SimpleQA-Verified (#2436), AdvancedIF (#2461), HLE-Verified (#2454), MRCR-V2 (#2467), S2-ToM-G Bench (#2476), SciReasoner 1.5 (#2477, #2479), MP20 (#2482), PerspectiveGap (#2484), ZebraLogic (#2464), and ArxivRollBench (#2458).

🔧 Introduced RawPromptTemplate and new dataset configs (#2407).

🔧 Added RawPromptTemplate support for OpenAISDKStreaming and ChatMLDatasets (#2414).

🔧 Added SciReasoner config and support for retrievers using RawPromptTemplate (#2422).

🔧 Added repeat configs for HMMT2025 and UGD_hard (#2425).

🔧 Added new subjective dataset configs and fixed AdvancedIF related configs (#2466).

🔧 Added RawPromptTemplate configs for AlignBench v1.1 (#2472), RULER, NeedleBench, LongBench v2, and BABILong (#2474).

🔧 Added support for repeat analysis (#2455).

🔧 Added OpenAI Responses API model support (#2481).

🔧 Added Gemini SDK API model support (#2494).

🔧 Added support for Claude SDK thinking content (#2487).

🔧 Added CL Bench support (#2483).

🐛 Bug Fixes

🔧 Fixed IFEval path issue (#2406).

🔧 Fixed tag matching in generic_llmjudge_postprocess (#2417).

🔧 Fixed issues in generic.py (#2419).

🔧 Fixed ChatML and dataset utility issues (#2421).

🔧 Fixed OlympiadBench compatibility with RawPromptTemplate (#2423).

🔧 Fixed ellipsis extraction in bio_data (#2429).

🔧 Fixed Molecular-IQ PR test issues (#2432).

🔧 Added multi-round support from api_meta_template (#2435).

🔧 Fixed result metrics in SimpleQA (#2439).

🔧 Fixed no-prediction cases in SciReasoner biology instruction evaluation (#2456).

🔧 Fixed hub version issue (#2457).

🔧 Fixed SimpleQA prompt (#2463).

🔧 Added unsafe pattern handling for MathEvaluator (#2468).

🔧 Fixed MTBench101, WildBench, and AdvancedIF issues (#2470).

🔧 Added subprocess timeout to MATHVerifyEvaluator to prevent parse and verify hangs (#2478).

🔧 Resolved dependency issues for Python 3.12 (#2480).

🔧 Restored LongBench v2 answer prompt punctuation (#2452).

🔧 Fixed MP20 metrics (#2488).

🔧 Prioritized Responses reasoning content (#2490).

⚙ Enhancements and Refactors

⚙ Evaluation and Runtime Improvements:

Supported concurrent inference across tasks (#2403).
Reduced RJOB sleep time (#2411).
Added new template for CompassAcademic (#2447).
Expanded config options of the summarizer group (#2448).
Added --no-progress parameter for repeat analysis to suppress detailed logs (#2473).
Added prediction postprocessor support for judge models in GenericLLMEvaluator (#2491).

⚙ RawPromptTemplate Documentation and Compatibility:

Added guide for RawPromptTemplate (#2420).
Improved RawPromptTemplate coverage across benchmark and dataset configs (#2407, #2414, #2422, #2472, #2474).

⚙ Metadata and Release Updates:

Updated dataset index metadata (#2492).
Bumped version to 0.5.3 (#2493).

⚙ CI/CD Improvements:

Refactored ETE test cases and added more unit tests (#2408).
Added inference test cases using mock API (#2428).
Added API test cases into PR tests (#2460).
Removed host network from CI settings (#2462).

⚙ Documentation:

Fixed minor documentation bugs (#2469).

🎉 Welcome New Contributors

A warm welcome and special thanks to our newest contributors who made this release possible:

@JifeiShan made their first contribution in (#2473).
@ssiq made their first contribution in (#2480).
@yhzhu99 made their first contribution in (#2452).
@HuangZixian made their first contribution in (#2482).
@WhymustIhaveaname made their first contribution in (#2484).
@amanyara made their first contribution in (#2464).
@liangzid made their first contribution in (#2458).

Full Changelog: 0.5.2...0.5.3

Thank you for using OpenCompass! These updates bring broader benchmark coverage, stronger API support, and more reliable evaluation workflows. Keep exploring, and stay tuned for future innovations! 🌟

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.5.3

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

OpenCompass v0.5.3 Release Notes

🌟 Highlights

🚀 New Features

🐛 Bug Fixes

⚙ Enhancements and Refactors

🎉 Welcome New Contributors

Contributors

Uh oh!