OpenCompass v0.5.3 Release Notes
π Highlights
β¨ π§ͺ Extensive New Benchmarks and Dataset Support: Added support for a wide range of new benchmarks and datasets, including AIME2026, HMMT Feb 2026, SimpleQA-Verified, AdvancedIF, HLE-Verified, MRCR-V2, Molecular-IQ, SciReasoner 1.5, MP20, PerspectiveGap, ZebraLogic, ArxivRollBench, CL-Bench, and more.
β¨ π§© RawPromptTemplate Support: Introduced RawPromptTemplate and expanded its support across dataset configs, OpenAISDKStreaming, ChatML datasets, retrievers, and multiple benchmark configurations.
β¨ π€ New Model & API Support: Added support for the OpenAI Responses API, Gemini SDK API and Claude SDK thinking content.
β¨ π οΈ Infrastructure & Evaluation Enhancements: Improved concurrent inference, repeat analysis, summarizer configuration, judge post-processing, CI coverage, and evaluation robustness.
π New Features
π§ Added support for new datasets and benchmarks, including AIME2026 and HMMT Feb 2026 (#2404), Molecular-IQ (#2431), SimpleQA-Verified (#2436), AdvancedIF (#2461), HLE-Verified (#2454), MRCR-V2 (#2467), S2-ToM-G Bench (#2476), SciReasoner 1.5 (#2477, #2479), MP20 (#2482), PerspectiveGap (#2484), ZebraLogic (#2464), and ArxivRollBench (#2458).
π§ Introduced RawPromptTemplate and new dataset configs (#2407).
π§ Added RawPromptTemplate support for OpenAISDKStreaming and ChatMLDatasets (#2414).
π§ Added SciReasoner config and support for retrievers using RawPromptTemplate (#2422).
π§ Added repeat configs for HMMT2025 and UGD_hard (#2425).
π§ Added new subjective dataset configs and fixed AdvancedIF related configs (#2466).
π§ Added RawPromptTemplate configs for AlignBench v1.1 (#2472), RULER, NeedleBench, LongBench v2, and BABILong (#2474).
π§ Added support for repeat analysis (#2455).
π§ Added OpenAI Responses API model support (#2481).
π§ Added Gemini SDK API model support (#2494).
π§ Added support for Claude SDK thinking content (#2487).
π§ Added CL Bench support (#2483).
π Bug Fixes
π§ Fixed IFEval path issue (#2406).
π§ Fixed tag matching in generic_llmjudge_postprocess (#2417).
π§ Fixed issues in generic.py (#2419).
π§ Fixed ChatML and dataset utility issues (#2421).
π§ Fixed OlympiadBench compatibility with RawPromptTemplate (#2423).
π§ Fixed ellipsis extraction in bio_data (#2429).
π§ Fixed Molecular-IQ PR test issues (#2432).
π§ Added multi-round support from api_meta_template (#2435).
π§ Fixed result metrics in SimpleQA (#2439).
π§ Fixed no-prediction cases in SciReasoner biology instruction evaluation (#2456).
π§ Fixed hub version issue (#2457).
π§ Fixed SimpleQA prompt (#2463).
π§ Added unsafe pattern handling for MathEvaluator (#2468).
π§ Fixed MTBench101, WildBench, and AdvancedIF issues (#2470).
π§ Added subprocess timeout to MATHVerifyEvaluator to prevent parse and verify hangs (#2478).
π§ Resolved dependency issues for Python 3.12 (#2480).
π§ Restored LongBench v2 answer prompt punctuation (#2452).
π§ Fixed MP20 metrics (#2488).
π§ Prioritized Responses reasoning content (#2490).
β Enhancements and Refactors
β Evaluation and Runtime Improvements:
- Supported concurrent inference across tasks (#2403).
- Reduced RJOB sleep time (#2411).
- Added new template for CompassAcademic (#2447).
- Expanded config options of the summarizer group (#2448).
- Added
--no-progressparameter for repeat analysis to suppress detailed logs (#2473). - Added prediction postprocessor support for judge models in
GenericLLMEvaluator(#2491).
β RawPromptTemplate Documentation and Compatibility:
- Added guide for RawPromptTemplate (#2420).
- Improved RawPromptTemplate coverage across benchmark and dataset configs (#2407, #2414, #2422, #2472, #2474).
β Metadata and Release Updates:
β CI/CD Improvements:
- Refactored ETE test cases and added more unit tests (#2408).
- Added inference test cases using mock API (#2428).
- Added API test cases into PR tests (#2460).
- Removed host network from CI settings (#2462).
β Documentation:
- Fixed minor documentation bugs (#2469).
π Welcome New Contributors
A warm welcome and special thanks to our newest contributors who made this release possible:
- @JifeiShan made their first contribution in (#2473).
- @ssiq made their first contribution in (#2480).
- @yhzhu99 made their first contribution in (#2452).
- @HuangZixian made their first contribution in (#2482).
- @WhymustIhaveaname made their first contribution in (#2484).
- @amanyara made their first contribution in (#2464).
- @liangzid made their first contribution in (#2458).
Full Changelog: 0.5.2...0.5.3
Thank you for using OpenCompass! These updates bring broader benchmark coverage, stronger API support, and more reliable evaluation workflows. Keep exploring, and stay tuned for future innovations! π