Skip to content

0.5.3

Latest

Choose a tag to compare

@ssiq ssiq released this 29 Jun 03:17
a25fdd2

OpenCompass v0.5.3 Release Notes

🌟 Highlights

✨ πŸ§ͺ Extensive New Benchmarks and Dataset Support: Added support for a wide range of new benchmarks and datasets, including AIME2026, HMMT Feb 2026, SimpleQA-Verified, AdvancedIF, HLE-Verified, MRCR-V2, Molecular-IQ, SciReasoner 1.5, MP20, PerspectiveGap, ZebraLogic, ArxivRollBench, CL-Bench, and more.

✨ 🧩 RawPromptTemplate Support: Introduced RawPromptTemplate and expanded its support across dataset configs, OpenAISDKStreaming, ChatML datasets, retrievers, and multiple benchmark configurations.

✨ πŸ€– New Model & API Support: Added support for the OpenAI Responses API, Gemini SDK API and Claude SDK thinking content.

✨ πŸ› οΈ Infrastructure & Evaluation Enhancements: Improved concurrent inference, repeat analysis, summarizer configuration, judge post-processing, CI coverage, and evaluation robustness.


πŸš€ New Features

πŸ”§ Added support for new datasets and benchmarks, including AIME2026 and HMMT Feb 2026 (#2404), Molecular-IQ (#2431), SimpleQA-Verified (#2436), AdvancedIF (#2461), HLE-Verified (#2454), MRCR-V2 (#2467), S2-ToM-G Bench (#2476), SciReasoner 1.5 (#2477, #2479), MP20 (#2482), PerspectiveGap (#2484), ZebraLogic (#2464), and ArxivRollBench (#2458).

πŸ”§ Introduced RawPromptTemplate and new dataset configs (#2407).

πŸ”§ Added RawPromptTemplate support for OpenAISDKStreaming and ChatMLDatasets (#2414).

πŸ”§ Added SciReasoner config and support for retrievers using RawPromptTemplate (#2422).

πŸ”§ Added repeat configs for HMMT2025 and UGD_hard (#2425).

πŸ”§ Added new subjective dataset configs and fixed AdvancedIF related configs (#2466).

πŸ”§ Added RawPromptTemplate configs for AlignBench v1.1 (#2472), RULER, NeedleBench, LongBench v2, and BABILong (#2474).

πŸ”§ Added support for repeat analysis (#2455).

πŸ”§ Added OpenAI Responses API model support (#2481).

πŸ”§ Added Gemini SDK API model support (#2494).

πŸ”§ Added support for Claude SDK thinking content (#2487).

πŸ”§ Added CL Bench support (#2483).


πŸ› Bug Fixes

πŸ”§ Fixed IFEval path issue (#2406).

πŸ”§ Fixed tag matching in generic_llmjudge_postprocess (#2417).

πŸ”§ Fixed issues in generic.py (#2419).

πŸ”§ Fixed ChatML and dataset utility issues (#2421).

πŸ”§ Fixed OlympiadBench compatibility with RawPromptTemplate (#2423).

πŸ”§ Fixed ellipsis extraction in bio_data (#2429).

πŸ”§ Fixed Molecular-IQ PR test issues (#2432).

πŸ”§ Added multi-round support from api_meta_template (#2435).

πŸ”§ Fixed result metrics in SimpleQA (#2439).

πŸ”§ Fixed no-prediction cases in SciReasoner biology instruction evaluation (#2456).

πŸ”§ Fixed hub version issue (#2457).

πŸ”§ Fixed SimpleQA prompt (#2463).

πŸ”§ Added unsafe pattern handling for MathEvaluator (#2468).

πŸ”§ Fixed MTBench101, WildBench, and AdvancedIF issues (#2470).

πŸ”§ Added subprocess timeout to MATHVerifyEvaluator to prevent parse and verify hangs (#2478).

πŸ”§ Resolved dependency issues for Python 3.12 (#2480).

πŸ”§ Restored LongBench v2 answer prompt punctuation (#2452).

πŸ”§ Fixed MP20 metrics (#2488).

πŸ”§ Prioritized Responses reasoning content (#2490).


βš™ Enhancements and Refactors

βš™ Evaluation and Runtime Improvements:

  • Supported concurrent inference across tasks (#2403).
  • Reduced RJOB sleep time (#2411).
  • Added new template for CompassAcademic (#2447).
  • Expanded config options of the summarizer group (#2448).
  • Added --no-progress parameter for repeat analysis to suppress detailed logs (#2473).
  • Added prediction postprocessor support for judge models in GenericLLMEvaluator (#2491).

βš™ RawPromptTemplate Documentation and Compatibility:

  • Added guide for RawPromptTemplate (#2420).
  • Improved RawPromptTemplate coverage across benchmark and dataset configs (#2407, #2414, #2422, #2472, #2474).

βš™ Metadata and Release Updates:

  • Updated dataset index metadata (#2492).
  • Bumped version to 0.5.3 (#2493).

βš™ CI/CD Improvements:

  • Refactored ETE test cases and added more unit tests (#2408).
  • Added inference test cases using mock API (#2428).
  • Added API test cases into PR tests (#2460).
  • Removed host network from CI settings (#2462).

βš™ Documentation:

  • Fixed minor documentation bugs (#2469).

πŸŽ‰ Welcome New Contributors

A warm welcome and special thanks to our newest contributors who made this release possible:


Full Changelog: 0.5.2...0.5.3

Thank you for using OpenCompass! These updates bring broader benchmark coverage, stronger API support, and more reliable evaluation workflows. Keep exploring, and stay tuned for future innovations! 🌟