v1.5.0
中文版
基准测试数据集
- 数学评测: 新增 HMMT25 数学基准测试
- 代码评测: 新增 CL-bench (腾讯) 基准测试
- 修复 LiveCodeBench 代码提取逻辑,改为使用最后一个代码块
功能增强
- Judge LLM 类型指定: 支持在评测中指定 Judge LLM 的类型
- 火山引擎沙箱支持: 新增 Volcengine 沙箱环境支持
- Anthropic API 支持: 新增 Anthropic API 接入能力
- 性能测试进度条: 为 perf 数据集处理新增 tqdm 进度条显示
- 统一子集更新: 更新 unify subset 相关逻辑
- 服务端 Demo 更新: 优化 server demo 展示
文档优化
- 新增基准测试详情文档说明
- 更新 eval_type 相关文档
- 更新文档同步脚本
问题修复
- 修复空数据集跳过处理问题
- 修复 rate type 更新问题
- 修复 input_audio 错误前缀问题 (Issue #1152)
English Version
Benchmark Datasets
- Math Evaluation: Added HMMT25 math benchmark
- Code Evaluation: Added CL-bench (Tencent) benchmark
- Fixed LiveCodeBench code extraction to use the last fenced code block
Feature Enhancements
- Judge LLM Type: Added support for specifying judge LLM type in evaluation
- Volcengine Sandbox: Added Volcengine sandbox environment support
- Anthropic API: Added Anthropic API integration support
- Performance Testing Progress Bar: Added tqdm progress display for perf dataset processing
- Unified Subset Update: Updated unify subset related logic
- Server Demo Update: Optimized server demo presentation
Documentation
- Added benchmark detail documentation
- Updated eval_type related documentation
- Updated doc sync script
Bug Fixes
- Fixed empty dataset skipping issue
- Fixed rate type update issue
- Fixed input_audio wrong prefix issue (Issue #1152)
What's Changed
- feat: specify judge llm type by @jerryldh in #1157
- feat(benchmark): add HMMT25 math benchmark by @XChen-Zero in #1154
- [Fix] skip empty dataset by @Yunnglin in #1161
- feat: add Volcengine sandbox support by @XChen-Zero in #1160
- [Fix] update rate type by @Yunnglin in #1166
- [Feature] add perf dataset tqdm by @Yunnglin in #1168
- Fix LiveCodeBench extraction: use last fenced code block by @koshieguchi in #1170
- [Doc] benchmarks detail doc by @Yunnglin in #1180
- [Fix] fix input_audio wrong prefix (#1152) by @labAxiaoming in #1174
- feat/Add CL-bench (tencent/CL-bench) benchmark by @XChen-Zero in #1191
- Add anthropic api by @Yunnglin in #1202
- update sync doc script by @Yunnglin in #1205
- [Feature] Update server demo by @Yunnglin in #1215
- [Doc] update eval_type by @Yunnglin in #1217
- [Feature] update unify subset by @Yunnglin in #1220
New Contributors
- @jerryldh made their first contribution in #1157
- @XChen-Zero made their first contribution in #1154
- @koshieguchi made their first contribution in #1170
- @labAxiaoming made their first contribution in #1174
Full Changelog: v1.4.2...v1.5.0