v1.5.0

Yunnglin released this 10 Mar 02:01

· 32 commits to release/1.5 since this release

68a5df2

中文版

基准测试数据集

数学评测: 新增 HMMT25 数学基准测试
代码评测: 新增 CL-bench (腾讯) 基准测试
修复 LiveCodeBench 代码提取逻辑，改为使用最后一个代码块

功能增强

Judge LLM 类型指定: 支持在评测中指定 Judge LLM 的类型
火山引擎沙箱支持: 新增 Volcengine 沙箱环境支持
Anthropic API 支持: 新增 Anthropic API 接入能力
性能测试进度条: 为 perf 数据集处理新增 tqdm 进度条显示
统一子集更新: 更新 unify subset 相关逻辑
服务端 Demo 更新: 优化 server demo 展示

文档优化

新增基准测试详情文档说明
更新 eval_type 相关文档
更新文档同步脚本

问题修复

修复空数据集跳过处理问题
修复 rate type 更新问题
修复 input_audio 错误前缀问题 (Issue #1152)

English Version

Benchmark Datasets

Math Evaluation: Added HMMT25 math benchmark
Code Evaluation: Added CL-bench (Tencent) benchmark
Fixed LiveCodeBench code extraction to use the last fenced code block

Feature Enhancements

Judge LLM Type: Added support for specifying judge LLM type in evaluation
Volcengine Sandbox: Added Volcengine sandbox environment support
Anthropic API: Added Anthropic API integration support
Performance Testing Progress Bar: Added tqdm progress display for perf dataset processing
Unified Subset Update: Updated unify subset related logic
Server Demo Update: Optimized server demo presentation

Documentation

Added benchmark detail documentation
Updated eval_type related documentation
Updated doc sync script

Bug Fixes

Fixed empty dataset skipping issue
Fixed rate type update issue
Fixed input_audio wrong prefix issue (Issue #1152)

What's Changed

feat: specify judge llm type by @jerryldh in #1157
feat(benchmark): add HMMT25 math benchmark by @XChen-Zero in #1154
[Fix] skip empty dataset by @Yunnglin in #1161
feat: add Volcengine sandbox support by @XChen-Zero in #1160
[Fix] update rate type by @Yunnglin in #1166
[Feature] add perf dataset tqdm by @Yunnglin in #1168
Fix LiveCodeBench extraction: use last fenced code block by @koshieguchi in #1170
[Doc] benchmarks detail doc by @Yunnglin in #1180
[Fix] fix input_audio wrong prefix (#1152) by @labAxiaoming in #1174
feat/Add CL-bench (tencent/CL-bench) benchmark by @XChen-Zero in #1191
Add anthropic api by @Yunnglin in #1202
update sync doc script by @Yunnglin in #1205
[Feature] Update server demo by @Yunnglin in #1215
[Doc] update eval_type by @Yunnglin in #1217
[Feature] update unify subset by @Yunnglin in #1220

New Contributors

@jerryldh made their first contribution in #1157
@XChen-Zero made their first contribution in #1154
@koshieguchi made their first contribution in #1170
@labAxiaoming made their first contribution in #1174

Full Changelog: v1.4.2...v1.5.0

Contributors

jerryldh, labAxiaoming, and 3 other contributors

Assets 2