v1.5.1
中文版
基准测试数据集
- 新增 AIME 2026 数学竞赛基准测试
- 新增 MMMLU 多语言大规模多任务理解基准测试
- 新增 LongBench v2 长文本理解基准测试
功能增强
- 性能测试: 新增获取基准端点(get benchmark endpoint)功能,修复测试连接参数配置
- 性能测试: 优化 SLA 自动调优(SLA auto tune)功能
- 评测服务: 支持以表格形式返回评测结果,修复分析统计相关问题
- Judge 模型: 支持为 Judge LLM 配置 model_args 参数
- 请求追踪: 支持打印 request id,便于请求追踪与调试
问题修复
- 修复
eval()安全性问题,替换为ast.literal_eval()处理字符串参数解析 - 修复性能测试(perf)tokenize 及空子集(empty subset)相关问题
- 修复数据集 shuffle 随机性问题,使用带种子的
random.Random确保可复现性
English Version
Benchmark Datasets
- Added AIME 2026 math competition benchmark
- Added MMMLU (Multilingual Massive Multitask Language Understanding) benchmark
- Added LongBench v2 for long-context understanding evaluation
Feature Enhancements
- Performance Testing: Added get benchmark endpoint and fixed test connection parameter configuration
- Performance Testing: Added SLA auto-tune functionality
- Evaluation Service: Support returning results in table format and fixed analysis bugs
- Judge Model: Added model_args support for judge LLM configuration
- Request Tracking: Added request ID printing for better request tracing and debugging
Bug Fixes
- Fixed security issue by replacing
eval()withast.literal_eval()for string argument parsing - Fixed perf tokenize and empty subset related issues
- Fixed dataset shuffling reproducibility by using seeded
random.Random
What's Changed
- Replace eval() with ast.literal_eval() in ParseStrArgsAction by @RinZ27 in #1221
- [Fix] update kontext_bench, refcoco by @Yunnglin in #1229
- [Benchmark]Add aime26 and SLA auto tune by @Yunnglin in #1230
- [Fix] perf test connection args and add get benchmark endpoint by @Yunnglin in #1232
- [Benchmark] add mmmlu by @Yunnglin in #1235
- [Fix]perf tokenize and empty subset by @Yunnglin in #1236
- [Benchmark] Add longbench_v2 by @Yunnglin in #1237
- [Feature] Return table for service and fix analysis bug by @Yunnglin in #1240
- Add model_args for judge LLM by @haihongtran in #1241
- print request id by @strenuous-life in #1242
- [Fix] Use seeded random.Random for dataset shuffling by @pcabriada in #1243
New Contributors
- @RinZ27 made their first contribution in #1221
- @haihongtran made their first contribution in #1241
- @strenuous-life made their first contribution in #1242
- @pcabriada made their first contribution in #1243
Full Changelog: v1.5.0...v1.5.1