v1.5.1

Yunnglin released this 23 Mar 07:29

· 17 commits to release/1.5 since this release

e7c85f7

中文版

基准测试数据集

新增 AIME 2026 数学竞赛基准测试
新增 MMMLU 多语言大规模多任务理解基准测试
新增 LongBench v2 长文本理解基准测试

功能增强

性能测试: 新增获取基准端点（get benchmark endpoint）功能，修复测试连接参数配置
性能测试: 优化 SLA 自动调优（SLA auto tune）功能
评测服务: 支持以表格形式返回评测结果，修复分析统计相关问题
Judge 模型: 支持为 Judge LLM 配置 model_args 参数
请求追踪: 支持打印 request id，便于请求追踪与调试

问题修复

修复 eval() 安全性问题，替换为 ast.literal_eval() 处理字符串参数解析
修复性能测试（perf）tokenize 及空子集（empty subset）相关问题
修复数据集 shuffle 随机性问题，使用带种子的 random.Random 确保可复现性

English Version

Benchmark Datasets

Added AIME 2026 math competition benchmark
Added MMMLU (Multilingual Massive Multitask Language Understanding) benchmark
Added LongBench v2 for long-context understanding evaluation

Feature Enhancements

Performance Testing: Added get benchmark endpoint and fixed test connection parameter configuration
Performance Testing: Added SLA auto-tune functionality
Evaluation Service: Support returning results in table format and fixed analysis bugs
Judge Model: Added model_args support for judge LLM configuration
Request Tracking: Added request ID printing for better request tracing and debugging

Bug Fixes

Fixed security issue by replacing eval() with ast.literal_eval() for string argument parsing
Fixed perf tokenize and empty subset related issues
Fixed dataset shuffling reproducibility by using seeded random.Random

What's Changed

Replace eval() with ast.literal_eval() in ParseStrArgsAction by @RinZ27 in #1221
[Fix] update kontext_bench, refcoco by @Yunnglin in #1229
[Benchmark]Add aime26 and SLA auto tune by @Yunnglin in #1230
[Fix] perf test connection args and add get benchmark endpoint by @Yunnglin in #1232
[Benchmark] add mmmlu by @Yunnglin in #1235
[Fix]perf tokenize and empty subset by @Yunnglin in #1236
[Benchmark] Add longbench_v2 by @Yunnglin in #1237
[Feature] Return table for service and fix analysis bug by @Yunnglin in #1240
Add model_args for judge LLM by @haihongtran in #1241
print request id by @strenuous-life in #1242
[Fix] Use seeded random.Random for dataset shuffling by @pcabriada in #1243

New Contributors

@RinZ27 made their first contribution in #1221
@haihongtran made their first contribution in #1241
@strenuous-life made their first contribution in #1242
@pcabriada made their first contribution in #1243

Full Changelog: v1.5.0...v1.5.1

Contributors

strenuous-life, haihongtran, and 3 other contributors

Assets 2