Skip to content

v1.5.1

Choose a tag to compare

@Yunnglin Yunnglin released this 23 Mar 07:29
· 17 commits to release/1.5 since this release

中文版

基准测试数据集

  • 新增 AIME 2026 数学竞赛基准测试
  • 新增 MMMLU 多语言大规模多任务理解基准测试
  • 新增 LongBench v2 长文本理解基准测试

功能增强

  • 性能测试: 新增获取基准端点(get benchmark endpoint)功能,修复测试连接参数配置
  • 性能测试: 优化 SLA 自动调优(SLA auto tune)功能
  • 评测服务: 支持以表格形式返回评测结果,修复分析统计相关问题
  • Judge 模型: 支持为 Judge LLM 配置 model_args 参数
  • 请求追踪: 支持打印 request id,便于请求追踪与调试

问题修复

  • 修复 eval() 安全性问题,替换为 ast.literal_eval() 处理字符串参数解析
  • 修复性能测试(perf)tokenize 及空子集(empty subset)相关问题
  • 修复数据集 shuffle 随机性问题,使用带种子的 random.Random 确保可复现性

English Version

Benchmark Datasets

  • Added AIME 2026 math competition benchmark
  • Added MMMLU (Multilingual Massive Multitask Language Understanding) benchmark
  • Added LongBench v2 for long-context understanding evaluation

Feature Enhancements

  • Performance Testing: Added get benchmark endpoint and fixed test connection parameter configuration
  • Performance Testing: Added SLA auto-tune functionality
  • Evaluation Service: Support returning results in table format and fixed analysis bugs
  • Judge Model: Added model_args support for judge LLM configuration
  • Request Tracking: Added request ID printing for better request tracing and debugging

Bug Fixes

  • Fixed security issue by replacing eval() with ast.literal_eval() for string argument parsing
  • Fixed perf tokenize and empty subset related issues
  • Fixed dataset shuffling reproducibility by using seeded random.Random

What's Changed

New Contributors

Full Changelog: v1.5.0...v1.5.1