Skip to content

v1.5.0

Choose a tag to compare

@Yunnglin Yunnglin released this 10 Mar 02:01
· 32 commits to release/1.5 since this release

中文版

基准测试数据集

  • 数学评测: 新增 HMMT25 数学基准测试
  • 代码评测: 新增 CL-bench (腾讯) 基准测试
  • 修复 LiveCodeBench 代码提取逻辑,改为使用最后一个代码块

功能增强

  • Judge LLM 类型指定: 支持在评测中指定 Judge LLM 的类型
  • 火山引擎沙箱支持: 新增 Volcengine 沙箱环境支持
  • Anthropic API 支持: 新增 Anthropic API 接入能力
  • 性能测试进度条: 为 perf 数据集处理新增 tqdm 进度条显示
  • 统一子集更新: 更新 unify subset 相关逻辑
  • 服务端 Demo 更新: 优化 server demo 展示

文档优化

  • 新增基准测试详情文档说明
  • 更新 eval_type 相关文档
  • 更新文档同步脚本

问题修复

  • 修复空数据集跳过处理问题
  • 修复 rate type 更新问题
  • 修复 input_audio 错误前缀问题 (Issue #1152)

English Version

Benchmark Datasets

  • Math Evaluation: Added HMMT25 math benchmark
  • Code Evaluation: Added CL-bench (Tencent) benchmark
  • Fixed LiveCodeBench code extraction to use the last fenced code block

Feature Enhancements

  • Judge LLM Type: Added support for specifying judge LLM type in evaluation
  • Volcengine Sandbox: Added Volcengine sandbox environment support
  • Anthropic API: Added Anthropic API integration support
  • Performance Testing Progress Bar: Added tqdm progress display for perf dataset processing
  • Unified Subset Update: Updated unify subset related logic
  • Server Demo Update: Optimized server demo presentation

Documentation

  • Added benchmark detail documentation
  • Updated eval_type related documentation
  • Updated doc sync script

Bug Fixes

  • Fixed empty dataset skipping issue
  • Fixed rate type update issue
  • Fixed input_audio wrong prefix issue (Issue #1152)

What's Changed

New Contributors

Full Changelog: v1.4.2...v1.5.0