Skip to content

v1.7.1

Choose a tag to compare

@Yunnglin Yunnglin released this 18 May 05:30

中文版

基准测试数据集

  • 多模态评测: 新增 AIR-Bench 基准测试支持
  • 视频评测: 新增原生视频基准测试支持,包括 MVBench、Video-MME-v2

功能增强

  • WebUI 与多轮评测: 重构 WebUI,并优化多轮评测能力
  • 模型网关支持: 新增 LiteLLM 作为 AI gateway provider
  • 性能测试: 新增 perf swe-smith 压测能力
  • 性能测试: 新增 perf warmup 预热能力
  • 自定义数据集: 支持 line_by_line_oai 自定义数据集模式
  • 评测能力: 新增 multi-mcq 支持
  • Agent 能力: 新增 agent loop 支持
  • 评测资源管理: 为 T2V metric assets 新增 local-only 本地加载开关
  • 沙箱能力: 更新 volcengine sandbox 支持

文档优化

  • 更新 Web 相关文档说明

问题修复

  • 修复 perf swe-smith 相关问题
  • 修复 aigc device args 相关问题
  • 修复 IFBench 评测逻辑问题

English Version

Benchmark Datasets

  • Multimodal Evaluation: Added AIR-Bench benchmark support
  • Video Evaluation: Added native video benchmark support, including MVBench and Video-MME-v2

Feature Enhancements

  • WebUI and Multi-turn Evaluation: Refactored WebUI and improved multi-turn evaluation capabilities
  • Model Gateway Support: Added LiteLLM as an AI gateway provider
  • Performance Testing: Added perf swe-smith benchmarking capability
  • Performance Testing: Added perf warmup support
  • Custom Dataset: Added support for line_by_line_oai custom dataset mode
  • Evaluation Capability: Added multi-mcq support
  • Agent Capability: Added agent loop support
  • Evaluation Asset Management: Added local-only loading switch for T2V metric assets
  • Sandbox Support: Updated volcengine sandbox support

Documentation

  • Updated Web-related documentation

Bug Fixes

  • Fixed perf swe-smith related issues
  • Fixed aigc device args related issues
  • Fixed IFBench evaluation logic issue

What's Changed

New Contributors

Full Changelog: v1.6.1...v1.7.1