v1.7.1

Yunnglin released this 18 May 05:30

aa53e64

中文版

基准测试数据集

多模态评测: 新增 AIR-Bench 基准测试支持
视频评测: 新增原生视频基准测试支持，包括 MVBench、Video-MME-v2

功能增强

WebUI 与多轮评测: 重构 WebUI，并优化多轮评测能力
模型网关支持: 新增 LiteLLM 作为 AI gateway provider
性能测试: 新增 perf swe-smith 压测能力
性能测试: 新增 perf warmup 预热能力
自定义数据集: 支持 line_by_line_oai 自定义数据集模式
评测能力: 新增 multi-mcq 支持
Agent 能力: 新增 agent loop 支持
评测资源管理: 为 T2V metric assets 新增 local-only 本地加载开关
沙箱能力: 更新 volcengine sandbox 支持

文档优化

更新 Web 相关文档说明

问题修复

修复 perf swe-smith 相关问题
修复 aigc device args 相关问题
修复 IFBench 评测逻辑问题

English Version

Benchmark Datasets

Multimodal Evaluation: Added AIR-Bench benchmark support
Video Evaluation: Added native video benchmark support, including MVBench and Video-MME-v2

Feature Enhancements

WebUI and Multi-turn Evaluation: Refactored WebUI and improved multi-turn evaluation capabilities
Model Gateway Support: Added LiteLLM as an AI gateway provider
Performance Testing: Added perf swe-smith benchmarking capability
Performance Testing: Added perf warmup support
Custom Dataset: Added support for line_by_line_oai custom dataset mode
Evaluation Capability: Added multi-mcq support
Agent Capability: Added agent loop support
Evaluation Asset Management: Added local-only loading switch for T2V metric assets
Sandbox Support: Updated volcengine sandbox support

Documentation

Updated Web-related documentation

Bug Fixes

Fixed perf swe-smith related issues
Fixed aigc device args related issues
Fixed IFBench evaluation logic issue

What's Changed

[Feature]Refact webui and multi-turn eval by @Yunnglin in #1315
feat: add LiteLLM as AI gateway provider by @RheagalFire in #1317
[Feature]Add perf swe-smith by @Yunnglin in #1321
[Fix] perf swe-smith by @Yunnglin in #1322
[Fix] perf swe-smith load by @Yunnglin in #1323
[Fix] perf swe multi process by @Yunnglin in #1324
[Feature]Update web and docs by @Yunnglin in #1326
[Feature] Add perf warmup by @Yunnglin in #1329
[Fix]update aigc device args by @Yunnglin in #1336
Add local-only loading switch for T2V metric assets by @AuFlow in #1334
[update] volcengine sandbox by @Yunnglin in #1337
[Feature]Add multi-mcq by @Yunnglin in #1345
[Feature] Add AIR-Bench benchmark support by @haoruilee in #1341
support line_by_line_oai custom dataset mode by @llc-kc in #1342
[Feature] Add native video benchmark support with MVBench and Video-MME-v2 by @haoruilee in #1343
[Feature]Add agent loop by @Yunnglin in #1344
[Fix] ifbench eval logic by @Yunnglin in #1346

New Contributors

@RheagalFire made their first contribution in #1317
@AuFlow made their first contribution in #1334
@haoruilee made their first contribution in #1341
@llc-kc made their first contribution in #1342

Full Changelog: v1.6.1...v1.7.1

Contributors

Yunnglin, RheagalFire, and 3 other contributors

Assets 2