v1.7.1
中文版
基准测试数据集
- 多模态评测: 新增 AIR-Bench 基准测试支持
- 视频评测: 新增原生视频基准测试支持,包括 MVBench、Video-MME-v2
功能增强
- WebUI 与多轮评测: 重构 WebUI,并优化多轮评测能力
- 模型网关支持: 新增 LiteLLM 作为 AI gateway provider
- 性能测试: 新增 perf swe-smith 压测能力
- 性能测试: 新增 perf warmup 预热能力
- 自定义数据集: 支持 line_by_line_oai 自定义数据集模式
- 评测能力: 新增 multi-mcq 支持
- Agent 能力: 新增 agent loop 支持
- 评测资源管理: 为 T2V metric assets 新增 local-only 本地加载开关
- 沙箱能力: 更新 volcengine sandbox 支持
文档优化
- 更新 Web 相关文档说明
问题修复
- 修复 perf swe-smith 相关问题
- 修复 aigc device args 相关问题
- 修复 IFBench 评测逻辑问题
English Version
Benchmark Datasets
- Multimodal Evaluation: Added AIR-Bench benchmark support
- Video Evaluation: Added native video benchmark support, including MVBench and Video-MME-v2
Feature Enhancements
- WebUI and Multi-turn Evaluation: Refactored WebUI and improved multi-turn evaluation capabilities
- Model Gateway Support: Added LiteLLM as an AI gateway provider
- Performance Testing: Added perf swe-smith benchmarking capability
- Performance Testing: Added perf warmup support
- Custom Dataset: Added support for line_by_line_oai custom dataset mode
- Evaluation Capability: Added multi-mcq support
- Agent Capability: Added agent loop support
- Evaluation Asset Management: Added local-only loading switch for T2V metric assets
- Sandbox Support: Updated volcengine sandbox support
Documentation
- Updated Web-related documentation
Bug Fixes
- Fixed perf swe-smith related issues
- Fixed aigc device args related issues
- Fixed IFBench evaluation logic issue
What's Changed
- [Feature]Refact webui and multi-turn eval by @Yunnglin in #1315
- feat: add LiteLLM as AI gateway provider by @RheagalFire in #1317
- [Feature]Add perf swe-smith by @Yunnglin in #1321
- [Fix] perf swe-smith by @Yunnglin in #1322
- [Fix] perf swe-smith load by @Yunnglin in #1323
- [Fix] perf swe multi process by @Yunnglin in #1324
- [Feature]Update web and docs by @Yunnglin in #1326
- [Feature] Add perf warmup by @Yunnglin in #1329
- [Fix]update aigc device args by @Yunnglin in #1336
- Add local-only loading switch for T2V metric assets by @AuFlow in #1334
- [update] volcengine sandbox by @Yunnglin in #1337
- [Feature]Add multi-mcq by @Yunnglin in #1345
- [Feature] Add AIR-Bench benchmark support by @haoruilee in #1341
- support line_by_line_oai custom dataset mode by @llc-kc in #1342
- [Feature] Add native video benchmark support with MVBench and Video-MME-v2 by @haoruilee in #1343
- [Feature]Add agent loop by @Yunnglin in #1344
- [Fix] ifbench eval logic by @Yunnglin in #1346
New Contributors
- @RheagalFire made their first contribution in #1317
- @AuFlow made their first contribution in #1334
- @haoruilee made their first contribution in #1341
- @llc-kc made their first contribution in #1342
Full Changelog: v1.6.1...v1.7.1