中文版
基准测试数据集
- Agent 评测: 新增 SWE-Bench Pro、Tau3-Bench、GAIA、Terminal-Bench v2.1 等 Agent 能力评测基准
- 通用评测: 新增 ArxivRollBench 学术论文理解基准测试
- 厂商验证评测: 新增 k2、kimi、minimax 等厂商验证器基准测试
功能增强
- OpenAI Responses API: 新增 OpenAI Responses API 支持 (Issue #1192)
- API Reranker 评测: 支持 API reranker 评测能力 (Issue #1029)
- Agent Bridge: 新增 Agent Bridge 功能及 WebUI 更新
- MCP Server 支持: NativeAgentConfig 支持配置 MCP server
- 图片压缩: 新增多模态评测可配置图片压缩功能
- 性能测试 - Trie 回放: 支持 trie agentic trace replay、Turn 模型及 --duration 参数
- 性能测试 - SwanLab: 支持自部署 SwanLab 的 swanlab_host 配置
文档优化
- 统一 agent 相关指引至 AGENTS.md
问题修复
- 修复 build_docker_images 未优先 subset 数据集的问题 (#1348)
- 修复 tokenizer 加载时 max_position_embeddings AttributeError (#1354)
- 修复 livecodebench 数据集迁移至 ModelScope parquet 格式 (#1357)
- 修复 DatasetDict.from_dataset 中 repeats 参数无效的问题 (#1363)
- 修复 perf open-loop / rate-paced 模式下实际 QPS 不稳定的问题 (#1367)
- 修复 perf SLA 多轮平均后整数字段未取整的问题 (#1370)
- 修复 agent sandbox 中 ms_enclave 缺失时未快速失败的问题 (#1372)
English Version
Benchmark Datasets
- Agent Evaluation: Added SWE-Bench Pro, Tau3-Bench, GAIA, Terminal-Bench v2.1 for agent capability assessment
- General Evaluation: Added ArxivRollBench for academic paper comprehension
- Vendor Verifier: Added k2, kimi, minimax vendor verifier benchmarks
Feature Enhancements
- OpenAI Responses API: Added OpenAI Responses API support (Issue #1192)
- API Reranker Evaluation: Added support for API reranker evaluation (Issue #1029)
- Agent Bridge: Added Agent Bridge functionality with WebUI update
- MCP Server Support: Added MCP server configuration for NativeAgentConfig
- Image Compression: Added configurable image compression for VLM benchmarks
- Perf - Trie Replay: Added trie agentic trace replay, Turn model, and --duration parameter
- Perf - SwanLab: Added swanlab_host support for self-hosted SwanLab deployments
Documentation
- Unified agent instructions into AGENTS.md
Bug Fixes
- Fixed build_docker_images check to always subset dataset first (#1348)
- Fixed max_position_embeddings AttributeError in tokenizer loading (#1354)
- Fixed livecodebench migration to parquet dataset on ModelScope (#1357)
- Fixed ineffective repeats in DatasetDict.from_dataset (#1363)
- Fixed unstable realised QPS in open-loop / rate-paced benchmarks (#1367)
- Fixed int fields rounding after SLA multi-run averaging (#1370)
- Fixed fail-fast in EnclaveAgentEnvironment when ms_enclave is missing (#1372)
What's Changed
- core: fix build_docker_images check to always subset dataset first by @liguodongiot in #1348
- [Update] old agent benchmarks by @Yunnglin in #1349
- perf: support swanlab_host for self-hosted SwanLab deployments by @Yunnglin in #1350
- [Benchmark] Add SWE-Bench pro and Tau3-Bench by @Yunnglin in #1351
- Add OpenAI Responses API support, solve #1192 by @haoruilee in #1352
- docs: unify agent instructions into AGENTS.md by @Yunnglin in #1353
- fix: handle max_position_embeddings AttributeError in tokenizer loading by @Yunnglin in #1354
- feat: add configurable image compression for VLM benchmarks by @Yunnglin in #1355
- fix: migrate livecodebench to parquet dataset on ModelScope by @Yunnglin in #1357
- [Fix] Fix ineffective repeats in DatasetDict.from_dataset by @we1sper in #1363
- fix(perf): stabilise realised QPS in open-loop / rate-paced benchmarks by @Syqinx in #1367
- [Feature] Add agent bridge and webui update by @Yunnglin in #1364
- fix(agent/sandbox): fail-fast in EnclaveAgentEnvironment when ms_enclave is missing by @Yunnglin in #1372
- feat: GAIA benchmark + MCP server support for NativeAgentConfig by @Yunnglin in #1371
- [Benchmark] Add ArxivRollBench by @liangzid in #1365
- fix(perf): round int fields after sla multi-run averaging by @qiumuyang in #1370
- feat(benchmarks): add k2/kimi/minimax vendor verifier benchmarks by @Yunnglin in #1375
- feat(perf): trie agentic trace replay + Turn model + --duration by @Yunnglin in #1374
- feat(benchmarks): add Terminal-Bench v2.1 + upgrade harbor integration by @Yunnglin in #1376
- feat: support API reranker evaluation, fix #1029 by @haoruilee in #1377
New Contributors
- @liguodongiot made their first contribution in #1348
- @we1sper made their first contribution in #1363
- @Syqinx made their first contribution in #1367
- @liangzid made their first contribution in #1365
- @qiumuyang made their first contribution in #1370
Full Changelog: v1.7.1...v1.8.0