Skip to content

v1.8.0

Latest

Choose a tag to compare

@Yunnglin Yunnglin released this 28 May 08:48

中文版

基准测试数据集

  • Agent 评测: 新增 SWE-Bench Pro、Tau3-Bench、GAIA、Terminal-Bench v2.1 等 Agent 能力评测基准
  • 通用评测: 新增 ArxivRollBench 学术论文理解基准测试
  • 厂商验证评测: 新增 k2、kimi、minimax 等厂商验证器基准测试

功能增强

  • OpenAI Responses API: 新增 OpenAI Responses API 支持 (Issue #1192)
  • API Reranker 评测: 支持 API reranker 评测能力 (Issue #1029)
  • Agent Bridge: 新增 Agent Bridge 功能及 WebUI 更新
  • MCP Server 支持: NativeAgentConfig 支持配置 MCP server
  • 图片压缩: 新增多模态评测可配置图片压缩功能
  • 性能测试 - Trie 回放: 支持 trie agentic trace replay、Turn 模型及 --duration 参数
  • 性能测试 - SwanLab: 支持自部署 SwanLab 的 swanlab_host 配置

文档优化

  • 统一 agent 相关指引至 AGENTS.md

问题修复

  • 修复 build_docker_images 未优先 subset 数据集的问题 (#1348)
  • 修复 tokenizer 加载时 max_position_embeddings AttributeError (#1354)
  • 修复 livecodebench 数据集迁移至 ModelScope parquet 格式 (#1357)
  • 修复 DatasetDict.from_dataset 中 repeats 参数无效的问题 (#1363)
  • 修复 perf open-loop / rate-paced 模式下实际 QPS 不稳定的问题 (#1367)
  • 修复 perf SLA 多轮平均后整数字段未取整的问题 (#1370)
  • 修复 agent sandbox 中 ms_enclave 缺失时未快速失败的问题 (#1372)

English Version

Benchmark Datasets

  • Agent Evaluation: Added SWE-Bench Pro, Tau3-Bench, GAIA, Terminal-Bench v2.1 for agent capability assessment
  • General Evaluation: Added ArxivRollBench for academic paper comprehension
  • Vendor Verifier: Added k2, kimi, minimax vendor verifier benchmarks

Feature Enhancements

  • OpenAI Responses API: Added OpenAI Responses API support (Issue #1192)
  • API Reranker Evaluation: Added support for API reranker evaluation (Issue #1029)
  • Agent Bridge: Added Agent Bridge functionality with WebUI update
  • MCP Server Support: Added MCP server configuration for NativeAgentConfig
  • Image Compression: Added configurable image compression for VLM benchmarks
  • Perf - Trie Replay: Added trie agentic trace replay, Turn model, and --duration parameter
  • Perf - SwanLab: Added swanlab_host support for self-hosted SwanLab deployments

Documentation

  • Unified agent instructions into AGENTS.md

Bug Fixes

  • Fixed build_docker_images check to always subset dataset first (#1348)
  • Fixed max_position_embeddings AttributeError in tokenizer loading (#1354)
  • Fixed livecodebench migration to parquet dataset on ModelScope (#1357)
  • Fixed ineffective repeats in DatasetDict.from_dataset (#1363)
  • Fixed unstable realised QPS in open-loop / rate-paced benchmarks (#1367)
  • Fixed int fields rounding after SLA multi-run averaging (#1370)
  • Fixed fail-fast in EnclaveAgentEnvironment when ms_enclave is missing (#1372)

What's Changed

  • core: fix build_docker_images check to always subset dataset first by @liguodongiot in #1348
  • [Update] old agent benchmarks by @Yunnglin in #1349
  • perf: support swanlab_host for self-hosted SwanLab deployments by @Yunnglin in #1350
  • [Benchmark] Add SWE-Bench pro and Tau3-Bench by @Yunnglin in #1351
  • Add OpenAI Responses API support, solve #1192 by @haoruilee in #1352
  • docs: unify agent instructions into AGENTS.md by @Yunnglin in #1353
  • fix: handle max_position_embeddings AttributeError in tokenizer loading by @Yunnglin in #1354
  • feat: add configurable image compression for VLM benchmarks by @Yunnglin in #1355
  • fix: migrate livecodebench to parquet dataset on ModelScope by @Yunnglin in #1357
  • [Fix] Fix ineffective repeats in DatasetDict.from_dataset by @we1sper in #1363
  • fix(perf): stabilise realised QPS in open-loop / rate-paced benchmarks by @Syqinx in #1367
  • [Feature] Add agent bridge and webui update by @Yunnglin in #1364
  • fix(agent/sandbox): fail-fast in EnclaveAgentEnvironment when ms_enclave is missing by @Yunnglin in #1372
  • feat: GAIA benchmark + MCP server support for NativeAgentConfig by @Yunnglin in #1371
  • [Benchmark] Add ArxivRollBench by @liangzid in #1365
  • fix(perf): round int fields after sla multi-run averaging by @qiumuyang in #1370
  • feat(benchmarks): add k2/kimi/minimax vendor verifier benchmarks by @Yunnglin in #1375
  • feat(perf): trie agentic trace replay + Turn model + --duration by @Yunnglin in #1374
  • feat(benchmarks): add Terminal-Bench v2.1 + upgrade harbor integration by @Yunnglin in #1376
  • feat: support API reranker evaluation, fix #1029 by @haoruilee in #1377

New Contributors

Full Changelog: v1.7.1...v1.8.0