Release v1.8.0 · modelscope/evalscope

中文版

基准测试数据集

Agent 评测: 新增 SWE-Bench Pro、Tau3-Bench、GAIA、Terminal-Bench v2.1 等 Agent 能力评测基准
通用评测: 新增 ArxivRollBench 学术论文理解基准测试
厂商验证评测: 新增 k2、kimi、minimax 等厂商验证器基准测试

功能增强

OpenAI Responses API: 新增 OpenAI Responses API 支持 (Issue #1192)
API Reranker 评测: 支持 API reranker 评测能力 (Issue #1029)
Agent Bridge: 新增 Agent Bridge 功能及 WebUI 更新
MCP Server 支持: NativeAgentConfig 支持配置 MCP server
图片压缩: 新增多模态评测可配置图片压缩功能
性能测试 - Trie 回放: 支持 trie agentic trace replay、Turn 模型及 --duration 参数
性能测试 - SwanLab: 支持自部署 SwanLab 的 swanlab_host 配置

文档优化

统一 agent 相关指引至 AGENTS.md

问题修复

修复 build_docker_images 未优先 subset 数据集的问题 (#1348)
修复 tokenizer 加载时 max_position_embeddings AttributeError (#1354)
修复 livecodebench 数据集迁移至 ModelScope parquet 格式 (#1357)
修复 DatasetDict.from_dataset 中 repeats 参数无效的问题 (#1363)
修复 perf open-loop / rate-paced 模式下实际 QPS 不稳定的问题 (#1367)
修复 perf SLA 多轮平均后整数字段未取整的问题 (#1370)
修复 agent sandbox 中 ms_enclave 缺失时未快速失败的问题 (#1372)

English Version

Benchmark Datasets

Agent Evaluation: Added SWE-Bench Pro, Tau3-Bench, GAIA, Terminal-Bench v2.1 for agent capability assessment
General Evaluation: Added ArxivRollBench for academic paper comprehension
Vendor Verifier: Added k2, kimi, minimax vendor verifier benchmarks

Feature Enhancements

OpenAI Responses API: Added OpenAI Responses API support (Issue #1192)
API Reranker Evaluation: Added support for API reranker evaluation (Issue #1029)
Agent Bridge: Added Agent Bridge functionality with WebUI update
MCP Server Support: Added MCP server configuration for NativeAgentConfig
Image Compression: Added configurable image compression for VLM benchmarks
Perf - Trie Replay: Added trie agentic trace replay, Turn model, and --duration parameter
Perf - SwanLab: Added swanlab_host support for self-hosted SwanLab deployments

Documentation

Unified agent instructions into AGENTS.md

Bug Fixes

Fixed build_docker_images check to always subset dataset first (#1348)
Fixed max_position_embeddings AttributeError in tokenizer loading (#1354)
Fixed livecodebench migration to parquet dataset on ModelScope (#1357)
Fixed ineffective repeats in DatasetDict.from_dataset (#1363)
Fixed unstable realised QPS in open-loop / rate-paced benchmarks (#1367)
Fixed int fields rounding after SLA multi-run averaging (#1370)
Fixed fail-fast in EnclaveAgentEnvironment when ms_enclave is missing (#1372)

What's Changed

core: fix build_docker_images check to always subset dataset first by @liguodongiot in #1348
[Update] old agent benchmarks by @Yunnglin in #1349
perf: support swanlab_host for self-hosted SwanLab deployments by @Yunnglin in #1350
[Benchmark] Add SWE-Bench pro and Tau3-Bench by @Yunnglin in #1351
Add OpenAI Responses API support, solve #1192 by @haoruilee in #1352
docs: unify agent instructions into AGENTS.md by @Yunnglin in #1353
fix: handle max_position_embeddings AttributeError in tokenizer loading by @Yunnglin in #1354
feat: add configurable image compression for VLM benchmarks by @Yunnglin in #1355
fix: migrate livecodebench to parquet dataset on ModelScope by @Yunnglin in #1357
[Fix] Fix ineffective repeats in DatasetDict.from_dataset by @we1sper in #1363
fix(perf): stabilise realised QPS in open-loop / rate-paced benchmarks by @Syqinx in #1367
[Feature] Add agent bridge and webui update by @Yunnglin in #1364
fix(agent/sandbox): fail-fast in EnclaveAgentEnvironment when ms_enclave is missing by @Yunnglin in #1372
feat: GAIA benchmark + MCP server support for NativeAgentConfig by @Yunnglin in #1371
[Benchmark] Add ArxivRollBench by @liangzid in #1365
fix(perf): round int fields after sla multi-run averaging by @qiumuyang in #1370
feat(benchmarks): add k2/kimi/minimax vendor verifier benchmarks by @Yunnglin in #1375
feat(perf): trie agentic trace replay + Turn model + --duration by @Yunnglin in #1374
feat(benchmarks): add Terminal-Bench v2.1 + upgrade harbor integration by @Yunnglin in #1376
feat: support API reranker evaluation, fix #1029 by @haoruilee in #1377

New Contributors

@liguodongiot made their first contribution in #1348
@we1sper made their first contribution in #1363
@Syqinx made their first contribution in #1367
@liangzid made their first contribution in #1365
@qiumuyang made their first contribution in #1370

Full Changelog: v1.7.1...v1.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!