中文版
基准测试数据集
- 语音评测: 新增 Seed-TTS-Eval 基准测试
- Agent 与工具调用评测: 新增 ACEBench 基准测试
- OCR 评测: 新增 Maritime-OCR-Bench 基准测试
- 图像描述评测: 新增 Caption benchmarks 支持
功能增强
- RAG 评测: 重构 RAG Eval,支持 MTEB 2.x、RAGAS 0.4.x,并引入 Pydantic 配置
- SWE-Bench: 支持为 SWE-Bench 镜像配置自定义 DockerHub namespace
- 图像质量评测: 新增全参考图像质量指标
问题修复
- 修复 SciCode 中 assistant text blocks 读取问题
- 修复 Terminal-Bench 在 trials 前未检查 Docker CLI 的问题
- 修复多轮对话中
reasoning_content未作为顶层字段透传的问题 - 修复
serviceoptional-dependencies 中缺少 perf 依赖的问题 - 修复 RAG API encoder/reranker 中超过
max_seq_length的文本截断问题 - 修复 agent bash 工具 stdout 空白字符保留问题,避免 patch 内容损坏
- 修复 Windows 环境下缓存写入可能触发
PermissionError的问题
English Version
Benchmark Datasets
- Speech Evaluation: Added Seed-TTS-Eval benchmark
- Agent and Tool-Use Evaluation: Added ACEBench benchmark
- OCR Evaluation: Added Maritime-OCR-Bench benchmark
- Image Captioning Evaluation: Added Caption benchmarks support
Feature Enhancements
- RAG Evaluation: Refactored RAG Eval with MTEB 2.x, RAGAS 0.4.x, and Pydantic configs
- SWE-Bench: Added support for custom DockerHub namespace for SWE-Bench images
- Image Quality Evaluation: Added full-reference image quality metrics
Bug Fixes
- Fixed SciCode assistant text block parsing
- Fixed Terminal-Bench Docker CLI check before trials
- Fixed forwarding
reasoning_contentas a top-level field in multi-turn conversations - Fixed missing perf dependencies in
serviceoptional-dependencies - Fixed truncation for texts exceeding
max_seq_lengthin RAG API encoder/reranker - Fixed stdout whitespace preservation in agent bash tool to prevent patch corruption
- Fixed possible Windows
PermissionErrorwhen writing cache files
What's Changed
- fix(scicode): read assistant text blocks by @he-yufeng in #1381
- add seed_tts_eval benchmark, solve #1360 by @haoruilee in #1379
- feat(benchmarks): add ACEBench support, fix #1025 by @haoruilee in #1386
- fix(terminal_bench): check docker cli before trials by @Li-Bailiang in #1389
- refactor(rag_eval): MTEB 2.x + RAGAS 0.4.x + Pydantic configs by @Yunnglin in #1383
- add Maritime-OCR-Bench support by @K-zhy in #1388
- fix(models): forward reasoning_content as top-level field in multi-turn by @Yunnglin in #1396
- fix: include perf deps in
serviceoptional-dependencies by @Blackteaxx in #1398 - Add caption benchmarks by @haoruilee in #1402
- fix(rag): truncate texts exceeding max_seq_length in API encoder/reranker by @Yunnglin in #1407
- fix(agent): preserve stdout whitespace in bash tool to prevent patch corruption by @Yunnglin in #1409
- fix(cache): use persistent jsonl writer to avoid Windows PermissionError by @Yunnglin in #1410
- feat: allow custom DockerHub namespace for SWE-Bench images by @Yunnglin in #1417
- feat(metric): add full-reference image quality metrics by @haoruilee in #1412
New Contributors
- @he-yufeng made their first contribution in #1381
- @Li-Bailiang made their first contribution in #1389
- @Blackteaxx made their first contribution in #1398
Full Changelog: v1.8.0...v1.8.1