Skip to content

v1.8.1

Latest

Choose a tag to compare

@Yunnglin Yunnglin released this 16 Jun 09:51

中文版

基准测试数据集

  • 语音评测: 新增 Seed-TTS-Eval 基准测试
  • Agent 与工具调用评测: 新增 ACEBench 基准测试
  • OCR 评测: 新增 Maritime-OCR-Bench 基准测试
  • 图像描述评测: 新增 Caption benchmarks 支持

功能增强

  • RAG 评测: 重构 RAG Eval,支持 MTEB 2.x、RAGAS 0.4.x,并引入 Pydantic 配置
  • SWE-Bench: 支持为 SWE-Bench 镜像配置自定义 DockerHub namespace
  • 图像质量评测: 新增全参考图像质量指标

问题修复

  • 修复 SciCode 中 assistant text blocks 读取问题
  • 修复 Terminal-Bench 在 trials 前未检查 Docker CLI 的问题
  • 修复多轮对话中 reasoning_content 未作为顶层字段透传的问题
  • 修复 service optional-dependencies 中缺少 perf 依赖的问题
  • 修复 RAG API encoder/reranker 中超过 max_seq_length 的文本截断问题
  • 修复 agent bash 工具 stdout 空白字符保留问题,避免 patch 内容损坏
  • 修复 Windows 环境下缓存写入可能触发 PermissionError 的问题

English Version

Benchmark Datasets

  • Speech Evaluation: Added Seed-TTS-Eval benchmark
  • Agent and Tool-Use Evaluation: Added ACEBench benchmark
  • OCR Evaluation: Added Maritime-OCR-Bench benchmark
  • Image Captioning Evaluation: Added Caption benchmarks support

Feature Enhancements

  • RAG Evaluation: Refactored RAG Eval with MTEB 2.x, RAGAS 0.4.x, and Pydantic configs
  • SWE-Bench: Added support for custom DockerHub namespace for SWE-Bench images
  • Image Quality Evaluation: Added full-reference image quality metrics

Bug Fixes

  • Fixed SciCode assistant text block parsing
  • Fixed Terminal-Bench Docker CLI check before trials
  • Fixed forwarding reasoning_content as a top-level field in multi-turn conversations
  • Fixed missing perf dependencies in service optional-dependencies
  • Fixed truncation for texts exceeding max_seq_length in RAG API encoder/reranker
  • Fixed stdout whitespace preservation in agent bash tool to prevent patch corruption
  • Fixed possible Windows PermissionError when writing cache files

What's Changed

  • fix(scicode): read assistant text blocks by @he-yufeng in #1381
  • add seed_tts_eval benchmark, solve #1360 by @haoruilee in #1379
  • feat(benchmarks): add ACEBench support, fix #1025 by @haoruilee in #1386
  • fix(terminal_bench): check docker cli before trials by @Li-Bailiang in #1389
  • refactor(rag_eval): MTEB 2.x + RAGAS 0.4.x + Pydantic configs by @Yunnglin in #1383
  • add Maritime-OCR-Bench support by @K-zhy in #1388
  • fix(models): forward reasoning_content as top-level field in multi-turn by @Yunnglin in #1396
  • fix: include perf deps in service optional-dependencies by @Blackteaxx in #1398
  • Add caption benchmarks by @haoruilee in #1402
  • fix(rag): truncate texts exceeding max_seq_length in API encoder/reranker by @Yunnglin in #1407
  • fix(agent): preserve stdout whitespace in bash tool to prevent patch corruption by @Yunnglin in #1409
  • fix(cache): use persistent jsonl writer to avoid Windows PermissionError by @Yunnglin in #1410
  • feat: allow custom DockerHub namespace for SWE-Bench images by @Yunnglin in #1417
  • feat(metric): add full-reference image quality metrics by @haoruilee in #1412

New Contributors

Full Changelog: v1.8.0...v1.8.1