Skip to content

v1.4.0

Choose a tag to compare

@Yunnglin Yunnglin released this 16 Dec 09:17
· 31 commits to release/1.4 since this release

中文版

基准测试数据集

  • 通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
  • 代码评测: 新增 MultiplE、MBPP 等代码能力评测
  • 语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试

功能增强

  • 性能测试可视化: 新增 ClearML 可视化支持,优化性能测试(perf)监控能力
  • 服务API: 新增 service api 功能,提供更灵活的服务调用方式,参考文档
  • 懒加载模型: 新增 lazy model 支持,优化模型加载机制
  • 重试机制: 新增 retry function,提升评测稳定性
  • 沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
  • 随机算法优化: 更新性能测试随机算法,提升测试准确性
  • UI增强: Dashboard 支持 HTTP params 参数配置
  • 进度条优化: 更新 tqdm 进度显示机制

文档优化

  • 更新自定义 VQA 相关文档
  • 更新参数配置相关文档
  • 更新基准测试(benchmarks)文档
  • 更新服务(service)相关文档
  • 更新 MTEB 相关链接

问题修复

  • 修复 --analysis-report、--dataset-dir 等命令行参数问题
  • 修复并发为1时的令牌吞吐量计算问题
  • 修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
  • 修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
  • 修复 NLTK 资源检查相关问题

English Version

Benchmark Datasets

  • General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
  • Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
  • Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks

Feature Enhancements

  • Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
  • Service API: Added service api functionality for more flexible service invocation
  • Lazy Model Loading: Added lazy model support to optimize model loading mechanism
  • Retry Mechanism: Added retry function to improve evaluation stability
  • Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
  • Random Algorithm: Updated performance testing random algorithm for improved accuracy
  • UI Enhancement: Dashboard now supports HTTP params parameter configuration
  • Progress Bar: Updated tqdm progress display mechanism

Documentation

  • Updated custom VQA documentation
  • Updated parameter configuration documentation
  • Updated benchmarks documentation
  • Updated service documentation
  • Updated MTEB related links

Bug Fixes

  • Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
  • Fixed token throughput calculation at concurrency 1
  • Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
  • Fixed SWE-bench image build and MRCR leading newline support
  • Fixed NLTK resource checking issues

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0