Skip to content

v1.3.0

Choose a tag to compare

@Yunnglin Yunnglin released this 28 Nov 07:51

中文版

基准测试数据集

  • 多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
  • 代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
  • 通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试

功能增强

  • 自定义工具调用评测: 支持自定义函数调用(function-call)评测能力,参考使用文档
  • 自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持,参考使用文档
  • 聚合评分: 更新聚合(agg)参数,优化评分聚合机制
  • 性能测试: 优化性能测试(perf)相关参数配置

文档优化

  • 更新 collection 相关文档说明,支持自定义构建评测指数(index),参考使用文档

问题修复

  • 修复 perf completion endpoint streaming 相关问题
  • 修复 judge model 错误日志显示问题
  • 修复 --no-test-connection 参数 action 问题
  • 修复函数调用类测试用例错误处理问题(Issue #1005)
  • 修复 model args 相关问题

English Version

Benchmark Datasets

  • Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
  • Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
  • General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks

Feature Enhancements

  • Custom Evaluation: Added support for custom function-call evaluation
  • Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
  • Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
  • Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
  • Performance Testing: Optimized performance (perf) related parameter configuration

Documentation

  • Updated eval_type related documentation
  • Updated collection documentation

Bug Fixes

  • Fixed perf completion endpoint streaming issues
  • Fixed error log display for judge model
  • Fixed --no-test-connection parameter action issue
  • Fixed error handling for function-call test cases (Issue #1005)
  • Fixed model args related issues

What's Changed

New Contributors

Full Changelog: v1.2.0...v1.3.0