v1.3.0
中文版
基准测试数据集
- 多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
- 代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
- 通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试
功能增强
- 自定义工具调用评测: 支持自定义函数调用(function-call)评测能力,参考使用文档
- 自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持,参考使用文档
- 聚合评分: 更新聚合(agg)参数,优化评分聚合机制
- 性能测试: 优化性能测试(perf)相关参数配置
文档优化
- 更新 collection 相关文档说明,支持自定义构建评测指数(index),参考使用文档
问题修复
- 修复 perf completion endpoint streaming 相关问题
- 修复 judge model 错误日志显示问题
- 修复 --no-test-connection 参数 action 问题
- 修复函数调用类测试用例错误处理问题(Issue #1005)
- 修复 model args 相关问题
English Version
Benchmark Datasets
- Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
- Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
- General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks
Feature Enhancements
- Custom Evaluation: Added support for custom function-call evaluation
- Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
- Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
- Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
- Performance Testing: Optimized performance (perf) related parameter configuration
Documentation
- Updated eval_type related documentation
- Updated collection documentation
Bug Fixes
- Fixed perf completion endpoint streaming issues
- Fixed error log display for judge model
- Fixed --no-test-connection parameter action issue
- Fixed error handling for function-call test cases (Issue #1005)
- Fixed model args related issues
What's Changed
- [Doc] Update doc eval_type by @Yunnglin in #970
- [Benchmark] Add A_OKVQA, CMMU, ScienceQ, V*Bench by @mushenL in #973
- [Benchmark] Add SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini by @Yunnglin in #976
- [Feature] Add custom function-call eval by @Yunnglin in #982
- [Fix] perf completion endpoint streaming by @Yunnglin in #983
- [Fix] fix error log of judge model by @Yunnglin in #986
- add openai mrcr by @sophies-cerebras in #987
- [Feature] Add extra param spec by @Yunnglin in #990
- Add gsm8k_v,mgsm and micro_vqa benchmarks by @mushenL in #995
- fix: fix --no-test-connection args action by @ljwh in #999
- Update collection doc by @Yunnglin in #997
- [Benchmark] Add IFBench by @Yunnglin in #1001
- 解决Issue1005 处理函数调用类测试用例错误 问题 by @hougedengwo in #1007
- [Benchmark] Add SciCode by @Yunnglin in #1011
- [Fix] update perf args by @Yunnglin in #1013
- [Fix] model args by @Yunnglin in #1014
- [Feature] update agg args by @Yunnglin in #1016
- [Feature] Add custom VQA by @Yunnglin in #1019
- [Benchmark] add CMMMU by @Yunnglin in #1020
New Contributors
- @ljwh made their first contribution in #999
- @hougedengwo made their first contribution in #1007
Full Changelog: v1.2.0...v1.3.0