v1.3.0

Yunnglin released this 28 Nov 07:51

5cdbe62

中文版

基准测试数据集

多模态评测: 新增 A_OKVQA、CMMU、CMMMU、ScienceQA、V*Bench、MicroVQA 等多模态基准测试
代码评测: 新增 SWE-bench_Verified、SWE-bench_Lite、SWE-bench_Verified_mini、SciCode 等代码能力评测
通用评测: 新增 GSM8K-V、MGSM、IFBench、OpenAI MRCR 等基准测试

功能增强

自定义工具调用评测: 支持自定义函数调用(function-call)评测能力，参考使用文档
自定义多模态VQA评测: 新增自定义视觉问答(VQA)评测支持，参考使用文档
聚合评分: 更新聚合(agg)参数，优化评分聚合机制
性能测试: 优化性能测试(perf)相关参数配置

文档优化

更新 collection 相关文档说明，支持自定义构建评测指数（index），参考使用文档

问题修复

修复 perf completion endpoint streaming 相关问题
修复 judge model 错误日志显示问题
修复 --no-test-connection 参数 action 问题
修复函数调用类测试用例错误处理问题(Issue #1005)
修复 model args 相关问题

English Version

Benchmark Datasets

Multimodal Evaluation: Added A_OKVQA, CMMU, CMMMU, ScienceQA, V*Bench, MicroVQA and other multimodal benchmarks
Code Evaluation: Added SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini, SciCode for code capability assessment
General Evaluation: Added GSM8K-V, MGSM, IFBench, OpenAI MRCR and other benchmarks

Feature Enhancements

Custom Evaluation: Added support for custom function-call evaluation
Custom VQA: Added support for custom Visual Question Answering (VQA) evaluation
Parameter Extension: Added extra_param_spec functionality for more flexible parameter configuration
Aggregate Scoring: Updated aggregation (agg) parameters to optimize scoring aggregation mechanism
Performance Testing: Optimized performance (perf) related parameter configuration

Documentation

Updated eval_type related documentation
Updated collection documentation

Bug Fixes

Fixed perf completion endpoint streaming issues
Fixed error log display for judge model
Fixed --no-test-connection parameter action issue
Fixed error handling for function-call test cases (Issue #1005)
Fixed model args related issues

What's Changed

[Doc] Update doc eval_type by @Yunnglin in #970
[Benchmark] Add A_OKVQA, CMMU, ScienceQ, V*Bench by @mushenL in #973
[Benchmark] Add SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini by @Yunnglin in #976
[Feature] Add custom function-call eval by @Yunnglin in #982
[Fix] perf completion endpoint streaming by @Yunnglin in #983
[Fix] fix error log of judge model by @Yunnglin in #986
add openai mrcr by @sophies-cerebras in #987
[Feature] Add extra param spec by @Yunnglin in #990
Add gsm8k_v,mgsm and micro_vqa benchmarks by @mushenL in #995
fix: fix --no-test-connection args action by @ljwh in #999
Update collection doc by @Yunnglin in #997
[Benchmark] Add IFBench by @Yunnglin in #1001
解决Issue1005 处理函数调用类测试用例错误问题 by @hougedengwo in #1007
[Benchmark] Add SciCode by @Yunnglin in #1011
[Fix] update perf args by @Yunnglin in #1013
[Fix] model args by @Yunnglin in #1014
[Feature] update agg args by @Yunnglin in #1016
[Feature] Add custom VQA by @Yunnglin in #1019
[Benchmark] add CMMMU by @Yunnglin in #1020

New Contributors

@ljwh made their first contribution in #999
@hougedengwo made their first contribution in #1007

Full Changelog: v1.2.0...v1.3.0

Contributors

ljwh, hougedengwo, and 3 other contributors

Assets 2