Release v1.4.0 · modelscope/evalscope

中文版

基准测试数据集

通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
代码评测: 新增 MultiplE、MBPP 等代码能力评测
语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试

功能增强

性能测试可视化: 新增 ClearML 可视化支持，优化性能测试(perf)监控能力
服务API: 新增 service api 功能，提供更灵活的服务调用方式，参考文档
懒加载模型: 新增 lazy model 支持，优化模型加载机制
重试机制: 新增 retry function，提升评测稳定性
沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
随机算法优化: 更新性能测试随机算法，提升测试准确性
UI增强: Dashboard 支持 HTTP params 参数配置
进度条优化: 更新 tqdm 进度显示机制

文档优化

更新自定义 VQA 相关文档
更新参数配置相关文档
更新基准测试(benchmarks)文档
更新服务(service)相关文档
更新 MTEB 相关链接

问题修复

修复 --analysis-report、--dataset-dir 等命令行参数问题
修复并发为1时的令牌吞吐量计算问题
修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
修复 NLTK 资源检查相关问题

English Version

Benchmark Datasets

General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks

Feature Enhancements

Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
Service API: Added service api functionality for more flexible service invocation
Lazy Model Loading: Added lazy model support to optimize model loading mechanism
Retry Mechanism: Added retry function to improve evaluation stability
Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
Random Algorithm: Updated performance testing random algorithm for improved accuracy
UI Enhancement: Dashboard now supports HTTP params parameter configuration
Progress Bar: Updated tqdm progress display mechanism

Documentation

Updated custom VQA documentation
Updated parameter configuration documentation
Updated benchmarks documentation
Updated service documentation
Updated MTEB related links

Bug Fixes

Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
Fixed token throughput calculation at concurrency 1
Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
Fixed SWE-bench image build and MRCR leading newline support
Fixed NLTK resource checking issues

What's Changed

[Feature] Add perf ClearML visualization by @Yunnglin in #1032
[Doc] update custom vqa by @Yunnglin in #1036
[Benchmark ]Add eq bench by @Yunnglin in #1037
Feature/zebralogicbench by @nhes in #1035
[Fix] Update tau2 by @Yunnglin in #1039
[Feature] Add service api by @Yunnglin in #1042
fix --analysis-report=true bug by @pumpkin12135 in #1046
[feature] add lazy model by @Secbone in #1045
[Doc] update parameter by @Yunnglin in #1048
[Fix] update default work dir by @Yunnglin in #1049
[Feature] Update perf random Algorithm by @Yunnglin in #1050
[Feature] add retry function by @Yunnglin in #1051
Fix --dataset-dir parameter to work correctly by @gbdjxgp in #1053
[Fix] chartqa prompt by @Yunnglin in #1054
[Benchmark] Add fleurs, librispeech by @Yunnglin in #1059
[Fix] multi-if load by @Yunnglin in #1062
[Benchmakr] Add MultiplE-mbpp, MBPP by @Yunnglin in #1066
Update mteb link by @Samoed in #1065
UI dashboard supports HTTP params parameters by @pumpkin12135 in #1060
[Feature] update sandbox with pool and multiple-humaneval by @Yunnglin in #1073
check_nltk_data does not accept a parameter by @Zhaoyi-Yan in #1071
[Fix] update check nltk resource by @Yunnglin in #1078
[Fix] omni doc bench load by @Yunnglin in #1079
[Doc] update benchmarks doc by @Yunnglin in #1081
[Doc] update service and doc by @Yunnglin in #1085
fix: Output token throughput and Total token throughput on Concurrency 1 by @cdpath in #1083
[Fix] SWE build image by @Yunnglin in #1087
small fixes to mrcr to support leading \n characters by @sophies-cerebras in #1086
[Feature] Update tqdm process by @Yunnglin in #1089

New Contributors

@nhes made their first contribution in #1035
@pumpkin12135 made their first contribution in #1046
@Secbone made their first contribution in #1045
@gbdjxgp made their first contribution in #1053
@Samoed made their first contribution in #1065
@Zhaoyi-Yan made their first contribution in #1071
@cdpath made their first contribution in #1083

Full Changelog: v1.3.0...v1.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

中文版

基准测试数据集

功能增强

文档优化

问题修复

English Version

Benchmark Datasets

Feature Enhancements

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

Uh oh!