v1.4.0
中文版
基准测试数据集
- 通用评测: 新增 EQ-Bench、ZebraLogicBench 等推理与逻辑评测基准
- 代码评测: 新增 MultiplE、MBPP 等代码能力评测
- 语音评测: 新增 FLEURS、LibriSpeech 等语音识别基准测试
功能增强
- 性能测试可视化: 新增 ClearML 可视化支持,优化性能测试(perf)监控能力
- 服务API: 新增 service api 功能,提供更灵活的服务调用方式,参考文档
- 懒加载模型: 新增 lazy model 支持,优化模型加载机制
- 重试机制: 新增 retry function,提升评测稳定性
- 沙箱优化: 更新 sandbox 支持连接池(pool)和 MultiplE 多语言代码评测
- 随机算法优化: 更新性能测试随机算法,提升测试准确性
- UI增强: Dashboard 支持 HTTP params 参数配置
- 进度条优化: 更新 tqdm 进度显示机制
文档优化
- 更新自定义 VQA 相关文档
- 更新参数配置相关文档
- 更新基准测试(benchmarks)文档
- 更新服务(service)相关文档
- 更新 MTEB 相关链接
问题修复
- 修复 --analysis-report、--dataset-dir 等命令行参数问题
- 修复并发为1时的令牌吞吐量计算问题
- 修复 ChartQA、TAU2、OmniDocBench 等基准测试加载问题
- 修复 SWE-bench 镜像构建、MRCR 前导换行符支持等问题
- 修复 NLTK 资源检查相关问题
English Version
Benchmark Datasets
- General Evaluation: Added EQ-Bench, ZebraLogicBench for reasoning and logic evaluation
- Code Evaluation: Added MultiplE-MBPP, MBPP for code capability assessment
- Speech Evaluation: Added FLEURS, LibriSpeech for speech recognition benchmarks
Feature Enhancements
- Performance Visualization: Added ClearML visualization support for performance (perf) monitoring
- Service API: Added service api functionality for more flexible service invocation
- Lazy Model Loading: Added lazy model support to optimize model loading mechanism
- Retry Mechanism: Added retry function to improve evaluation stability
- Sandbox Optimization: Updated sandbox with connection pool support and multiple-humaneval evaluation
- Random Algorithm: Updated performance testing random algorithm for improved accuracy
- UI Enhancement: Dashboard now supports HTTP params parameter configuration
- Progress Bar: Updated tqdm progress display mechanism
Documentation
- Updated custom VQA documentation
- Updated parameter configuration documentation
- Updated benchmarks documentation
- Updated service documentation
- Updated MTEB related links
Bug Fixes
- Fixed command-line parameter issues (--analysis-report, --dataset-dir, etc.)
- Fixed token throughput calculation at concurrency 1
- Fixed benchmark loading issues (ChartQA, TAU2, OmniDocBench, etc.)
- Fixed SWE-bench image build and MRCR leading newline support
- Fixed NLTK resource checking issues
What's Changed
- [Feature] Add perf ClearML visualization by @Yunnglin in #1032
- [Doc] update custom vqa by @Yunnglin in #1036
- [Benchmark ]Add eq bench by @Yunnglin in #1037
- Feature/zebralogicbench by @nhes in #1035
- [Fix] Update tau2 by @Yunnglin in #1039
- [Feature] Add service api by @Yunnglin in #1042
- fix --analysis-report=true bug by @pumpkin12135 in #1046
- [feature] add lazy model by @Secbone in #1045
- [Doc] update parameter by @Yunnglin in #1048
- [Fix] update default work dir by @Yunnglin in #1049
- [Feature] Update perf random Algorithm by @Yunnglin in #1050
- [Feature] add retry function by @Yunnglin in #1051
- Fix --dataset-dir parameter to work correctly by @gbdjxgp in #1053
- [Fix] chartqa prompt by @Yunnglin in #1054
- [Benchmark] Add fleurs, librispeech by @Yunnglin in #1059
- [Fix] multi-if load by @Yunnglin in #1062
- [Benchmakr] Add MultiplE-mbpp, MBPP by @Yunnglin in #1066
- Update mteb link by @Samoed in #1065
- UI dashboard supports HTTP params parameters by @pumpkin12135 in #1060
- [Feature] update sandbox with pool and multiple-humaneval by @Yunnglin in #1073
- check_nltk_data does not accept a parameter by @Zhaoyi-Yan in #1071
- [Fix] update check nltk resource by @Yunnglin in #1078
- [Fix] omni doc bench load by @Yunnglin in #1079
- [Doc] update benchmarks doc by @Yunnglin in #1081
- [Doc] update service and doc by @Yunnglin in #1085
- fix: Output token throughput and Total token throughput on Concurrency 1 by @cdpath in #1083
- [Fix] SWE build image by @Yunnglin in #1087
- small fixes to mrcr to support leading \n characters by @sophies-cerebras in #1086
- [Feature] Update tqdm process by @Yunnglin in #1089
New Contributors
- @nhes made their first contribution in #1035
- @pumpkin12135 made their first contribution in #1046
- @Secbone made their first contribution in #1045
- @gbdjxgp made their first contribution in #1053
- @Samoed made their first contribution in #1065
- @Zhaoyi-Yan made their first contribution in #1071
- @cdpath made their first contribution in #1083
Full Changelog: v1.3.0...v1.4.0