v1.0.2
新增功能
- 代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行,要使用该功能需先安装ms-enclave。
- 新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准,和Multi-IF、HealthBench、AMC等纯文本评测基准。
New Features
- Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
- Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.
What's Changed
- [Benchmark] add Multi-IF by @Yunnglin in #822
- Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
- [Benchmark] Add health bench by @Yunnglin in #826
- fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
- [Fix] vlm tokenize by @Yunnglin in #829
- [Doc] update qwen next doc by @Yunnglin in #832
- [Fix] fix bfcl-v3 score by @Yunnglin in #833
- [Benchmark] Add MMBench and MMStar by @mushenL in #834
- [Benchmark] Add Omnibench by @Yunnglin in #837
- [Fix] Fix bfcl validation error by @Yunnglin in #838
- [Feature] add docker sandbox by @Yunnglin in #835
- [Fix] Fix thread pool error by @Yunnglin in #841
- [Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
- [Benchmark] add minerva-math by @Yunnglin in #846
New Contributors
Full Changelog: v1.0.1...v1.0.2