v1.0.2

Yunnglin released this 23 Sep 09:30

ac3a470

新增功能

代码评测基准(HumanEval, LiveCodeBench)支持在沙箱环境中运行，要使用该功能需先安装ms-enclave。
新增支持RealWorldQA、AI2D、MMStar、MMBench、OmniBench等图文多模态评测基准，和Multi-IF、HealthBench、AMC等纯文本评测基准。

New Features

Code evaluation benchmarks (HumanEval, LiveCodeBench) now support execution in a sandbox environment. To utilize this feature, you must first install ms-enclave.
Added support for various image-text multimodal evaluation benchmarks such as RealWorldQA, AI2D, MMStar, MMBench, OmniBench, as well as pure text evaluation benchmarks like Multi-IF, HealthBench, and AMC.

What's Changed

[Benchmark] add Multi-IF by @Yunnglin in #822
Add ai2d_adapter and real_world_qa_adapter by @mushenL in #824
[Benchmark] Add health bench by @Yunnglin in #826
fix: make _temp_run top-level to resolve M1 pickle error by @MemoryIt in #827
[Fix] vlm tokenize by @Yunnglin in #829
[Doc] update qwen next doc by @Yunnglin in #832
[Fix] fix bfcl-v3 score by @Yunnglin in #833
[Benchmark] Add MMBench and MMStar by @mushenL in #834
[Benchmark] Add Omnibench by @Yunnglin in #837
[Fix] Fix bfcl validation error by @Yunnglin in #838
[Feature] add docker sandbox by @Yunnglin in #835
[Fix] Fix thread pool error by @Yunnglin in #841
[Benchmark] Add amc23 and OlympiadBench by @mushenL in #840
[Benchmark] add minerva-math by @Yunnglin in #846

New Contributors

@MemoryIt made their first contribution in #827

Full Changelog: v1.0.1...v1.0.2

Contributors

Yunnglin, MemoryIt, and mushenL

Assets 2