Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,6 +393,7 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)

评测参数继承于[部署参数](#部署参数)

- 🔥eval_backend: 评测后端,默认为Native,也可以指定为OpenCompass或VLMEvalKit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文文档也加一下

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- 🔥eval_dataset: 评测数据集,请查看[评测文档](./评测.md)
- eval_limit: 每个评测集的采样数,默认为None
- eval_output_dir: 评测存储结果的文件夹,默认为'eval_output'
Expand Down
100 changes: 64 additions & 36 deletions docs/source/Instruction/评测.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,40 +6,54 @@ SWIFT支持了eval(评测)能力,用于对原始模型和训练后的模

SWIFT的eval能力使用了魔搭社区[评测框架EvalScope](https://github.com/modelscope/eval-scope),并进行了高级封装以支持各类模型的评测需求。

目前我们支持了**标准评测集**的评测流程,以及**用户自定义**评测集的评测流程。其中**标准评测集**包含:

> 注意:EvalScope支持许多其他的复杂能力,例如模型的性能评测,请直接使用EvalScope框架。

纯文本评测:
```text
'obqa', 'cmb', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```
数据集的具体介绍可以查看:https://hub.opencompass.org.cn/home

多模态评测:
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME',
'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN',
'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11',
'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST',
'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL',
'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```
数据集的具体介绍可以查看:https://github.com/open-compass/VLMEvalKit
> 注意:EvalScope支持许多其他的复杂能力,例如[模型的性能评测](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/quick_start.html),请直接使用EvalScope框架。

目前我们支持了**标准评测集**的评测流程,以及**用户自定义**评测集的评测流程。其中**标准评测集**由三个评测后端提供支持:

下面展示所支持的数据集名称,若需了解数据集的详细信息,请参考[所有支持的数据集](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset.html)

1. Native(默认):

主要支持纯文本评测,同时**支持**评测结果可视化
```text
'arc', 'bbh', 'ceval', 'cmmlu', 'competition_math',
'general_qa', 'gpqa', 'gsm8k', 'hellaswag', 'humaneval',
'ifeval', 'iquiz', 'mmlu', 'mmlu_pro',
'race', 'trivia_qa', 'truthful_qa'
```

2. OpenCompass:

主要支持纯文本评测,暂**不支持**评测结果可视化
```text
'obqa', 'cmb', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```

3. VLMEvalKit:

主要支持多模态评测,暂**不支持**评测结果可视化
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME',
'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN',
'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11',
'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST',
'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL',
'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```

## 环境准备

Expand All @@ -59,9 +73,23 @@ pip install -e '.[eval]'

支持纯文本评测、多模态评测、url评测、自定义数据集评测四种方式

评测的样例可以参考[examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval)
**基本示例**

```shell
swift eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--eval_backend Native \
--infer_backend pt \
--eval_limit 10 \
--eval_dataset gsm8k
```
其中:
- eval_backend: 可选 Native, OpenCompass, VLMEvalKit
- infer_backend: 可选 pt, vllm, lmdeploy

具体评测的参数列表可以参考[这里](命令行参数.md#评测参数)。

评测的参数列表可以参考[这里](命令行参数.md#评测参数)。
更多评测的样例可以参考[examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval)

## 自定义评测集

Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,7 @@ App parameters inherit from [deployment arguments](#deployment-arguments) and [W

Evaluation Arguments inherit from the [deployment arguments](#deployment-arguments).

- 🔥eval_backend: Evaluation backend, default is Native, but can also be specified as OpenCompass or VLMEvalKit
- 🔥eval_dataset: Evaluation dataset, refer to [Evaluation documentation](./Evaluation.md).
- eval_limit: Number of samples for each evaluation set, default is None.
- eval_output_dir: Folder for storing evaluation results, default is 'eval_output'.
Expand Down
110 changes: 69 additions & 41 deletions docs/source_en/Instruction/Evaluation.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,67 @@
# Evaluation

SWIFT supports eval(evaluation) capabilities to provide standardized assessment metrics for both the raw model and the trained model.
SWIFT supports eval (evaluation) capabilities to provide standardized evaluation metrics for both raw models and trained models.

## Capability Introduction

SWIFT's eval capability uses the [evalution framework EvalScope](https://github.com/modelscope/eval-scope) from the ModelScope, with high-level encapsulation to meet various model evaluation needs.

Currently, we support evaluation processes for **standard evaluation sets** as well as **user-defined** evaluation sets. The **standard evaluation sets** include:

> Note: EvalScope supports many other complex capabilities, such as model performance evaluation. Please use the EvalScope framework directly for those features.

Pure Text Evaluation:
```text
'obqa', 'cmb', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```
For detailed information on the datasets, please visit: https://hub.opencompass.org.cn/home

Multimodal Evaluation:
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME',
'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN',
'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11',
'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST',
'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL',
'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```
For detailed information on the datasets, please visit: https://github.com/open-compass/VLMEvalKit
SWIFT's eval capability utilizes the EvalScope evaluation framework from the Magic Tower community, which has been advanced in its encapsulation to support the evaluation needs of various models.

> Note: EvalScope supports many other complex capabilities, such as [model performance evaluation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html), so please use the EvalScope framework directly.

Currently, we support the evaluation process of **standard evaluation datasets** as well as the evaluation process of **user-defined** evaluation datasets. The **standard evaluation datasets** are supported by three evaluation backends:

Below are the names of the supported datasets. For detailed information on the datasets, please refer to [all supported datasets](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).

1. Native (default):

Primarily supports pure text evaluation, while **supporting** visualization of evaluation results.
```text
'arc', 'bbh', 'ceval', 'cmmlu', 'competition_math',
'general_qa', 'gpqa', 'gsm8k', 'hellaswag', 'humaneval',
'ifeval', 'iquiz', 'mmlu', 'mmlu_pro',
'race', 'trivia_qa', 'truthful_qa'
```

2. OpenCompass:

Primarily supports pure text evaluation, currently **does not support** visualization of evaluation results.
```text
'obqa', 'cmb', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada',
'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze',
'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval',
'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench',
'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```

3. VLMEvalKit:

Primarily supports multimodal evaluation and currently **does not support** visualization of evaluation results.
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN',
'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11',
'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2',
'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL',
'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar',
'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL',
'DocVQA_TEST', 'InfoVQA_VAL', 'InfoVQA_TEST', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI',
'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST',
'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500',
'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_ZH_EASY_ALL',
'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMDU', 'MMBench-Video', 'Video-MME',
'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN',
'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11',
'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST',
'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL',
'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```

## Environment Preparation

```shell
pip install ms-swift[eval] -U
```

Or install from the source code:
Or install from source:

```shell
git clone https://github.com/modelscope/ms-swift.git
Expand All @@ -57,11 +71,25 @@ pip install -e '.[eval]'

## Evaluation

We support four types of evaluation: pure text evaluation, multimodal evaluation, URL evaluation, and custom dataset evaluation.
Supports four methods of evaluation: pure text evaluation, multimodal evaluation, URL evaluation, and custom dataset evaluation.

**Basic Example**

```shell
swift eval \
--model Qwen/Qwen2.5-0.5B-Instruct \
--eval_backend Native \
--infer_backend pt \
--eval_limit 10 \
--eval_dataset gsm8k
```
Where:
- eval_backend: options are Native, OpenCompass, VLMEvalKit
- infer_backend: options are pt, vllm, lmdeploy

For sample evaluations, please refer to [examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval).
For a specific list of evaluation parameters, please refer to [here](./Command-line-parameters.md#evaluation-arguments).

The list of evaluation parameters can be found [here](Commend-line-parameters#评测参数).
More evaluation examples can be found in [examples](https://github.com/modelscope/ms-swift/tree/main/examples/eval).

## Custom Evaluation Sets

Expand Down
2 changes: 1 addition & 1 deletion examples/eval/eval_url/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
with run_deploy(
DeployArguments(model='Qwen/Qwen2.5-1.5B-Instruct', verbose=False, log_interval=-1, infer_backend='vllm'),
return_url=True) as url:
eval_main(EvalArguments(model='Qwen2.5-1.5B-Instruct', eval_url=url, eval_dataset=['ARC_c']))
eval_main(EvalArguments(model='Qwen2.5-1.5B-Instruct', eval_url=url, eval_dataset=['arc']))
3 changes: 2 additions & 1 deletion examples/eval/vlm/eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ swift eval \
--model Qwen/Qwen2-VL-2B-Instruct \
--infer_backend pt \
--eval_limit 100 \
--eval_dataset realWorldQA
--eval_dataset realWorldQA \
--eval_backend VLMEvalKit
41 changes: 18 additions & 23 deletions swift/llm/argument/eval_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import datetime as dt
import os
from dataclasses import dataclass, field
from typing import List, Optional
from typing import List, Literal, Optional

from swift.utils import get_logger
from .base_args import to_abspath
Expand All @@ -29,7 +29,7 @@ class EvalArguments(DeployArguments):
eval_dataset: List[str] = field(default_factory=list)
eval_limit: Optional[int] = None
eval_output_dir: str = 'eval_output'

eval_backend: Literal['Native', 'OpenCompass', 'VLMEvalKit'] = 'Native'
local_dataset: bool = False

temperature: Optional[float] = 0.
Expand All @@ -53,38 +53,33 @@ def __post_init__(self):

@staticmethod
def list_eval_dataset():
from evalscope.constants import EvalBackend
from evalscope.benchmarks.benchmark import BENCHMARK_MAPPINGS
from evalscope.backend.opencompass import OpenCompassBackendManager
from evalscope.backend.vlm_eval_kit import VLMEvalKitBackendManager
return {
'opencompass': OpenCompassBackendManager.list_datasets(),
'vlmeval': VLMEvalKitBackendManager.list_supported_datasets()
EvalBackend.NATIVE: list(BENCHMARK_MAPPINGS.keys()),
EvalBackend.OPEN_COMPASS: OpenCompassBackendManager.list_datasets(),
EvalBackend.VLM_EVAL_KIT: VLMEvalKitBackendManager.list_supported_datasets()
}

def _init_eval_dataset(self):
if isinstance(self.eval_dataset, str):
self.eval_dataset = [self.eval_dataset]

eval_dataset = self.list_eval_dataset()
from evalscope.backend.opencompass import OpenCompassBackendManager
from evalscope.backend.vlm_eval_kit import VLMEvalKitBackendManager
self.opencompass_dataset = set(eval_dataset['opencompass'])
self.vlmeval_dataset = set(eval_dataset['vlmeval'])
eval_dataset_mapping = {dataset.lower(): dataset for dataset in self.opencompass_dataset | self.vlmeval_dataset}
self.eval_dataset_oc = []
self.eval_dataset_vlm = []
all_eval_dataset = self.list_eval_dataset()
dataset_mapping = {dataset.lower(): dataset for dataset in all_eval_dataset[self.eval_backend]}
valid_dataset = []
for dataset in self.eval_dataset:
dataset = eval_dataset_mapping.get(dataset.lower(), dataset)
if dataset in self.opencompass_dataset:
self.eval_dataset_oc.append(dataset)
elif dataset in self.vlmeval_dataset:
self.eval_dataset_vlm.append(dataset)
else:
raise ValueError(f'eval_dataset: {dataset} is not supported.\n'
f'opencompass_dataset: {OpenCompassBackendManager.list_datasets()}.\n\n'
f'vlmeval_dataset: {VLMEvalKitBackendManager.list_supported_datasets()}.')
if dataset.lower() not in dataset_mapping:
raise ValueError(
f'eval_dataset: {dataset} is not supported.\n'
f'eval_backend: {self.eval_backend} supported datasets: {all_eval_dataset[self.eval_backend]}')
valid_dataset.append(dataset_mapping[dataset.lower()])
self.eval_dataset = valid_dataset

logger.info(f'opencompass dataset: {self.eval_dataset_oc}')
logger.info(f'vlmeval dataset: {self.eval_dataset_vlm}')
logger.info(f'eval_backend: {self.eval_backend}')
logger.info(f'eval_dataset: {self.eval_dataset}')

def _init_result_path(self, folder_name: str) -> None:
self.time = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
Expand Down
Loading