-
Notifications
You must be signed in to change notification settings - Fork 336
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* rename * add en subdoc * fix name * fix writing * update --------- Co-authored-by: Leymore <zfz-960727@163.com>
- Loading branch information
1 parent
e3d4901
commit b628423
Showing
7 changed files
with
328 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
# Subjective Evaluation Guidance | ||
|
||
## Introduction | ||
|
||
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation. | ||
|
||
To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)). | ||
|
||
A popular evaluation method involves comparing model responses pairwise to calculate their win rate ([Chatbot Arena](https://chat.lmsys.org/)). | ||
|
||
We support the use of GPT-4 for the subjective evaluation of models based on this method. | ||
|
||
## Data Preparation | ||
|
||
We provide a demo test set [subjective_demo.xlsx](https://opencompass.openxlab.space/utils/subjective_demo.xlsx) based on [z-bench](https://github.com/zhenbench/z-bench). | ||
|
||
Store the set of subjective questions in .xlsx format in the `data/subjective/directory`. | ||
|
||
The table includes the following fields: | ||
|
||
- 'question': Question description | ||
- 'index': Question number | ||
- 'reference_answer': Reference answer | ||
- 'evaluating_guidance': Evaluation guidance | ||
- 'capability': The capability dimension of the question. | ||
|
||
## Evaluation Configuration | ||
|
||
The specific process includes: | ||
|
||
1. Model response reasoning | ||
2. GPT-4 evaluation comparisons | ||
3. Generating evaluation reports | ||
|
||
For `config/subjective.py`, we provide some annotations to help users understand the configuration file's meaning. | ||
|
||
```python | ||
# Import datasets and subjective evaluation summarizer | ||
from mmengine.config import read_base | ||
with read_base(): | ||
from .datasets.subjective_cmp.subjective_cmp import subjective_datasets | ||
from .summarizers.subjective import summarizer | ||
|
||
datasets = [*subjective_datasets] | ||
|
||
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI | ||
|
||
# Import partitioner and task required for subjective evaluation | ||
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner | ||
from opencompass.runners import LocalRunner | ||
from opencompass.tasks.subjective_eval import SubjectiveEvalTask | ||
|
||
|
||
# Define model configurations for inference and evaluation | ||
# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4 | ||
models = [...] | ||
|
||
api_meta_template = dict( | ||
round=[ | ||
dict(role='HUMAN', api_role='HUMAN'), | ||
dict(role='BOT', api_role='BOT', generate=True) | ||
], | ||
reserved_roles=[ | ||
dict(role='SYSTEM', api_role='SYSTEM'), | ||
], | ||
) | ||
|
||
# Define the configuration for subjective evaluation | ||
eval = dict( | ||
partitioner=dict( | ||
type=SubjectiveNaivePartitioner, | ||
mode='all', # alternately constructs two for comparisons | ||
), | ||
runner=dict( | ||
type=LocalRunner, | ||
max_num_workers=2, # Supports parallel comparisons | ||
task=dict( | ||
type=SubjectiveEvalTask, # Used to read inputs for a pair of models | ||
judge_cfg=dict( | ||
abbr='GPT4', | ||
type=OpenAI, | ||
path='gpt-4-0613', | ||
key='ENV', | ||
meta_template=api_meta_template, | ||
query_per_second=1, | ||
max_out_len=2048, | ||
max_seq_len=2048, | ||
batch_size=2), | ||
)), | ||
) | ||
``` | ||
|
||
## Launching the Evaluation | ||
|
||
```shell | ||
python run.py config/subjective.py -r | ||
``` | ||
|
||
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results. | ||
|
||
## Evaluation Report | ||
|
||
The evaluation report will be output to `output/.../summary/timestamp/report.md`, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows: | ||
|
||
```markdown | ||
# Subjective Analysis | ||
|
||
A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent) | ||
A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00% | ||
|
||
### Basic statistics (4 stats: win / tie / lose / not bad) | ||
|
||
| Dimension \ Stat [W / T / L / NB] | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf | | ||
| --------------------------------- | ----------------------------- | ---------------------------- | ----------------------------- | | ||
| LANG: Overall | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% | | ||
| LANG: CN | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% | | ||
| LANG: EN | N/A | N/A | N/A | | ||
| CAPA: common | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% | | ||
|
||
![Capabilities Dimension Classification Result](by_capa.png) | ||
|
||
![Language Classification Result](by_lang.png) | ||
|
||
### Model scores (base score is 0, win +3, both +1, neither -1, lose -3) | ||
|
||
| Dimension \ Score | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf | | ||
| ----------------- | -------------- | --------------- | ------------------- | | ||
| LANG: Overall | -8 | 0 | -8 | | ||
| LANG: CN | -8 | 0 | -8 | | ||
| LANG: EN | N/A | N/A | N/A | | ||
| CAPA: common | -8 | 0 | -8 | | ||
|
||
### Bootstrap ELO, Median of n=1000 times | ||
|
||
| | chatglm2-6b-hf | internlm-chat-7b-hf | qwen-7b-chat-hf | | ||
| ---------------- | -------------- | ------------------- | --------------- | | ||
| elo_score [Mean] | 999.504 | 999.912 | 1000.26 | | ||
| elo_score [Std] | 0.621362 | 0.400226 | 0.694434 | | ||
``` | ||
|
||
For comparing the evaluation of models A and B, there are four choices: | ||
|
||
1. A is better than B. | ||
2. A and B are equally good. | ||
3. A is worse than B. | ||
4. Neither A nor B is good. | ||
|
||
So, `win` / `tie` / `lose` / `not bad` represent the proportions of the model winning / tying / losing / winning or being equally good, respectively. | ||
|
||
`Bootstrap ELO` is calculated as the median ELO score by comparing match results through 1000 random permutations. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.