Zero scores on cnn-dm benchmark from HELM #188

hicleo · 2024-05-13T02:57:01Z

When running the evaluation of Sheared-LLaMA-1.3B and original LLaMA-7B on helm|summarization:cnn-dm, I get zero scores:

accelerate launch --multi_gpu --num_processes=3 run_evals_accelerate.py --model_args "pr
etrained=princeton-nlp/Sheared-LLaMA-1.3B,model_parallel=True" --task "helm|summarization:cnn-dm|0|0" --override_batch_size 1 --out
put_dir "./evals/"

Output:

|           Task            |Version|         Metric          |Value|   |Stderr|
|---------------------------|------:|-------------------------|----:|---|-----:|
|all                        |       |rouge1                   |    0|±  |     0|
|                           |       |rouge2                   |    0|±  |     0|
|                           |       |rougeL                   |    0|±  |     0|
|                           |       |summac                   |    0|±  |     0|
|                           |       |summarization_coverage   |    0|±  |     0|
|                           |       |summarization_density    |    0|±  |     0|
|                           |       |summarization_compression|    0|±  |     0|
|helm:summarization:cnn-dm:0|      0|rouge1                   |    0|±  |     0|
|                           |       |rouge2                   |    0|±  |     0|
|                           |       |rougeL                   |    0|±  |     0|
|                           |       |summac                   |    0|±  |     0|
|                           |       |summarization_coverage   |    0|±  |     0|
|                           |       |summarization_density    |    0|±  |     0|
|                           |       |summarization_compression|    0|±  |     0|

hicleo · 2024-05-27T02:14:06Z

Seems to have something to do with task_prompt_formatting
Change the prompt as follows:

instruction="### Instruction: Summarize the following passage in 3 sentences.\n", 
query=f"### Instruction: Summarize the following passage in 3 sentences.\n### Passage: {line['article']}\n### Summary: ",

And this issue can be fixed

clefourrier · 2024-05-27T07:01:35Z

Thanks so much for debugging, would you be OK with opening a PR to share this fix with the community?

hicleo · 2024-06-13T08:29:43Z

Not sure if it has something to do with the evaluated model itself. When I use another finetuned model, the original code seems to be fine.
Maybe adjusting the task_prompt_formatting ourselves according to requirements is needed.

hicleo closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero scores on cnn-dm benchmark from HELM #188

Zero scores on cnn-dm benchmark from HELM #188

hicleo commented May 13, 2024

hicleo commented May 27, 2024 •

edited

clefourrier commented May 27, 2024 •

edited

hicleo commented Jun 13, 2024

Zero scores on cnn-dm benchmark from HELM #188

Zero scores on cnn-dm benchmark from HELM #188

Comments

hicleo commented May 13, 2024

hicleo commented May 27, 2024 • edited

clefourrier commented May 27, 2024 • edited

hicleo commented Jun 13, 2024

hicleo commented May 27, 2024 •

edited

clefourrier commented May 27, 2024 •

edited