使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232

Jeryi-Sun · 2023-10-19T10:55:20Z

python src/evaluate.py \
    --model_name_or_path /llama-2-hf/Llama-2-7b-hf/ \
    --finetuning_type none \
    --template llama2 \
    --task mmlu \
    --split test \
    --lang en \
    --n_shot 5 \
    --batch_size 16 \

The text was updated successfully, but these errors were encountered:

luckfu · 2023-10-19T15:04:31Z

这。。。。。

lsyysl9711 · 2023-10-19T15:08:02Z

我也遇到了类似的问题，虽然不是这个数据集，但是用Llama2官网微调出来的结果要远好于基于这个repo的结果，我暂时也没有找到原因

hiyouga · 2023-10-20T13:27:35Z

我测试 baichuan2 的预测结果是正常的，LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment)

Jeryi-Sun · 2023-10-20T13:32:28Z

我测试 baichuan2 的预测结果是正常的，LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment)

所以我们应该怎么测试 LLaMA呢，感觉网上的争论好多，麻烦您看看怎么修复一下

Jeryi-Sun · 2023-10-20T13:57:31Z

我测试 baichuan2 的预测结果是正常的，LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment)

如果使用https://github.com/EleutherAI
lm-evaluation-harness进行评测的话，我看他们的库里并没有为像LLaMA-Factory中为 llama2 提供template，这个会不会造成评测不准的问题？谢谢您～

register_template(
    name="llama2",
    prefix=[
        "<<SYS>>\n{{system}}\n<</SYS>>\n\n"
    ],
    prompt=[
        "[INST] {{query}} [/INST] "
    ],
    system=(
        "You are a helpful, respectful and honest assistant. "
        "Always answer as helpfully as possible, while being safe.  "
        "Your answers should not include any harmful, unethical, "
        "racist, sexist, toxic, dangerous, or illegal content. "
        "Please ensure that your responses are socially unbiased and positive in nature.\n\n"
        "If a question does not make any sense, or is not factually coherent, "
        "explain why instead of answering something not correct. "
        "If you don't know the answer to a question, please don't share false information."
    ),
    sep=[]
)

hiyouga · 2023-10-20T14:04:59Z

few shot 评测不需要 template，在本项目做 few shot 也应当用 vanilla template 来做（但是我刚才测了也很低，正在修复

Jeryi-Sun · 2023-10-20T14:08:02Z

好的感谢！

hiyouga · 2023-10-20T14:49:57Z

我想起来了，LLaMA2 left padding 会有溢出的问题，大意了，等我修复一下

hiyouga · 2023-10-20T15:30:09Z

python src/evaluate.py \
    --model_name_or_path llama2-7b \
    --task mmlu \
    --split test \
    --lang en \
    --template vanilla \
    --n_shot 5

ShunLu91 · 2024-04-19T11:39:11Z

Thanks for the nice contributions.
Strictly follow your suggestions and use the Llama-2-7b-hf, the evaluation still gets lower average accuracy 43.04. How to improve the result?

Zkli-hub · 2024-06-23T16:26:39Z

Thanks for the nice contributions. Strictly follow your suggestions and use the Llama-2-7b-hf, the evaluation still gets lower average accuracy 43.04. How to improve the result?

Did you solve it? I also meet the same problem.

ShunLu91 · 2024-06-24T11:48:18Z

Thanks for the nice contributions. Strictly follow your suggestions and use the Llama-2-7b-hf, the evaluation still gets lower average accuracy 43.04. How to improve the result?

Did you solve it? I also meet the same problem.

I� did not solve this issue. But I found that lm-evaluation-harness can achieve a reasonable result.

DaozeZhang · 2024-06-27T09:01:12Z

few shot 评测不需要 template，在本项目做 few shot 也应当用 vanilla template 来做（但是我刚才测了也很低，正在修复

请问如果是对原本chatGLM3、或微调后的chatGLM3做eval，template应该和训练时保持一致传入chatglm3（按照readme里写的“请务必在训练和推理时采用完全一致的模板”），还是改为vanilla或什么其他的？

hiyouga added the pending This problem is yet to be addressed label Oct 20, 2023

hiyouga closed this as completed in b665e9e Oct 20, 2023

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Oct 20, 2023

hiyouga mentioned this issue Dec 18, 2023

Low results of Llama 2 on MMLU #1889

Closed

DaozeZhang mentioned this issue Jun 28, 2024

glm系列模型做eval时应该将template参数设为什么 #4598

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232

使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232

Jeryi-Sun commented Oct 19, 2023

luckfu commented Oct 19, 2023

lsyysl9711 commented Oct 19, 2023

hiyouga commented Oct 20, 2023

Jeryi-Sun commented Oct 20, 2023

Jeryi-Sun commented Oct 20, 2023 •

edited

Loading

hiyouga commented Oct 20, 2023

Jeryi-Sun commented Oct 20, 2023

hiyouga commented Oct 20, 2023

hiyouga commented Oct 20, 2023

ShunLu91 commented Apr 19, 2024 •

edited

Loading

Zkli-hub commented Jun 23, 2024

ShunLu91 commented Jun 24, 2024

DaozeZhang commented Jun 27, 2024

使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232

使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232

Comments

Jeryi-Sun commented Oct 19, 2023

luckfu commented Oct 19, 2023

lsyysl9711 commented Oct 19, 2023

hiyouga commented Oct 20, 2023

Jeryi-Sun commented Oct 20, 2023

Jeryi-Sun commented Oct 20, 2023 • edited Loading

hiyouga commented Oct 20, 2023

Jeryi-Sun commented Oct 20, 2023

hiyouga commented Oct 20, 2023

hiyouga commented Oct 20, 2023

ShunLu91 commented Apr 19, 2024 • edited Loading

Zkli-hub commented Jun 23, 2024

ShunLu91 commented Jun 24, 2024

DaozeZhang commented Jun 27, 2024

Jeryi-Sun commented Oct 20, 2023 •

edited

Loading

ShunLu91 commented Apr 19, 2024 •

edited

Loading