Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232

Closed
Jeryi-Sun opened this issue Oct 19, 2023 · 13 comments
Labels
solved This problem has been already solved

Comments

@Jeryi-Sun
Copy link

image image image
python src/evaluate.py \
    --model_name_or_path /llama-2-hf/Llama-2-7b-hf/ \
    --finetuning_type none \
    --template llama2 \
    --task mmlu \
    --split test \
    --lang en \
    --n_shot 5 \
    --batch_size 16 \
@luckfu
Copy link

luckfu commented Oct 19, 2023

这。。。。。

@lsyysl9711
Copy link

我也遇到了类似的问题,虽然不是这个数据集,但是用Llama2官网微调出来的结果要远好于基于这个repo的结果,我暂时也没有找到原因

@hiyouga hiyouga added the pending This problem is yet to be addressed label Oct 20, 2023
@hiyouga
Copy link
Owner

hiyouga commented Oct 20, 2023

我测试 baichuan2 的预测结果是正常的,LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment)

@Jeryi-Sun
Copy link
Author

我测试 baichuan2 的预测结果是正常的,LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment)

所以我们应该怎么测试 LLaMA呢,感觉网上的争论好多,麻烦您看看怎么修复一下

@Jeryi-Sun
Copy link
Author

Jeryi-Sun commented Oct 20, 2023

我测试 baichuan2 的预测结果是正常的,LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment)

如果使用https://github.com/EleutherAI
lm-evaluation-harness进行评测的话,我看他们的库里并没有为像LLaMA-Factory中为 llama2 提供template,这个会不会造成评测不准的问题?谢谢您~

register_template(
    name="llama2",
    prefix=[
        "<<SYS>>\n{{system}}\n<</SYS>>\n\n"
    ],
    prompt=[
        "[INST] {{query}} [/INST] "
    ],
    system=(
        "You are a helpful, respectful and honest assistant. "
        "Always answer as helpfully as possible, while being safe.  "
        "Your answers should not include any harmful, unethical, "
        "racist, sexist, toxic, dangerous, or illegal content. "
        "Please ensure that your responses are socially unbiased and positive in nature.\n\n"
        "If a question does not make any sense, or is not factually coherent, "
        "explain why instead of answering something not correct. "
        "If you don't know the answer to a question, please don't share false information."
    ),
    sep=[]
)

@hiyouga
Copy link
Owner

hiyouga commented Oct 20, 2023

few shot 评测不需要 template,在本项目做 few shot 也应当用 vanilla template 来做(但是我刚才测了也很低,正在修复

@Jeryi-Sun
Copy link
Author

好的感谢!

@hiyouga
Copy link
Owner

hiyouga commented Oct 20, 2023

我想起来了,LLaMA2 left padding 会有溢出的问题,大意了,等我修复一下

@hiyouga
Copy link
Owner

hiyouga commented Oct 20, 2023

python src/evaluate.py \
    --model_name_or_path llama2-7b \
    --task mmlu \
    --split test \
    --lang en \
    --template vanilla \
    --n_shot 5

image

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Oct 20, 2023
@ShunLu91
Copy link

ShunLu91 commented Apr 19, 2024

Thanks for the nice contributions.
Strictly follow your suggestions and use the Llama-2-7b-hf, the evaluation still gets lower average accuracy 43.04. How to improve the result?
image

@Zkli-hub
Copy link

Thanks for the nice contributions. Strictly follow your suggestions and use the Llama-2-7b-hf, the evaluation still gets lower average accuracy 43.04. How to improve the result? image

Did you solve it? I also meet the same problem.

@ShunLu91
Copy link

Thanks for the nice contributions. Strictly follow your suggestions and use the Llama-2-7b-hf, the evaluation still gets lower average accuracy 43.04. How to improve the result? image

Did you solve it? I also meet the same problem.

I� did not solve this issue. But I found that lm-evaluation-harness can achieve a reasonable result.

@DaozeZhang
Copy link

few shot 评测不需要 template,在本项目做 few shot 也应当用 vanilla template 来做(但是我刚才测了也很低,正在修复

请问如果是对原本chatGLM3、或微调后的chatGLM3做eval,template应该和训练时保持一致传入chatglm3(按照readme里写的“请务必在训练和推理时采用完全一致的模板”),还是改为vanilla或什么其他的?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

7 participants