-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用 evaluate.py 在 mmlu 上 evaluate llama2-7b-base 与llama2-7b-chat 效果远低于公开评测结果 #1232
Comments
这。。。。。 |
我也遇到了类似的问题,虽然不是这个数据集,但是用Llama2官网微调出来的结果要远好于基于这个repo的结果,我暂时也没有找到原因 |
我测试 baichuan2 的预测结果是正常的,LLaMA 的问题可能是由于 EleutherAI/lm-evaluation-harness#531 (comment) |
所以我们应该怎么测试 LLaMA呢,感觉网上的争论好多,麻烦您看看怎么修复一下 |
如果使用https://github.com/EleutherAI
|
few shot 评测不需要 template,在本项目做 few shot 也应当用 vanilla template 来做(但是我刚才测了也很低,正在修复 |
好的感谢! |
我想起来了,LLaMA2 left padding 会有溢出的问题,大意了,等我修复一下 |
I� did not solve this issue. But I found that lm-evaluation-harness can achieve a reasonable result. |
请问如果是对原本chatGLM3、或微调后的chatGLM3做eval, |
The text was updated successfully, but these errors were encountered: