Regarding PPL and GEN modes, could you please provide more details or clarify your question? Thanks #646

niexufei · 2023-11-29T02:25:48Z

niexufei
Nov 29, 2023

以选择题为例

问题：小白是什么动物？A. 老鼠 B. 牛 C. 老虎 D. 兔子
gen (generate) 就是以上述问题作为提示词，让默写往下续写，并从续写中提取答案为 A / B / C / D 中的哪一个

ppl (perplexity) 就是给模型 4 句话：

问题：小白是什么动物？A. 老鼠 B. 牛 C. 老虎 D. 兔子答案是 A
问题：小白是什么动物？A. 老鼠 B. 牛 C. 老虎 D. 兔子答案是 B
问题：小白是什么动物？A. 老鼠 B. 牛 C. 老虎 D. 兔子答案是 C
问题：小白是什么动物？A. 老鼠 B. 牛 C. 老虎 D. 兔子答案是 D
看模型更认同哪一句话 (混淆度 perplexity 更低)，认同哪一句话，就用哪一句话对应的答案 A / B / C / D

gen 和 ppl 最终都是得到 A / B / C / D 之一，与参考答案进行比较，得分或者不得分等等

上面是背景，下面是问题：
一个模型的ppl和gen方式对于同一套选择题的得分理论上是否应该一致？但是实际测试中，发现有不一致的场景，那么此时即使ppl得分高，ppl能代表模型能力吗？

English description is following:
Should the PPL and GEN scores of a model be consistent for the same set of multiple-choice questions in theory? However, in actual testing, there are scenarios where they are inconsistent. In such cases, even if the PPL score is high, can it represent the model's ability?

Answered by Leymore

Nov 29, 2023

The scores of ppl and gen in multiple-choice questions are not necessarily the same theoretically. This is because LM is doing next token prediction, where the choice range for ppl's next token is only A / B / C / D, while for gen's next token, the range is the entire vocabulary.
When the model's instruction-following ability is weak, it may not be able to output A / B / C / D; or when the model is fine-tuned in a tricky way, it might output a long explanation first, followed by A / B / C / D. These factors can lead to differences in the extracted results, and therefore, differences in accuracy.

View full answer

Leymore · 2023-11-29T02:53:02Z

Leymore
Nov 29, 2023
Maintainer

The scores of ppl and gen in multiple-choice questions are not necessarily the same theoretically. This is because LM is doing next token prediction, where the choice range for ppl's next token is only A / B / C / D, while for gen's next token, the range is the entire vocabulary.
When the model's instruction-following ability is weak, it may not be able to output A / B / C / D; or when the model is fine-tuned in a tricky way, it might output a long explanation first, followed by A / B / C / D. These factors can lead to differences in the extracted results, and therefore, differences in accuracy.

0 replies

Leymore · 2023-11-29T03:22:24Z

Leymore
Nov 29, 2023
Maintainer

ppl can represent the model's ability in multiple-choice questions, on this dataset, under this mode of use.

Whether this ability can be extrapolated to other capabilities, or even used to generally discuss whether a model is good or bad, depends on your value orientation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding PPL and GEN modes, could you please provide more details or clarify your question? Thanks #646

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Regarding PPL and GEN modes, could you please provide more details or clarify your question? Thanks #646

niexufei Nov 29, 2023

Replies: 2 comments

Leymore Nov 29, 2023 Maintainer

Leymore Nov 29, 2023 Maintainer

niexufei
Nov 29, 2023

Leymore
Nov 29, 2023
Maintainer

Leymore
Nov 29, 2023
Maintainer