Evaluation script for Huggingface Causal models #13

ollmer · 2023-05-28T18:28:58Z

It was made using the existing FLAN eval script as a reference.
Minor changes:

load models as Float16;
put the samples on the same device as a model;
use the last token of option tokenization instead of the first one;
save the aggregated results to a JSON file.

ollmer · 2023-05-28T18:33:21Z

I've used this script while working on the MMLU eval for LLaMA models in lm-evaluation-harness. I've replicated the numbers for the two of the LLaMA models, 7B & 13B. Results are here.
Example usage, to reproduce LLaMA-7B metrics: python evaluate_hf.py -m huggyllama/llama-7b

vince62s · 2023-07-03T16:08:48Z

Hello @ollmer it's me again (I discussed with you on the other repo)
I think there is a mistake in this implementation for models using the gpt2_pretok, let me explain.

The prompt here:

def format_example(df, idx, include_answer=True):
    prompt = df.iloc[idx, 0]
    k = df.shape[1] - 2
    for j in range(k):
        prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
    prompt += "\nAnswer:"
    if include_answer:
        prompt += " {}\n\n".format(df.iloc[idx, k + 1])
    return prompt

makes it clear that there is no space after "Answer:" when requesting the answer from the model.
However, since in the five-shots there is a space before the answer the model will generate this space before the A/B/C/D

When using a BPE model the generated token for " B" will be "ĠB" id 347.

The way you are getting the probs and compute the argmax is based on logits from "A" "B" "C" "D" instead of the token with the space character " " in front of it.

I don't know if you kept the same logic when making your PR https://github.com/EleutherAI/lm-evaluation-harness/pull/497/files
but I think it is wrong for BPE models using this special pretokenization like XGEN.

ollmer · 2023-07-04T20:03:01Z

Thanks for pointing it out! Could you provide the complete code snippet to reproduce the issue?

vince62s · 2023-07-04T21:12:20Z

Just run your script with HF's model "Salesforce/xgen-7b-8k-base" (or any BPE based model)
and print the logits of "A","B","C","D", " A", " B", " C", " D" (note the leading space in the second sequence)
you will immediately understand and see the issue. I had a look at Eleuther lm harness and I think this is the same issue.

ollmer · 2023-07-07T18:51:22Z

In this repo, it's true that the space is not inserted at the end of the sample. I did it precisely the same way it was implemented in this repo for FLAN models which I use as a reference implementation when I started this PR. I agree that it could be a problem for some tokenizers. Maybe it should be changed to some tokenizer-agnostic way of encoding the choices. But the addition of the space at the end of the sample will not solve this problem, as you can see in the following illustration. Because the trailing space is tokenized in a completely different way.

This issue is not the case for the Eleuther harness, because it prepends a space to each answer before calculating loglikelihood, likes this: " {}".format(choice) and calculates the likelihood of a full sample for the each of the possible choices.

vince62s · 2023-07-07T19:16:57Z

for this repo, I think just taking logits of " A", " B", " C", " D" would work independently of the tokenizer.
Sentencepiece will encode " A" the same way as "A".
BPE will encode " A" with "ĠA" the same way the model will answer.
In the end it shoule work this way.
for Eleuther I don't understand your point here nor the one in the other thread. I thought the whole point of your PR was to align Harness on the original implementation, hence using logits of the 4 choices.

In the end not sure if Harness'MMLU is ok or not but I had the feeling it was taking the same loglikeihood of "A" "B" "C" and "D".

ollmer · 2023-07-07T22:18:14Z

Followup thought, for the logprob selection approach to be correct we can use the following method. Encode the whole line of the possible answer Answer: A, where the choice is the last character, then use only the last token of this sequence. This case should yield the correct results for both tokenizers (assuming that the choice character always translated into exactly one token).

ollmer · 2023-07-07T22:24:34Z

for Eleuther I don't understand your point here nor the one in the other thread. I thought the whole point of your PR was to align Harness on the original implementation, hence using logits of the 4 choices.

The point was to fully reproduce the original prompt format in lm-harness. Before this discussion, I assumed that logprob selection is numerically equivalent to calculating the loglikelihoods of 4 different samples with different choices. It should be that way because all the tokens are exactly the same except the last one, so you end up with the loglikelihood of the last token in lm-harness being the same as the logprob of the token in the OG approach. Plus, the lm-harness approach is suitable for choices longer than 1 token, because they're using the same code for a lot of different multiple-choice tasks.

vince62s · 2023-07-08T07:25:12Z

For this repo, I think we are saying the same thing, encoding Answer: A or A (and get the argmax amongst A, B, C, D to compare to gold) is the same thing, except that it is quicker to get the logprob of a single token.

For Eleuther, I have not digged enough into the code but when I get time I will print intermediary steps because I need to explain why we are not getting the same results.

Heepo · 2023-07-26T12:51:16Z

@ollmer How about this solution :
Append the space after Answer: instead of prepending it before "{}\n\n".format(df.iloc[idx, k + 1])
Change this

for j in range(k):
    prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
prompt += "\nAnswer:"
if include_answer:
    prompt += " {}\n\n".format(df.iloc[idx, k + 1])

to

for j in range(k):
    prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j + 1])
prompt += "\nAnswer: "
if include_answer:
    prompt += "{}\n\n".format(df.iloc[idx, k + 1])

Now in the few-shot setting, no matter what LLM is using, we are expecting the models output one of the four tokens ['A', 'B', 'C', 'D'].
Then we don't use the tokenizer() or tokenizer.encode() method, just use tokenizer.convert_tokens_to_ids() instead to acquire the logits of this four tokens.

>>> llama_tokenizer.convert_tokens_to_ids('A')
29909
>>> neox_tokenizer.convert_tokens_to_ids('A')
34

In this repo, it's true that the space is not inserted at the end of the sample. I did it precisely the same way it was implemented in this repo for FLAN models which I use as a reference implementation when I started this PR. I agree that it could be a problem for some tokenizers. Maybe it should be changed to some tokenizer-agnostic way of encoding the choices. But the addition of the space at the end of the sample will not solve this problem, as you can see in the following illustration. Because the trailing space is tokenized in a completely different way.

3. This issue is not the case for the Eleuther harness, because it [prepends a space](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/base.py#L749) to each answer before calculating loglikelihood, likes this: `" {}".format(choice)` and calculates the likelihood of a full sample for the each of the possible choices.

karpathy · 2024-02-27T19:42:30Z

For others who might stumble here in the future, the current implementation here atm is definitely wrong for some models/tokenizers. E.g. GPT series would not work. Sentencepiece also would not work unless add_dummy_prefix is turned on.

The problem is here:

probs = (
    torch.nn.functional.softmax(
        torch.tensor(
            [
                logits[tokenizer("A").input_ids[-1]],
                logits[tokenizer("B").input_ids[-1]],
                logits[tokenizer("C").input_ids[-1]],
                logits[tokenizer("D").input_ids[-1]],
            ]
        ).float(),
        dim=0,
    )
    .detach()
    .cpu()
    .numpy()
)

where instead the space must be included prepended when querying the logits, e.g.: logits[tokenizer(" A").input_ids[-1]],.

It might be helpful to play with the tiktokenizer web app and copy paste in the example prompt, e.g.:

The following are multiple choice questions (with answers) about  abstract algebra.

Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A. 0
B. 1
C. 2
D. 3
Answer: B

Statement 1 | If aH is an element of a factor group, then |aH| divides |a|. Statement 2 | If H and K are subgroups of G then HK is a subgroup of G.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: B

Statement 1 | Every element of a group generates a cyclic subgroup of the group. Statement 2 | The symmetric group S_10 has 10 elements.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: C

Statement 1| Every function from a finite set onto itself must be one to one. Statement 2 | Every subgroup of an abelian group is abelian.
A. True, True
B. False, False
C. True, False
D. False, True
Answer: A

Find the characteristic of the ring 2Z.
A. 0
B. 3
C. 12
D. 30
Answer: A

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:

(make sure to show whitespace and delete the last \n if it is present after copy paste. You should have exactly 344 tokens.

For GPT-2 as an example, here we see that the correct next token to match the few-shot examples is " B" (with space), i.e. token 347. But tokenizer("B") would return token 33 instead of 347:

(Pdb) tokenizer("B")
{'input_ids': [33], 'attention_mask': [1]}
(Pdb) tokenizer(" B")
{'input_ids': [347], 'attention_mask': [1]}

It seems like there is discussion in this Issue noting this problem, but it was never resolved and the code as is right now is buggy.

karpathy · 2024-02-27T19:47:18Z

Also another reason this code will fail is that it hardcodes max context length to be 2048:

while input_ids.shape[-1] > 2048:

but e.g. GPT-2 has max context length 1024 so the code will violently fail and throw an error.

vince62s · 2024-02-27T19:58:17Z

It seems like there is discussion in this Issue noting this problem, but it was never resolved and the code as is right now is buggy.

here yes, but after this discussion it was fixed in lm_eval from Eleuther. and in OpenNMT-py we left strip the space: https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/run_mmlu_opennmt.py#L183

karpathy · 2024-02-27T19:59:06Z

I noticed one more bug

The issue is that there is a double space in the intro line, right before the subject. This is because

def format_subject(subject):
    l = subject.split("_")
    s = ""
    for entry in l:
        s += " " + entry
    return s

see how it prepends " " at beginning. But then

def gen_prompt(train_df, subject, k=-1):
    prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(
        format_subject(subject)
    )

see how it has a space right after "about", in the string "about {}". This causes a double space.

vince62s · 2024-02-27T20:06:32Z

exactly (noticed it too) but you know what, when you want to replicate papers numbers you just keep those in :) (seriously it changes results)

ollmer added 5 commits May 28, 2023 16:17

Huggingface causal lms evaluation script

a858088

save resulting accuracies as json

f7b9140

use fp16 as default

82d60d2

upcast before softmax

d108312

fix imports

461a388

arnocandel mentioned this pull request Jun 8, 2023

Add evaluations of h2oGPT models with hendrycks/test - Humanities/STEM/SocialSciences/Other h2oai/h2ogpt#251

Open

vince62s mentioned this pull request Jul 3, 2023

mmlu detailed numbers salesforce/xgen#8

Closed

vince62s mentioned this pull request Jul 5, 2023

Revert PR 497 for MMLU/hendrycksTest to be compatible with Open LLM Leaderboard EleutherAI/lm-evaluation-harness#614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation script for Huggingface Causal models #13

Evaluation script for Huggingface Causal models #13

ollmer commented May 28, 2023

ollmer commented May 28, 2023

vince62s commented Jul 3, 2023 •

edited

ollmer commented Jul 4, 2023

vince62s commented Jul 4, 2023

ollmer commented Jul 7, 2023

vince62s commented Jul 7, 2023

ollmer commented Jul 7, 2023

ollmer commented Jul 7, 2023

vince62s commented Jul 8, 2023

Heepo commented Jul 26, 2023 •

edited

karpathy commented Feb 27, 2024 •

edited

karpathy commented Feb 27, 2024

vince62s commented Feb 27, 2024 •

edited

karpathy commented Feb 27, 2024

vince62s commented Feb 27, 2024

Evaluation script for Huggingface Causal models #13

Are you sure you want to change the base?

Evaluation script for Huggingface Causal models #13

Conversation

ollmer commented May 28, 2023

ollmer commented May 28, 2023

vince62s commented Jul 3, 2023 • edited

ollmer commented Jul 4, 2023

vince62s commented Jul 4, 2023

ollmer commented Jul 7, 2023

vince62s commented Jul 7, 2023

ollmer commented Jul 7, 2023

ollmer commented Jul 7, 2023

vince62s commented Jul 8, 2023

Heepo commented Jul 26, 2023 • edited

karpathy commented Feb 27, 2024 • edited

karpathy commented Feb 27, 2024

vince62s commented Feb 27, 2024 • edited

karpathy commented Feb 27, 2024

vince62s commented Feb 27, 2024

vince62s commented Jul 3, 2023 •

edited

Heepo commented Jul 26, 2023 •

edited

karpathy commented Feb 27, 2024 •

edited

vince62s commented Feb 27, 2024 •

edited