Winogrande degraded results #132

opherlieber · 2024-03-27T16:50:44Z

Hi,

I'm trying to reproduce the results from the OpenLLM leaderboard, and all benchmarks seem ok (within ~0.2%) except for winogrande which is consistently lower when running through lighteval.

Examples

accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=mistralai/Mistral-7B-v0.1" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 75.61
OpenLB Result: 78.37

accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=google/gemma-7b" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 73.95
OpenLB Result: 79.01

The OpenLLM results reference lighteval_sha '494ee12240e716e804ae9ea834f84a2c864c07ca'. Is that available somewhere?

Thanks

The text was updated successfully, but these errors were encountered:

clefourrier · 2024-03-28T07:08:53Z

Hi!

The OpenLLM results reference lighteval_sha '494ee12240e716e804ae9ea834f84a2c864c07ca'. Is that available somewhere?

No: the Open LLM Leaderboard is not using lighteval but a wrapper around the harness. The sha reported is for the wrapper - if you want to reproduce leaderboard results, best would be to use the command in the About page of the leaderboard, with the correct commit of lm_eval. (The wrapper adds our logging system for example, or allows to launch several tasks at once - all the logic will be ported to the harness to be used by the community soon).

all benchmarks seem ok (within ~0.2%) except for winogrande which is consistently lower when running through lighteval.

This is not normal, I'll investigate.

clefourrier · 2024-03-28T07:22:45Z

Could you share one of the details files you generated like so, so I can compare the outputs?

opherlieber · 2024-03-29T08:40:01Z

Hey, attaching the output for both runs, thanks!
lighteval_winogrande.tar.gz

opherlieber · 2024-03-29T09:21:19Z

Regarding reproducing the leaderboard results with lm-eval-harness: Was unable to do this for gemma, since the specific lm-eval-harness commit referenced in the leaderboard about page does not support adding the BOS token which is required for gemma (compared to mistral for example where the results are all reproduced with reasonable error)
For example the following command using the specific leaderboard harness revision

python main.py --model=hf-causal-experimental --model_args="pretrained=google/gemma-7b,use_accelerate=True" --tasks=winogrande --num_fewshot=5 --batch_size=1 --output_path=lm_eval_out
Gives a 48.6 score for gemma (for mistral I get within 0.3% of the lb score)

So I think i'm still missing something about how the leaderboard runs the models, gemma in particular.

clefourrier · 2024-04-08T08:57:58Z

Ok, so full update of what I checked:

reran our entire test suite (we use a subset for CI) > test suite passes, so if there was a problem it was already there when we added the eval
compared the samples by re-running lm_eval vs lighteval on the task, plus looking at the examples you provided > the order of the few shot samples is different for each item and different between the 2 suites. Imo, it's is the most likely source of the mismatch above, as we showed that there can be a diff of up to 3 points for the same task with different ordering of few shot samples.

I could rename leaderboard|winogrande to lighteval|winogrande, to make it clearer there is a mismatch, as I won't have the time to investigate/fix this more in depth at the moment. However, if you want to take a look, I'd be grateful!

Regarding launching models on the leaderboard, we are using the above command from the harness version, but might have made a fix in the management of tokens - person who has the most knowledge about the current state of the backend would be @NathanHB, but more generally we are looking to use the main of lm_eval for the leaderboard quite soon, so it might not be worth it to deep dive.

clefourrier self-assigned this Mar 28, 2024

clefourrier added the bug Something isn't working label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Winogrande degraded results #132

Winogrande degraded results #132

opherlieber commented Mar 27, 2024

clefourrier commented Mar 28, 2024 •

edited

clefourrier commented Mar 28, 2024

opherlieber commented Mar 29, 2024

opherlieber commented Mar 29, 2024

clefourrier commented Apr 8, 2024

Winogrande degraded results #132

Winogrande degraded results #132

Comments

opherlieber commented Mar 27, 2024

clefourrier commented Mar 28, 2024 • edited

clefourrier commented Mar 28, 2024

opherlieber commented Mar 29, 2024

opherlieber commented Mar 29, 2024

clefourrier commented Apr 8, 2024

clefourrier commented Mar 28, 2024 •

edited