Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Winogrande degraded results #132

Open
opherlieber opened this issue Mar 27, 2024 · 5 comments
Open

Winogrande degraded results #132

opherlieber opened this issue Mar 27, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@opherlieber
Copy link

Hi,

I'm trying to reproduce the results from the OpenLLM leaderboard, and all benchmarks seem ok (within ~0.2%) except for winogrande which is consistently lower when running through lighteval.

Examples

accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=mistralai/Mistral-7B-v0.1" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 75.61
OpenLB Result: 78.37

accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=google/gemma-7b" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 73.95
OpenLB Result: 79.01

The OpenLLM results reference lighteval_sha '494ee12240e716e804ae9ea834f84a2c864c07ca'. Is that available somewhere?

Thanks

@clefourrier
Copy link
Member

clefourrier commented Mar 28, 2024

Hi!

The OpenLLM results reference lighteval_sha '494ee12240e716e804ae9ea834f84a2c864c07ca'. Is that available somewhere?

No: the Open LLM Leaderboard is not using lighteval but a wrapper around the harness. The sha reported is for the wrapper - if you want to reproduce leaderboard results, best would be to use the command in the About page of the leaderboard, with the correct commit of lm_eval. (The wrapper adds our logging system for example, or allows to launch several tasks at once - all the logic will be ported to the harness to be used by the community soon).

all benchmarks seem ok (within ~0.2%) except for winogrande which is consistently lower when running through lighteval.

This is not normal, I'll investigate.

@clefourrier clefourrier self-assigned this Mar 28, 2024
@clefourrier
Copy link
Member

Could you share one of the details files you generated like so, so I can compare the outputs?

@opherlieber
Copy link
Author

Hey, attaching the output for both runs, thanks!
lighteval_winogrande.tar.gz

@opherlieber
Copy link
Author

Regarding reproducing the leaderboard results with lm-eval-harness: Was unable to do this for gemma, since the specific lm-eval-harness commit referenced in the leaderboard about page does not support adding the BOS token which is required for gemma (compared to mistral for example where the results are all reproduced with reasonable error)
For example the following command using the specific leaderboard harness revision

python main.py --model=hf-causal-experimental --model_args="pretrained=google/gemma-7b,use_accelerate=True" --tasks=winogrande --num_fewshot=5 --batch_size=1 --output_path=lm_eval_out
Gives a 48.6 score for gemma (for mistral I get within 0.3% of the lb score)

So I think i'm still missing something about how the leaderboard runs the models, gemma in particular.

@clefourrier clefourrier added the bug Something isn't working label Apr 8, 2024
@clefourrier
Copy link
Member

Ok, so full update of what I checked:

  • reran our entire test suite (we use a subset for CI) > test suite passes, so if there was a problem it was already there when we added the eval
  • compared the samples by re-running lm_eval vs lighteval on the task, plus looking at the examples you provided > the order of the few shot samples is different for each item and different between the 2 suites. Imo, it's is the most likely source of the mismatch above, as we showed that there can be a diff of up to 3 points for the same task with different ordering of few shot samples.

I could rename leaderboard|winogrande to lighteval|winogrande, to make it clearer there is a mismatch, as I won't have the time to investigate/fix this more in depth at the moment. However, if you want to take a look, I'd be grateful!

Regarding launching models on the leaderboard, we are using the above command from the harness version, but might have made a fix in the management of tokens - person who has the most knowledge about the current state of the backend would be @NathanHB, but more generally we are looking to use the main of lm_eval for the leaderboard quite soon, so it might not be worth it to deep dive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants