-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Winogrande degraded results #132
Comments
Hi!
No: the Open LLM Leaderboard is not using lighteval but a wrapper around the harness. The sha reported is for the wrapper - if you want to reproduce leaderboard results, best would be to use the command in the About page of the leaderboard, with the correct commit of
This is not normal, I'll investigate. |
Could you share one of the details files you generated like so, so I can compare the outputs? |
Hey, attaching the output for both runs, thanks! |
Regarding reproducing the leaderboard results with lm-eval-harness: Was unable to do this for gemma, since the specific lm-eval-harness commit referenced in the leaderboard about page does not support adding the BOS token which is required for gemma (compared to mistral for example where the results are all reproduced with reasonable error)
So I think i'm still missing something about how the leaderboard runs the models, gemma in particular. |
Ok, so full update of what I checked:
I could rename Regarding launching models on the leaderboard, we are using the above command from the harness version, but might have made a fix in the management of tokens - person who has the most knowledge about the current state of the backend would be @NathanHB, but more generally we are looking to use the |
Hi,
I'm trying to reproduce the results from the OpenLLM leaderboard, and all benchmarks seem ok (within ~0.2%) except for winogrande which is consistently lower when running through lighteval.
Examples
accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=mistralai/Mistral-7B-v0.1" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 75.61
OpenLB Result: 78.37
accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=google/gemma-7b" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 73.95
OpenLB Result: 79.01
The OpenLLM results reference lighteval_sha '494ee12240e716e804ae9ea834f84a2c864c07ca'. Is that available somewhere?
Thanks
The text was updated successfully, but these errors were encountered: