-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Whole History Rating to Leaderboard? #3004
Comments
We publish Arena battle data with timestamp here at You can use a sliding window to plot model's rating over time. Could you contribute a PR? |
So I tried this over the weekend and made some graphs, but the library doesn't handle ties, and doesn't seem to report uncertainty correctly, and I don't know how to decide what the w² value should be set to:
Is there any reason why a pre-trained LLM model would change skill over time? Only thing I can think of is network outages or the like that cause people to vote down that model temporarily (which I am guilty of). But otherwise the model weights and inference implementation are always the same, right? ("In the context of LLM evaluation, models can be assumed to be static.") So I wonder if ideally there would be a way to mark certain models as having w = 0 and others (API calls or model with internet access, etc.) as having skill that could possibly change? (Also I arbitrarily added 1000 to the results to make it look more like Elo ratings, but I don't know if that really makes them equivalent to Elo ratings.) (I also want to try counting "both are bad" as a loss of both models against a "HumanEvaluator" model that would serve as a sort of benchmark for ideal LLM performance, but I need to figure out how to implement ties first.) |
For the API-based models, there are frequent claims online that users see models getting worse over time. It would be good to know if that's true. Copying a comment of mine from HF:
I know there are a bunch of Elo variants, but never learned the exact differences. Here is one summary:
I know Glicko has a measure of uncertainty built-in, not sure how that compares to lmsys' bootstrap method.
Maybe WHR would be a better choice? I know WHR is used to track rock climber skill over time, for instance. From their own paper, they say:
WHR can show how models change in skill over time, and how confident we can be in the measurement:
The text was updated successfully, but these errors were encountered: