Add Whole History Rating to Leaderboard? #3004

endolith · 2024-02-02T18:55:35Z

For the API-based models, there are frequent claims online that users see models getting worse over time. It would be good to know if that's true. Copying a comment of mine from HF:

I know there are a bunch of Elo variants, but never learned the exact differences. Here is one summary:

To make a long story super short, Elo is the grandfather of most like systems. It has been around for ages and is super simple. Elo doesn’t care about inactivity or inconsistencies. The process starts from day 1 and moves chronologically throughout time, every competitor starts with a starter rating, which is then modified with each result. Glicko-1 is a very similar system to Elo, except it has the concept of “rating deviation” which allows competitors’ ratings to deviate more or less, given when they fought last. There is also a second version of Glicko, which tosses in a factor called volatility — it is a major complication with extremely limited benefit.

In comes WHR. Again, it is based on Elo, but is setup to take numerous passes throughout history. With each pass, it “learns” from what happened in surrounding events. This makes it an excellent system for reviewing the past and in trying to determine when a competitor was really at their peak. Whether it paints a more accurate ranking picture… who knows?

I know Glicko has a measure of uncertainty built-in, not sure how that compares to lmsys' bootstrap method.

Maybe WHR would be a better choice? I know WHR is used to track rock climber skill over time, for instance. From their own paper, they say:

Experiments demonstrate that, in comparison to Elo, Glicko, TrueSkill, and decayed-history algorithms, WHR produces better predictions.

WHR can show how models change in skill over time, and how confident we can be in the measurement:

infwinston · 2024-02-02T19:00:04Z

We publish Arena battle data with timestamp here at
https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH#scrollTo=o_CpbkGEbhrK

You can use a sliding window to plot model's rating over time. Could you contribute a PR?

endolith · 2024-02-09T05:18:35Z

So I tried this over the weekend and made some graphs, but the library doesn't handle ties, and doesn't seem to report uncertainty correctly, and I don't know how to decide what the w² value should be set to:

w is a parameter of the model, that indicates the variability of ratings in time. The extreme case of w = 0 would mean static ratings.

The Wiener process had a variance of w2 = 60 Elo2 per day.

Is there any reason why a pre-trained LLM model would change skill over time? Only thing I can think of is network outages or the like that cause people to vote down that model temporarily (which I am guilty of). But otherwise the model weights and inference implementation are always the same, right? ("In the context of LLM evaluation, models can be assumed to be static.") So I wonder if ideally there would be a way to mark certain models as having w = 0 and others (API calls or model with internet access, etc.) as having skill that could possibly change?

(Also I arbitrarily added 1000 to the results to make it look more like Elo ratings, but I don't know if that really makes them equivalent to Elo ratings.)

(I also want to try counting "both are bad" as a loss of both models against a "HumanEvaluator" model that would serve as a sort of benchmark for ideal LLM performance, but I need to figure out how to implement ties first.)