# Convince Me AutoArena Works

Evaluating AI has never been trivial. As traditional ML models evolve into LLMs and datasets take on more complex forms, benchmarking models becomes difficult. **[AutoArena by Kolena](https://github.com/kolenaIO/autoarena)** is a platform made for creating leaderboards to rank LLMs comparing model responses against one another using automated judges.

### AutoArena Overview

AutoArena sets up head-to-head comparisons of model generations before a jury of LLMs. With multiple automated judges within the jury from different LLM families, the aim is to apply the most ideal measurement of generation quality to critique other model generations. In comparison, traditional text similarity metrics are less relevant in measuring "quality". Winners of these head-to-head comparisons gain "Elo" - a score that determines a model's overall placement on a leaderboard.

### Experiment

The necessary dependancies to run this notebook can be installed with: `pip install ipykernel pandas nbformat plotly`.

In this notebook, we will use a portion of the data from the [LMSYS - Chatbot Arena Human Preference Predictions](https://www.kaggle.com/competitions/lmsys-chatbot-arena/data) training split, having the [Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license. This dataset includes human votes indicating which model's response to a prompt was the best in a pairwise fashion, and the needed data for this experiment has been reformatted in the `models` folder.

### Steps in this notebook

<style>
    .spaced-list li {margin-bottom: 10px;}
</style>

<div style="display: flex; align-items: center;">
    <img src="../assets/getting_started.jpg" width="300"/>
    <ol class="spaced-list">
        <li>Create a project</li>
        <li>Create an automated judge</li>
        <li>Upload model responses</li>
        <li>Make some hypotheses about LLM rankings</li>
        <li>Run the human votes through an <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo rating system</a> and check the hypotheses</li>
        <li>Verify AutoArena's leaderboard</li>
    </ol>
</div>


In [1]:
%pip install ipykernel pandas nbformat plotly -U -q
from collections import defaultdict

import pandas as pd
import plotly.express as px

pd.options.display.float_format = '{:.2f}'.format

Note: you may need to restart the kernel to use updated packages.


### 1. Create a project

To start, install AutoArena from [PyPI](https://pypi.org/project/autoarena/), add your OpenAI API key to your environment for an automated judge, and run it as a module.
```bash
pip install autoarena
export OPENAI_API_KEY=sk-...
python -m autoarena
```
Once the module is running, visit [http://localhost:8899](http://localhost:8899) to create a project! 

### 2. Create an automated judge

Click on `Configure Judge` > `OpenAI` > `gpt-4o-mini` > `Create` which creates a `gpt-4o-mini` judge using your `OPENAI_API_KEY` environment variable.

### 3. Upload model responses

Let's return to the Leaderboard page to upload our data. The `models` folder contains six CSVs:
1. `models/gpt-3.5-turbo-0613.csv`
2. `models/gpt-4-0314.csv`
3. `models/gpt-4-1106-preview.csv`
4. `models/vicuna-13b.csv`
5. `models/vicuna-33b.csv`

Each CSV contains a `prompt` and `response` column storing a language model's input and output, for example:

In [4]:
sample_df = pd.read_csv("models/gpt-4-0314.csv", usecols=['prompt', 'response'])
sample_df.head()

Unnamed: 0,prompt,response
0,General interaction behaviors instructions:\n\...,"[BEGINIMDETECT]\n{\n ""response"": [\n {\n ..."
1,you are a powerhouse of creative brilliance an...,Topic: Sustainable living and eco-friendly pra...
2,"can you summarize the below ""1\t0\tARRISGro_f7...",The text provided is a list of wireless networ...
3,Apprentissage automatique (Machine Learning) s...,"import pandas as pd\n\nurl = ""https://wagon-pu..."
4,Can you provide a list of 10 youtube video tit...,<answer>\n<item>Tesla's $300 Wireless Charging...


Click on `Add Model` and select all of them to upload to your project.


### 4. Make some hypotheses about LLM rankings

Which LLMs align the most with a human voter's preferences?

Note that our pool of models are: `gpt-3.5-turbo-0613`, `gpt-4-0314`, `gpt-4-1106-preview`, `vicuna-13b`, and `vicuna-33b`.

- **Hypothesis 1: Bigger Models Are Better**
   * `gpt-4-*` should outperform `gpt-3.5-turbo-0613`
   * `vicuna-33b` should outperform `vicuna-13b`

- **Hypothesis 2: Models Made by Major Industry Leaders Are Better**
   * Vicuna's models should be closer to the bottom of the leaderboard

- **Hypothesis 3: Newer Models Are Better**
   * `gpt-4-1106-preview` should outperform all the other models since it is the newest in the group

Let's examine what the human votes from the dataset indicate based on win rates over other models.

In [6]:
df = pd.read_csv("lmsys-chatbot-arena/train_subset.csv") # a subset of the original train split

def compute_head_to_head_win_rate(battles):
    a_win = pd.pivot_table(battles[battles['winner_model_a'] == 1], index="model_a", columns="model_b", aggfunc="size", fill_value=0)
    b_win = pd.pivot_table(battles[battles['winner_model_b'] == 1], index="model_a", columns="model_b", aggfunc="size", fill_value=0)
    counts = pd.pivot_table(battles, index="model_a", columns="model_b", aggfunc="size", fill_value=0)
    return ((a_win + b_win.T) / (counts + counts.T)).mean(axis=1).sort_values(ascending=True)

row_beats_col_freq = compute_head_to_head_win_rate(df)
fig = px.bar(row_beats_col_freq, title="Approximate Win Rate", text_auto=".2f")
fig.update_layout(yaxis_title="Avg Win Rate", xaxis_title="Model", showlegend=False)
fig

From the plot above, we see that within the GPT-family, newer models have higher winning rates. What's interesting is that `vicuna-33b` is worse than `vicuna-13b` by this metric, and we may have expected the opposite. Furthermore, it may be confusing to see `gpt-4-0314` under `gpt-3.5-turbo-0613`. Only some of our hypotheses can agree with the information above.

Are win rates a sufficient metric? While they are explainable and easy to compute, win rates lack precision for ties and cannot allow for performance comparisons beyond two models. Think about this senario: a 5-year-old plays chess with a grand master, and the result is that the child wins. Why is the value of this single win equal to the value of winning to a peer of the same skill level?

The better solution over win rates is an [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system):
- Elo adjusts based on opponent's Elo rating; win rate doesn’t
- Elo rewards/penalizes draws; win rate ignores draws


### 5. Run the human votes through an [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system) and check the hypotheses

We'll pass along all these head-to-head battles into an Elo rating system. In general, the winner gains some Elo, while the loser's Elo is lowered.

In [7]:
df = pd.read_csv("lmsys-chatbot-arena/train_subset.csv", usecols=['model_a', 'model_b', 'winner_model_a', 'winner_model_b', 'winner_tie'])

def compute_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)
    for _, model_a, model_b, winner_model_a, winner_model_b, winner_tie in battles.itertuples():
        ra, rb = rating[model_a], rating[model_b]
        ea, eb = 1 / (1 + BASE ** ((rb - ra) / SCALE)), 1 / (1 + BASE ** ((ra - rb) / SCALE))
        sa = 1 if winner_model_a else 0 if winner_model_b else 0.5 if winner_tie else Exception("no winner selected")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)
    return rating

def display_leaderboard(ratings):
    df = pd.DataFrame(ratings.items(), columns=["Model", "Elo rating"]).sort_values("Elo rating", ascending=False).reset_index(drop=True)
    df["Elo rating"] = (df["Elo rating"] + 0.5).astype(int)
    df.index = df.index + 1
    return df

elo_ratings = compute_elo(df)
display_leaderboard(elo_ratings)


Unnamed: 0,Model,Elo rating
1,gpt-4-1106-preview,1025
2,gpt-4-0314,1014
3,vicuna-33b,990
4,gpt-3.5-turbo-0613,987
5,vicuna-13b,984


From the leaderboard above, we see that all the hypotheses' claims are true. The bigger/newer models rank higher than the smaller/older models, and the GPT-family still ranks the highest.

With Elo scores, it becomes much easier to interpret if two models are similar in generation quality (e.g. `gpt-3.5-turbo-0613`, `vicuna-13b`, and `vicuna-33b` have very similar standings).

### 6. Verify AutoArena's leaderboard

By now, your leaderboard on [AutoArena](http://localhost:8899) should have completed the judging process. Let's click on `Recompute Leaderboard` to refresh the leaderboard's content.

<img src="../assets/recompute.jpg" width="300"/>


You'll find that the leaderboard within [AutoArena](http://localhost:8899) (example below) is very similar to the leaderboard computed above.

<img src="../assets/leaderboard.jpg" width="800"/>

Again, we see `gpt-4-1106-preview` in first place. All the larger models outperform their smaller counterparts. This time, `vicuna-13b` and `vicuna-33b` no longer have near identical Elo rating.

### Summary
In this notebook, we evaluated language model responses using **[AutoArena by Kolena](https://github.com/kolenaIO/autoarena)**, the platform designed to rank LLM generations through head-to-head comparisons judged by an automated jury of other LLMs.


We used a dataset that included human preferences for pairwise model comparisons, and uploaded data the prompts and responses into AutoArena. Then, the automated judge critiques the raw model responses. To ensure this strategy is sound, we hypothesized which types of LLMs would align most with human preference. We miss the full picture when observing overall winning rates, but gain more understanding by using an Elo rating system. We constructed a leaderboard using the human preferences from the dataset and compared it to the generated leaderboard in Autoarena, and conclude that automated judges produced a strikingly similar ranking, validating our approach.

AutoArena can effectively automate model benchmarking with human preference alignment, or any other policies through creating appropriate system prompts.