# Convince Me This Works

Evaluating AI has never been trivial. As traditional ML models evolve into LLMs and datasets take on more complex forms, benchmarking models becomes difficult. **[AutoArena by Kolena](https://github.com/kolenaIO/autoarena)** is a platform made for creating leaderboards to rank LLM outputs against one another using automated judges.

### AutoArena Overview

AutoArena sets up head-to-head comparisons of model generations before a jury of LLMs. With multiple automated judges within the jury from different LLM families, the aim is to apply the most ideal measurement of generation quality to critique other model generations. In comparison, traditional text similarity metrics are less relevant to measuring quality. Winners of these head-to-head comparisons gain "Elo" - a score that determines a model's overall placement on a leaderboard.

### Experiment

In this notebook, we will use the [LMSYS - Chatbot Arena Human Preference Predictions](https://www.kaggle.com/competitions/lmsys-chatbot-arena/data) dataset having the [Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/) license. This dataset includes human votes indicating which model's response to a prompt was the best in a pairwise fashion. To follow along, please download the dataset from [here](https://www.kaggle.com/competitions/lmsys-chatbot-arena/data?select=train.csv).

### Steps in this notebook

1. Reformat the dataset for [AutoArena](https://github.com/kolenaIO/autoarena/tree/trunk?tab=readme-ov-file#-getting-started)
2. Upload the data into AutoArena
3. Create a jury of automated judges. While the automated judges run in the background...
4. Make a hypothesis. Which LLMs align the most with a human voter's preferences?
5. Run the human votes through an [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system)
6. Verify that the Elo leaderboard agrees with our hypothesis
7. Learn that the automated judges in AutoArena produced a very similar leaderboard

In [1]:
import json
import os
from collections import defaultdict

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from tqdm import tqdm

pd.options.display.float_format = '{:.2f}'.format

### 1. Reformat the dataset for [AutoArena](https://github.com/kolenaIO/autoarena/tree/trunk?tab=readme-ov-file#-getting-started)
The `train` split of the `lmsys-chatbot-arena` dataset is structured in a way where two models are given a prompt, and both models provide their answers. There's an indicator for the human voters to decide whether `model_a` won, `model_b` won, or that it was a tie.

AutoArena requires that each models' unique prompts with responses to be in its own CSV with the columns `prompt` and `response`.

In [2]:
df = pd.read_csv('lmsys-chatbot-arena/train.csv')
df.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


We will create `auto_arena_df`, has all of the prompts and responses, where models are no longer paired together.

In [3]:
auto_arena_df = pd.concat(
    [
        df[['prompt', 'model_a', 'response_a']].rename(columns={'model_a': 'model', 'response_a': 'response'}),
        df[['prompt', 'model_b', 'response_b']].rename(columns={'model_b': 'model', 'response_b': 'response'})
    ],
    ignore_index=True
)
auto_arena_df['prompt'] = auto_arena_df['prompt'].apply(lambda x: json.loads(x)[0])
auto_arena_df['response'] = auto_arena_df['response'].apply(lambda x: json.loads(x)[0])
auto_arena_df.head()

Unnamed: 0,prompt,model,response
0,Is it morally right to try to have a certain p...,gpt-4-1106-preview,The question of whether it is morally right to...
1,What is the difference between marriage licens...,koala-13b,A marriage license is a legal document that al...
2,explain function calling. how would you call a...,gpt-3.5-turbo-0613,Function calling is the process of invoking or...
3,How can I create a test set for a very rare ca...,llama-2-13b-chat,Creating a test set for a very rare category c...
4,What is the best way to travel from Tel-Aviv t...,koala-13b,The best way to travel from Tel Aviv to Jerusa...


Now, we split up the data into one CSV per model, containing the `prompt` and `response` columns. We will consider the top 40 most involved models to keep things simpler. 

In [4]:
folder = 'models'
os.makedirs(folder, exist_ok=True)

model_counts = list(auto_arena_df['model'].value_counts().items())[:40]
models_of_interest = []
for model_name, count in model_counts:
    models_of_interest.append(model_name)
    model_level_df = auto_arena_df[auto_arena_df['model'] == model_name]
    # Ensure that the prompt column is unique
    model_level_df.drop_duplicates(subset='prompt').to_csv(f'{folder}/{model_name}.csv', index=False)

### 2. Upload the data into AutoArena
The `models` folder contains 40 CSVs. Let's go to AutoArena and upload them all, and create a `gpt-4o-mini` judge. If you want to add API keys for other LLM providers, details are available within the UI, or you can see what's available from the [source code here](https://github.com/kolenaIO/autoarena/blob/96ea0326404215891051fad3d7b83db49ab77070/ui/src/components/Judges/types.ts#L111C17-L111C38). To start, simply open your terminal, install AutoArena from [PyPI](https://pypi.org/project/autoarena/), add an OpenAI API key to your environment, and run it as a module.
```bash
pip install autoarena
export OPENAI_API_KEY=sk-...
python -m autoarena
```
Once the module is running, visit [http://localhost:8899](http://localhost:8899) to create a project!

### 3. Create a jury of automated judges

<img src="./assets/tutorial.png" width="300"/>


Let's follow the steps within the tutorial.

1. Create a project with a name of your choice via the UI
2. Add responses from a model by selecting a CSV, such as `models/alpaca-13b.csv`
3. Configure an automated judge via the UI by selecting OpenAI from the "Judges" page, anc choosing `gpt-4o-mini`
4. Add responses from another model (select all the remaining ones in `models`) and the `gpt-4o-mini` judge will automatically start voting

While we wait for our jury to go through the thousands of comparisons of model responses in AutoArena, let's see what the human ratings from the dataset reveals.

### 4. Make a hypothesis

Based on the overall win rates of each LLM from the dataset, we can see which ones align most with human preferences.

In [5]:
def compute_head_to_head_win_rate(battles):
    a_win = pd.pivot_table(
        battles[battles['winner_model_a'] == 1],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    b_win = pd.pivot_table(
        battles[battles['winner_model_b'] == 1],
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    num_battles = pd.pivot_table(battles,
        index="model_a", columns="model_b", aggfunc="size", fill_value=0)

    # Computing the proportion of wins for each model as A and as B against all other models
    row_beats_col_freq = (
        (a_win + b_win.T) /
        (num_battles + num_battles.T)
    )

    # Arrange by proportion of wins
    prop_wins = row_beats_col_freq.mean(axis=1).sort_values(ascending=False)
    model_names = list(prop_wins.keys())
    row_beats_col = row_beats_col_freq.loc[model_names, model_names]
    return row_beats_col


# Delete all records outside of the 40 selected models
df = df[(df['model_a'].isin(models_of_interest)) & (df['model_b'].isin(models_of_interest))]

row_beats_col_freq = compute_head_to_head_win_rate(df)
fig = px.bar(
    row_beats_col_freq.mean(axis=1).sort_values(ascending=False),
    title="Approximate Win Rate Against Other Models",
    text_auto=".2f"
)
fig.update_layout(yaxis_title="Average Win Rate", xaxis_title="Model", showlegend=False)
fig

From the plot above, we could assume that models that tend to be more aligned with human preferences are larger or newer.
We'll also note that the GPT-4 models are more performant than the other models from this dataset.

### 5. Run the human votes through an [Elo rating system](https://en.wikipedia.org/wiki/Elo_rating_system)

We'll pass along all these head-to-head battles into an Elo rating system, commonly used in the world of chess to determine a skill rating relative to other players. In general, the dataset indicates the winner from a pair of models, and the winner gains some Elo, while the loser's Elo is lowered. You can read more about the details of Elo rating systems [here](https://en.wikipedia.org/wiki/Elo_rating_system).

In [7]:
def compute_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, winner_model_a, winner_model_b, winner_tie in battles[['model_a', 'model_b', 'winner_model_a', 'winner_model_b', 'winner_tie']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        
        # Determine the winner based on the one-hot encoded winner columns
        if winner_model_a == 1:
            sa = 1
        elif winner_model_b == 1:
            sa = 0
        elif winner_tie == 1:
            sa = 0.5
        else:
            raise Exception("no winner selected")

        # Update the ratings
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

    return rating

def display_leaderboard(ratings):
    df = pd.DataFrame([
        [n, ratings[n]] for n in ratings.keys()
    ], columns=["Model", "Elo rating"]).sort_values("Elo rating", ascending=False).reset_index(drop=True)
    df["Elo rating"] = (df["Elo rating"] + 0.5).astype(int)
    df.index = df.index + 1
    return df

elo_ratings = compute_elo(df)
display_leaderboard(elo_ratings)

Unnamed: 0,Model,Elo rating
1,gpt-4-0125-preview,1173
2,gpt-4-1106-preview,1155
3,gpt-4-0314,1139
4,gpt-4-0613,1089
5,gemini-pro,1078
6,gpt-3.5-turbo-0613,1077
7,mistral-medium,1075
8,claude-1,1066
9,claude-2.0,1060
10,mixtral-8x7b-instruct-v0.1,1055


Let's bootstrap the results to make sure the Elo scores are stable and reliable. As a result, we can display confidence intervals later on.

In [8]:
def compute_bootstrap_result(battles, func_compute_elo, num_round):
    rows = []
    for i in tqdm(range(num_round), desc="bootstrap"):
        rows.append(func_compute_elo(battles.sample(frac=1.0, replace=True)))
    df = pd.DataFrame(rows)
    return df[df.median().sort_values(ascending=False).index]

BOOTSTRAP_ROUNDS = 100

np.random.seed(42)
bootstrap_elo_lu = compute_bootstrap_result(df, compute_elo, BOOTSTRAP_ROUNDS)
bootstrap_lu_median = bootstrap_elo_lu.median().reset_index().set_axis(["model", "Elo rating"], axis=1)
bootstrap_lu_median["Elo rating"] = (bootstrap_lu_median["Elo rating"] + 0.5).astype(int)
bootstrap_lu_median

bootstrap: 100%|██████████| 100/100 [00:06<00:00, 15.11it/s]


Unnamed: 0,model,Elo rating
0,gpt-4-1106-preview,1186
1,gpt-4-0125-preview,1176
2,gpt-4-0314,1117
3,gpt-4-0613,1091
4,mistral-medium,1082
5,claude-1,1073
6,claude-2.0,1061
7,gemini-pro,1053
8,gpt-3.5-turbo-0314,1052
9,mixtral-8x7b-instruct-v0.1,1048


### 6. Verify that the Elo leaderboard agrees with our hypothesis

From the leaderboard above, we immediately see that the GPT-4 models are the top performers. We may also notice that the larger and newer LLMs tend to appear higher up on the leaderboard. By inspection, the order of models by win rate and by Elo score are very similar, verifying that the Elo rating system works!

Let's go ahead and add on the confidence intervals in a more appealing plot.

In [9]:
def visualize_bootstrap_scores(df):
    bars = df.quantile([.025, .5, .975]).T.reset_index(names='model')
    bars.columns = ['model', 'lower', 'rating', 'upper']
    bars = bars.sort_values('rating', ascending=False)

    bars['error_y'] = bars['upper'] - bars['rating']
    bars['error_y_minus'] = bars['rating'] - bars['lower']
    bars['rating_rounded'] = bars['rating'].round(2)

    mid_index = len(bars) // 2
    top_half, bottom_half = bars.iloc[:mid_index], bars.iloc[mid_index:]

    fig = make_subplots(rows=2, cols=1, subplot_titles=("Top Performers", "Rest of Performers"), vertical_spacing=0.3)

    # Function to add traces
    def add_trace(data, row):
        fig.add_trace(
            go.Scatter(
                x=data['model'], 
                y=data['rating'], 
                error_y=dict(type='data', array=data['error_y'], arrayminus=data['error_y_minus']),
                mode='markers+text',
                text=data['rating_rounded'],
                textposition='top center',
                textfont=dict(size=8)
            ),
            row=row, col=1
        )
        
    # Add top half and bottom half plots
    add_trace(top_half, row=1)
    add_trace(bottom_half, row=2)

    fig.update_layout(
        showlegend=False,
        height=800
    )
    fig.update_xaxes(title_text="Model", row=1, col=1)
    fig.update_yaxes(title_text="Rating", row=1, col=1, range=[top_half['rating'].min() - 50, top_half['rating'].max() + 50])
    fig.update_xaxes(title_text="Model", row=2, col=1)
    fig.update_yaxes(title_text="Rating", row=2, col=1, range=[bottom_half['rating'].min() - 50, bottom_half['rating'].max() + 50])
    
    return fig

fig = visualize_bootstrap_scores(bootstrap_elo_lu)
fig.show()

### 7. Learn that the automated judges in AutoArena produced a very similar leaderboard

By now, your leaderboard on [AutoArena](http://localhost:8899) should have a good number of votes on most of the models you've uploaded. Let's click on `Recompute Leaderboard` to refresh the leaderboard's content.

<img src="./assets/recompute.png" width="300"/>


You'll find that the leaderboard within [AutoArena](http://localhost:8899) is very similar to the placement of models shown above with slight shifts. Again, we'll notice that the newer GPT-4 models are at the top of the leaderboard, and the larger/newer models tend to dominate among the top 20 performers. Consequently, the bottom 20 performers generally consist of the smaller/older models. None of this should be surprising, but it demonstrates that the jury of automated judges (`gpt-4o-mini` in this case) is able to leverage an Elo rating system to create an accurate leaderboard of many LLMs efficiently.


Here is an example of what you might see in [AutoArena](http://localhost:8899).


<img src="./assets/leaderboard.png" width="600"/>



### Summary
In this notebook, we evaluated language model responses using **[AutoArena by Kolena](https://github.com/kolenaIO/autoarena)**, the platform designed to rank LLM generations through head-to-head comparisons judged by an automated jury of other LLMs.


We used a dataset that included human preferences in pairwise model comparisons and uploaded its data without the human indicators into AutoArena. Then, we set up a jury of automated judges to critique the raw model responses given their respective prompts. To ensure this strategy is sound, we hypothesized which types of LLMs would align most closely with human voters' preferences from observing overall winning rates. By using an Elo rating system, we ranked models using the human preferences from the dataset and compared it to the generated leaderboard in Autoarena. We compared the Elo leaderboard with our hypothesis and *ground truth* leaderboard to find that the automated judges produced a strikingly similar ranking, validating our approach.

AutoArena can effectively automate model benchmarking and align results with human preferences, or any other policies through creating appropriate system prompts.