# LMSYS - Chatbot Arena Human Preference Predictions

In this notebook you can find Exploratory Data Analysis of the dataset provided for the [Chatbot Arena Human Preference Predictions competition](https://www.kaggle.com/competitions/lmsys-chatbot-arena/overview). 

The competition dataset consists of user interactions from the ChatBot Arena. In each user interaction a judge provides one or more prompts to two different large language models, and then indicates which of the models gave the more satisfactory response. \
The goal of the competition is to predict the preferences of the judges and determine the likelihood that a given prompt/response pair is selected as the winner.

## Data description
Training data is made of ~57.5K rows. The data contains following columns:
- **id** - A unique identifier for the row.
- **model_[a/b]** - The identity of model_[a/b]. Included in train.csv but not test.csv.
- **prompt** - The prompt that was given as an input (to both models).
- **response_[a/b]** - The response from model_[a/b] to the given prompt.
- **winner_model_[a/b/tie]** - Binary columns marking the judge's selection. The ground truth target column.



# Imports and setup

In [1]:
%load_ext jupyter_black

In [2]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go
from plotly import express as px
from plotly.offline import init_notebook_mode, iplot

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

In [3]:
init_notebook_mode(connected=True)

# set max column width to 500
pd.set_option("display.max_colwidth", 500)

In [4]:
DATA_PATH = "../data/train.csv"

# Data analysis

In [5]:
df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain percentage of females on managerial positions?"",""OK, does pineapple belong on a pizza? Relax and give me fun answer.""]","[""The question of whether it is morally right to aim for a certain percentage of females in managerial positions is a complex ethical issue that involves considerations of fairness, equality, diversity, and discrimination.\n\nHere are some arguments in favor of and against such policies:\n\n**Arguments in favor:**\n\n1. **Correcting Historical Inequities:** Women have historically been underrepresented in leadership roles due to various cultural, institutional, and social barriers. Aiming fo...","[""As an AI, I don't have personal beliefs or opinions. However, I can tell you that the question of gender quotas in managerial positions is a complex one and positions can vary. Advocates argue that such measures can help correct historical imbalances, promote diversity, and may lead to better decision-making. Critics may argue that positions should be based purely on merit, and that quotas could potentially lead to tokenism or unqualified candidates being promoted. Morality can be subjecti...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage license and marriage certificate?"",""How can I get both of them as quick as possible in California "",""What is the minimal time to get them? 1 day or 10 days?""]","[""A marriage license is a legal document that allows a couple to get married. It is issued by a government agency, such as a county clerk's office or a state government, and is valid for a certain period of time, usually one year. After the marriage has taken place, the couple must obtain a marriage certificate, which is a document that records the marriage and is used to prove that the marriage took place. The marriage certificate is usually issued by the same government agency that issued ...","[""A marriage license and a marriage certificate are two different legal documents that have separate purposes.\n\n1. Marriage License: A marriage license is a legal document that gives a couple permission to get married. It's usually obtained from local government or court officials before the wedding ceremony takes place. The couple is required to meet certain criteria, such as being of a certain age or not being closely related. Once the license is issued, there's often a waiting period be...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call a function?""]","[""Function calling is the process of invoking or executing a function in a programming language. To call a function, you need to use the function's name followed by parentheses \""()\"". Within the parentheses, you can pass arguments if the function requires any.\n\nHere is an example of how to call a function in Python:\n\n```python\n# Define a function\ndef greet(name):\n print(\""Hello, \"" + name + \""!\"")\n\n# Call the function\ngreet(\""John\"")\n```\n\nIn this example, the function \""gree...","[""Function calling is the process of invoking a function in a program. A function is a block of code that performs a specific task and can be reused multiple times in a program. Functions help to organize and modularize code, making it more readable and maintainable.\n\nTo call a function, you simply write the name of the function followed by parentheses () containing any arguments or parameters that the function requires. Here's an example:\n```\nfunction greet(name) {\n console.log(\""Hell...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare category? I want to build a classifier.""]","[""Creating a test set for a very rare category can be challenging, as it may not be possible to collect a sufficient number of examples for the rare category. However, there are several strategies you can use to create a test set for a rare category:\n\n1. Use synthetic data: If the rare category is not present in any real-world data, you can create synthetic data to simulate the rare category. For example, you can use generative adversarial networks (GANs) or other techniques to generate sy...","[""When building a classifier for a very rare category, creating a test set can be challenging due to the limited amount of data available. Here are some steps you can take to create a test set for a rare category:\n\n1. Gather as much data as possible: The more data you have, the easier it will be to create a test set. Look for any additional sources of data that you can use to supplement the data you already have.\n2. Stratified sampling: Use stratified sampling to ensure that your test set...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv to Jerusalem? Car? Bus? Plane?""]","[""The best way to travel from Tel Aviv to Jerusalem depends on your personal preference and the availability of transportation options. All the options you have mentioned are valid options, but here are some details to help you make your decision:\n\n* By car: Traveling by car is the quickest way to get from Tel Aviv to Jerusalem, as the distance between the two cities is only about 60 kilometers (37 miles). It takes around 45 minutes to drive from Tel Aviv to Jerusalem by car, depending o...","[""The best way to travel from Tel-Aviv to Jerusalem depends on your personal preferences and circumstances. \n\nIf you have a car and enjoy driving, then driving from Tel-Aviv to Jerusalem is an option. However, parking in Jerusalem can be challenging and expensive.\n\nIf you prefer to use public transportation, there are several bus lines that operate between Tel-Aviv and Jerusalem. Some of the most popular bus companies include Egged and Dan. The bus ride typically takes about an hour, dep...",0,1,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57477 entries, 0 to 57476
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              57477 non-null  int64 
 1   model_a         57477 non-null  object
 2   model_b         57477 non-null  object
 3   prompt          57477 non-null  object
 4   response_a      57477 non-null  object
 5   response_b      57477 non-null  object
 6   winner_model_a  57477 non-null  int64 
 7   winner_model_b  57477 non-null  int64 
 8   winner_tie      57477 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 3.9+ MB


## Models performance analysis

#### How often each model is used to generate responses?

In [7]:
model_usage = pd.concat([df["model_a"], df["model_b"]]).value_counts()

# How many unique models are there?
len(model_usage.keys())

64

In [8]:
# plot the model usage
fig = px.bar(
    x=model_usage.index,
    y=model_usage.values,
    labels={"x": "Model", "y": "Usage times"},
    title="Model Usage Distribution",
)

fig.update_layout(width=1200, height=700)

iplot(fig)

From above analysis, we can find out that:
- in total, 64 unique models are used to generate responses
- the model usage is not distribution evenly. The most popular model (gpt-4-1106-preview) was called 7387 times, while the most common one (mistral-7b-instruct-v0.2) only 100 times,
- the most popular models are versions of GPT and Claude.

#### What are the popular pairs of compared models?

In [9]:
# count the number of times each pair of models are compared, regardless of the order
compared_models_count = (
    pd.DataFrame(
        np.sort(df[["model_a", "model_b"]].values, axis=1),
        columns=["model_a", "model_b"],
    )
    .value_counts()
    .reset_index(name="counts")
)

len(compared_models_count)

1275

In [10]:
top_compared_models = compared_models_count.head(20)

fig = px.bar(
    x=top_compared_models["model_a"] + " vs " + top_compared_models["model_b"],
    y=top_compared_models["counts"],
    labels={"x": "Model Comparison", "y": "Comparison times"},
    title="Top 20 Model Comparison Distribution",
)

iplot(fig)

In [11]:
# there are 1275 combinations of compared models. How often given numbers of comparisons occur?

compared_models_count["counts"].value_counts()

counts
4       56
1       55
2       51
3       45
6       38
        ..
153      1
148      1
147      1
145      1
1073     1
Name: count, Length: 197, dtype: int64

In [12]:
# bin the comparison count distribution to 10 bins
comparision_count_distribution_bins = pd.cut(
    compared_models_count["counts"],
    bins=[
        0,
        5,
        10,
        20,
        30,
        40,
        50,
        60,
        70,
        80,
        90,
        100,
        compared_models_count["counts"].max(),
    ],
    precision=0,
    retbins=False,
)

comparision_count_distribution_bins.value_counts(normalize=True).sort_index()

counts
(0, 5]         0.188235
(5, 10]        0.136471
(10, 20]       0.193725
(20, 30]       0.112941
(30, 40]       0.068235
(40, 50]       0.050196
(50, 60]       0.037647
(60, 70]       0.047059
(70, 80]       0.023529
(80, 90]       0.021961
(90, 100]      0.015686
(100, 1073]    0.104314
Name: proportion, dtype: float64

Some models tends to be compared agains each other more often than others. A pair of models is mostly compared 1-10 times.

#### Which models performs the best and the worst?

The performance is calculated as percentage of answers marked as winner to the total number of model responses

In [13]:
model_scores_part1 = df.groupby("model_a")["winner_model_a"].sum()
model_scores_part2 = df.groupby("model_b")["winner_model_b"].sum()

model_scores = model_scores_part1.add(model_scores_part2, fill_value=0)
model_scores = model_scores / (
    df["model_a"].value_counts() + df["model_b"].value_counts()
)
model_scores = model_scores.sort_values(ascending=False)

In [14]:
# top 10 best models
fig = px.bar(
    x=model_scores.head(10).index,
    y=model_scores.head(10).values,
    labels={"x": "Model", "y": "Score"},
    title="Top 10 best-rated models",
)

iplot(fig)

In [15]:
# top 10 worst models
fig = px.bar(
    x=model_scores.tail(10).index,
    y=model_scores.tail(10).values,
    labels={"x": "Model", "y": "Score"},
    title="Top 10 worst-rated models",
)

iplot(fig)

From above chart we can see that some models tends to generate better responses and there is no model, which answers are always scored as the best or the worst.

#### What does the tie means? Can 2 answers be winners at once?

In [16]:
tie_df = df.query("winner_tie == 1")
(tie_df["winner_model_a"] + tie_df["winner_model_b"]).value_counts()

0    17761
Name: count, dtype: int64

No, tie means there is no winning answer.

#### How frequent is tie?

In [17]:
f"{(tie_df.shape[0] / df.shape[0]) * 100:.1f}%"

'30.9%'

## Positive and negative responses analysis

#### Do some words indicate that response is negatively scored?

In [18]:
# exclude tied examples
df_no_ties = df.query("winner_tie == 0")

loosing_responses = df_no_ties.apply(
    lambda x: (x["response_a"] if x["winner_model_a"] == 0 else x["response_b"]),
    axis=1,
)

loosing_responses

0        ["As an AI, I don't have personal beliefs or opinions. However, I can tell you that the question of gender quotas in managerial positions is a complex one and positions can vary. Advocates argue that such measures can help correct historical imbalances, promote diversity, and may lead to better decision-making. Critics may argue that positions should be based purely on merit, and that quotas could potentially lead to tokenism or unqualified candidates being promoted. Morality can be subjecti...
1        ["A marriage license is a legal document that allows a couple to get married. It is issued by a government agency, such as a county clerk's office or a state government, and is valid for a certain period of time, usually one year. After the marriage has taken place, the couple must obtain a marriage certificate, which is a document that records the marriage and is used to prove that the marriage took place. The marriage certificate is usually issued by the same government agenc

In [19]:
winning_responses = df_no_ties.apply(
    lambda x: (x["response_a"] if x["winner_model_a"] == 1 else x["response_b"]),
    axis=1,
)

In [20]:
# loosing responses must be cleaned
loosing_responses = loosing_responses.str.strip("[]")
loosing_responses = loosing_responses.str.strip('"')

winning_responses = winning_responses.str.strip("[]")
winning_responses = winning_responses.str.strip('"')

In [21]:
df_no_ties["winning_response"] = winning_responses
df_no_ties["loosing_response"] = loosing_responses



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [22]:
texts = [
    "I do not feel comfortable",
    "I'm sorry, but",
    "I am sorry, but",
    "I apologize, but",
    "Unfortunately I",
    "I do not have enough",
    "I'm just an AI",
    "I'm an AI and",
    "I'm afraid",
    "I can't",
]

unprecise_responses = []
for _, row in df_no_ties.iterrows():
    if row["loosing_response"].startswith(tuple(texts)):
        if row["winning_response"].startswith(tuple(texts)):
            unprecise_responses.append("both")
        else:
            unprecise_responses.append("loosing")
    elif row["winning_response"].startswith(tuple(texts)):
        unprecise_responses.append("winning")
    else:
        continue

pd.Series(unprecise_responses).value_counts(normalize=True)

loosing    0.737934
winning    0.204911
both       0.057155
Name: proportion, dtype: float64

If we compare responses of the models, we can see that responses staring with "I do not feel comfortable", "I'm sorry, but", "I am sorry, but", "I apologize, but", "Unfortunately I", "I do not have enough", "I'm just an AI", "I'm an AI and", "I'm afraid" etc. are most probable to be marked as negative.

# Conclusions

In this Jupyter Notebook, exploratory data analysis was performed on the dataset provided for the Chatbot Arena Human Preference Predictions competition. The user interactions, model performance, and response characteristics were analyzed.

Here are the key findings from the analysis:

1. **Model Usage**: It was observed that there are 64 unique models used to generate responses. The model usage is not evenly distributed, with some models being more popular than others.

2. **Model Comparisons**: The frequency of model comparisons was analyzed and it was found that certain pairs of models are compared more often than others. Most comparisons occur between 1-10 times.

3. **Model Performance**: The performance of each model was calculated based on the percentage of answers marked as winners. The top 10 best-rated models and the top 10 worst-rated models were identified.

4. **Ties**: The occurrence of ties, where there is no winning answer, was investigated. It was found that ties are  present in 30% of cases.

5. **Poorly Assessed Answers**: The characteristics of responses that are negatively scored were analyzed. Phrases like "I'm just an AI" or "I'm afraid", which indicates that model can't get answer to given prompt, were identified as more likely to obtain a negative score.

Further analysis and modeling can be performed to predict the likelihood of a given prompt/response pair being selected as the winner. This can help in developing more accurate and effective chatbot systems.