# ELO calculation

Before going into the formulas, a few notes:
- The change in ELO ratings depends mainly on the K-factor (development coefficient). A higher K results in higher variance in the rating, while a lower K results in lower variance.
- ELO ratings also take into account the difference in ratings between players to calculate the probability of winning or probability of scoring.
- ELO rating changes are determined over a "rating period" (e.g.: event or tournament)

ELO has two formulas:
- Probability of winning: `1/(1+10^(-D/400))`, where `D` is the rating difference.
- Rating change formula: `K * (score - win_probability)`, where K is the development coefficient, and the score is 1 for a win, 0.5 for a draw and 0 for a loss.
    - For multiple games within a rating period: `K * Sum(score - win_probability)`

The K used is thus the main factor for changing ELO ratings. FIDE handbook recommends:
> K = 40 for a player new to the rating list until he has completed events with at least 30 games.
>
> K = 20 as long as a player's rating remains under 2400.
>
> K = 10 once a player's published rating has reached 2400 and remains at that level subsequently, even if the rating drops below 2400.

## Helper functions

In [1]:
def winning_prob(self_rating, opponent_rating):
    """
    Calculate the winning probability using the Elo rating system.

    Args:
        self_rating (float): The rating of the player whose winning probability we want to calculate.
        opponent_rating (float): The rating of the opponent.

    Returns:
        float: The probability of winning.
    """
    return 1 / (1 + 10 ** ((opponent_rating - self_rating) / 400))

In [2]:
def calculate_new_rating(self_rating: int, opponent_rating: int, score: float, k: int = 30):
    """
    Calculate the new rating after a match using the Elo rating system.

    Args:
        self_rating (int): The rating of the player whose new rating we want to calculate.
        opponent_rating (int): The rating of the opponent.
        score (float): The score of the player (1 for win, 0.5 for draw, 0 for loss).

    Returns:
        int: The new rating.
    """
    expected_score = winning_prob(self_rating, opponent_rating)
    return round(self_rating + k * (score - expected_score))

In [3]:
def calculate_rating_change(starting_rating: int, opponent_ratings: list[int], scores: list[float], k: int = 30):
    """
    Calculate the new rating after multiple matches.

    Args:
        starting_rating (int): The initial rating of the player.
        opponent_ratings (list[int]): A list of opponent ratings.
        scores (list[float]): A list of scores corresponding to each match.
        k (int): The K-factor used in the Elo rating system.

    Returns:
        int: The new rating after all matches.
    """
    winning_probabilities = [winning_prob(starting_rating, opponent_rating) for opponent_rating in opponent_ratings]

    rating_change = 0
    for score, winning_probability in zip(scores, winning_probabilities):
        rating_change += score - winning_probability
    
    return round(k * rating_change)

### Quick analysis of rating period

In [4]:
starting_rating = 1656
opponent_ratings = [1763, 1700, 1800]
scores = [1, 0.5, 1]

In [5]:
new_rating = starting_rating
for opponent_rating, score in zip(opponent_ratings, scores):
    new_rating = calculate_new_rating(new_rating, opponent_rating, score)
    print(f"New rating after match against {opponent_rating} with score {score}: {new_rating}")

print(f"Final rating after all matches: {new_rating} ({new_rating - starting_rating} change)")

New rating after match against 1763 with score 1: 1675
New rating after match against 1700 with score 0.5: 1676
New rating after match against 1800 with score 1: 1696
Final rating after all matches: 1696 (40 change)


In [6]:
rating_change = calculate_rating_change(starting_rating, opponent_ratings, scores, k=30)

print(f"Rating change after all matches: {rating_change}")
print(f"New rating after all matches: {starting_rating + rating_change}")

Rating change after all matches: 42
New rating after all matches: 1698


## ELO scores from pre-calculated metrics

For simplicity, we will use a rating period of 1 game, and a fixed K of 30



### Load data

In [7]:
import os
import json

import pandas as pd

In [8]:
FUNC_OUTPUT_PATH = "../data/metrics/medhelm-mixed/func_output"

In [9]:
metrics = []
for file in os.listdir(FUNC_OUTPUT_PATH):
    if not file.endswith(".json"):
        continue

    with open(os.path.join(FUNC_OUTPUT_PATH, file), 'r') as f:
        data = json.load(f)
    
    model_name = data["original_run"]["model_run"]["model"]["name"]
    dataset_name = data["original_run"]["model_run"]["dataset"]["name"]

    for instance_metrics in data["metrics_results"]["instance_level_metrics"]:
        instance_id = instance_metrics["instance_id"]
        for metric_name, metric_value in instance_metrics.items():
            if metric_name == "instance_id":
                continue
            metrics.append({
                "model_name": model_name,
                "dataset_name": dataset_name,
                "instance_id": instance_id,
                "metric_name": metric_name,
                "metric_value": metric_value,
            })

metrics_df = pd.DataFrame(metrics)
metrics_df

Unnamed: 0,model_name,dataset_name,instance_id,metric_name,metric_value
0,microsoft/phi-3.5-mini-instruct,ehr_sql,id9852,ehr_sql_execution_accuracy,0.0
1,microsoft/phi-3.5-mini-instruct,ehr_sql,id9852,ehr_sql_query_validity,0.0
2,microsoft/phi-3.5-mini-instruct,ehr_sql,id9852,ehr_sql_precision_answerable,1.0
3,microsoft/phi-3.5-mini-instruct,ehr_sql,id9852,ehr_sql_recall_answerable,1.0
4,microsoft/phi-3.5-mini-instruct,ehr_sql,id9852,ehr_sql_total_predicted_answerable,1.0
...,...,...,...,...,...
2378363,google/gemini-1.5-pro-001,pubmed_qa,id999,quasi_prefix_exact_match,0.0
2378364,google/gemini-1.5-pro-001,pubmed_qa,id999,quasi_prefix_exact_match@5,0.0
2378365,google/gemini-1.5-pro-001,pubmed_qa,id999,logprob,0.0
2378366,google/gemini-1.5-pro-001,pubmed_qa,id999,num_perplexity_tokens,0.0


In [10]:
# Filter out multiple choice datasets
datasets_with_bert_score = metrics_df[metrics_df["metric_name"] == "bert_score"]["dataset_name"].unique()

metrics_df = metrics_df[metrics_df["dataset_name"].isin(datasets_with_bert_score)]
metrics_df

Unnamed: 0,model_name,dataset_name,instance_id,metric_name,metric_value
95272,openai/gpt-4o-mini,aci_bench,id67,rouge_1,0.491192
95273,openai/gpt-4o-mini,aci_bench,id67,rouge_2,0.211838
95274,openai/gpt-4o-mini,aci_bench,id67,rouge_l,0.296373
95275,openai/gpt-4o-mini,aci_bench,id67,summarization_coverage,0.686893
95276,openai/gpt-4o-mini,aci_bench,id67,summarization_density,1.774272
...,...,...,...,...,...
2291845,microsoft/phi-3.5-mini-instruct,medication_qa,id688,prompt_truncated,0.000000
2291846,microsoft/phi-3.5-mini-instruct,medication_qa,id688,max_prob,0.500000
2291847,microsoft/phi-3.5-mini-instruct,medication_qa,id688,logprob,-0.693147
2291848,microsoft/phi-3.5-mini-instruct,medication_qa,id688,num_perplexity_tokens,512.000000


In [11]:
# Filter out unwanted metrics
unwanted_metrics = [
    "num_references",
    "num_train_trials",
    "num_prompt_tokens",
    "num_completion_tokens",
    "num_output_tokens",
    "inference_runtime",
    "batch_size",
    "finish_reason_length",
    "finish_reason_stop",
    "finish_reason_endoftext",
    "finish_reason_unknown",
    "num_train_instances",
    "prompt_truncated",
    "max_prob",
    "logprob",
    "num_perplexity_tokens",
    "num_bytes",
]

metrics_df = metrics_df[~metrics_df["metric_name"].isin(unwanted_metrics)]
metrics_df

Unnamed: 0,model_name,dataset_name,instance_id,metric_name,metric_value
95272,openai/gpt-4o-mini,aci_bench,id67,rouge_1,0.491192
95273,openai/gpt-4o-mini,aci_bench,id67,rouge_2,0.211838
95274,openai/gpt-4o-mini,aci_bench,id67,rouge_l,0.296373
95275,openai/gpt-4o-mini,aci_bench,id67,summarization_coverage,0.686893
95276,openai/gpt-4o-mini,aci_bench,id67,summarization_density,1.774272
...,...,...,...,...,...
2280132,microsoft/phi-3.5-mini-instruct,medication_qa,id688,BERTScore-F,0.737665
2280133,microsoft/phi-3.5-mini-instruct,medication_qa,id688,bert_score,0.737665
2280134,microsoft/phi-3.5-mini-instruct,medication_qa,id688,rouge1,0.282486
2280135,microsoft/phi-3.5-mini-instruct,medication_qa,id688,rouge2,0.034026


In [12]:
metrics_df[metrics_df["model_name"] == "deepseek-ai/deepseek-r1"]

Unnamed: 0,model_name,dataset_name,instance_id,metric_name,metric_value
1384740,deepseek-ai/deepseek-r1,aci_bench,id115,rouge_1,0.476543
1384741,deepseek-ai/deepseek-r1,aci_bench,id115,rouge_2,0.193069
1384742,deepseek-ai/deepseek-r1,aci_bench,id115,rouge_l,0.204938
1384743,deepseek-ai/deepseek-r1,aci_bench,id115,summarization_coverage,0.595491
1384744,deepseek-ai/deepseek-r1,aci_bench,id115,summarization_density,1.173740
...,...,...,...,...,...
1385015,deepseek-ai/deepseek-r1,aci_bench,id143,BERTScore-F,0.814235
1385016,deepseek-ai/deepseek-r1,aci_bench,id143,bert_score,0.814235
1385017,deepseek-ai/deepseek-r1,aci_bench,id143,rouge1,0.485772
1385018,deepseek-ai/deepseek-r1,aci_bench,id143,rouge2,0.154786


In [13]:
metrics_df["dataset_name"].unique()

array(['aci_bench', 'mtsamples', 'medication_qa', 'mtsamples_replicate'],
      dtype=object)

### Score distribution

In [14]:
target_metrics = [
    "rouge_1",
    "rouge_2",
    "rouge_l",
    "bert_score",
]

filtered_metrics_df = metrics_df[metrics_df["metric_name"].isin(target_metrics)]

filtered_metrics_df.groupby("metric_name")["metric_value"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
metric_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bert_score,6958.0,0.743877,0.058322,0.0,0.709928,0.744421,0.783495,0.895537
rouge_1,6958.0,0.226268,0.155291,0.0,0.108108,0.20155,0.305602,0.777778
rouge_2,6958.0,0.061037,0.072013,0.0,0.012739,0.035503,0.077245,0.714286
rouge_l,6958.0,0.131555,0.08642,0.0,0.072727,0.117647,0.166667,0.777778


### Create matches for ELO

In [15]:
def calculate_match_score(match: dict, draw_threshold: float = 0.03):
    """Calculate the match score for a given match.

    Draw threshold is set to 3% by default, as this is below
    the stdandard deviation from all matches calculated below.

    Args:
        match (dict): A dictionary containing the match results.

    Returns:
        dict: A dictionary containing the match score.
    """
    model_a, model_b = list(match.keys())
    score_a = match[model_a]
    score_b = match[model_b]

    # Allow for a draw as long as results are at max 10% apart
    if abs((score_a / score_b) - 1) < draw_threshold:
        return {model_a: 0.5, model_b: 0.5}
    elif score_a > score_b:
        return {model_a: 1, model_b: 0}
    else:
        return {model_a: 0, model_b: 1}


def create_matches(
    metrics_df,
    target_metric: str,
    model_name_column: str = "model_name",
    dataset_name_column: str = "dataset_name",
    instance_id_column: str = "instance_id",
    metric_name_column: str = "metric_name",
    metric_value_column: str = "metric_value",
):
    """Create matches for the Elo rating system.

    Each match is a pair of two models along with the score of the match.

    Returns:
        list of dict: A list of dictionaries containing the matches.
        Each dictionary contains the match results, the dataset name and the instance id.
    """
    matches = []
    filtered_metrics_df = metrics_df[metrics_df[metric_name_column] == target_metric]
    filtered_metrics_df = filtered_metrics_df.dropna()
    for _, group in filtered_metrics_df.groupby(
        [dataset_name_column, instance_id_column]
    ):
        # Generate matches for all combinations of models
        models = group[model_name_column].unique()
        for i in range(len(models)):
            for j in range(i + 1, len(models)):
                model_a = models[i]
                model_b = models[j]

                # Create a match dictionary
                model_a_metric = group[group[model_name_column] == model_a][
                    metric_value_column
                ].values[0]
                model_b_metric = group[group[model_name_column] == model_b][
                    metric_value_column
                ].values[0]
                match_results = calculate_match_score(
                    match={
                        model_a: model_a_metric,
                        model_b: model_b_metric,
                    }
                )

                matches.append(
                    {
                        "model_a": model_a,
                        "model_b": model_b,
                        "model_a_score": match_results[model_a],
                        "model_b_score": match_results[model_b],
                        "model_a_metric": model_a_metric,
                        "model_b_metric": model_b_metric,
                        "metric_name": target_metric,
                        "dataset_name": group[dataset_name_column].values[0],
                        "instance_id": group[instance_id_column].values[0],
                    }
                )

    return matches

In [16]:
matches = create_matches(
    metrics_df,
    target_metric="bert_score",
    model_name_column="model_name",
    dataset_name_column="dataset_name",
    instance_id_column="instance_id",
    metric_name_column="metric_name",
    metric_value_column="metric_value",
)

matches_df = pd.DataFrame(matches)
matches_df

Unnamed: 0,model_a,model_b,model_a_score,model_b_score,model_a_metric,model_b_metric,metric_name,dataset_name,instance_id
0,openai/gpt-4o-mini,google/gemini-1.5-pro-001,0.5,0.5,0.832080,0.827930,bert_score,aci_bench,id100
1,openai/gpt-4o-mini,qwen/qwen2.5-7b-instruct,0.5,0.5,0.832080,0.841437,bert_score,aci_bench,id100
2,openai/gpt-4o-mini,openai/gpt-4o,0.5,0.5,0.832080,0.846204,bert_score,aci_bench,id100
3,openai/gpt-4o-mini,microsoft/phi-3.5-mini-instruct,0.5,0.5,0.832080,0.825319,bert_score,aci_bench,id100
4,openai/gpt-4o-mini,deepseek-ai/deepseek-r1,1.0,0.0,0.832080,0.785560,bert_score,aci_bench,id100
...,...,...,...,...,...,...,...,...,...
14350,openai/gpt-4o-mini,meta/llama-3.3-70b-instruct,0.5,0.5,0.764106,0.760645,bert_score,mtsamples_replicate,id9
14351,openai/gpt-4o-mini,openai/gpt-4o,0.5,0.5,0.764106,0.766231,bert_score,mtsamples_replicate,id9
14352,google/gemini-1.5-pro-001,meta/llama-3.3-70b-instruct,0.0,1.0,0.736867,0.760645,bert_score,mtsamples_replicate,id9
14353,google/gemini-1.5-pro-001,openai/gpt-4o,0.0,1.0,0.736867,0.766231,bert_score,mtsamples_replicate,id9


Check standard deviation in 1on1 games.

In [17]:
non_0_matches_df = matches_df[
    (matches_df["model_a_metric"] != 0) & (matches_df["model_b_metric"] != 0)
]

(non_0_matches_df["model_a_metric"] / non_0_matches_df["model_b_metric"]).describe()

count    14355.000000
mean         1.011275
std          0.052329
min          0.589143
25%          0.983533
50%          1.007395
75%          1.035708
max          1.688747
dtype: float64

Check general results

In [18]:
matches_df["model_a_score"].value_counts()

model_a_score
0.5    7983
1.0    4176
0.0    2196
Name: count, dtype: int64

### Calculate scores

Aggregated across all datasets

In [19]:
# Shuffle the matches
shuffled_matches_df = matches_df.sample(frac=1, random_state=42).reset_index(drop=True)

elo_ratings = {}
for row in shuffled_matches_df.iterrows():
    model_a = row[1]["model_a"]
    model_b = row[1]["model_b"]
    score_a = row[1]["model_a_score"]
    score_b = row[1]["model_b_score"]
    
    if model_a not in elo_ratings:
        elo_ratings[model_a] = 1500
    if model_b not in elo_ratings:
        elo_ratings[model_b] = 1500
    model_a_rating = elo_ratings[model_a]
    model_b_rating = elo_ratings[model_b]

    # NOTE: Rating changes are usually proportional (the same for both models)
    # but it would be different if we use different K-factors for each model.
    model_a_rating_change = calculate_rating_change(
        starting_rating=model_a_rating,
        opponent_ratings=[model_b_rating],
        scores=[score_a],
        k=30
    )
    model_b_rating_change = calculate_rating_change(
        starting_rating=model_b_rating,
        opponent_ratings=[model_a_rating],
        scores=[score_b],
        k=30
    )

    elo_ratings[model_a] += model_a_rating_change
    elo_ratings[model_b] += model_b_rating_change

elo_ratings

{'microsoft/phi-3.5-mini-instruct': 1416,
 'google/gemini-1.5-pro-001': 1470,
 'openai/gpt-4o-mini': 1625,
 'qwen/qwen2.5-7b-instruct': 1489,
 'meta/llama-3.3-70b-instruct': 1612,
 'openai/gpt-4o': 1547,
 'deepseek-ai/deepseek-r1': 1341}

In [20]:
metrics_dict = {}
for model, rating in elo_ratings.items():
    metrics_dict[model] = {"bert_score_elo": rating}

print(json.dumps(metrics_dict, indent=4))

{
    "microsoft/phi-3.5-mini-instruct": {
        "bert_score_elo": 1416
    },
    "google/gemini-1.5-pro-001": {
        "bert_score_elo": 1470
    },
    "openai/gpt-4o-mini": {
        "bert_score_elo": 1625
    },
    "qwen/qwen2.5-7b-instruct": {
        "bert_score_elo": 1489
    },
    "meta/llama-3.3-70b-instruct": {
        "bert_score_elo": 1612
    },
    "openai/gpt-4o": {
        "bert_score_elo": 1547
    },
    "deepseek-ai/deepseek-r1": {
        "bert_score_elo": 1341
    }
}


Per dataset

In [21]:
# Shuffle the matches
shuffled_matches_df = matches_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Initialize a dictionary to store ELO ratings for each dataset
elo_ratings_by_dataset = {}

# Group matches by dataset_name
for dataset_name, group in shuffled_matches_df.groupby("dataset_name"):
    elo_ratings = {}
    for row in group.iterrows():
        model_a = row[1]["model_a"]
        model_b = row[1]["model_b"]
        score_a = row[1]["model_a_score"]
        score_b = row[1]["model_b_score"]

        if model_a not in elo_ratings:
            elo_ratings[model_a] = 1500
        if model_b not in elo_ratings:
            elo_ratings[model_b] = 1500

        model_a_rating = elo_ratings[model_a]
        model_b_rating = elo_ratings[model_b]

        # Calculate rating changes for both models
        model_a_rating_change = calculate_rating_change(
            starting_rating=model_a_rating,
            opponent_ratings=[model_b_rating],
            scores=[score_a],
            k=30
        )
        model_b_rating_change = calculate_rating_change(
            starting_rating=model_b_rating,
            opponent_ratings=[model_a_rating],
            scores=[score_b],
            k=30
        )

        elo_ratings[model_a] += model_a_rating_change
        elo_ratings[model_b] += model_b_rating_change

    # Store the ELO ratings for the current dataset
    elo_ratings_by_dataset[dataset_name] = elo_ratings

# Convert the results into a dictionary for easier inspection
metrics_dict = {}
for dataset_name, ratings in elo_ratings_by_dataset.items():
    metrics_dict[dataset_name] = {model: {"bert_score_elo": rating} for model, rating in ratings.items()}

# Print the results
print(json.dumps(metrics_dict, indent=4))

{
    "aci_bench": {
        "openai/gpt-4o-mini": {
            "bert_score_elo": 1550
        },
        "qwen/qwen2.5-7b-instruct": {
            "bert_score_elo": 1492
        },
        "openai/gpt-4o": {
            "bert_score_elo": 1574
        },
        "microsoft/phi-3.5-mini-instruct": {
            "bert_score_elo": 1447
        },
        "google/gemini-1.5-pro-001": {
            "bert_score_elo": 1548
        },
        "meta/llama-3.3-70b-instruct": {
            "bert_score_elo": 1546
        },
        "deepseek-ai/deepseek-r1": {
            "bert_score_elo": 1343
        }
    },
    "medication_qa": {
        "openai/gpt-4o-mini": {
            "bert_score_elo": 1612
        },
        "qwen/qwen2.5-7b-instruct": {
            "bert_score_elo": 1478
        },
        "meta/llama-3.3-70b-instruct": {
            "bert_score_elo": 1596
        },
        "microsoft/phi-3.5-mini-instruct": {
            "bert_score_elo": 1350
        },
        "openai/gpt-4o": {
  

In [25]:
for dataset_name, ratings in elo_ratings_by_dataset.items():
    metrics_dict = {}
    for model, rating in ratings.items():
        metrics_dict[model] = {"bert_score_elo": rating}

    print(f"Dataset: {dataset_name}")
    print(json.dumps(metrics_dict, indent=4))

    # Save the metrics to a JSON file
    with open(f"elo_ratings_{dataset_name}.json", "w") as f:
        json.dump(metrics_dict, f, indent=4)

Dataset: aci_bench
{
    "openai/gpt-4o-mini": {
        "bert_score_elo": 1550
    },
    "qwen/qwen2.5-7b-instruct": {
        "bert_score_elo": 1492
    },
    "openai/gpt-4o": {
        "bert_score_elo": 1574
    },
    "microsoft/phi-3.5-mini-instruct": {
        "bert_score_elo": 1447
    },
    "google/gemini-1.5-pro-001": {
        "bert_score_elo": 1548
    },
    "meta/llama-3.3-70b-instruct": {
        "bert_score_elo": 1546
    },
    "deepseek-ai/deepseek-r1": {
        "bert_score_elo": 1343
    }
}
Dataset: medication_qa
{
    "openai/gpt-4o-mini": {
        "bert_score_elo": 1612
    },
    "qwen/qwen2.5-7b-instruct": {
        "bert_score_elo": 1478
    },
    "meta/llama-3.3-70b-instruct": {
        "bert_score_elo": 1596
    },
    "microsoft/phi-3.5-mini-instruct": {
        "bert_score_elo": 1350
    },
    "openai/gpt-4o": {
        "bert_score_elo": 1538
    },
    "google/gemini-1.5-pro-001": {
        "bert_score_elo": 1426
    }
}
Dataset: mtsamples
{
    "mic