# Calcul de scores à partir du jeu de données `comparia`

Dans ce notebook nous illustrons l'utilisation des classes `Ranker` pour calculer des scores à partir des données `comparia`.

## Chargement des données

In [None]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir

In [2]:
from rank_comparia.utils import load_comparia

token = None

reactions = load_comparia("ministere-culture/comparia-reactions", token=token)

Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-reactions/default/0.0.0/5a02c58f54f2db3fc51f076d24a840f04efa01fa (last modified on Tue Sep 30 13:43:20 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).





## Mise en forme des données

Ici nous utilisons des fonctions *legacy* avec une heuristique simple pour déterminer le résultat d'un *match* (une paire de conversation) à partir des réactions associées. On soustrait le nombre de réactions négatives au nombre de réactions positives pour chaque modèle. Le modèle avec la différence la plus élevée est vainqueur du match. Si les différences sont identiques pour les deux modèle, le *match* est une égalité (on filtre les égalités dans la fonction `get_winners`).   

In [3]:
from rank_comparia.data_transformation import get_matches_with_score, get_winners, get_winrates

matches = get_matches_with_score(reactions)

matches.head(5)

model_a_name,model_b_name,conversation_pair_id,score_a,score_b
str,str,str,i64,i64
"""deepseek-r1-distill-llama-70b""","""claude-3-5-sonnet-v2""","""2276d42cda9b4cdf896323796721ea…",3,0
"""llama-3.1-nemotron-70b-instruc…","""qwen2.5-coder-32b-instruct""","""9e603a6802ec40de94388200a3b3dc…",0,-3
"""llama-3.1-8b""","""llama-3.1-405b""","""d11b8deefbbf425e82edac13f29fe7…",2,1
"""hermes-3-llama-3.1-405b""","""llama-3.1-70b""","""c1a5acc9e7a745f1837f0e5d846772…",-2,-2
"""gemini-2.0-flash-001""","""o3-mini""","""8cdab6f5d8f44d539a2a0f33ad22bf…",2,0


In [4]:
winners = get_winners(matches)

On calcule des taux de victoire par modèle.

In [5]:
winrates = get_winrates(winners)
winrates.sort("winrate", descending=True)

model_name,matches,wins,winrate
str,u32,u32,f64
"""gemini-2.0-flash-exp""",856,647,75.584112
"""deepseek-v3-chat""",1530,1080,70.588235
"""gemma-3-27b""",495,348,70.30303
"""gemini-2.0-flash-001""",688,480,69.767442
"""gemini-1.5-pro-001""",328,222,67.682927
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",585,222,37.948718
"""lfm-40b""",887,321,36.189402
"""mixtral-8x22b-instruct-v0.1""",1459,445,30.500343
"""mistral-nemo-2407""",1461,435,29.774127


## Calcul des scores

Pour chaque *match* on calcule un score, on mélange les matchs et on les ajoute un par un à un `ELORanker` qui met à jour les scores à chaque ajout de match.

In [6]:
from rank_comparia.elo import ELORanker
from rank_comparia.ranker import Match, MatchScore
import random


def compute_match_score(score_a: int, score_b: int) -> MatchScore:
    final_score = score_b - score_a
    if final_score > 0:
        return MatchScore.B
    elif final_score < 0:
        return MatchScore.A
    else:
        return MatchScore.Draw


def get_shuffled_results(matches: list[Match], model_names: list[str], seed: int = 0):
    random.seed(seed)
    ranker_shuffle = ELORanker(K=40)
    matches_shuffle = random.sample(matches, k=len(matches))
    ranker_shuffle.add_players(model_names)
    ranker_shuffle.compute_scores(matches=matches_shuffle)
    return ranker_shuffle.players

100 sets de scores sont calculés avec des ordres d'ajout des matchs différents.

In [7]:
model_names = set(matches["model_a_name"].unique()) | set(matches["model_b_name"].unique())
matches = [
    Match(
        match_dict["model_a_name"],
        match_dict["model_b_name"],
        compute_match_score(match_dict["score_a"], match_dict["score_b"]),
    )
    for match_dict in matches.to_dicts()
]

player_results = {
    seed: get_shuffled_results(matches=matches, model_names=model_names, seed=seed) for seed in range(100)  # type: ignore
}

Les scores moyens sont calculés:

In [8]:
players_avg_ranking = {
    player_name: sum(results[player_name] for results in player_results.values()) / 100 for player_name in model_names
}

In [9]:
for player, ranking in sorted(players_avg_ranking.items(), key=lambda x: -x[1]):
    print(f"{player} : {ranking}")

gemini-2.0-flash-exp : 1149.7600338965026
gemma-3-27b : 1145.6220675334805
gemini-2.0-flash-001 : 1140.1673630525236
deepseek-v3-chat : 1117.7780800034252
claude-3-7-sonnet : 1112.231287537241
deepseek-v3-0324 : 1109.901416165925
command-a : 1101.9762156523593
gpt-4.1-mini : 1096.6951244552292
gemma-3-12b : 1087.0556016448456
llama-3.1-nemotron-70b-instruct : 1076.7824321432136
grok-3-mini-beta : 1076.0651567130722
deepseek-r1 : 1071.477265680711
gemma-3-4b : 1061.133589658061
gemini-1.5-pro-002 : 1051.3243868214545
llama-4-scout : 1037.8032023669548
gemini-1.5-pro-001 : 1037.3937535429066
mistral-large-2411 : 1032.3269729226088
mistral-small-3.1-24b : 1024.6608968152257
claude-3-5-sonnet-v2 : 1006.9001525583038
o4-mini : 1005.4739126029199
gpt-4o-mini-2024-07-18 : 1004.5908532800541
o3-mini : 999.0178481656511
gpt-4.1-nano : 996.4487961017362
llama-3.3-70b : 993.7061815960126
mistral-saba : 993.2464174218336
gpt-4o-2024-08-06 : 991.2592904973827
mistral-small-24b-instruct-2501 : 983.7

Deux calculs des scores avec des ordres de matchs différents:

In [10]:
ranker_shuffle = ELORanker(K=40)

random.seed(42)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'gpt-4.1-mini': 1185.493097751954,
 'gemini-2.0-flash-001': 1175.05935674024,
 'gemini-2.0-flash-exp': 1141.2356325460535,
 'claude-3-7-sonnet': 1138.4313153049895,
 'command-a': 1135.805419915371,
 'llama-3.1-nemotron-70b-instruct': 1134.3952522668783,
 'o3-mini': 1129.406559564824,
 'deepseek-v3-0324': 1129.1817238871029,
 'gemma-3-27b': 1107.1520354735362,
 'grok-3-mini-beta': 1101.3270625304315,
 'o4-mini': 1097.5163374943552,
 'deepseek-v3-chat': 1095.2648937046413,
 'deepseek-r1': 1081.2666860633276,
 'gemini-1.5-pro-002': 1071.0964823222112,
 'gemma-3-12b': 1063.9251743128825,
 'llama-4-scout': 1057.9315061144282,
 'mistral-saba': 1055.131850109499,
 'gpt-4.1-nano': 1050.640918442269,
 'gemini-1.5-pro-001': 1041.3902843440005,
 'jamba-1.5-large': 1037.9679337504474,
 'claude-3-5-sonnet-v2': 1036.2063287117253,
 'mistral-small-3.1-24b': 1016.7154478338211,
 'qwq-32b': 1013.1401011343322,
 'deepseek-r1-distill-llama-70b': 1001.8564753573025,
 'mistral-large-2411': 997.01988686405

In [11]:
ranker_shuffle = ELORanker(K=40)

random.seed(1337)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'claude-3-7-sonnet': 1168.5374763730636,
 'command-a': 1156.1491953627,
 'gemma-3-27b': 1140.5073274661925,
 'gemini-2.0-flash-001': 1117.398502831557,
 'gemini-2.0-flash-exp': 1114.0765257436406,
 'mistral-small-3.1-24b': 1092.0629598889764,
 'gpt-4.1-mini': 1091.9791204062508,
 'deepseek-v3-chat': 1091.8955246876528,
 'llama-4-scout': 1086.7513518365213,
 'llama-3.1-nemotron-70b-instruct': 1081.0139150292869,
 'deepseek-v3-0324': 1076.5567744886228,
 'o3-mini': 1070.975017608004,
 'o4-mini': 1067.035865387678,
 'gpt-4o-mini-2024-07-18': 1038.869223948626,
 'mistral-saba': 1036.959243234743,
 'jamba-1.5-large': 1034.5212155339914,
 'qwen2.5-coder-32b-instruct': 1033.83075479585,
 'gemma-3-12b': 1023.405510181561,
 'ministral-8b-instruct-2410': 1022.8050708721671,
 'aya-expanse-8b': 1009.3704770503061,
 'deepseek-r1-distill-llama-70b': 1008.0840885023439,
 'claude-3-5-sonnet-v2': 1006.2331514651354,
 'grok-3-mini-beta': 999.1875817744526,
 'llama-3.1-70b': 994.4368529285404,
 'gemma-2

## Utilisation d'un Ranker par maximum de vraisemblance

Ici on calcule les scores avec un `Ranker` alternatif, défini dans `src/rank_comparia/maximum_likelihood.py`

In [12]:
from rank_comparia.maximum_likelihood import MaximumLikelihoodRanker

ranker = MaximumLikelihoodRanker()
ranker.compute_scores(matches=matches)
ranker.get_scores()

{'gemini-2.0-flash-exp': np.float64(1143.1512358130012),
 'gemma-3-27b': np.float64(1136.3184203836174),
 'gemini-2.0-flash-001': np.float64(1133.400570032988),
 'deepseek-v3-chat': np.float64(1114.0328789363393),
 'deepseek-v3-0324': np.float64(1113.2067915235539),
 'claude-3-7-sonnet': np.float64(1105.1612084450092),
 'command-a': np.float64(1101.8786858020478),
 'gpt-4.1-mini': np.float64(1093.6685242271394),
 'gemma-3-12b': np.float64(1081.4653106195133),
 'llama-3.1-nemotron-70b-instruct': np.float64(1078.196872612909),
 'grok-3-mini-beta': np.float64(1071.2724072426324),
 'deepseek-r1': np.float64(1064.2931624791943),
 'gemma-3-4b': np.float64(1055.3168065207053),
 'gemini-1.5-pro-002': np.float64(1047.884270532969),
 'llama-4-scout': np.float64(1038.2019664020102),
 'gemini-1.5-pro-001': np.float64(1037.4923002507746),
 'mistral-small-3.1-24b': np.float64(1030.0445144875346),
 'mistral-large-2411': np.float64(1028.6925455484807),
 'o3-mini': np.float64(1006.2210001085825),
 'cla

## Bootstrap

Les classes `Ranker` ont une méthode `compute_boostrap_scores` qui permettent de calculer des scores et intervalles de confiance bootstrap (les matchs qui servent au calcul des scores pour chaque échantillon bootstrap sont issus de ré-échantillonages avec remise de l'échantillon de matchs initial). 

In [13]:
ranker = ELORanker(K=40)

ranker.add_players(model_names)  # type: ignore
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:04<00:00, 23.25it/s]


model_name,median,p2.5,p97.5,rank,rank_p2.5,rank_p97.5
str,f64,f64,f64,u32,i64,i64
"""gemini-2.0-flash-exp""",1152.000254,1053.123136,1262.464787,1,1,16
"""gemma-3-27b""",1141.022557,1036.195015,1236.557939,2,1,18
"""gemini-2.0-flash-001""",1129.024535,1009.508868,1232.417883,3,1,23
"""deepseek-v3-0324""",1116.323379,1000.538134,1210.532661,4,1,26
"""command-a""",1114.870281,1003.453468,1236.458954,5,1,26
…,…,…,…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",882.380214,771.627778,998.907549,44,26,48
"""phi-3.5-mini-instruct""",879.390624,775.675877,1003.517693,45,24,48
"""mixtral-8x22b-instruct-v0.1""",861.395916,771.10219,958.908885,46,31,48
"""mistral-nemo-2407""",838.852441,717.000266,954.458342,47,33,48


In [14]:
ranker = MaximumLikelihoodRanker()
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:06<00:00, 15.15it/s]


model_name,median,p2.5,p97.5,rank,rank_p2.5,rank_p97.5
str,f64,f64,f64,u32,i64,i64
"""gemini-2.0-flash-exp""",1144.729114,1121.742313,1167.21675,1,1,4
"""gemini-2.0-flash-001""",1133.627056,1110.422788,1158.197063,2,1,6
"""gemma-3-27b""",1132.57884,1110.02133,1163.233521,3,1,6
"""deepseek-v3-0324""",1113.265848,1070.393128,1144.563903,4,1,11
"""deepseek-v3-chat""",1112.383248,1096.916518,1128.96423,5,2,8
…,…,…,…,…,…,…
"""phi-3.5-mini-instruct""",889.187149,863.661018,919.916999,44,42,45
"""mixtral-8x7b-instruct-v0.1""",881.976693,859.420482,908.405145,45,43,46
"""mixtral-8x22b-instruct-v0.1""",859.889721,844.835259,873.09579,46,45,48
"""mistral-nemo-2407""",854.536247,841.993482,869.60709,47,46,48


# Ajout de la notion de frugalité dans le score

In [16]:
from rank_comparia.frugality import get_normalized_log_cost, calculate_frugality_score
from rank_comparia.plot import plot_elo_against_frugal_elo

conversations = load_comparia("ministere-culture/comparia-conversations", token=token)
conversations = conversations.rename({"model_a_name": "model_a", "model_b_name": "model_b"})

frugality_score = calculate_frugality_score(conversations, None)
graph = plot_elo_against_frugal_elo(
    frugal_log_score=get_normalized_log_cost(frugality_score, mean="token"), bootstraped_scores=scores
)

Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).





In [17]:
graph