# Calcul de scores à partir du jeu de données `comparia`

Dans ce notebook nous illustrons l'utilisation des classes `Ranker` pour calculer des scores à partir des données `comparia`.

## Chargement des données

In [1]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir
os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token:")

In [None]:
from rank_comparia.utils import load_comparia

reactions = load_comparia("ministere-culture/comparia-reactions")

Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-reactions/default/0.0.0/92a324c10228176065909b52bbbaa16430e64c5a (last modified on Wed Jun  4 17:40:33 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


## Mise en forme des données

Ici nous utilisons des fonctions *legacy* avec une heuristique simple pour déterminer le résultat d'un *match* (une paire de conversation) à partir des réactions associées. On soustrait le nombre de réactions négatives au nombre de réactions positives pour chaque modèle. Le modèle avec la différence la plus élevée est vainqueur du match. Si les différences sont identiques pour les deux modèle, le *match* est une égalité (on filtre les égalités dans la fonction `get_winners`).   

In [3]:
from rank_comparia.data_transformation import get_matches_with_score, get_winners, get_winrates

matches = get_matches_with_score(reactions)

matches.head(5)

model_a_name,model_b_name,conversation_pair_id,score_a,score_b
str,str,str,i64,i64
"""mistral-nemo-2407""","""gemini-2.0-flash-exp""","""c8b9ec214d2f4dafb0a95c78d8a892…",1,0
"""mistral-nemo-2407""","""gpt-4o-mini-2024-07-18""","""dff1c1d545e948f89c1fcd1d7a2abc…",0,2
"""phi-4""","""qwen2.5-coder-32b-instruct""","""eb8ae1d9580944d89b6a48162949e7…",-1,1
"""gemini-2.0-flash-exp""","""gpt-4o-2024-08-06""","""b01da0ffc514436ba0bafb5fc93532…",0,1
"""mixtral-8x7b-instruct-v0.1""","""gpt-4o-mini-2024-07-18""","""080356cf942546dd92483be99cb025…",1,0


In [4]:
winners = get_winners(matches)

On calcule des taux de victoire par modèle.

In [5]:
winrates = get_winrates(winners)
winrates.sort("winrate", descending=True)

model_name,len,wins,winrate
str,u32,u32,f64
"""gemini-2.0-flash-exp""",856,647,75.584112
"""deepseek-v3-chat""",1530,1080,70.588235
"""gemma-3-27b""",495,348,70.30303
"""gemini-2.0-flash-001""",688,480,69.767442
"""gemini-1.5-pro-001""",328,222,67.682927
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",585,222,37.948718
"""lfm-40b""",887,321,36.189402
"""mixtral-8x22b-instruct-v0.1""",1459,445,30.500343
"""mistral-nemo-2407""",1461,435,29.774127


## Calcul des scores

Pour chaque *match* on calcule un score, on mélange les matchs et on les ajoute un par un à un `ELORanker` qui met à jour les scores à chaque ajout de match.

In [6]:
from rank_comparia.elo import ELORanker
from rank_comparia.ranker import Match, MatchScore
import random


def compute_match_score(score_a: int, score_b: int) -> MatchScore:
    final_score = score_b - score_a
    if final_score > 0:
        return MatchScore.B
    elif final_score < 0:
        return MatchScore.A
    else:
        return MatchScore.Draw


def get_shuffled_results(matches: list[Match], model_names: list[str], seed: int = 0):
    random.seed(seed)
    ranker_shuffle = ELORanker(K=40)
    matches_shuffle = random.sample(matches, k=len(matches))
    ranker_shuffle.add_players(model_names)
    ranker_shuffle.compute_scores(matches=matches_shuffle)
    return ranker_shuffle.players

100 sets de scores sont calculés avec des ordres d'ajout des matchs différents.

In [7]:
model_names = set(matches["model_a_name"].unique()) | set(matches["model_b_name"].unique())
matches = [
    Match(
        match_dict["model_a_name"],
        match_dict["model_b_name"],
        compute_match_score(match_dict["score_a"], match_dict["score_b"]),
    )
    for match_dict in matches.to_dicts()
]

player_results = {
    seed: get_shuffled_results(matches=matches, model_names=model_names, seed=seed) for seed in range(100)  # type: ignore
}

Les scores moyens sont calculés:

In [8]:
players_avg_ranking = {
    player_name: sum(results[player_name] for results in player_results.values()) / 100 for player_name in model_names
}

In [9]:
for player, ranking in sorted(players_avg_ranking.items(), key=lambda x: -x[1]):
    print(f"{player} : {ranking}")

gemini-2.0-flash-exp : 1147.8481175501213
gemma-3-27b : 1145.7125030288078
gemini-2.0-flash-001 : 1129.5428311878898
deepseek-v3-0324 : 1125.3260784689844
deepseek-v3-chat : 1124.6133978301414
command-a : 1107.5363513892096
claude-3-7-sonnet : 1103.229348265574
gpt-4.1-mini : 1098.503816385903
gemma-3-12b : 1088.4978148565012
llama-3.1-nemotron-70b-instruct : 1080.3102454203633
grok-3-mini-beta : 1070.8061027704755
deepseek-r1 : 1069.1528137014666
gemma-3-4b : 1062.2977150873512
gemini-1.5-pro-002 : 1048.9745576207565
gemini-1.5-pro-001 : 1037.9813083316171
llama-4-scout : 1036.8566662260016
mistral-large-2411 : 1023.6826125523356
mistral-small-3.1-24b : 1021.6568432057678
o4-mini : 1006.6902726205205
claude-3-5-sonnet-v2 : 1005.5743258490104
gpt-4o-2024-08-06 : 1004.3699633518434
o3-mini : 1002.3993567426545
gpt-4o-mini-2024-07-18 : 1001.2141806253848
llama-3.1-405b : 996.804254959252
mistral-saba : 994.494144668334
mistral-small-24b-instruct-2501 : 994.4299252148985
llama-3.3-70b : 9

Deux calculs des scores avec des ordres de matchs différents:

In [10]:
ranker_shuffle = ELORanker(K=40)

random.seed(42)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'gemma-3-27b': 1175.7675169152285,
 'deepseek-r1': 1174.2652704264415,
 'gemini-1.5-pro-002': 1172.0786055985104,
 'llama-3.1-nemotron-70b-instruct': 1132.3337169470485,
 'claude-3-7-sonnet': 1128.863574152478,
 'llama-3.3-70b': 1119.6917588766162,
 'deepseek-v3-0324': 1118.149224110632,
 'grok-3-mini-beta': 1115.0312086006782,
 'gemini-1.5-pro-001': 1100.5184241312677,
 'mistral-large-2411': 1096.7574765960023,
 'mistral-small-3.1-24b': 1082.2690858138126,
 'command-a': 1081.9517099869151,
 'gemma-3-12b': 1074.0889420030878,
 'gemma-2-27b-it-q8': 1070.6257559112175,
 'gemma-2-9b-it': 1067.4373937253715,
 'gpt-4o-mini-2024-07-18': 1058.2175255408786,
 'llama-3.1-70b': 1057.9762109737276,
 'gemma-3-4b': 1054.437991134534,
 'gemini-2.0-flash-001': 1051.1481451057036,
 'mistral-saba': 1047.1976969808816,
 'mistral-small-24b-instruct-2501': 1046.9249610246507,
 'gemini-2.0-flash-exp': 1028.484044771889,
 'gpt-4.1-mini': 1027.632715985455,
 'deepseek-v3-chat': 1025.384156513145,
 'claude-3

In [11]:
ranker_shuffle = ELORanker(K=40)

random.seed(1337)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'gemini-2.0-flash-001': 1171.92148019597,
 'gemini-2.0-flash-exp': 1169.0353339957626,
 'gemma-3-27b': 1168.9292503999961,
 'deepseek-r1': 1143.5056696392357,
 'gpt-4.1-mini': 1135.4871025277052,
 'command-a': 1118.5292852098798,
 'deepseek-v3-0324': 1113.9310815347837,
 'llama-3.1-nemotron-70b-instruct': 1099.7292452690317,
 'gemma-3-12b': 1089.6004058492053,
 'claude-3-7-sonnet': 1083.3614620013254,
 'gemini-1.5-pro-001': 1082.626905044981,
 'gemini-1.5-pro-002': 1056.7671938056449,
 'deepseek-v3-chat': 1048.2244121294304,
 'o3-mini': 1045.2339253978762,
 'o4-mini': 1042.195219022252,
 'llama-3.1-405b': 1039.021646735078,
 'jamba-1.5-large': 1036.8735697183015,
 'gemma-3-4b': 1035.1880985062458,
 'deepseek-r1-distill-llama-70b': 1032.6497520747007,
 'aya-expanse-8b': 1032.2632356843005,
 'mistral-small-3.1-24b': 1022.610683586296,
 'hermes-3-llama-3.1-405b': 1015.2924618594501,
 'grok-3-mini-beta': 1012.240710581544,
 'llama-4-scout': 1010.0772783145576,
 'mistral-saba': 1001.145938

## Utilisation d'un Ranker par maximum de vraisemblance

Ici on calcule les scores avec un `Ranker` alternatif, défini dans `src/rank_comparia/maximum_likelihood.py`

In [12]:
from rank_comparia.maximum_likelihood import MaximumLikelihoodRanker

ranker = MaximumLikelihoodRanker()
ranker.compute_scores(matches=matches)
ranker.get_scores()

{'gemini-2.0-flash-exp': np.float64(1143.1512358130012),
 'gemma-3-27b': np.float64(1136.3184203836174),
 'gemini-2.0-flash-001': np.float64(1133.400570032988),
 'deepseek-v3-chat': np.float64(1114.0328789363393),
 'deepseek-v3-0324': np.float64(1113.2067915235539),
 'claude-3-7-sonnet': np.float64(1105.1612084450092),
 'command-a': np.float64(1101.8786858020478),
 'gpt-4.1-mini': np.float64(1093.6685242271394),
 'gemma-3-12b': np.float64(1081.4653106195133),
 'llama-3.1-nemotron-70b-instruct': np.float64(1078.196872612909),
 'grok-3-mini-beta': np.float64(1071.2724072426324),
 'deepseek-r1': np.float64(1064.2931624791943),
 'gemma-3-4b': np.float64(1055.3168065207053),
 'gemini-1.5-pro-002': np.float64(1047.884270532969),
 'llama-4-scout': np.float64(1038.2019664020102),
 'gemini-1.5-pro-001': np.float64(1037.4923002507746),
 'mistral-small-3.1-24b': np.float64(1030.0445144875346),
 'mistral-large-2411': np.float64(1028.6925455484807),
 'o3-mini': np.float64(1006.2210001085826),
 'cla

## Bootstrap

Les classes `Ranker` ont une méthode `compute_boostrap_scores` qui permettent de calculer des scores et intervalles de confiance bootstrap (les matchs qui servent au calcul des scores pour chaque échantillon bootstrap sont issus de ré-échantillonages avec remise de l'échantillon de matchs initial). 

In [13]:
ranker = ELORanker(K=40)

ranker.add_players(model_names)  # type: ignore
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:04<00:00, 24.35it/s]


model,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1153.018523,1041.897383,1287.775123
"""gemma-3-27b""",1147.807565,1023.970027,1258.39757
"""deepseek-v3-chat""",1133.735125,1026.499754,1252.080781
"""gemini-2.0-flash-001""",1132.257993,997.639738,1243.592212
"""deepseek-v3-0324""",1117.417997,995.152635,1219.643944
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",893.843588,774.208085,1012.10676
"""lfm-40b""",889.784753,792.780578,990.414233
"""mistral-nemo-2407""",851.233953,766.146503,950.10665
"""mixtral-8x22b-instruct-v0.1""",850.156447,764.034211,952.399245


In [14]:
ranker = MaximumLikelihoodRanker()
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:05<00:00, 16.80it/s]


model,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1144.740823,1121.58747,1167.795144
"""gemma-3-27b""",1138.644425,1114.739581,1168.384904
"""gemini-2.0-flash-001""",1133.733449,1115.320508,1156.431579
"""deepseek-v3-0324""",1113.812728,1080.199238,1148.106654
"""deepseek-v3-chat""",1113.630365,1097.878134,1132.646018
…,…,…,…
"""phi-3.5-mini-instruct""",889.556082,858.835227,922.926601
"""mixtral-8x7b-instruct-v0.1""",884.614638,856.992706,904.549497
"""mixtral-8x22b-instruct-v0.1""",860.767683,845.844134,875.150503
"""mistral-nemo-2407""",857.140384,840.639206,868.92556


# Ajout de la notion de frugalité dans le score

In [15]:
from rank_comparia.frugality import draw_ranked_frugality, get_normalized_log_cost, calculate_frugality_score

conversations = load_comparia("ministere-culture/comparia-conversations")
frugality_score = calculate_frugality_score(conversations, None, True)
graph = draw_ranked_frugality(frugal_log_score=get_normalized_log_cost(frugality_score), bootstraped_scores=scores)

Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


In [16]:
graph