# Calcul de scores à partir du jeu de données `comparia`

Dans ce notebook nous illustrons l'utilisation des classes `Ranker` pour calculer des scores à partir des données `comparia`.

## Chargement des données

In [1]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir
os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token:")

In [2]:
from rank_comparia.utils import load_comparia

reactions = load_comparia("ministere-culture/comparia-reactions")

Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-reactions/default/0.0.0/92a324c10228176065909b52bbbaa16430e64c5a (last modified on Wed Jun  4 17:40:33 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


## Mise en forme des données

Ici nous utilisons des fonctions *legacy* avec une heuristique simple pour déterminer le résultat d'un *match* (une paire de conversation) à partir des réactions associées. On soustrait le nombre de réactions négatives au nombre de réactions positives pour chaque modèle. Le modèle avec la différence la plus élevée est vainqueur du match. Si les différences sont identiques pour les deux modèle, le *match* est une égalité (on filtre les égalités dans la fonction `get_winners`).   

In [3]:
from rank_comparia.data_transformation import get_matches_with_score, get_winners, get_winrates

matches = get_matches_with_score(reactions)

matches.head(5)

model_a_name,model_b_name,conversation_pair_id,score_a,score_b
str,str,str,i64,i64
"""qwen2.5-coder-32b-instruct""","""mistral-small-24b-instruct-250…","""bcda2b64d8274b1087921c0b7cc7d8…",1,0
"""hermes-3-llama-3.1-405b""","""command-a""","""ecefa182135e4acc9cf04be13b1145…",0,1
"""mixtral-8x22b-instruct-v0.1""","""gemma-2-9b-it""","""bae309920e444c9ea73dbc977b5584…",0,-1
"""phi-4""","""llama-3.1-70b""","""354dce524e094fd3a0d39bb0422154…",2,0
"""mistral-large-2411""","""llama-3.1-nemotron-70b-instruc…","""1739b0fd5fb34f43a182863a6f3f55…",0,2


In [4]:
winners = get_winners(matches)

On calcule des taux de victoire par modèle.

In [5]:
winrates = get_winrates(winners)
winrates.sort("winrate", descending=True)

model_name,len,wins,winrate
str,u32,u32,f64
"""gemini-2.0-flash-exp""",856,647,75.584112
"""deepseek-v3-chat""",1530,1080,70.588235
"""gemma-3-27b""",495,348,70.30303
"""gemini-2.0-flash-001""",688,480,69.767442
"""gemini-1.5-pro-001""",328,222,67.682927
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",585,222,37.948718
"""lfm-40b""",887,321,36.189402
"""mixtral-8x22b-instruct-v0.1""",1459,445,30.500343
"""mistral-nemo-2407""",1461,435,29.774127


## Calcul des scores

Pour chaque *match* on calcule un score, on mélange les matchs et on les ajoute un par un à un `ELORanker` qui met à jour les scores à chaque ajout de match.

In [6]:
from rank_comparia.elo import ELORanker
from rank_comparia.ranker import Match, MatchScore
import random


def compute_match_score(score_a: int, score_b: int) -> MatchScore:
    final_score = score_b - score_a
    if final_score > 0:
        return MatchScore.B
    elif final_score < 0:
        return MatchScore.A
    else:
        return MatchScore.Draw


def get_shuffled_results(matches: list[Match], model_names: list[str], seed: int = 0):
    random.seed(seed)
    ranker_shuffle = ELORanker(K=40)
    matches_shuffle = random.sample(matches, k=len(matches))
    ranker_shuffle.add_players(model_names)
    ranker_shuffle.compute_scores(matches=matches_shuffle)
    return ranker_shuffle.players

100 sets de scores sont calculés avec des ordres d'ajout des matchs différents.

In [None]:
model_names = set(matches["model_a_name"].unique()) | set(matches["model_b_name"].unique())
matches = [
    Match(
        match_dict["model_a_name"],
        match_dict["model_b_name"],
        compute_match_score(match_dict["score_a"], match_dict["score_b"]),
    )
    for match_dict in matches.to_dicts()
]

player_results = {
    seed: get_shuffled_results(matches=matches, model_names=model_names, seed=seed) for seed in range(100)  # type: ignore
}

Les scores moyens sont calculés:

In [None]:
players_avg_ranking = {
    player_name: sum(results[player_name] for results in player_results.values()) / 100 for player_name in model_names
}

In [None]:
for player, ranking in sorted(players_avg_ranking.items(), key=lambda x: -x[1]):
    print(f"{player} : {ranking}")

gemini-2.0-flash-exp : 1150.7875539940342
gemma-3-27b : 1146.1814815271186
gemini-2.0-flash-001 : 1145.8797934782958
deepseek-v3-0324 : 1120.69196210039
claude-3-7-sonnet : 1116.8451653200439
deepseek-v3-chat : 1112.2859940634464
command-a : 1099.1468132613213
gpt-4.1-mini : 1093.3035931469067
gemma-3-12b : 1079.6231164328428
grok-3-mini-beta : 1075.3302854861156
llama-3.1-nemotron-70b-instruct : 1075.2449628593781
deepseek-r1 : 1069.4332452165186
gemini-1.5-pro-002 : 1059.8031897395003
gemma-3-4b : 1050.277091070327
llama-4-scout : 1042.615116941869
gemini-1.5-pro-001 : 1038.5672651384687
mistral-large-2411 : 1027.209557652193
mistral-small-3.1-24b : 1026.3386272941707
o3-mini : 1013.0501472659827
claude-3-5-sonnet-v2 : 1009.8067470339537
o4-mini : 1003.4964558569891
llama-3.3-70b : 1001.1886684951511
gpt-4o-mini-2024-07-18 : 998.4060853891027
gpt-4o-2024-08-06 : 994.4362943268783
mistral-saba : 994.2686844396636
llama-3.1-405b : 993.0428972867334
mistral-small-24b-instruct-2501 : 988

Deux calculs des scores avec des ordres de matchs différents:

In [None]:
ranker_shuffle = ELORanker(K=40)

random.seed(42)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'command-a': 1220.1527290800668,
 'gemini-2.0-flash-exp': 1161.6044499773263,
 'deepseek-v3-chat': 1150.560015291643,
 'gemma-3-27b': 1141.4328148299785,
 'gemini-2.0-flash-001': 1139.5359177795244,
 'deepseek-v3-0324': 1119.6334478686797,
 'grok-3-mini-beta': 1119.5307837389546,
 'llama-4-scout': 1091.5751639816478,
 'llama-3.1-nemotron-70b-instruct': 1091.0795137063953,
 'claude-3-7-sonnet': 1088.7675701005921,
 'mistral-small-3.1-24b': 1070.6355146154658,
 'gpt-4.1-mini': 1050.3170643862245,
 'mistral-saba': 1041.3557054625712,
 'gemini-1.5-pro-001': 1041.2486761693076,
 'mistral-small-24b-instruct-2501': 1040.8327831881509,
 'llama-3.3-70b': 1039.0387741753636,
 'gemma-3-12b': 1031.760023662148,
 'gemma-3-4b': 1023.6936864525285,
 'deepseek-r1-distill-llama-70b': 1023.4496226956928,
 'qwq-32b': 1020.8316195161804,
 'gemini-1.5-pro-002': 1014.5163572166467,
 'phi-4': 1013.0522119901723,
 'o4-mini': 1003.6818359030656,
 'o3-mini': 997.813962646233,
 'gemma-2-9b-it': 993.950823037820

In [None]:
ranker_shuffle = ELORanker(K=40)

random.seed(1337)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'deepseek-v3-chat': 1179.3730081196136,
 'gemma-3-27b': 1150.8317142104368,
 'gemini-2.0-flash-001': 1143.7168434136654,
 'grok-3-mini-beta': 1142.6570863406994,
 'mistral-large-2411': 1142.1452498662647,
 'llama-3.1-nemotron-70b-instruct': 1131.4665117329307,
 'deepseek-r1': 1130.811884553102,
 'gemma-3-12b': 1112.5530413171457,
 'gemma-2-9b-it': 1096.048355031984,
 'deepseek-v3-0324': 1094.993840729807,
 'gemini-2.0-flash-exp': 1081.5379183680243,
 'gemini-1.5-pro-002': 1075.4348182301687,
 'command-a': 1072.7119341969287,
 'llama-4-scout': 1067.852953995856,
 'mistral-small-3.1-24b': 1063.5642262327028,
 'claude-3-5-sonnet-v2': 1053.0858633064097,
 'o3-mini': 1037.6401080954292,
 'o4-mini': 1034.6809769907807,
 'qwen2.5-coder-32b-instruct': 1031.3418629004047,
 'gemma-3-4b': 1027.9349699340958,
 'mistral-saba': 1015.2707776128149,
 'mistral-small-24b-instruct-2501': 1014.3792660281887,
 'aya-expanse-8b': 1013.7532922323962,
 'llama-3.3-70b': 1010.9963606887425,
 'claude-3-7-sonnet'

Ici on calcule les scores avec un `Ranker` alternatif.

In [None]:
from rank_comparia.maximum_likelihood import MaximumLikelihoodRanker

ranker = MaximumLikelihoodRanker()
ranker.compute_scores(matches=matches)
ranker.get_scores()

{'gemini-2.0-flash-exp': np.float64(1143.1512358130012),
 'gemma-3-27b': np.float64(1136.3184203836174),
 'gemini-2.0-flash-001': np.float64(1133.400570032988),
 'deepseek-v3-chat': np.float64(1114.0328789363393),
 'deepseek-v3-0324': np.float64(1113.2067915235539),
 'claude-3-7-sonnet': np.float64(1105.1612084450092),
 'command-a': np.float64(1101.8786858020478),
 'gpt-4.1-mini': np.float64(1093.6685242271394),
 'gemma-3-12b': np.float64(1081.4653106195133),
 'llama-3.1-nemotron-70b-instruct': np.float64(1078.196872612909),
 'grok-3-mini-beta': np.float64(1071.2724072426324),
 'deepseek-r1': np.float64(1064.2931624791943),
 'gemma-3-4b': np.float64(1055.3168065207053),
 'gemini-1.5-pro-002': np.float64(1047.884270532969),
 'llama-4-scout': np.float64(1038.2019664020102),
 'gemini-1.5-pro-001': np.float64(1037.4923002507746),
 'mistral-small-3.1-24b': np.float64(1030.0445144875346),
 'mistral-large-2411': np.float64(1028.6925455484807),
 'o3-mini': np.float64(1006.2210001085826),
 'cla

## Bootstrap

Les classes `Ranker` ont une méthode `compute_boostrap_scores` qui permettent de calculer des scores et intervalles de confiance bootstrap (les matchs qui servent au calcul des scores pour chaque échantillon bootstrap sont issus de ré-échantillonages avec remise de l'échantillon de matchs initial). 

In [None]:
ranker = ELORanker(K=40)

ranker.add_players(model_names)  # type: ignore
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:05<00:00, 17.55it/s]


model,median,p2.5,p97.5
str,f64,f64,f64
"""gemma-3-27b""",1152.211148,1040.594581,1248.836173
"""gemini-2.0-flash-exp""",1145.397963,1062.294185,1245.563045
"""deepseek-v3-0324""",1134.799235,1013.055123,1228.182057
"""gemini-2.0-flash-001""",1133.382779,1038.849998,1255.32273
"""deepseek-v3-chat""",1114.914604,1013.009749,1228.08963
…,…,…,…
"""phi-3.5-mini-instruct""",885.697997,761.038555,989.372198
"""mixtral-8x7b-instruct-v0.1""",866.404306,755.314356,1002.638954
"""mixtral-8x22b-instruct-v0.1""",853.719716,758.372644,956.329336
"""mistral-nemo-2407""",832.302051,743.147907,930.64927


In [None]:
ranker = MaximumLikelihoodRanker()
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:12<00:00,  7.92it/s]


model,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1143.325771,1125.07339,1169.492016
"""gemma-3-27b""",1136.857481,1103.11539,1167.911554
"""gemini-2.0-flash-001""",1134.308685,1109.516369,1153.092221
"""deepseek-v3-chat""",1114.655428,1099.492036,1125.729753
"""deepseek-v3-0324""",1111.016549,1071.932718,1150.501659
…,…,…,…
"""phi-3.5-mini-instruct""",890.765449,859.732501,920.225684
"""mixtral-8x7b-instruct-v0.1""",884.938064,865.35744,909.701876
"""mixtral-8x22b-instruct-v0.1""",860.519985,845.704374,873.367552
"""mistral-nemo-2407""",854.598367,836.824396,871.170898


# Ajout de la notion de frugalité dans le score

In [None]:
from rank_comparia.frugality import draw_ranked_frugality, get_normalized_log_cost, calculate_frugality_score

conversations = load_comparia("ministere-culture/comparia-conversations")
frugality_score = calculate_frugality_score(conversations, None, True)
graph = draw_ranked_frugality(frugal_log_score=get_normalized_log_cost(frugality_score), bootstraped_scores=scores)

NameError: name 'load_comparia' is not defined

In [None]:
graph

NameError: name 'graph' is not defined