# Calcul de scores à partir du jeu de données `comparia`

Dans ce notebook nous illustrons l'utilisation des classes `Ranker` pour calculer des scores à partir des données `comparia`.

## Chargement des données

In [1]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir
os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token:")

In [2]:
from rank_comparia.utils import load_comparia

reactions = load_comparia("ministere-culture/comparia-reactions")

Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-reactions/default/0.0.0/92a324c10228176065909b52bbbaa16430e64c5a (last modified on Wed Jun  4 17:40:33 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


## Mise en forme des données

Ici nous utilisons des fonctions *legacy* avec une heuristique simple pour déterminer le résultat d'un *match* (une paire de conversation) à partir des réactions associées. On soustrait le nombre de réactions négatives au nombre de réactions positives pour chaque modèle. Le modèle avec la différence la plus élevée est vainqueur du match. Si les différences sont identiques pour les deux modèle, le *match* est une égalité (on filtre les égalités dans la fonction `get_winners`).   

In [3]:
from rank_comparia.data_transformation import get_matches_with_score, get_winners, get_winrates

matches = get_matches_with_score(reactions)

matches.head(5)

model_a_name,model_b_name,conversation_pair_id,score_a,score_b
str,str,str,i64,i64
"""gemini-1.5-pro-001""","""ministral-8b-instruct-2410""","""4450f89de7904d76af3ba953240c66…",0,1
"""ministral-8b-instruct-2410""","""llama-3.1-nemotron-70b-instruc…","""f8e3b39c027c489ebcb75864416cd9…",2,-1
"""claude-3-5-sonnet-v2""","""mistral-nemo-2407""","""9f06161e5b0a42d9bba227ed7766f8…",-2,-2
"""mistral-large-2411""","""gemini-1.5-pro-002""","""f27ef52c27b64b6e93c1cd42e2fbb4…",1,1
"""qwen2.5-coder-32b-instruct""","""lfm-40b""","""4a4f535270bb435fb9c6b5b4d5cfb9…",3,2


In [4]:
winners = get_winners(matches)

On calcule des taux de victoire par modèle.

In [5]:
winrates = get_winrates(winners)
winrates.sort("winrate", descending=True)

model_name,len,wins,winrate
str,u32,u32,f64
"""gemini-2.0-flash-exp""",856,647,75.584112
"""deepseek-v3-chat""",1530,1080,70.588235
"""gemma-3-27b""",495,348,70.30303
"""gemini-2.0-flash-001""",688,480,69.767442
"""gemini-1.5-pro-001""",328,222,67.682927
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",585,222,37.948718
"""lfm-40b""",887,321,36.189402
"""mixtral-8x22b-instruct-v0.1""",1459,445,30.500343
"""mistral-nemo-2407""",1461,435,29.774127


## Calcul des scores

Pour chaque *match* on calcule un score, on mélange les matchs et on les ajoute un par un à un `ELORanker` qui met à jour les scores à chaque ajout de match.

In [6]:
from rank_comparia.elo import ELORanker
from rank_comparia.ranker import Match, MatchScore
import random


def compute_match_score(score_a: int, score_b: int) -> MatchScore:
    final_score = score_b - score_a
    if final_score > 0:
        return MatchScore.B
    elif final_score < 0:
        return MatchScore.A
    else:
        return MatchScore.Draw


def get_shuffled_results(matches: list[Match], model_names: list[str], seed: int = 0):
    random.seed(seed)
    ranker_shuffle = ELORanker(K=40)
    matches_shuffle = random.sample(matches, k=len(matches))
    ranker_shuffle.add_players(model_names)
    ranker_shuffle.compute_scores(matches=matches_shuffle)
    return ranker_shuffle.players

100 sets de scores sont calculés avec des ordres d'ajout des matchs différents.

In [7]:
model_names = set(matches["model_a_name"].unique()) | set(matches["model_b_name"].unique())
matches = [
    Match(
        match_dict["model_a_name"],
        match_dict["model_b_name"],
        compute_match_score(match_dict["score_a"], match_dict["score_b"]),
    )
    for match_dict in matches.to_dicts()
]

player_results = {
    seed: get_shuffled_results(matches=matches, model_names=model_names, seed=seed) for seed in range(100)  # type: ignore
}

Les scores moyens sont calculés:

In [8]:
players_avg_ranking = {
    player_name: sum(results[player_name] for results in player_results.values()) / 100 for player_name in model_names
}

In [9]:
for player, ranking in sorted(players_avg_ranking.items(), key=lambda x: -x[1]):
    print(f"{player} : {ranking}")

gemini-2.0-flash-exp : 1152.8366412019461
gemma-3-27b : 1144.7063170304061
gemini-2.0-flash-001 : 1136.3821727630689
deepseek-v3-0324 : 1121.7816151994991
deepseek-v3-chat : 1117.4537219092451
command-a : 1112.2775244821546
claude-3-7-sonnet : 1109.096220513812
gpt-4.1-mini : 1097.665189885303
gemma-3-12b : 1089.1013268150045
llama-3.1-nemotron-70b-instruct : 1080.5783685002398
deepseek-r1 : 1071.1763386118964
grok-3-mini-beta : 1068.2694511090176
gemma-3-4b : 1059.3082770360838
gemini-1.5-pro-002 : 1048.3084756377436
gemini-1.5-pro-001 : 1044.9355084180904
llama-4-scout : 1042.4478038824443
mistral-large-2411 : 1033.376497764221
mistral-small-3.1-24b : 1031.5392038354798
claude-3-5-sonnet-v2 : 1010.4206329041114
o3-mini : 1009.1088800159082
o4-mini : 1001.9211577791461
gpt-4o-2024-08-06 : 994.4100249379924
jamba-1.5-large : 992.4584658748473
gpt-4o-mini-2024-07-18 : 990.7846827251981
llama-3.1-405b : 990.3972941832486
mistral-saba : 987.137596178378
llama-3.3-70b : 985.6541098958028
g

Deux calculs des scores avec des ordres de matchs différents:

In [10]:
ranker_shuffle = ELORanker(K=40)

random.seed(42)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'deepseek-v3-0324': 1236.3719091756216,
 'gemini-2.0-flash-001': 1213.4842721586701,
 'gemini-2.0-flash-exp': 1203.5753442579917,
 'gemma-3-27b': 1198.4349094171573,
 'claude-3-7-sonnet': 1152.7797633880423,
 'deepseek-v3-chat': 1143.1466198974388,
 'llama-3.1-nemotron-70b-instruct': 1136.792135210341,
 'gemini-1.5-pro-002': 1117.1623088312801,
 'gemma-2-27b-it-q8': 1106.2182669601466,
 'gpt-4.1-mini': 1085.0620689453488,
 'command-a': 1083.9635942434466,
 'llama-4-scout': 1080.2956506184617,
 'deepseek-r1': 1063.4064758803981,
 'grok-3-mini-beta': 1058.007952951965,
 'claude-3-5-sonnet-v2': 1051.1853144913307,
 'gemini-1.5-pro-001': 1048.5352869605028,
 'o4-mini': 1046.2094059365756,
 'gemma-3-12b': 1043.0737942198189,
 'mistral-large-2411': 1029.1756086376288,
 'mistral-small-3.1-24b': 1028.2980367179614,
 'o3-mini': 1023.2584618001066,
 'gemma-2-9b-it': 1006.6588356523501,
 'gpt-4o-mini-2024-07-18': 1004.0451285027407,
 'llama-3.1-70b': 1002.0562626858222,
 'llama-3.1-405b': 995.03

In [11]:
ranker_shuffle = ELORanker(K=40)

random.seed(1337)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'gemini-2.0-flash-exp': 1221.5728805040792,
 'gemini-2.0-flash-001': 1150.9215453400564,
 'claude-3-7-sonnet': 1134.422645443624,
 'gemma-3-12b': 1115.3388508066523,
 'llama-3.1-nemotron-70b-instruct': 1114.83223717991,
 'gemma-3-27b': 1110.5979760900896,
 'deepseek-r1': 1105.520595090032,
 'grok-3-mini-beta': 1104.974653247346,
 'deepseek-v3-0324': 1088.0302373366842,
 'mistral-small-3.1-24b': 1064.320830945207,
 'deepseek-v3-chat': 1057.7471390694056,
 'gemini-1.5-pro-001': 1053.0681126101822,
 'mistral-small-24b-instruct-2501': 1050.7599525466683,
 'gpt-4.1-mini': 1049.8798396426466,
 'llama-4-scout': 1047.041144963684,
 'llama-3.1-70b': 1045.0424891233636,
 'gemma-3-4b': 1037.3034731891182,
 'qwq-32b': 1035.0909878284276,
 'ministral-8b-instruct-2410': 1031.7153063316618,
 'o3-mini': 1025.3558987188956,
 'llama-3.1-405b': 1001.2216524746405,
 'gpt-4o-2024-08-06': 1000.0718565058323,
 'jamba-1.5-large': 996.8618735655315,
 'mistral-saba': 995.4473493238643,
 'gemini-1.5-pro-002': 9

Ici on calcule les scores avec un `Ranker` alternatif.

In [12]:
from rank_comparia.maximum_likelihood import MaximumLikelihoodRanker

ranker = MaximumLikelihoodRanker()
ranker.compute_scores(matches=matches)
ranker.get_scores()

{'gemini-2.0-flash-exp': np.float64(1143.1512358130012),
 'gemma-3-27b': np.float64(1136.3184203836174),
 'gemini-2.0-flash-001': np.float64(1133.400570032988),
 'deepseek-v3-chat': np.float64(1114.0328789363393),
 'deepseek-v3-0324': np.float64(1113.2067915235539),
 'claude-3-7-sonnet': np.float64(1105.1612084450092),
 'command-a': np.float64(1101.8786858020478),
 'gpt-4.1-mini': np.float64(1093.6685242271394),
 'gemma-3-12b': np.float64(1081.4653106195133),
 'llama-3.1-nemotron-70b-instruct': np.float64(1078.196872612909),
 'grok-3-mini-beta': np.float64(1071.2724072426324),
 'deepseek-r1': np.float64(1064.2931624791943),
 'gemma-3-4b': np.float64(1055.3168065207053),
 'gemini-1.5-pro-002': np.float64(1047.884270532969),
 'llama-4-scout': np.float64(1038.2019664020102),
 'gemini-1.5-pro-001': np.float64(1037.4923002507746),
 'mistral-small-3.1-24b': np.float64(1030.0445144875346),
 'mistral-large-2411': np.float64(1028.6925455484807),
 'o3-mini': np.float64(1006.2210001085826),
 'cla

## Bootstrap

Les classes `Ranker` ont une méthode `compute_boostrap_scores` qui permettent de calculer des scores et intervalles de confiance bootstrap (les matchs qui servent au calcul des scores pour chaque échantillon bootstrap sont issus de ré-échantillonages avec remise de l'échantillon de matchs initial). 

In [13]:
ranker = ELORanker(K=40)

ranker.add_players(model_names)  # type: ignore
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:04<00:00, 23.79it/s]


model,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1155.859289,1044.442026,1274.769562
"""gemini-2.0-flash-001""",1142.655273,1029.583229,1239.132195
"""gemma-3-27b""",1140.367437,1017.22942,1238.438644
"""deepseek-v3-0324""",1114.776689,1008.863562,1235.314884
"""deepseek-v3-chat""",1114.200893,1006.377414,1203.025085
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",886.571652,756.061861,1018.108256
"""phi-3.5-mini-instruct""",873.88268,781.99617,985.558817
"""mixtral-8x22b-instruct-v0.1""",858.155122,770.765492,993.206626
"""mistral-nemo-2407""",857.208879,751.981659,965.433334


In [14]:
ranker = MaximumLikelihoodRanker()
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:06<00:00, 15.67it/s]


model,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1144.325284,1119.460207,1162.371241
"""gemma-3-27b""",1137.418884,1100.19138,1163.874126
"""gemini-2.0-flash-001""",1135.156265,1113.097957,1157.500833
"""deepseek-v3-0324""",1118.98235,1080.592383,1158.408431
"""deepseek-v3-chat""",1113.222689,1097.716567,1129.206429
…,…,…,…
"""phi-3.5-mini-instruct""",885.512672,852.759295,916.903774
"""mixtral-8x7b-instruct-v0.1""",883.558106,853.418372,913.701314
"""mixtral-8x22b-instruct-v0.1""",859.604307,842.88358,875.133647
"""mistral-nemo-2407""",854.648057,840.199542,870.119204
