# `RankingPipeline`

Dans ce script on teste la pipeline complète, permettant de paramétrer les méthodes de calcul des scores.

## Calcul des scores

La méthode `run` lance le calcul des scores.

In [1]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir
os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token:")

In [2]:
from pathlib import Path
from rank_comparia.pipeline import RankingPipeline

### Paramètres de `RankingPipeline`  

- `method` : Méthode de classement utilisé : `elo_random`, `elo_ordered`, `ml`  
- `include_votes` : Utilisation des données de votes  
- `include_reactions` : Utilisation des données de réactions
- `bootstrap_samples` : Nombres d'échantillons pour cacluler la version *Bootstrap*  
- `batch` : si on batch le nombre de match 
- `export_graphs` : le chemin vers le dossier dans lequel exporter les graphes

In [3]:
pipeline = RankingPipeline(
    method="elo_random",
    include_votes=True,
    include_reactions=True,
    include_frugality=False,
    bootstrap_samples=5,
    batch=False,
    export_path=Path("output"),
)

Using the latest cached version of the dataset since ministere-culture/comparia-votes couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-votes/default/0.0.0/679f56e14f413546403b3468d717c4417e394326 (last modified on Mon Jul 28 10:06:44 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Fou

Final votes dataset contains 55617 conversations pairs.
Reactions data originally contains 22993 conversations pairs.
Final reactions dataset contains 21244 conversations pairs.


In [4]:
pipeline.matches

conversation_pair_id,model_a,model_b,score,categories,model_a_active_params,model_b_active_params,total_conv_a_output_tokens,total_conv_a_kwh,total_conv_b_output_tokens,total_conv_b_kwh
str,str,str,i32,list[str],f64,f64,f64,f64,f64,f64
"""3fa1582ecd2d46f183d2b736b50a1b…","""chocolatine-2-14b-instruct-v2.…","""deepseek-r1""",0,"[""Environment"", ""Natural Science & Formal Science & Technology""]",14.0,37.0,202.0,0.000733,897.0,0.042182
"""d5e982f8f05a4428b67cf4eb00307f…","""phi-3.5-mini-instruct""","""qwen2.5-coder-32b-instruct""",0,"[""Politics & Government"", ""Society & Social Issues & Human Rights""]",14.0,37.0,202.0,0.000733,897.0,0.042182
"""137686c63dcb4b0e9314d155396dd3…","""phi-4""","""claude-3-5-sonnet-v2""",2,"[""Education"", ""Culture & Cultural geography""]",14.0,37.0,202.0,0.000733,897.0,0.042182
"""7608dcdd04184bf9b4f4e0bd8d51a4…","""qwen2-7b-instruct""","""llama-3.1-405b""",0,"[""Entertainment & Travel & Hobby"", ""Culture & Cultural geography"", ""Arts""]",14.0,37.0,202.0,0.000733,897.0,0.042182
"""ffeb3a84f6e64609a650d517fc6bae…","""mixtral-8x22b-instruct-v0.1""","""deepseek-v3-chat""",2,"[""Education""]",14.0,37.0,202.0,0.000733,897.0,0.042182
…,…,…,…,…,…,…,…,…,…,…
"""732bd4d0c93d4de4927c6b8614d496…","""deepseek-r1""","""mistral-large-2411""",2,"[""Health & Wellness & Medicine""]",37.0,123.0,583.0,0.027416,436.0,0.008679
"""21e97286051d4e378c2681dcbc2ebc…","""mistral-small-24b-instruct-250…","""claude-3-5-sonnet-v2""",0,"[""Business & Economics & Finance"", ""Politics & Government"", ""Environment""]",24.0,300.0,305.0,0.001834,336.0,0.045104
"""08c50a876efa4c8988d8206c4d35db…","""gpt-4o-mini-2024-07-18""","""qwen2.5-coder-32b-instruct""",2,"[""Arts"", ""Education"", ""Entertainment & Travel & Hobby""]",35.0,32.0,721.0,0.005449,1291.0,0.009212
"""9736c3b65f584023b85cc872b3d0b4…","""mistral-nemo-2407""","""llama-3.1-405b""",1,"[""Education"", ""Natural Science & Formal Science & Technology""]",12.0,405.0,674.0,0.002918,1000.0,0.237926


In [5]:
pipeline.match_list()

[Match(model_a='chocolatine-2-14b-instruct-v2.0.3-q8', model_b='deepseek-r1', score=<MatchScore.B: 0>, id='3fa1582ecd2d46f183d2b736b50a1b10-1ddc7fd95ddc44289d59f059531be62f'),
 Match(model_a='phi-3.5-mini-instruct', model_b='qwen2.5-coder-32b-instruct', score=<MatchScore.B: 0>, id='d5e982f8f05a4428b67cf4eb00307fb7-2aecfd877c854234a381e3b6b6e10175'),
 Match(model_a='phi-4', model_b='claude-3-5-sonnet-v2', score=<MatchScore.A: 2>, id='137686c63dcb4b0e9314d155396dd356-0324fdbb7f3d4b9ca99151584efcac6b'),
 Match(model_a='qwen2-7b-instruct', model_b='llama-3.1-405b', score=<MatchScore.B: 0>, id='7608dcdd04184bf9b4f4e0bd8d51a4bb-7b859100a54647029aeb879eb56d0b2c'),
 Match(model_a='mixtral-8x22b-instruct-v0.1', model_b='deepseek-v3-chat', score=<MatchScore.A: 2>, id='ffeb3a84f6e64609a650d517fc6baedd-fecc240ef69e4400ab6c7ea331c8870d'),
 Match(model_a='deepseek-v3-0324', model_b='gpt-4.1-mini', score=<MatchScore.Draw: 1>, id='976907e715c94f4996e5f2db8a867aab-8e7fed19bb7244bca3cde55bc967ae38'),
 M

In [6]:
scores = pipeline.run()

Computing bootstrap scores from a sample of 76861 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00,  5.88it/s]


In [7]:
scores

model_name,median,p2.5,p97.5,total_output_tokens,conso_all_conv,mean_conso_per_token
str,f64,f64,f64,f64,f64,f64
"""Yi-1.5-9B-Chat""",834.027391,758.216195,877.782599,36760.0,1.456907,0.00004
"""aya-expanse-8b""",1011.672168,945.47975,1062.760477,794676.0,19.628913,0.000025
"""c4ai-command-r-08-2024""",947.415616,859.70613,999.115581,2.141281e6,60.586985,0.000028
"""chocolatine-14b-instruct-dpo-v…",761.748259,744.996467,810.678337,168753.0,6.56817,0.000039
"""chocolatine-2-14b-instruct-v2.…",840.965973,780.82909,853.552286,833788.0,26.342577,0.000032
…,…,…,…,…,…,…
"""qwen2-7b-instruct""",791.761748,659.659345,811.405281,45350.0,1.799493,0.00004
"""qwen2.5-32b-instruct""",1065.27488,951.207666,1128.765847,75944.0,2.922606,0.000038
"""qwen2.5-7b-instruct""",926.277077,905.090601,1096.955142,867508.0,22.533149,0.000026
"""qwen2.5-coder-32b-instruct""",964.898127,910.922079,981.977596,3.06755e6,82.379908,0.000027


### Une autre méthode de calcul 

Ici on utilise uniquement les données de votes.

In [None]:
pipeline = RankingPipeline(
    method="elo_random",
    include_votes=True,
    include_reactions=False,
    bootstrap_samples=5,
    batch=False,
)
scores_votes = pipeline.run()

In [None]:
scores_votes

## Pipeline avec un ranker alternatif

Utilisation du Ranker `MaximumLikelihood`

In [None]:
pipeline = RankingPipeline(method="ml", include_votes=True, include_reactions=True, bootstrap_samples=5, batch=False)

In [None]:
scores_ml = pipeline.run()

In [None]:
pipeline = RankingPipeline(method="ml", include_votes=True, include_reactions=False, bootstrap_samples=5, batch=False)
scores_ml_votes = pipeline.run()

## Comparaison des différentes méthodes

In [None]:
import polars as pl

pl.concat(
    [
        scores.select("model", "median").rename(mapping={"median": "score_elo"}),
        scores_votes.select("model", "median").rename(mapping={"median": "score_elo_votes"}),
        scores_ml.select("model", "median").rename(mapping={"median": "score_ml"}),
        scores_ml_votes.select("model", "median").rename(mapping={"median": "score_ml_votes"}),
    ],
    how="align",
)

In [None]:
import polars as pl
import altair as alt

df_pl = pl.concat(
    [
        scores.select("model", "median").rename(mapping={"median": "score_elo"}),
        scores_votes.select("model", "median").rename(mapping={"median": "score_elo_votes"}),
        scores_ml.select("model", "median").rename(mapping={"median": "score_ml"}),
        scores_ml_votes.select("model", "median").rename(mapping={"median": "score_ml_votes"}),
    ],
    how="align",
).sort("score_elo", descending=True)

df = df_pl.to_pandas()
df_long = df.melt(
    id_vars=["model"],
    value_vars=["score_elo", "score_elo_votes", "score_ml", "score_ml_votes"],
    var_name="score_type",
    value_name="score",
)
legend_labels = {
    "score_elo": "Elo score (all data)",
    "score_elo_votes": "Elo score (votes data)",
    "score_ml": "BT score (all data)",
    "score_ml_votes": "BT score (votes data)",
}
df_long["score_type"] = df_long["score_type"].map(legend_labels)

chart = (
    alt.Chart(df_long)
    .mark_circle(size=80)
    .encode(
        x=alt.X("model:N", sort=df["model"].tolist(), title="Model"),
        y=alt.Y("score:Q", title="Score", scale=alt.Scale(domain=[500, 1300])),
        color=alt.Color("score_type:N", title="Score Type"),
        tooltip=["model", "score", "score_type"],
    )
    .properties(width=600, height=400)
)

chart

In [None]:
chart

## Scores par catégorie

Les méthodes `run_category` et `run_all_categories` permettent de calculer des scores pour une catégorie spécifiée ou pour toutes les catégories (avec un nombre de matchs total supérieur à un seuil).

In [None]:
pipeline = RankingPipeline(
    method="elo_random",
    include_votes=True,
    include_reactions=True,
    bootstrap_samples=5,
    batch=False,
)

In [None]:
pipeline.run_category("Education")

In [None]:
results = pipeline.run_all_categories()