# Summary of the benchmark

The results can be compared with [this article](https://arxiv.org/pdf/2403.08035) in which GPT-3.5 achieves $\text{acc}=0.89$ and $F_1=0.93$ in *English Hate Speech Detection*. We use newer models in this benchmark.

## Metrics

Our task is a binary classification. We aim at reaching a ground truth $f:E\to\{0,1\}$ which we will estimate with $\hat f:E\to\{0,1\}$.

### Confusion Matrix Notations (Binary Classification)

|              | Predicted 0 | Predicted 1 |
| ------------ | ----------- | ----------- |
| **Actual 0** | TN          | FP          |
| **Actual 1** | FN          | TP          |

* **TP** = True Positives (predicted 1, actual 1)
* **TN** = True Negatives (predicted 0, actual 0)
* **FP** = False Positives (predicted 1, actual 0)
* **FN** = False Negatives (predicted 0, actual 1)

### Metrics 

- **Precision\_0** : The proportion of predicted class 0 that is actually class 0 : $$\text{Precision}\_0 = \frac{\text{TN}}{\text{TN} + \text{FN}}$$
- **Recall\_0** : The proportion of actual class 0 correctly predicted as class 0 : $$\text{Recall}\_0 = \frac{\text{TN}}{\text{TN} + \text{FP}}$$
- **F1\_0** : The harmonic mean of Precision\_0 and Recall\_0 : $$\text{F1}_0 = 2 \cdot \frac{\text{Precision}\_0 \times \text{Recall}\_0}{\text{Precision}\_0 + \text{Recall}\_0}$$
- **Precision\_1** : The proportion of predicted class 1 that is actually class 1 : $$\text{Precision}\_1 = \frac{\text{TN}}{\text{TN} + \text{FN}}$$
- **Recall\_1** : The proportion of actual class 1 correctly predicted as class 1 : $$\text{Recall}\_1 = \frac{\text{TN}}{\text{TN} + \text{FP}}$$
- **F1\_1** : The harmonic mean of Precision\_1 and Recall\_1 : $$\text{F1}_0 = 2 \cdot \frac{\text{Precision}\_1 \times \text{Recall}\_1}{\text{Precision}\_1 + \text{Recall}\_1}$$
- **Accuracy** : The proportion of all correct predictions : $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$
- **ROC AUC** : Area under the **Receiver Operating Characteristic** curve. It evaluates the tradeoff between True Positive Rate (TPR) and False Positive Rate (FPR) over all thresholds : $$\text{TPR (Recall)} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \quad 
\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}$$

## Libraries

In [1]:
import pandas as pd 
from pathlib import Path
import os
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score
)

## Global variables

In [2]:
ROOT = Path('../..')
DATA_PATH = ROOT / "data"
BENCHMARK_PATH = DATA_PATH / "benchmark_jigsaw"

console = Console()

## Load predictions

In [3]:
to_exclude = [
    "benchmark_jigsaw.csv",
    "checkpoint",
    "eng",
]

In [4]:
df_benchmark = pd.read_csv(BENCHMARK_PATH / "benchmark_jigsaw.csv", encoding = 'utf-8')
len_benchmark = len(df_benchmark)
console.print(f"Loaded benchmark with {len_benchmark} entries.")

In [5]:
df_benchmark

Unnamed: 0,msg_id,original_text,content,label
0,e6865c831da40459,Combining 'Official requests for clemency' sec...,Combiner la section 'Demandes officielles de c...,0
1,cff2b456e5059b47,Web \n\nIs this her own website? 113.22.84.65,Web\n\nEst-ce son propre site web ? 113.22.84.65,0
2,880d3d608f37bee3,"RFCU \n\nFYI, I have opened Wikipedia:Requests...","RFCU \n\nPour info, j'ai ouvert Wikipedia:Requ...",0
3,dd29d52a4320a2e4,Monitor for cleanup: \nAs of the date signed b...,Surveillance pour nettoyage : \nÀ la date sig...,0
4,d70e773c975839e1,I concur and have done so.,Je suis d'accord et je l'ai fait.,0
...,...,...,...,...
391,bf5aae5b17fb0b77,because I am a huge faggot,parce que je suis un énorme pédé,1
392,8947f2c12aea0a32,"If no-one beats me to it, I'll knock something...","Si personne ne me devance, je vais bricoler qu...",1
393,ccf05c5d4b791836,"ur a faggot, fuck off and get a real job and s...","T'es un pédé, casse-toi et trouve un vrai boul...",1
394,ae7a53526025c0e2,"I SAID SHUT UP, YOU STUPID, STUPID...","J'AI DIT FERME TA GUEULE, ESPÈCE DE BÊTE, BÊTE...",1


In [6]:
dfs = []

for file in os.listdir(BENCHMARK_PATH):
    if file.endswith('.csv') and all(file_exclude not in file for file_exclude in to_exclude):
        file_path = BENCHMARK_PATH / file
        df = pd.read_csv(file_path, encoding='utf-8')
        df['file'] = file  
        assert len(df) == len_benchmark, f"Length mismatch for {file}: {len(df)} vs {len_benchmark}"
        assert "prediction" in df.columns, f"'prediction' column missing in {file}"
        assert "label" in df.columns, f"'prediction' column missing in {file}"
        dfs.append(df)

console.print(f"Loaded {len(dfs)} additional files from benchmark directory.")

## Compute the metrics

In [7]:
results = []

for df in dfs:
    file_name = df['file'].iloc[0]
    y_true = df['label'].astype(int)
    y_pred = df['prediction'].apply(lambda x: 1 if x else 0).astype(int)
    row = {"Model": file_name.replace('.csv', '').replace('_', ' ')}

    try:
        report = classification_report(y_true, y_pred, output_dict=True)
        row.update({
            "Precision_0": report['0']['precision'],
            "Recall_0": report['0']['recall'],
            "F1_0": report['0']['f1-score'],
            "Precision_1": report['1']['precision'],
            "Recall_1": report['1']['recall'],
            "F1_1": report['1']['f1-score'],
            "Accuracy": report['accuracy'],
        })
    except Exception as e:
        console.print(f"[red]Error computing classification report for {file_name}: {e}[/red]")

    try:
        roc_auc = roc_auc_score(y_true, y_pred)
        row["ROC_AUC"] = roc_auc
    except:
        row["ROC_AUC"] = None

    results.append(row)

In [8]:
# === Convert to DataFrame & Display as Rich Table ===
summary_df = pd.DataFrame(results)
summary_df = summary_df.sort_values(by="Accuracy", ascending=False) 

# === Print Table in Rich ===
rich_table = Table(title="Benchmark Summary for All Models", show_lines=True)
for col in summary_df.columns:
    rich_table.add_column(col, justify="center")

for _, row in summary_df.iterrows():
    rich_table.add_row(*[f"{x:.3f}" if isinstance(x, float) else str(x) for x in row])

console.print(rich_table)