# Summary of the benchmark

The results can be compared with [this article](https://arxiv.org/pdf/2403.08035) in which GPT-3.5 achieves $\text{acc}=0.89$ and $F_1=0.93$ in *English Hate Speech Detection*. We use newer models in this benchmark.

## Metrics

Our task is a binary classification. We aim at reaching a ground truth $f:E\to\{0,1\}$ which we will estimate with $\hat f:E\to\{0,1\}$.

### Confusion Matrix Notations (Binary Classification)

|              | Predicted 0 | Predicted 1 |
| ------------ | ----------- | ----------- |
| **Actual 0** | TN          | FP          |
| **Actual 1** | FN          | TP          |

* **TP** = True Positives (predicted 1, actual 1)
* **TN** = True Negatives (predicted 0, actual 0)
* **FP** = False Positives (predicted 1, actual 0)
* **FN** = False Negatives (predicted 0, actual 1)

### Metrics 

- **Precision\_0** : The proportion of predicted class 0 that is actually class 0 : $$\text{Precision}\_0 = \frac{\text{TN}}{\text{TN} + \text{FN}}$$
- **Recall\_0** : The proportion of actual class 0 correctly predicted as class 0 : $$\text{Recall}\_0 = \frac{\text{TN}}{\text{TN} + \text{FP}}$$
- **F1\_0** : The harmonic mean of Precision\_0 and Recall\_0 : $$\text{F1}_0 = 2 \cdot \frac{\text{Precision}\_0 \times \text{Recall}\_0}{\text{Precision}\_0 + \text{Recall}\_0}$$
- **Precision\_1** : The proportion of predicted class 1 that is actually class 1 : $$\text{Precision}\_1 = \frac{\text{TN}}{\text{TN} + \text{FN}}$$
- **Recall\_1** : The proportion of actual class 1 correctly predicted as class 1 : $$\text{Recall}\_1 = \frac{\text{TN}}{\text{TN} + \text{FP}}$$
- **F1\_1** : The harmonic mean of Precision\_1 and Recall\_1 : $$\text{F1}_0 = 2 \cdot \frac{\text{Precision}\_1 \times \text{Recall}\_1}{\text{Precision}\_1 + \text{Recall}\_1}$$
- **Accuracy** : The proportion of all correct predictions : $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$
- **ROC AUC** : Area under the **Receiver Operating Characteristic** curve. It evaluates the tradeoff between True Positive Rate (TPR) and False Positive Rate (FPR) over all thresholds : $$\text{TPR (Recall)} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \quad 
\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}$$

## Libraries

In [51]:
import pandas as pd 
from pathlib import Path
import os
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score
)

## Global variables

In [52]:
ROOT = Path('../..')
DATA_PATH = ROOT / "data"
BENCHMARK_PATH = DATA_PATH / "benchmark"

console = Console()

## Load predictions

In [53]:
to_exclude = []

In [54]:
df_benchmark = pd.read_csv(DATA_PATH / "benchmark_jigsaw" / "benchmark_jigsaw.csv", encoding = 'utf-8')
len_benchmark = len(df_benchmark)
console.print(f"Loaded benchmark with {len_benchmark} entries.")

In [55]:
# Group the files by pairs (first name finish with _0 and second with _1)
files = os.listdir(BENCHMARK_PATH / "benchmark_our_custom_model_jigsaw")
files = [f for f in files if f.endswith('_0.csv') and not f.startswith('.') and f not in to_exclude]
files = [f for f in files if not any(ex in f for ex in to_exclude)]
files

['jigsawen_output_rebl_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawfr_output_rebl_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawen_output_oeal_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawfr_output_odal_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawfr_output_rdal_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawfr_output_oebl_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawfr_output_real_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawen_output_real_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawfr_output_oeal_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawen_output_rdal_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawen_output_odal_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv',
 'jigsawen_output_oebl_Qwen3-4B_<cot_intention>-<cot_categorie_list>_0.csv']

In [56]:
dfs = []

for file in files:
    file_0 = file
    file_1 = file.replace('_0', '_1')
    # Verify that the files are paired correctly
    if file_1 not in os.listdir(BENCHMARK_PATH / "benchmark_our_custom_model_jigsaw"):
        console.print(f"Warning: Unmatched files {file_0} and {file_1}. Skipping.")
        continue
    file_path_0 = BENCHMARK_PATH / "benchmark_our_custom_model_jigsaw" / file_0
    file_path_1 = BENCHMARK_PATH / "benchmark_our_custom_model_jigsaw" / file_1 if file_1 else None
    df0 = pd.read_csv(file_path_0, encoding='utf-8')
    df0['file'] = file_0
    df1 = pd.read_csv(file_path_1, encoding='utf-8') if file_1 else None
    if df1 is not None:
        df1['file'] = file_1
    if df1 is not None:
        assert len(df0) == len(df1), f"Length mismatch between {file_0} and {file_1}"
        df = pd.concat([df0, df1], ignore_index=True)
    else:
        df = df0
    assert len(df) == len_benchmark, f"Length mismatch for {file_0}: {len(df)} vs {len_benchmark}"
    assert "prediction" in df.columns, f"'prediction' column missing in {file_0}"
    assert "label" in df.columns, f"'label' column missing in {file_0}"
    dfs.append(df)

console.print(f"Loaded {len(dfs)} additional files from benchmark directory.")

## Compute the metrics

In [57]:
results = []

for df in dfs:
    file_name = df['file'].iloc[0]
    y_true = df['label'].apply(lambda x: 1 if x == 'tensor(1)' else 0)
    y_pred = df['prediction']
    row = {"Model": file_name.replace('.csv', '').replace('_', ' ')}

    try:
        report = classification_report(y_true, y_pred, output_dict=True)
        row.update({
            "Precision_0": report['0']['precision'],
            "Recall_0": report['0']['recall'],
            "F1_0": report['0']['f1-score'],
            "Precision_1": report['1']['precision'],
            "Recall_1": report['1']['recall'],
            "F1_1": report['1']['f1-score'],
            "Accuracy": report['accuracy'],
        })
    except Exception as e:
        console.print(f"[red]Error computing classification report for {file_name}: {e}[/red]")

    try:
        roc_auc = roc_auc_score(y_true, y_pred)
        row["ROC_AUC"] = roc_auc
    except:
        row["ROC_AUC"] = None

    results.append(row)

In [58]:
# === Convert to DataFrame & Display as Rich Table ===
summary_df = pd.DataFrame(results)
summary_df = summary_df.sort_values(by="Accuracy", ascending=False) 

# === Print Table in Rich ===
rich_table = Table(title="Benchmark Summary for All Models", show_lines=True)
for col in summary_df.columns:
    rich_table.add_column(col, justify="center", no_wrap=False)

for _, row in summary_df.iterrows():
    rich_table.add_row(*[f"{x:.3f}" if isinstance(x, float) else str(x) for x in row])

console.print(rich_table)