# Model Evaluation Results

This notebook compares the performance of different transformer models for multi-label emotion classification. Models are evaluated using AUC metrics (micro and macro) on the Lemotif dataset after two-stage fine-tuning:

1) First-stage fine-tuning on GoEmotions dataset
2) Second-stage fine-tuning with a custom EmotionModel (emotion + intensity heads)

The focus is on both accuracy and model efficiency, with a focus on client-side, browser based inference. **As of the latest results, MiniLM outperforms DistilBERT and is the new recommended model for deployment.**

## Import Libraries

In [15]:
%pip install -q pandas matplotlib

import json
import pandas as pd

from pathlib import Path

REPO_ROOT = Path().resolve().parent

Note: you may need to restart the kernel to use updated packages.


## Load Metrics

In [16]:
ARTIFACT_ROOT = REPO_ROOT / "artifacts" / "experiments"

runs = [
    ARTIFACT_ROOT / "distilbert_v1",
    ARTIFACT_ROOT / "minilm_v1",
    ARTIFACT_ROOT / "tinybert_v1",
]

In [17]:
rows = []

for run in runs:
    metrics_path = run / "analysis_metrics.json"
    if not metrics_path.exists():
        print(f"Missing metrics for {run}")
        continue

    with open(metrics_path) as f:
        metrics = json.load(f)

    rows.append(metrics)

df = pd.DataFrame(rows)
df

Unnamed: 0,model,config,timestamp,micro_auc,macro_auc,per_label_auc,label_names
0,distilbert/distilbert-base-uncased,distilbert.yaml,2026-01-13T21:25:01.606962,0.889145,0.803192,"[0.9140127388535032, 0.8397435897435899, 0.843...","[afraid, angry, anxious, ashamed, awkward, bor..."
1,nreimers/MiniLMv2-L12-H384-distilled-from-RoBE...,minilm.yaml,2026-01-13T21:25:13.310337,0.904017,0.798494,"[0.9689490445859873, 0.8306623931623932, 0.833...","[afraid, angry, anxious, ashamed, awkward, bor..."
2,huawei-noah/TinyBERT_General_4L_312D,tinybert.yaml,2026-01-13T21:24:50.109017,0.847198,0.650786,"[0.7659235668789808, 0.45566239316239315, 0.79...","[afraid, angry, anxious, ashamed, awkward, bor..."


 ## Clean & Sort Metrics

In [18]:
display_cols = [
    "model",
    "config",
    "micro_auc",
    "macro_auc",
]

df_sorted = df[display_cols].sort_values("macro_auc", ascending=False).reset_index(drop=True)

df_sorted

Unnamed: 0,model,config,micro_auc,macro_auc
0,distilbert/distilbert-base-uncased,distilbert.yaml,0.889145,0.803192
1,nreimers/MiniLMv2-L12-H384-distilled-from-RoBE...,minilm.yaml,0.904017,0.798494
2,huawei-noah/TinyBERT_General_4L_312D,tinybert.yaml,0.847198,0.650786


## Per-Class Performance Comparison

In [20]:
# Display per-label AUC and baseline statistics
for _, row in df.iterrows():
    model_name = row["model"].split("/")[-1]
    print(f"Per-Class Metrics: {model_name}")

    label_names = row.get("label_names")
    auc = row.get("per_label_auc")

    per_class_df = pd.DataFrame(
        {
            "Emotion": label_names,
            "AUC": auc,
        }
    )

    # Top 5 and Bottom 5 classes by AUC
    # Display it pretty
    top_k_df = per_class_df.sort_values("AUC", ascending=False).head(5)
    display(
        top_k_df.style.format(
            {"AUC": "{:.4f}"}
        ).background_gradient(subset=["AUC"], cmap="Blues", vmin=0.5, vmax=1.0)
    )

    bottom_k_df = per_class_df.sort_values("AUC", ascending=True).head(5)
    display(
        bottom_k_df.style.format(
            {"AUC": "{:.4f}"}
        ).background_gradient(subset=["AUC"], cmap="Reds", vmin=0.0, vmax=0.5)
    )

Per-Class Metrics: distilbert-base-uncased


Unnamed: 0,Emotion,AUC
0,afraid,0.914
8,disgusted,0.9099
10,frustrated,0.894
11,happy,0.8927
15,sad,0.8702


Unnamed: 0,Emotion,AUC
13,nostalgic,0.6139
7,confused,0.6854
17,surprised,0.7169
3,ashamed,0.7261
9,excited,0.7485


Per-Class Metrics: MiniLMv2-L12-H384-distilled-from-RoBERTa-Large


Unnamed: 0,Emotion,AUC
0,afraid,0.9689
12,jealous,0.9446
11,happy,0.8837
8,disgusted,0.8754
10,frustrated,0.8704


Unnamed: 0,Emotion,AUC
13,nostalgic,0.5706
4,awkward,0.6503
3,ashamed,0.7237
9,excited,0.7257
6,calm,0.728


Per-Class Metrics: TinyBERT_General_4L_312D


Unnamed: 0,Emotion,AUC
2,anxious,0.7945
11,happy,0.7824
0,afraid,0.7659
16,satisfied,0.7565
17,surprised,0.7045


Unnamed: 0,Emotion,AUC
1,angry,0.4557
12,jealous,0.4573
4,awkward,0.4699
5,bored,0.589
15,sad,0.6093


### Model Comparison Summary

- MiniLM has the highest micro AUC (0.90), slightly outperforming DistilBERT (0.89), while their macro AUCs are very close (MiniLM: 0.80, DistilBERT: 0.80).
- DistilBERT and MiniLM both significantly outperform TinyBERT, which has a much lower macro AUC (0.65).
- All models show a gap between micro and macro AUC, reflecting the impact of class imbalance (common vs rare emotions).

Given MiniLM's strong performance and much smaller model size, MiniLM is the best overall performer and will be used as the primary model for inference.