# Model Evaluation Results

This notebook compares the performance of different transformer models for multi-label emotion classification. Models are evaluated using AUC metrics (micro and macro) on the Lemotif dataset after two-stage training:

1) Pre-training on GoEmotions dataset
2) Fine-tuning with a custom EmotionModel (emotion + intensity heads)

The focus is on both accuracy and model efficiency, with a focus on client-side, browser based inference.

## Import Libraries

In [1]:
%pip install -q pandas matplotlib

import json
import pandas as pd

from pathlib import Path

REPO_ROOT = Path().resolve().parent

Note: you may need to restart the kernel to use updated packages.


## Load Metrics

In [2]:
ARTIFACT_ROOT = REPO_ROOT / "artifacts" / "experiments"

runs = [
    ARTIFACT_ROOT / "distilbert_v1",
    ARTIFACT_ROOT / "minilm_v1",
    ARTIFACT_ROOT / "tinybert_v1",
]

In [3]:
rows = []

for run in runs:
    metrics_path = run / "analysis_metrics.json"
    if not metrics_path.exists():
        print(f"Missing metrics for {run}")
        continue

    with open(metrics_path) as f:
        metrics = json.load(f)

    rows.append(metrics)

df = pd.DataFrame(rows)
df

Unnamed: 0,model,config,timestamp,micro_auc,macro_auc,per_label_auc,label_names
0,distilbert/distilbert-base-uncased,distilbert.yaml,2026-01-08T00:27:16.697360,0.850713,0.782622,"[0.732484076433121, 0.7996794871794872, 0.7879...","[afraid, angry, anxious, ashamed, awkward, bor..."
1,microsoft/MiniLM-L12-H384-uncased,minilm.yaml,2026-01-08T00:27:27.260624,0.781408,0.702835,"[0.8702229299363057, 0.5181623931623931, 0.801...","[afraid, angry, anxious, ashamed, awkward, bor..."
2,huawei-noah/TinyBERT_General_4L_312D,tinybert.yaml,2026-01-08T00:27:37.146654,0.771134,0.642702,"[0.6886942675159236, 0.5069444444444444, 0.804...","[afraid, angry, anxious, ashamed, awkward, bor..."


 ## Clean & Sort Metrics

In [6]:
display_cols = [
    "model",
    "config",
    "micro_auc",
    "macro_auc",
]

df_sorted = df[display_cols].sort_values("macro_auc", ascending=False).reset_index(drop=True)

df_sorted

Unnamed: 0,model,config,micro_auc,macro_auc
0,distilbert/distilbert-base-uncased,distilbert.yaml,0.850713,0.782622
1,microsoft/MiniLM-L12-H384-uncased,minilm.yaml,0.781408,0.702835
2,huawei-noah/TinyBERT_General_4L_312D,tinybert.yaml,0.771134,0.642702


## Per-Class Performance Comparison

In [5]:
# Display per-label AUC and baseline statistics
for _, row in df.iterrows():
    model_name = row["model"].split("/")[-1]
    print(f"Per-Class Metrics: {model_name}")

    label_names = row.get("label_names")
    auc = row.get("per_label_auc")

    per_class_df = pd.DataFrame(
        {
            "Emotion": label_names,
            "AUC": auc,
        }
    )

    # Top 5 and Bottom 5 classes by AUC
    # Display it pretty
    top_k_df = per_class_df.sort_values("AUC", ascending=False).head(5)
    display(
        top_k_df.style.format(
            {"AUC": "{:.4f}"}
        ).background_gradient(subset=["AUC"], cmap="Blues", vmin=0.5, vmax=1.0)
    )

    bottom_k_df = per_class_df.sort_values("AUC", ascending=True).head(5)
    display(
        bottom_k_df.style.format(
            {"AUC": "{:.4f}"}
        ).background_gradient(subset=["AUC"], cmap="Reds", vmin=0.0, vmax=0.5)
    )

Per-Class Metrics: distilbert-base-uncased


Unnamed: 0,Emotion,AUC
8,disgusted,0.9393
12,jealous,0.9367
10,frustrated,0.9183
11,happy,0.8651
15,sad,0.8524


Unnamed: 0,Emotion,AUC
6,calm,0.6937
9,excited,0.7061
5,bored,0.7116
3,ashamed,0.7118
13,nostalgic,0.7202


Per-Class Metrics: MiniLM-L12-H384-uncased


Unnamed: 0,Emotion,AUC
12,jealous,0.913
0,afraid,0.8702
2,anxious,0.8018
10,frustrated,0.7661
11,happy,0.7339


Unnamed: 0,Emotion,AUC
4,awkward,0.413
1,angry,0.5182
5,bored,0.6217
9,excited,0.6387
14,proud,0.654


Per-Class Metrics: TinyBERT_General_4L_312D


Unnamed: 0,Emotion,AUC
2,anxious,0.8048
11,happy,0.7485
12,jealous,0.7421
10,frustrated,0.7352
16,satisfied,0.7179


Unnamed: 0,Emotion,AUC
5,bored,0.4721
1,angry,0.5069
13,nostalgic,0.5114
4,awkward,0.5206
9,excited,0.5673


### Model Comparison Summary

- DistilBERT has the highest micro and macro AUC scores (micro: 0.85, macro: 0.78), making it the best performer overall.
- MiniLM and TinyBERT perform slightly worse, with TinyBERT showing the lowest macro AUC.
- All models have a significant gap between micro and macro AUC, reflecting the impact of the class imbalance (common vs rare emotions).

However, MiniLM is significantly lighter than DistilBERT while achieving close enough performance. For client-side inference, MiniLM will be used as the primary model.